携程酒店数据爬取（新）

前言：由于携程网页的变化，以及不断的反击爬虫，导致目前许多携程的爬虫代码无法爬取到数据。
本文核心：根据更换cookies的值得到携程酒店数据
主要包含以下四个部分

headers
data
json解析
完整代码

前言

环境：python3.6+requests
包含部分文件写入操作

1、headers

爬虫程序需要模仿浏览器进行访问，因此headers属性必不可少，可以在网页中轻松找到

headers = {
        "Connection": "keep-alive",
        "Cookie":cookies,
        "origin": "https://hotels.ctrip.com",
        "Host": "hotels.ctrip.com",     
        "referer": "https://hotels.ctrip.com/hotel/qamdo575",
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8"
    }

其中较为重要的部分就是cookies，假如没有cookies会直接导致验证失败，获得空数据，并且在cookies需要登录后的cookies。

2、data属性

由于采用数据接口的方式爬取数据，因此主要组合相应的data属性，才能获得准确的返回值。在浏览器检索中，从header里面可以找到我们需要的data属性。

data = {
            "StartTime": "2020-10-09",
            "DepTime": "2019-10-10",
            "RoomGuestCount": "1,1,0",
            "cityId": 575,
            "cityPY": "qamdo",
            "cityCode": "0895",
            "page": page
        }

3、json解析

找到准确的数据接口之后，我们需要利用requests库，发送get或者post请求，拼接之前的headers和data参数，得到对应的json数据。
得到的json数据可以利用切片得到各种属性值，例如链接、评分、地址等。

 html = requests.post(url, headers=headers, data=data)
 hotel_list = html.json()["hotelPositionJSON"]

4、完整代码

# coding=utf8
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import random
import time
import csv
import json
import re
from tqdm import tqdm
# Pandas display option
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.width',1000)

url = "https://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx"
filename = "F:\\aaa\\changdu.csv"
print(requests.post(url))
def Scrap_hotel_lists():
    cookies = ''' ......"'
    headers = {
        "Connection": "keep-alive",
        "Cookie":cookies,
        "origin": "https://hotels.ctrip.com",
        "Host": "hotels.ctrip.com",     
        "referer": "https://hotels.ctrip.com/hotel/qamdo575",
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8"
    }
    id = []
    name = []
    hotel_url = []
    address = []
    score = []
    star = []
    stardesc=[]
    lat=[]
    lon=[]
    dpcount=[]
    dpscore=[]
    for page in tqdm(range(1,13) ,desc='进行中',ncols=10):
        data = {
            "StartTime": "2020-10-09",
            "DepTime": "2019-10-10",
            "RoomGuestCount": "1,1,0",
            "cityId": 575,
            "cityPY": "qamdo",
            "cityCode": "0895",
            "page": page
        }
        html = requests.post(url, headers=headers, data=data)
        hotel_list = html.json()["hotelPositionJSON"]
        for item in hotel_list:
            print(item)
            id.append(item['id'])
            name.append(item['name'])
            hotel_url.append(item['url'])
            address.append(item['address'])
            score.append(item['score'])
            stardesc.append(item['stardesc'])
            lat.append(item['lat'])
            lon.append(item['lon'])
            dpcount.append(item['dpcount'])
            dpscore.append(item['dpscore'])
            if(item['star']==''):
                star.append('NaN')
            else:
                star.append(item['star'])
        time.sleep(random.randint(3,5))
    hotel_array = np.array((id, name, score, hotel_url, address,star,stardesc,lat,lon,dpcount,dpscore)).T
    list_header = ['id', 'name', 'score', 'url', 'address',
                   'star','stardesc','lat','lon','dpcount','dpscore']
    array_header = np.array((list_header))
    hotellists = np.vstack((array_header, hotel_array))
    with open(filename, 'w', encoding="utf-8-sig", newline="") as f:
        csvwriter = csv.writer(f, dialect='excel')
        csvwriter.writerows(hotellists)
if __name__ == "__main__":
    Scrap_hotel_lists()
    df = pd.read_csv(filename, encoding='utf8')
    print(df)

备注：xiecheng网站经常发生改版，此程序仅用于学习

本文地址：https://blog.csdn.net/weixin_45026680/article/details/108609247

黄山市民网：https://www.huangshanshimin.com/