您所在位置：網(wǎng)站首頁 > Python > Python簡單爬蟲案例詳解

Python簡單爬蟲案例詳解

4頁

賣家[上傳人]：知***

文檔編號：597195890

上傳時間：2025-01-20

文檔格式：DOCX

文檔大?。?3.61KB

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載文檔到電腦，查找使用更方便

10 金貝

還剩頁未讀，繼續(xù)閱讀

/ 4 舉報版權(quán)申訴馬上下載

文本預覽

下載提示

常見問題

Python簡單爬蟲案例詳解一、爬蟲介紹用pyhton從網(wǎng)頁中爬取數(shù)據(jù)，是比較常用的爬蟲方式網(wǎng)頁一般由html編寫，里面包含大量的標簽，我們所需的內(nèi)容都包含在這些標簽之中，除了對python的基礎(chǔ)語法有了解之外，還要對html的結(jié)構(gòu)以及標簽選擇有簡單的認知，下面就用爬取XXX網(wǎng)的案例帶大家進入爬蟲的世界二、實現(xiàn)步驟1.導入依賴網(wǎng)頁內(nèi)容依賴import requests，如沒有下載依賴，在terminal處輸出pip install requests，系統(tǒng)會自動導入依賴解析內(nèi)容依賴常用的有BeautifulSoup、parsel、re等等與上面步驟一樣，如沒有依賴，則在terminal處導入依賴導入BeautifulSoup依賴pip install bs4導入pasel依賴pip install parsel使用依賴from bs4 import BeautifulSoupimport requestsimport parselimport re2.獲取數(shù)據(jù)簡單的獲取網(wǎng)頁，網(wǎng)頁文本response = requests.get(url).text對于很多網(wǎng)站可能需要用戶身份登錄，此時用headers偽裝，此內(nèi)容可以在瀏覽器f12獲得headers = { ? ?'Cookie': 'cookie，非真實的', ? ?'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'}?headers = { ? ?'Host': '', ? ?'Connection': 'keep-alive', ? ?'Pragma': 'no-cache', ? ?'Cache-Control': 'no-cache', ? ?'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"', ? ?'sec-ch-ua-mobile': '?0', ? ?'sec-ch-ua-platform': '"Windows"', ? ?'Upgrade-Insecure-Requests': '1', ? ?'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', ? ?'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', ? ?'Sec-Fetch-Site': 'same-origin', ? ?'Sec-Fetch-Mode': 'navigate'}偽裝后獲取網(wǎng)頁數(shù)據(jù)response = requests.get(url=url,headers=headers).get.text甚至還有些跟SSL證書相關(guān)，還需設(shè)置proxiesproxies = { ? 'http': 'http://127.0.0.1:9000', ? 'https': 'http://127.0.0.1:9000'}response = requests.get(url=url,headers=headers, proxies=proxies).get.text3.解析數(shù)據(jù)數(shù)據(jù)的解析有幾種方式，比如xpath，css, recss顧名思義，就是html標簽解析方式了re是正則表達式解析4.寫入文件with open(titleName + '.txt', mode='w', encoding='utf-8') as f: f.write(content)open函數(shù)打開文件IO，with函數(shù)讓你不用手動關(guān)閉IO流，類似Java中Try catch模塊中try()引入IO流第一個函數(shù)為文件名，mode為輸入模式，encoding為編碼，還有更多的參數(shù)，可以自行研究write為寫入文件三、完整案例import requestsimport parsel??link = '' #目標地址link_data = requests.get(url=link).textlink_selector = parsel.Selector(link_data)href = link_selector.css('.DivTr a::attr(href)').getall()for index in href: ? ?url = f'https:{index}' ? ?print(url) ? ?response = requests.get(url, headers)? ? ?html_data = response.text ? ?selector = parsel.Selector(html_data) ? ?title = selector.css('.c_l_title h1::text').get() ? ?content_list = selector.css('div.noveContent p::text').getall() ? ?content = '\n'.join(content_list) ? ?with open(title + '.txt', mode='w', encoding='utf-8') as f: ? ? ? ?f.write(content)。

點擊閱讀更多內(nèi)容

進入店鋪

收藏店鋪

相關(guān)文檔更多>

正為您匹配相似的精品文檔

av在线观看网站免费,欧美成人aaa片一区国产精品,精品国产乱码久久久久久免费,麻豆果冻传媒2021精品传媒一区,国产精品自在欧美一区

Python簡單爬蟲案例詳解