
Python簡單爬蟲案例詳解
4頁Python簡單爬蟲案例詳解一、爬蟲介紹用pyhton從網(wǎng)頁中爬取數(shù)據(jù),是比較常用的爬蟲方式網(wǎng)頁一般由html編寫,里面包含大量的標簽,我們所需的內(nèi)容都包含在這些標簽之中,除了對python的基礎(chǔ)語法有了解之外,還要對html的結(jié)構(gòu)以及標簽選擇有簡單的認知,下面就用爬取XXX網(wǎng)的案例帶大家進入爬蟲的世界二、實現(xiàn)步驟1.導入依賴網(wǎng)頁內(nèi)容依賴import requests,如沒有下載依賴,在terminal處輸出pip install requests,系統(tǒng)會自動導入依賴解析內(nèi)容依賴常用的有BeautifulSoup、parsel、re等等與上面步驟一樣,如沒有依賴,則在terminal處導入依賴導入BeautifulSoup依賴pip install bs4導入pasel依賴pip install parsel使用依賴from bs4 import BeautifulSoupimport requestsimport parselimport re2.獲取數(shù)據(jù)簡單的獲取網(wǎng)頁,網(wǎng)頁文本response = requests.get(url).text對于很多網(wǎng)站可能需要用戶身份登錄,此時用headers偽裝,此內(nèi)容可以在瀏覽器f12獲得headers = { ? ?'Cookie': 'cookie,非真實的', ? ?'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'}?headers = { ? ?'Host': '', ? ?'Connection': 'keep-alive', ? ?'Pragma': 'no-cache', ? ?'Cache-Control': 'no-cache', ? ?'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"', ? ?'sec-ch-ua-mobile': '?0', ? ?'sec-ch-ua-platform': '"Windows"', ? ?'Upgrade-Insecure-Requests': '1', ? ?'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', ? ?'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', ? ?'Sec-Fetch-Site': 'same-origin', ? ?'Sec-Fetch-Mode': 'navigate'}偽裝后獲取網(wǎng)頁數(shù)據(jù)response = requests.get(url=url,headers=headers).get.text甚至還有些跟SSL證書相關(guān),還需設(shè)置proxiesproxies = { ? 'http': 'http://127.0.0.1:9000', ? 'https': 'http://127.0.0.1:9000'}response = requests.get(url=url,headers=headers, proxies=proxies).get.text3.解析數(shù)據(jù)數(shù)據(jù)的解析有幾種方式,比如xpath,css, recss顧名思義,就是html標簽解析方式了re是正則表達式解析4.寫入文件with open(titleName + '.txt', mode='w', encoding='utf-8') as f: f.write(content)open函數(shù)打開文件IO,with函數(shù)讓你不用手動關(guān)閉IO流,類似Java中Try catch模塊中try()引入IO流第一個函數(shù)為文件名,mode為輸入模式,encoding為編碼,還有更多的參數(shù),可以自行研究write為寫入文件三、完整案例import requestsimport parsel??link = '' #目標地址link_data = requests.get(url=link).textlink_selector = parsel.Selector(link_data)href = link_selector.css('.DivTr a::attr(href)').getall()for index in href: ? ?url = f'https:{index}' ? ?print(url) ? ?response = requests.get(url, headers)? ? ?html_data = response.text ? ?selector = parsel.Selector(html_data) ? ?title = selector.css('.c_l_title h1::text').get() ? ?content_list = selector.css('div.noveContent p::text').getall() ? ?content = '\n'.join(content_list) ? ?with open(title + '.txt', mode='w', encoding='utf-8') as f: ? ? ? ?f.write(content)。