2020年3月12日 星期四

[Python] 網路爬蟲學習重點節錄

Reference : 
1.https://www.url-encode-decode.com/
2.https://httpbin.org/
3.SelectorGadget : https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=zh-TW
4.XPath Helper : https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=zh-TW

選擇用書Python網路爬蟲與資料視覺化應用實務(旗標出版社)
Request

import requests

requests.get參數
r = requests.get(url , timeout=0.3) //設定time out
r = requests.get(url, cookies={"over18": "1"}) 透過cookies儲存超過18歲 來通過訪問ptt
nextPg = response.xpath(xpath)
url = "http://www.google.com"
timeout = 0.3
'maxResult' : 5,
headers=url_headers


例外處理 :
RequestException 請求錯誤
HTTPError 不合法回應
ConnectionError 連線錯誤
Timeout 逾時
TooManyRedirects 超過轉址值



bs4(BeautifulSoup) + request混合使用


from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.google.com")
soup = BeautifulSoup(r.text, "lxml")
print(soup)

若soup要寫入檔案必須格式化 需要呼叫
prettify()
範例 :
fp = open ("xxx.txt", "w", encoding="utf8")
fp.write(soup.prettify())
print("writting")
fp.close()

============================================================
find()

find() 尋找單一條件
find_all()遍尋整個HTML網頁

find_all()加入限制條件 = >
soup.find_all("tag", class_name="class_name", limit=limit_num )
(tag, id, limit=)

============================================================
Ch4
Select()

BeautifulSoup使用select與select_one
select() 是清單式列出條件符合結果
select_one()是只列一個
css selector使用nth-of-type因此"nth-" 須改寫成=>nth-of-type
如tag.select("div:nth-of-type(3)")

============================================================
Ch5
遍歷祖先兄弟子孫

tag.content與tag.children 都可以取得子標籤
並透過for chile in tag.descendants 取得之下的所有子孫標籤
for tag in tag_ui chile //for子孫標籤
for tag in tag_ui.parent //for祖先標籤

next_sibling 兄弟標籤走訪
previous_sibling


find_previous_sibling() //找上一個兄弟(此函數會自動跳過NavigableString物件)
find_next_sibling() //找下一個兄弟(此函數會自動跳過NavigableString物件)

走訪上下元素
next_element //下一個元素
previous_elements //上一個元素
用法
next_tag = tag.next_element
print(type(next_tag), next_tag.name)

===========================================================
修改
修改僅修改python物件樹並不會修改網頁原始碼
tag.string = "str_value" //修改
del tag["tag_element"] //刪除
new_tag = tag.new_tag("tag's context")  //新增1
tag.append(new_tag) //新增2
insert_before  //插入在xxx之前
insert_after  //插入在xxx之後

Ch5-4 CSV & Json存取
寫入CSV
with open (csv_file, 'w+', newline="")as fp:
writer = csv.writer(fp)
writer =.writerow(["Data1","Data2","Data3"]) //建立Excel檔Data1~Data3的欄位頭

Json
j_str = json.dump(json_data)  //json_data字典轉j_str資料字串
json_data2 = json.loads(j_str) //j_str資料字串轉json_data2字典

j_file = "jfile.json"
with open(j_file, 'r') as fp:  //讀取j_file
json.dump(data, fp) 讀data寫入至fp

Ch5-5載圖
response = request.get(url, stream=True)
with open (path, 'wb') as fp:
  for chunk in response:
     fp.write(chunk)

或用urllib.request.urlopen(url)   開啟下載   //較有效率考量

===========================================================
Ch6 XPath 與 lxml 應用
from lxml import html


找到tag中的資源網址後點三個點點左鍵 copy... XPath 如下圖:










tag_img = tree.xpath("/html/body/section[2]/table/tbody/tr[21]/td[1]/a/img")[0]
print(tag_img.attrib["src"])

=>https://www.flag.com.tw/assets/img/bookpic/F4473.jpg

6-4XPath 基本語法 軸::節點測試[謂詞]
* : 萬用字元,表所有符合元素和屬性節點
軸 : 開始點 (軸定義請詳閱)
節點測試 : 此軸符合的節點有哪些
謂詞Predicates : 進一步過濾條件
以書中範例/child::library/child::book[2]/child::author
/child::library 代表根後子節點為library
/child::book[2] 子節點為book的第二筆
/child::author 子節點為author

XPath Helper點選 按住shift 把想查詢的地方選取 即可完成




















====================================================================
Ch7 Selenium

安裝selenium    python -m pip install -U selenium

並安裝chrome driver : https://sites.google.com/a/chromium.org/chromedriver/downloads

from selenium import webdriver //import webdriver

driver = webdriver.Chrome("./chromedriver")

driver.get(url)
tag = driver.fine_element_by_tag_name("tag")
soup = BeautifulSoup(driver.page_source, "lxml")

find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

執行google搜尋
keyword = driver.find_element_by_css_selector("#lst-ib")
keyword.send_keys("search term")
keyword.send_keys(keys.ENTER);

Keys 按鍵常數清單
https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/Keys.html


Selenium Action Chains 產生一系列網頁自動操作
click()  //點選元素
click_and_hold() //按住左鍵
context_click() //按住右鍵
double_click() //按兩下元素
move_to_element() //移到元素中間
key_up() //放開某鍵
key_down() //按下某鍵
perform() //執行所有儲存動作
semd_keys()  //送出按鍵至目前的元素
release() //鬆開滑鼠按鍵

act = ActionChains(driver)
act.move_to_element(ele)
act.click(eee)
act.perform()

===============================================================
Ch8

Anaconda3/Anaconda Prompt執行conda install -c conda-forge scrapy

===============================================================
Ch9

爬蟲程式的選擇
爬取數頁網頁 => BeautifulSoup
動態網頁爬取 => Selenium
整個Web網站的大量資料 => Scrapy框架

9-1透過urljoin搭配.format來對網址序列化
例如 :
catalog = ["2317", "3018", "2308"]
for i in catalog
url = urljoin("https://tw.stock.yahoo.com/q/q?s={0}".format(i)
print(url)

===============================================================
Ch10
SQL語句
SELECT * FROM
WHERE 條件 LIKE 其中LIKE有包含的意思  %a% 包含a的字串

import pymysql
db = pymysql.connect("localhost", "root", "", "mybooks", charset="utf8")
cursor = db.cursor() //取出
cursor.execute("SELECT * FROM books") //取出

sql = """INSERT abc (......)VALUES(......)""" //存入
try:
cursor.execute(sql)//執行
db.commit()//確定送出

沒有留言:

張貼留言