Reference :
1.https://www.url-encode-decode.com/
2.https://httpbin.org/
3.SelectorGadget : https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=zh-TW
4.XPath Helper : https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=zh-TW
Request
import requests
requests.get參數
r = requests.get(url , timeout=0.3) //設定time out
r = requests.get(url, cookies={"over18": "1"}) 透過cookies儲存超過18歲 來通過訪問ptt
nextPg = response.xpath(xpath)
url = "http://www.google.com"
timeout = 0.3
'maxResult' : 5,
headers=url_headers
例外處理 :
RequestException 請求錯誤
HTTPError 不合法回應
ConnectionError 連線錯誤
Timeout 逾時
TooManyRedirects 超過轉址值
bs4(BeautifulSoup) + request混合使用
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.google.com")
soup = BeautifulSoup(r.text, "lxml")
print(soup)
若soup要寫入檔案必須格式化 需要呼叫
prettify()
範例 :
fp = open ("xxx.txt", "w", encoding="utf8")
fp.write(soup.prettify())
print("writting")
fp.close()
============================================================
find()
find() 尋找單一條件
find_all()遍尋整個HTML網頁
find_all()加入限制條件 = >
soup.find_all("tag", class_name="class_name", limit=limit_num )
(tag, id, limit=)
============================================================
Ch4
Select()
BeautifulSoup使用select與select_one
select() 是清單式列出條件符合結果
select_one()是只列一個
css selector使用nth-of-type因此"nth-" 須改寫成=>nth-of-type
如tag.select("div:nth-of-type(3)")
============================================================
Ch5
遍歷祖先兄弟子孫
tag.content與tag.children 都可以取得子標籤
並透過for chile in tag.descendants 取得之下的所有子孫標籤
for tag in tag_ui chile //for子孫標籤
for tag in tag_ui.parent //for祖先標籤
next_sibling 兄弟標籤走訪
previous_sibling
find_previous_sibling() //找上一個兄弟(此函數會自動跳過NavigableString物件)
find_next_sibling() //找下一個兄弟(此函數會自動跳過NavigableString物件)
走訪上下元素
next_element //下一個元素
previous_elements //上一個元素
用法
next_tag = tag.next_element
print(type(next_tag), next_tag.name)
===========================================================
修改
修改僅修改python物件樹並不會修改網頁原始碼
tag.string = "str_value" //修改
del tag["tag_element"] //刪除
new_tag = tag.new_tag("tag's context") //新增1
tag.append(new_tag) //新增2
insert_before //插入在xxx之前
insert_after //插入在xxx之後
Ch5-4 CSV & Json存取
寫入CSV
with open (csv_file, 'w+', newline="")as fp:
writer = csv.writer(fp)
writer =.writerow(["Data1","Data2","Data3"]) //建立Excel檔Data1~Data3的欄位頭
Json
j_str = json.dump(json_data) //json_data字典轉j_str資料字串
json_data2 = json.loads(j_str) //j_str資料字串轉json_data2字典
j_file = "jfile.json"
with open(j_file, 'r') as fp: //讀取j_file
json.dump(data, fp) 讀data寫入至fp
Ch5-5載圖
response = request.get(url, stream=True)
with open (path, 'wb') as fp:
for chunk in response:
fp.write(chunk)
或用urllib.request.urlopen(url) 開啟下載 //較有效率考量
===========================================================
Ch6 XPath 與 lxml 應用
from lxml import html
找到tag中的資源網址後點三個點點左鍵 copy... XPath 如下圖:
tag_img = tree.xpath("/html/body/section[2]/table/tbody/tr[21]/td[1]/a/img")[0]
print(tag_img.attrib["src"])
=>https://www.flag.com.tw/assets/img/bookpic/F4473.jpg
6-4XPath 基本語法 軸::節點測試[謂詞]
* : 萬用字元,表所有符合元素和屬性節點
軸 : 開始點 (軸定義請詳閱)
節點測試 : 此軸符合的節點有哪些
謂詞Predicates : 進一步過濾條件
以書中範例/child::library/child::book[2]/child::author
/child::library 代表根後子節點為library
/child::book[2] 子節點為book的第二筆
/child::author 子節點為author
XPath Helper點選 按住shift 把想查詢的地方選取 即可完成
====================================================================
Ch7 Selenium
安裝selenium python -m pip install -U selenium
並安裝chrome driver : https://sites.google.com/a/chromium.org/chromedriver/downloads
from selenium import webdriver //import webdriver
driver = webdriver.Chrome("./chromedriver")
driver.get(url)
tag = driver.fine_element_by_tag_name("tag")
soup = BeautifulSoup(driver.page_source, "lxml")
find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()
執行google搜尋
keyword = driver.find_element_by_css_selector("#lst-ib")
keyword.send_keys("search term")
keyword.send_keys(keys.ENTER);
Keys 按鍵常數清單
https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/Keys.html
Selenium Action Chains 產生一系列網頁自動操作
click() //點選元素
click_and_hold() //按住左鍵
context_click() //按住右鍵
double_click() //按兩下元素
move_to_element() //移到元素中間
key_up() //放開某鍵
key_down() //按下某鍵
perform() //執行所有儲存動作
semd_keys() //送出按鍵至目前的元素
release() //鬆開滑鼠按鍵
act = ActionChains(driver)
act.move_to_element(ele)
act.click(eee)
act.perform()
===============================================================
Ch8
Anaconda3/Anaconda Prompt執行conda install -c conda-forge scrapy
===============================================================
Ch9
爬蟲程式的選擇
爬取數頁網頁 => BeautifulSoup
動態網頁爬取 => Selenium
整個Web網站的大量資料 => Scrapy框架
9-1透過urljoin搭配.format來對網址序列化
例如 :
catalog = ["2317", "3018", "2308"]
for i in catalog
url = urljoin("https://tw.stock.yahoo.com/q/q?s={0}".format(i)print(url)
===============================================================
Ch10
SQL語句
SELECT * FROM
WHERE 條件 LIKE 其中LIKE有包含的意思 %a% 包含a的字串
import pymysql
db = pymysql.connect("localhost", "root", "", "mybooks", charset="utf8")
cursor = db.cursor() //取出
cursor.execute("SELECT * FROM books") //取出
sql = """INSERT abc (......)VALUES(......)""" //存入
try:
cursor.execute(sql)//執行
db.commit()//確定送出
沒有留言:
張貼留言