avatar

目錄
【個人用途記錄】Python 爬蟲 - 抓取及解析 HTML 網頁資料

記錄一下最近的個人用法。
刷得太快追趕不上,而且用手機追看吃噗太嚴重了 表符-心累
決定用電腦一次過爬了後,再放進word檔導進手機慢慢追看。


做法 (basic)

把整個網頁另存新檔 (html) 來爬。

python
from bs4 import BeautifulSoup

with open("C:/Users/ouoholly/Desktop/data.html", encoding="utf-8") as f:
soup = BeautifulSoup(f)

stories = soup.find_all('div', {"class": ["user", "content"]})

for s in stories:
print(s.text + "\n")

進階版一(add 粗體樣式)

python
from bs4 import BeautifulSoup

with open("C:/Users/ouoholly/Desktop/data.html", encoding="utf-8") as f:
soup = BeautifulSoup(f)

u = soup.find_all('div', class_="user")
c = soup.find_all('div', class_="content")

for (uu, cc) in zip(u, c):
print("\033[1m" + uu.text + "\033[0m" + "\n" + cc.text + "\n")

解釋:
\033[1m start bold
\033[0m end bold


進階版二(add 顏色樣式)

python
from bs4 import BeautifulSoup

with open("C:/Users/ouoholly/Desktop/data.html", encoding="utf-8") as f:
soup = BeautifulSoup(f)

class color:
PURPLE = '\033[95m'
BOLD = '\033[1m'
END = '\033[0m'

u = soup.find_all('div', class_="user")
c = soup.find_all('div', class_="content")

for (uu, cc) in zip(u, c):
print(color.PURPLE + color.BOLD + uu.text + color.END + "\n" + cc.text + "\n")

格式

\033[(NUMBER)(NUMBER);(NUMBER)(NUMBER)m
第一個數字:3 = 設置文字顏色 ; 4 = 設置背景顏色
第二個數字:顏色碼

顏色碼

0 black
1 red
2 green
3 yellow
4 blue
5 magenta
6 cyan
7 white
9 default

例子

\033[43m 黃底
\033[41m 紅底
\033[43;31m 黃底紅字


進階版三(highlight 特定文字)

python
from bs4 import BeautifulSoup
from termcolor import colored

with open("C:/Users/ouoholly/Desktop/data.html", encoding="utf-8") as f:
soup = BeautifulSoup(f)

class color:
PURPLE = '\033[95m'
BOLD = '\033[1m'
END = '\033[0m'

u = soup.find_all('div', class_="user")
c = soup.find_all('div', class_="content")

for (uu, cc) in zip(u, c):
allu = uu.text

# 當出現 ಠ_ಠ 和 pizza1234 時,highlight它們
searchusers=['ಠ_ಠ','pizza1234']

colortext = []

for t in allu.lower().split():
if t in searchusers:
colortext.append(colored(t,'magenta','on_red'))
else:
colortext.append(t)

print(color.PURPLE + color.BOLD + " ".join(colortext) + color.END + "\n" + cc.text + "\n")

termcolor 顏色選項

https://pypi.org/project/termcolor/

Text colors Text highlight Attributes
grey
red
green
yellow
blue
magenta
cyan
white
on_grey
on_red
on_green
on_yellow
on_blue
on_magenta
on_cyan
on_white

bold
dark
underline
blink
reverse
concealed

進階版四(只看特定用戶發言)

python
from bs4 import BeautifulSoup
from termcolor import colored
import re

with open("C:/Users/ouoholly/Desktop/data.html", encoding="utf-8") as f:
soup = BeautifulSoup(f)

u = soup.find_all('div', class_="user")
c = soup.find_all('div', class_="content")

for (uu, cc) in zip(u, c):

# 只顯示 ಠ_ಠ 和 pizza1234 的發言
allusers = uu.text
r = re.search('.*ಠ_ಠ.*|.*pizza1234.*', allusers)

targetusers = []
usercontent = []

if r:
targetusers.append(colored(uu.text,'magenta',attrs=['bold']))
usercontent.append(cc.text)

for i in targetusers:
if i==[]:
list.remove(i)
print(''.join(targetusers))

for i in usercontent:
if i==[]:
list.remove(i)
print(''.join(usercontent) + '\n')

進階版五(看所有提及特定用戶的發言)

python
from bs4 import BeautifulSoup
from termcolor import colored
import re

with open("C:/Users/ouoholly/Desktop/data.html", encoding="utf-8") as f:
soup = BeautifulSoup(f)

u = soup.find_all('div', class_="user")
c = soup.find_all('div', class_="content")

for (uu, cc) in zip(u, c):

# 顯示 ಠ_ಠ 和 pizza1234 的發言,也顯示其他人提及過 ಠ_ಠ 和 pizza1234 的發言
all = uu.text + cc.text
r = re.search('.*ಠ_ಠ.*|.*pizza1234.*', all)

targetusers = []
usercontent = []

if r:
targetusers.append(colored(uu.text,'magenta',attrs=['bold']))
usercontent.append(cc.text)

for i in targetusers:
if i==[]:
list.remove(i)
print(''.join(targetusers))

for i in usercontent:
if i==[]:
list.remove(i)
print(''.join(usercontent) + '\n')

複製 output 內容

最簡單的複製粘貼

  1. select all: select the few words at the start, then Ctrl Shift End
  2. copy: Ctrl C
  3. paste: Ctrl V to MS Word (可保留顏色粗體等格式)

教學資源推薦

特別推薦這篇,解釋得十分清楚易懂,而且還舉出了很多例子,每個都清晰地列出它們的 input & output,十分適合新手理解跟著做。表符-比心

Python 使用 Beautiful Soup 抓取與解析網頁資料,開發網路爬蟲教學 - G. T. Wang


References 1

References 2


如果您喜歡我的文章,歡迎幫我在下面按5下讚!感謝您的鼓勵和支持!

文章作者: ouoholly
文章鏈接: https://ouoholly.github.io/post/python-web-crawl-shhh/
版權聲明: 本博客所有文章除特別聲明外,均採用 CC BY-NC-SA 4.0 許可協議。歡迎「部份引用」與介紹(如要全文轉貼請先留言詢問),轉載引用請註明來源 ouoholly 的倉庫,謝謝!

評論