程式語言:Python
Package:beautifulsoup4
官方文件
功能:分析 html
同 html 中的 tag
Attributes
NavigableString
tag 包起來的字串,若是不明確,則回傳 None
BeautifulSoup
整份 document
Comments and other special strings
只是 NavigableString 的子類
修改 .string
append(tag)
new_tag(self, name, namespace=None, nsprefix=None, **attrs)
insert(position, new_child)
insert_before(predecessor)
insert_after(successor)
clear(decompose=False)
extract()
decompose()
replace_with(replace_with)
wrap(wrap_inside)
unwrap()
str(soup)
Package:beautifulsoup4
官方文件
功能:分析 html
import requests from bs4 import BeautifulSoup result = requests.get("https://www.google.com.tw/") c = result.content soup = BeautifulSoup(c, "html.parser") links = soup.find_all("a") data = {} for a in links: title = a.text.strip() data[title] = a.attrs['href']
BeautifulSoup
- BeautifulSoup(markup="", features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, **kwargs)
- markup
- 解析的 html
- features
- 解析器
- parse_only
- 只解析在 SoupStrainer 中指定的元素
- from_encoding
- 指定編碼,無指定的話,會自動檢測
- BeautifulSoup(markup, from_encoding="iso-8859-8")
- exclude_encodings
- 排除編碼,使用 list
- BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
- 屬性
- .contains_replacement_characters
- True:文檔編碼時作了特殊字符的替換
- SoupStrainer(name=None, attrs={}, text=None, **kwargs)
- 參數說明,請參考 搜尋方法的 Filter
範例
解析器
不同解析器,可能得到的結果也會不一樣使用方法 | 優勢 | 劣勢 | |
---|---|---|---|
Python標準庫 | BeautifulSoup(markup, "html.parser") |
|
|
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
|
|
lxml XML 解析器 |
BeautifulSoup(markup,
["lxml", "xml"])
BeautifulSoup(markup,
"xml")
|
|
|
html5lib | BeautifulSoup(markup, "html5lib") |
|
|
物件種類
Tag同 html 中的 tag
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
- name
- tag 的名字
- 用法
- 取得:tag.name
- 修改:tag.name = "abc"
- 比較架構:tagA == tagB
- 同一個對象: tagA is tagB
- 複製
import copy markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") a_copy = copy.copy(soup.a) print(a_copy) # <a href="http://example.com/">I linked to <i>example.com</i></a> soup.a == a_copy # True soup.a is a_copy # False
- tag 的屬性,像是 class, id, ...
- 用法
- 取得
- tag.attrs
- tag['class'], tag['id']...
- tag.get('class'), tag.get('id')...
- 多值屬性,將回傳 list,像是 class
- 修改
- tag['class'] = 'verybold'
- 刪除
- del tag['class']
NavigableString
tag 包起來的字串,若是不明確,則回傳 None
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') soup.b.string # 'Extremely bold' type(soup.b.string) # <class 'bs4.element.NavigableString'> soup = BeautifulSoup('<b class="boldest">Extremely bold<i>abc</i></b>') soup.b.string # None
- 用法
- 取得:tag.string
- 修改:tag.string.replace_with("abc")
- 如果想在 Beautiful Soup 之外使用,需用 str() or unicode() 轉換。以免浪費內存
- 不支援
- .contents
- .string
- find()
BeautifulSoup
整份 document
- 類似 tag,但無 attribute
- soup.name # '[document]'
Comments and other special strings
只是 NavigableString 的子類
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) # <class 'bs4.element.Comment'>
- 種類
- Comment
- XML 相關
- CData
- ProcessingInstruction
- Declaration
- Doctype
使用範例
html 如下<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>解析器如下
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')
訪問的方法
- .tagName
- 回傳第一個找到的子節點
- 範例
- soup.head
- <head><title>The Dormouse's story</title></head>
- soup.body.b
- <b>The Dormouse's story</b>
- soup.a
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
- body = soup.body
body.b - <b>The Dormouse's story</b>
- .contents
- 將當前元素的所有直接子節點以 list 輸出
- 範例
- soup.body.contents
- [
- '\n',
- <p class="title"><b>The Dormouse's story</b></p>,
- '\n',
- <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, - '\n',
- <p class="story">...</p>,
- '\n'
- ]
- soup.contents
- [
- '\n',
- <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html> - ]
- .children
- 得到當前元素所有直接子節點的 iterator
- 範例
- for child in soup.body.children:
print(child) - .descendants
- 得到當前元素所有子節點(含子孫節點) 的 generator
- 範例
- for child in soup.body.descendants:
print(child) - .string
- 若當前元素只有一個 NavigableString 類型的子節點,可得到此值
若超過一個,則回傳 None - 範例
- soup.body.string
- None
- soup.body.p.string
- "The Dormouse's story"
- .strings
- 得到當前元素所有 NavigableString 類型子節點(含子孫節點) 的 generator
- 範例
- for string in soup.body.strings:
print(repr(string)) - .stripped_strings
- 同 .strings,但去除空白行 與 段頭段尾的空白和 \n
- for string in soup.body.strings:
print(repr(string)) - .parent
- 得到當前元素的父節點
- 範例
- soup.title.parent
- <head><title>The Dormouse's story</title></head>
- soup.parent
- None
- .parents
- 得到當前元素所有父節點的 generator
- 範例
- for parent in soup.title.parents:
print(parent.name) - .next_sibling
- 得到當前元素之後的兄弟節點
- 範例
- soup.head.next_sibling
- '\n'
- soup.head.next_sibling.next_sibling.name
- 'body'
- soup.title.next_sibling
- None
- .next_siblings
- 得到當前元素之後所有兄弟節點的 generator
- 範例
- for sibling in soup.a.next_siblings:
print(repr(sibling)) - .previous_sibling
- 得到當前元素之前的兄弟節點
- 範例
- soup.body.previous_sibling
- '\n'
- soup.body.previous_sibling.previous_sibling
- <head><title>The Dormouse's story</title></head>
- soup.head.previous_sibling
- None
- .previous_siblings
- 得到當前元素之前所有兄弟節點的 generator
- 範例
- for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling)) - .next_element
- 得到當前元素之後的解析元素
- 範例
- soup.head.next_element
- <title>The Dormouse's story</title>
- soup.a.next_element
- 'Elsie'
- 解析器先進入 <a> 標籤,然後是字符串 'Elsie',然後關閉 </a> 標籤
- .next_elements
- 得到當前元素之後所有解析元素的 generator
- 範例
- for element in soup.a.next_elements:
print(repr(element)) - .previous_element
- 得到當前元素之前的解析元素
- 範例
- soup.body.previous_element
- '\n'
- soup.p.string.previous_element
- <b>The Dormouse's story</b>
- .previous_elements
- 得到當前元素之前所有解析元素的 generator
- 範例
- for element in soup.p.string.previous_elements:
print(repr(element))
搜尋的方法
- Filters (搜尋參數可被傳入的類型,像 name, string, **kwargs)
- String (同 text)
- 範例
- 'b'
- soup.find_all('b')
- [<b>The Dormouse's story</b>]
- Regular Expression
- 範例
- for tag in soup.find_all(re.compile("^b")):
print(tag.name) - body
b - List
- 範例
- soup.find_all(["a", "b"])
- [
- <b>The Dormouse's story</b>,
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- ]
- True
- 範例
- soup.find_all(id=True)
- [
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- ]
- Function
- 需返回 True or False
- 範例
- def not_lacie(href):
return href and not re.compile("lacie").search(href) soup.find_all(href=not_lacie) - [
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- ]
- find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)
- 搜尋當前元素的所有子節點
- 簡寫
- soup.find_all(...) == soup(...)
- name
- 可以搜尋名字為 name 的 tag
- 字符串對象會被自動忽略掉
- 範例
- soup.find_all('title')
- [<title>The Dormouse's story</title>]
- attrs
- 可以搜尋指定屬性的值
- 範例
- soup.find_all(attrs={"class": "sister"})
- [
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- ]
- recursive
- 預設搜尋當前元素的所有子孫節點
設為 False,則只會搜尋當前元素的直接子節點 - 範例
- soup.html.find_all("title", recursive=False)
- []
- string
- 搜尋字符串內容
不混用其他參數時,只會回傳 NavigableString 物件 - 範例
- soup.find_all(string="Elsie")
- ['Elsie']
- soup.find_all("a", string="Elsie")
- [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
- limit
- 限制返回結果的數量
- 範例
- soup.find_all("a", limit=2)
- [
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
- ]
- **kwargs
- 如果 key name 不是內置的參數名,將會當作元素的屬性名,並用 key value 來搜尋其值
- 範例
- soup.find_all(href=re.compile("elsie"), id='link1')
- [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
- soup.find_all("a", class_="sister")
- [
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- ]
- 有些元素屬性在搜尋不能使用,比如 HTML5 中的 data-* 屬性, 需使用 attrs
- class為 python 保留字,故需加上底線 class_ 使用
- 因 class 為多值屬性,若 CSS 類名的順序與實際不符,將搜索不到結果
- 範例
- css_soup = BeautifulSoup('<p class="body strikeout"></p>', "html.parser")
- css_soup.find_all("p", class_="body strikeout")
- [<p class="body strikeout"></p>]
- css_soup.find_all("p", class_="strikeout body")
- []
- find(name=None, attrs={}, recursive=True, text=None, **kwargs)
- 類似 find_all,但只返回第一個找到的元素,找不到為 None
- 簡寫
- soup.find(tag) == soup.tag
- find_parents(name=None, attrs={}, limit=None, **kwargs)
- 搜尋當前元素的所有父輩節點
- 範例
- soup.b.find_parents('p')
- [<p class="title"><b>The Dormouse's story</b></p>]
- find_parent(name=None, attrs={}, **kwargs)
- 搜尋當前元素的第一個父輩節點
- 範例
- soup.b.find_parent('p')
- <p class="title"><b>The Dormouse's story</b></p>
- find_next_siblings(name=None, attrs={}, text=None, limit=None, **kwargs)
- 搜尋當前元素之後的所有兄弟節點
- 範例
- soup.a.find_next_siblings()
- [
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- ]
- find_next_sibling(name=None, attrs={}, text=None, **kwargs)
- 搜尋當前元素之後的第一個兄弟節點
- 範例
- soup.a.find_next_sibling()
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
- find_previous_siblings(name=None, attrs={}, text=None, limit=None, **kwargs)
- 搜尋當前元素之前的所有兄弟節點
- 範例
- soup.find(id='link3').find_previous_siblings()
- [
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
- ]
- find_previous_sibling(name=None, attrs={}, text=None, **kwargs)
- 搜尋當前元素之前的第一個兄弟節點
- 範例
- soup.find(id='link3').find_previous_sibling()
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
- find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs)
- 搜尋當前元素之後的所有解析節點
- 範例
- soup.a.find_all_next('p')
- [<p class="story">...</p>]
- find_next(name=None, attrs={}, text=None, **kwargs)
- 搜尋當前元素之後的第一個解析節點
- 範例
- soup.a.find_next('p')
- <p class="story">...</p>
- find_all_previous(name=None, attrs={}, text=None, limit=None, **kwargs)
- 搜尋當前元素之前的所有解析節點
- 範例
- soup.find(class_="story").find_all_previous('b')
- [<b>The Dormouse's story</b>]
- find_previous(name=None, attrs={}, text=None, **kwargs)
- 搜尋當前元素之前的第一個解析節點
- 範例
- soup.find(class_="story").find_previous('b')
- <b>The Dormouse's story</b>
- select(selector, limit=None)
- 使用 CSS 選擇器 的語法找到所有符合元素
- 範例
- soup.select('a.sister')
- [
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- ]
- select_one(selector)
- 使用 CSS 選擇器 的語法找到第一個符合元素
- 範例
- soup.select_one('a.sister')
- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
- get_text(separator="", strip=False, types=(NavigableString, CData))
- 得到元素中包含的文本內容
- 範例
- soup.get_text()
- "\n,The Dormouse's story,\n,\n,The Dormouse's story,\n,Once upon a time there were three little sisters; and their names were\n,Elsie,,\n,Lacie, and\n,Tillie,;\nand they lived at the bottom of a well.,\n,...,\n"
修改的方法
- 修改元素的名稱和屬性
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser") tag = soup.b tag.name = "blockquote" tag['class'] = 'verybold' tag['id'] = 1 tag # <blockquote class="verybold" id="1">Extremely bold</blockquote> del tag['class'] del tag['id'] tag # <blockquote>Extremely bold</blockquote>
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") tag = soup.a tag.string = "New link text." tag # <a href="http://example.com/">New link text.</a>
- 元素中添加內容
soup = BeautifulSoup("<a>Foo</a>", "html.parser") soup.a.append("Bar") soup # <html><head></head><body><a>FooBar</a></body></html> soup.a.contents # ['Foo', 'Bar']
- 新增元素
soup = BeautifulSoup("<b></b>" , "html.parser") original_tag = soup.b new_tag = soup.new_tag("a", href="http://www.example.com") original_tag.append(new_tag) original_tag # <b><a href="http://www.example.com"></a></b> new_tag.string = "Link text." original_tag # <b><a href="http://www.example.com">Link text.</a></b>
- 在指定位置插入元素
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") tag = soup.a tag.insert(1, "but did not endorse ") tag # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> tag.contents # ['I linked to ', 'but did not endorse', <i>example.com</i>]
- 在當前元素之前插入元素
soup = BeautifulSoup("<b>stop</b>", "html.parser") tag = soup.new_tag("i") tag.string = "Don't" soup.b.string.insert_before(tag) soup.b # <b><i>Don't</i>stop</b>
- 在當前元素之後插入元素
soup = BeautifulSoup("<b><i>Don't</i>stop</b>", "html.parser") soup.b.i.insert_after(soup.new_string(" ever ")) soup.b # <b><i>Don't</i> ever stop</b> soup.b.contents # [<i>Don't</i>, ' ever ', 'stop']
- 移除當前元素的內容
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") tag = soup.a tag.clear() tag # <a href="http://example.com/"></a>
- 將當前元素抽出,並返回
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") a_tag = soup.a i_tag = soup.i.extract() a_tag # <a href="http://example.com/">I linked to</a> i_tag # <i>example.com</i> print(i_tag.parent) # None
- 將當前元素銷毀,但不返回
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") a_tag = soup.a soup.i.decompose() a_tag # <a href="http://example.com/">I linked to</a>
- 將當前元素取代,並返回被取代的元素
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") a_tag = soup.a new_tag = soup.new_tag("b") new_tag.string = "example.net" old_tag = a_tag.i.replace_with(new_tag) a_tag # <a href="http://example.com/">I linked to <b>example.net</b></a> old_tag # <i>example.com</i>
- 將當前元素包裝,並返回結果
soup = BeautifulSoup("<p>I wish I was bold.</p>", "html.parser") soup.p.string.wrap(soup.new_tag("b")) # <b>I wish I was bold.</b> soup.p.wrap(soup.new_tag("div")) # <div><p><b>I wish I was bold.</b></p></div>
- 將當前元素移除外層元素,並返回被移除的元素
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") a_tag = soup.a a_tag.i.unwrap()
# <i></i> a_tag # <a href="http://example.com/">I linked to example.com</a>
輸出的方法
- prettify(encoding=None, formatter="minimal")
- 整理後輸出
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") soup.prettify() # '<a href="http://example.com/">\n I linked to\n <i>\n example.com\n </i>\n</a>' print(soup.prettify()) # <a href="http://example.com/"> # I linked to # <i> # example.com # </i> # </a>
- 只想得到結果字符串,不重視格式
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, "html.parser") str(soup)
留言
張貼留言