[Python] Beautifulsoup4 教學

程式語言：Python
Package：beautifulsoup4
官方文件

功能：分析 html

import requests
from bs4 import BeautifulSoup

result = requests.get("https://www.google.com.tw/")
c = result.content

soup = BeautifulSoup(c, "html.parser")
links = soup.find_all("a")

data = {}
for a in links:
    title = a.text.strip()
    data[title] = a.attrs['href']

BeautifulSoup

BeautifulSoup(markup="", features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, **kwargs)

markup

解析的 html

features

解析器

parse_only

只解析在 SoupStrainer 中指定的元素

from_encoding

指定編碼，無指定的話，會自動檢測
BeautifulSoup(markup, from_encoding="iso-8859-8")

exclude_encodings

排除編碼，使用 list
BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])

屬性

.contains_replacement_characters

True：文檔編碼時作了特殊字符的替換

SoupStrainer(name=None, attrs={}, text=None, **kwargs)

參數說明，請參考搜尋方法的 Filter

範例

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

only_a_tags = SoupStrainer("a")
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
# <a class="sister" href="http://example.com/elsie" id="link1">
#  Elsie
# </a>
# <a class="sister" href="http://example.com/lacie" id="link2">
#  Lacie
# </a>
# <a class="sister" href="http://example.com/tillie" id="link3">
#  Tillie
# </a>

only_tags_with_id_link2 = SoupStrainer(id="link2")
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
# <a class="sister" href="http://example.com/lacie" id="link2">
#  Lacie
# </a>

def is_short_string(text):
    return len(text) < 10

only_short_strings = SoupStrainer(text=is_short_string)
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
# Elsie
# ,
# Lacie
# and
# Tillie
# ...
#

解析器

不同解析器，可能得到的結果也會不一樣

	使用方法	優勢	劣勢
Python標準庫	`BeautifulSoup(markup, "html.parser")`	Python的內置標準庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2) 前的版本中文檔容錯能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安裝C語言庫
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

物件種類

Tag
同 html 中的 tag

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

name

tag 的名字
用法

取得：tag.name
修改：tag.name = "abc"
比較架構：tagA == tagB
同一個對象： tagA is tagB
複製

import copy
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
a_copy = copy.copy(soup.a)
print(a_copy)
# <a href="http://example.com/">I linked to <i>example.com</i></a>
soup.a == a_copy
# True
soup.a is a_copy
# False

Attributes

tag 的屬性，像是 class, id, ...
用法

取得

tag.attrs
tag['class'], tag['id']...
tag.get('class'), tag.get('id')...
多值屬性，將回傳 list，像是 class

修改

tag['class'] = 'verybold'

刪除

del tag['class']

NavigableString
tag 包起來的字串，若是不明確，則回傳 None

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
soup.b.string 
# 'Extremely bold'
type(soup.b.string)
# <class 'bs4.element.NavigableString'>

soup = BeautifulSoup('<b class="boldest">Extremely bold<i>abc</i></b>')
soup.b.string
# None

用法

取得：tag.string
修改：tag.string.replace_with("abc")

如果想在 Beautiful Soup 之外使用，需用 str() or unicode() 轉換。以免浪費內存
不支援

.contents
.string
find()

BeautifulSoup
整份 document

類似 tag，但無 attribute
soup.name # '[document]'

Comments and other special strings
只是 NavigableString 的子類

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

種類

Comment

XML 相關

CData

ProcessingInstruction

Declaration

Doctype

使用範例

html 如下

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

解析器如下

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

訪問的方法

.tagName

回傳第一個找到的子節點
範例

soup.head

<head><title>The Dormouse's story</title></head>

soup.body.b

The Dormouse's story

soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

body = soup.body
body.b

The Dormouse's story

.contents

將當前元素的所有直接子節點以 list 輸出
範例

soup.body.contents

[

'\n',
The Dormouse's story,
'\n',
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.,
'\n',
...,
'\n'

]

soup.contents

[

'\n',
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
</body></html>

]

.children

得到當前元素所有直接子節點的 iterator
範例

for child in soup.body.children:
print(child)

.descendants

得到當前元素所有子節點(含子孫節點) 的 generator
範例

for child in soup.body.descendants:
print(child)

.string

若當前元素只有一個 NavigableString 類型的子節點，可得到此值
若超過一個，則回傳 None
範例

soup.body.string

None

soup.body.p.string

"The Dormouse's story"

.strings

得到當前元素所有 NavigableString 類型子節點(含子孫節點) 的 generator
範例

for string in soup.body.strings:
print(repr(string))

.stripped_strings

同 .strings，但去除空白行與段頭段尾的空白和 \n

for string in soup.body.strings:
print(repr(string))

.parent

得到當前元素的父節點
範例

soup.title.parent

<head><title>The Dormouse's story</title></head>

soup.parent

None

.parents

得到當前元素所有父節點的 generator
範例

for parent in soup.title.parents:
print(parent.name)

.next_sibling

得到當前元素之後的兄弟節點
範例

soup.head.next_sibling

'\n'

soup.head.next_sibling.next_sibling.name

'body'

soup.title.next_sibling

None

.next_siblings

得到當前元素之後所有兄弟節點的 generator
範例

for sibling in soup.a.next_siblings:
print(repr(sibling))

.previous_sibling

得到當前元素之前的兄弟節點
範例

soup.body.previous_sibling

'\n'

soup.body.previous_sibling.previous_sibling

<head><title>The Dormouse's story</title></head>

soup.head.previous_sibling

None

.previous_siblings

得到當前元素之前所有兄弟節點的 generator
範例

for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))

.next_element

得到當前元素之後的解析元素
範例

soup.head.next_element

<title>The Dormouse's story</title>

soup.a.next_element

'Elsie'
解析器先進入 <a> 標籤，然後是字符串 'Elsie'，然後關閉 </a> 標籤

.next_elements

得到當前元素之後所有解析元素的 generator
範例

for element in soup.a.next_elements:
print(repr(element))

.previous_element

得到當前元素之前的解析元素
範例

soup.body.previous_element

'\n'

soup.p.string.previous_element

The Dormouse's story

.previous_elements

得到當前元素之前所有解析元素的 generator
範例

for element in soup.p.string.previous_elements:
print(repr(element))

搜尋的方法

Filters (搜尋參數可被傳入的類型，像 name, string, **kwargs)

String (同 text)

範例

'b'
soup.find_all('b')

[The Dormouse's story]

Regular Expression

範例

for tag in soup.find_all(re.compile("^b")):
print(tag.name)

body
b

List

範例

soup.find_all(["a", "b"])

[

The Dormouse's story,
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

True

範例

soup.find_all(id=True)

[

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

Function

需返回 True or False
範例

def not_lacie(href):
return href and not re.compile("lacie").search(href) soup.find_all(href=not_lacie)

[

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)

搜尋當前元素的所有子節點
簡寫

soup.find_all(...) == soup(...)

name

可以搜尋名字為 name 的 tag

字符串對象會被自動忽略掉

範例

soup.find_all('title')

[<title>The Dormouse's story</title>]

attrs

可以搜尋指定屬性的值
範例

soup.find_all(attrs={"class": "sister"})

[

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

recursive

預設搜尋當前元素的所有子孫節點
設為 False，則只會搜尋當前元素的直接子節點
範例

soup.html.find_all("title", recursive=False)

[]

string

搜尋字符串內容
不混用其他參數時，只會回傳 NavigableString 物件
範例

soup.find_all(string="Elsie")

['Elsie']

soup.find_all("a", string="Elsie")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

limit

限制返回結果的數量
範例

soup.find_all("a", limit=2)

[

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

]

**kwargs

如果 key name 不是內置的參數名，將會當作元素的屬性名，並用 key value 來搜尋其值
範例

soup.find_all(href=re.compile("elsie"), id='link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all("a", class_="sister")

[

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

有些元素屬性在搜尋不能使用，比如 HTML5 中的 data-* 屬性，需使用 attrs
class為 python 保留字，故需加上底線 class_ 使用

因 class 為多值屬性，若 CSS 類名的順序與實際不符,將搜索不到結果

範例
css_soup = BeautifulSoup('', "html.parser")

css_soup.find_all("p", class_="body strikeout")

[]

css_soup.find_all("p", class_="strikeout body")

[]

find(name=None, attrs={}, recursive=True, text=None, **kwargs)

類似 find_all，但只返回第一個找到的元素，找不到為 None
簡寫

soup.find(tag) == soup.tag

find_parents(name=None, attrs={}, limit=None, **kwargs)

搜尋當前元素的所有父輩節點
範例

soup.b.find_parents('p')

[The Dormouse's story]

find_parent(name=None, attrs={}, **kwargs)

搜尋當前元素的第一個父輩節點
範例

soup.b.find_parent('p')

The Dormouse's story

find_next_siblings(name=None, attrs={}, text=None, limit=None, **kwargs)

搜尋當前元素之後的所有兄弟節點
範例

soup.a.find_next_siblings()

[

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

find_next_sibling(name=None, attrs={}, text=None, **kwargs)

搜尋當前元素之後的第一個兄弟節點
範例

soup.a.find_next_sibling()

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

find_previous_siblings(name=None, attrs={}, text=None, limit=None, **kwargs)

搜尋當前元素之前的所有兄弟節點
範例

soup.find(id='link3').find_previous_siblings()

[

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

]

find_previous_sibling(name=None, attrs={}, text=None, **kwargs)

搜尋當前元素之前的第一個兄弟節點
範例

soup.find(id='link3').find_previous_sibling()

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs)

搜尋當前元素之後的所有解析節點
範例

soup.a.find_all_next('p')

[...]

find_next(name=None, attrs={}, text=None, **kwargs)

搜尋當前元素之後的第一個解析節點
範例

soup.a.find_next('p')

...

find_all_previous(name=None, attrs={}, text=None, limit=None, **kwargs)

搜尋當前元素之前的所有解析節點
範例

soup.find(class_="story").find_all_previous('b')

[The Dormouse's story]

find_previous(name=None, attrs={}, text=None, **kwargs)

搜尋當前元素之前的第一個解析節點
範例

soup.find(class_="story").find_previous('b')

The Dormouse's story

select(selector, limit=None)

使用 CSS 選擇器的語法找到所有符合元素
範例

soup.select('a.sister')

[

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

select_one(selector)

使用 CSS 選擇器的語法找到第一個符合元素
範例

soup.select_one('a.sister')

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

get_text(separator="", strip=False, types=(NavigableString, CData))

得到元素中包含的文本內容
範例

soup.get_text()

"\n,The Dormouse's story,\n,\n,The Dormouse's story,\n,Once upon a time there were three little sisters; and their names were\n,Elsie,,\n,Lacie, and\n,Tillie,;\nand they lived at the bottom of a well.,\n,...,\n"

修改的方法

修改元素的名稱和屬性

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser")
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

修改 .string

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")

tag = soup.a
tag.string = "New link text."
tag
# <a href="http://example.com/">New link text.</a>

append(tag)

元素中添加內容

soup = BeautifulSoup("<a>Foo</a>", "html.parser")
soup.a.append("Bar")

soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# ['Foo', 'Bar']

new_tag(self, name, namespace=None, nsprefix=None, **attrs)

新增元素

soup = BeautifulSoup("<b></b>" , "html.parser")
original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>

new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>

insert(position, new_child)

在指定位置插入元素

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
tag = soup.a

tag.insert(1, "but did not endorse ")
tag
# <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
tag.contents
# ['I linked to ', 'but did not endorse', <i>example.com</i>]

insert_before(predecessor)

在當前元素之前插入元素

soup = BeautifulSoup("<b>stop</b>", "html.parser")
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
soup.b
# <b><i>Don't</i>stop</b>

insert_after(successor)

在當前元素之後插入元素

soup = BeautifulSoup("<b><i>Don't</i>stop</b>", "html.parser")
soup.b.i.insert_after(soup.new_string(" ever "))
soup.b
# <b><i>Don't</i> ever stop</b>
soup.b.contents
# [<i>Don't</i>, ' ever ', 'stop']

clear(decompose=False)

移除當前元素的內容

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
tag = soup.a

tag.clear()
tag
# <a href="http://example.com/"></a>

extract()

將當前元素抽出，並返回

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
a_tag = soup.a

i_tag = soup.i.extract()

a_tag
# <a href="http://example.com/">I linked to</a>

i_tag
# <i>example.com</i>

print(i_tag.parent)
# None

decompose()

將當前元素銷毀，但不返回

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
a_tag = soup.a

soup.i.decompose()

a_tag
# <a href="http://example.com/">I linked to</a>

replace_with(replace_with)

將當前元素取代，並返回被取代的元素

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
a_tag = soup.a

new_tag = soup.new_tag("b")
new_tag.string = "example.net"
old_tag = a_tag.i.replace_with(new_tag)

a_tag
# <a href="http://example.com/">I linked to <b>example.net</b></a>
old_tag
# <i>example.com</i>

wrap(wrap_inside)

將當前元素包裝，並返回結果

soup = BeautifulSoup("<p>I wish I was bold.</p>", "html.parser")
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>

soup.p.wrap(soup.new_tag("div"))
# <div><p><b>I wish I was bold.</b></p></div>

unwrap()

將當前元素移除外層元素，並返回被移除的元素

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
a_tag = soup.a

a_tag.i.unwrap()

# <i></i>
a_tag
# <a href="http://example.com/">I linked to example.com</a>

輸出的方法

prettify(encoding=None, formatter="minimal")

整理後輸出

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
soup.prettify()
# '<a href="http://example.com/">\n I linked to\n <i>\n  example.com\n </i>\n</a>'

print(soup.prettify())
# <a href="http://example.com/">
#  I linked to
#  <i>
#   example.com
#  </i>
# </a>

str(soup)

只想得到結果字符串，不重視格式

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
str(soup)

子風的知識庫

搜尋此網誌