Beautiful Suup4 学习

官方的中文帮助文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

示例html代码：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

结构化数据：

简单的浏览结构化数据的方法

soup.prettify() #得到一个按照标准缩进格式结构输出的BeauitfulSoup对象
soup.title #<title>The Dormouse's story</title> 获取title标签的所有内容
soup.title.name #title 
soup.title.string #The Dormouse's story
soup.title.parent.name #head title的父标签名
soup.p # <p class="title"><b>The Dormouse's story</b></p>  获取第一个p标签的所有内容
soup.p['class'] # title 获取第一个p标签的class属性对应的值
soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# 运用find_all方法获取所有a标签下的内容
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

从文档中找到所有标签的链接:

for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

从文档中获取所有文字内容:

1	soup.get_text()

搜索文档树：

Beautiful Soup中主要使用的搜索方法有2个：find()和find_all()

filter过滤器，最常用到的过滤器主要有三种类型：字符串，正则表达式和列表

字符串 最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

1
soup.find_all('b') # [<b>The Dormouse's story</b>]

如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错。

正则表达式 调用re.compile(正则表达式)对象即可在Beautiful Soup中运用正则表达式来筛选所需数据。

1
2
3
4
5
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body b
# 在代码中找出所有名字带有b的标签

列表如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有标签和标签:

1
2
3
4
5
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
find_all() find_all( name , attrs , recursive , text , **kwargs )
注意：按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag

find()方法与find_all()方法大致类似，不过find只返回一个条件符合的数据，find_all则是返回全部。