Python网络爬虫3 – 使用BeautifulSoup解析网页

在第一节演示过如何使用正则表达式截取网页内容。不过html是比正则表达式更高一级的语言，仅仅使用正则表达式来获取内容还是有些困难的。

这次会使用一个新的工具：python的BeautifulSoup库，BeautifulSoup是用来从HTML或XML文件中提取数据的工具。

BeautifulSoup需要先安装才能使用。关于BeautifulSoup安装和使用可以参考这份文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/。

打开我们的搜索结果页https://www.torrentkitty.tv/search/蝙蝠侠/。此时最好使用chrome浏览器，因为chrome浏览器的Developer Tools（类似FireBug）有一个功能可以获取CSS选择器（选中目标 –> Copy –> Copy selector），这是很有用的（有的时候也会出现问题，不过还是可以用来做参考的）。

先使用BeautifulSoup获取title试试：

1 2	soup = BeautifulSoup(html_doc, "html.parser") soup.select("title")

很简单不是。这里是用了CSS选择器来获取HTML中的内容。

在搜索结果中，点击每个结果项右侧的“open”按钮可以打开下载。使用DeveloperTools可以看到“open”按钮实际上是一个超链接，超链接指向的是一个磁力链接。这个磁力链接就是我们要采集的目标。使用Chrome的DeveloperTools的选择器选中任意一个“open”按钮，然后在Elements项中，找到我们选中的项的源码（很容易找到，选中项的背景是蓝色的），右键 –> Copy –> Copy selector可以获取到这个按钮的CSS选择器：

1	#archiveResult > tbody > tr:nth-child(10) > td.action > a:nth-child(2)

将这个选择器放到代码中却是不生效的：

def detect(html_doc):

soup = BeautifulSoup(html_doc, "html.parser")

print(len(soup.select("#archiveResult > tbody > tr:nth-child(10) > td.action > a:nth-child(2)")))

执行结果输出的是0。soup.select()方法返回的是一个列表，长度为0……好像不用解释了。

python是不支持上面的选择器的部分语法的，比如nth或者tbody，修改下就可以了：

1	print(soup.select("#archiveResult > tr:nth-of-type(10) > td.action > a:nth-of-type(2)"))

直接执行上面的代码就可以得到一个超链接的源码：

[<a title="[BT乐园·bt606.com]蝙蝠侠大战超人：正义黎明.BD1080P.X264.AAC.中英字幕" href="magnet:?xt=urn:btih:DD6A680A7AE85F290A76826AA4D2E72194975EC8&dn=%5BBT%E4%B9%90%E5%9B%AD%C2%B7bt606.com%5D%E8%9D%99%E8%9D%A0%E4%BE%A0%E5%A4%A7%E6%88%98%E8%B6%85%E4%BA%BA%EF%BC%9A%E6%AD%A3%E4%B9%89%E9%BB%8E%E6%98%8E.BD1080P.X264.AAC.%E4%B8%AD%E8%8B%B1%E5%AD%97%E5%B9%95&tr=http%3A%2F%2Ftracker.ktxp.com%3A6868%2Fannounce&tr=http%3A%2F%2Ftracker.ktxp.com%3A7070%2Fannounce&tr=udp%3A%2F%2Ftracker.ktxp.com%3A6868%2Fannounce&tr=udp%3A%2F%2Ftracker.ktxp.com%3A7070%2Fannounce&tr=http%3A%2F%2Fbtfans.3322.org%3A8000%2Fannounce&tr=http%3A%2F%2Fbtfans.3322.org%3A8080%2Fannounce&tr=http%3A%2F%2Fbtfans.3322.org%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.bittorrent.am%2Fannounce&tr=udp%3A%2F%2Ftracker.bitcomet.net%3A8080%2Fannounce&tr=http%3A%2F%2Ftk3.5qzone.net%3A8080%2F&tr=http%3A%2F%2Ftracker.btzero.net%3A8080%2Fannounce&tr=http%3A%2F%2Fscubt.wjl.cn%3A8080%2Fannounce&tr=http%3A%2F%2Fbt.popgo.net%3A7456%2Fannounce&tr=http%3A%2F%2Fthetracker.org%2Fannounce&tr=http%3A%2F%2Ftracker.prq.to%2Fannounce&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=http%3A%2F%2Ftracker.dmhy.org%3A8000%2Fannounce&tr=http%3A%2F%2Fbt.titapark.com%3A2710%2Fannounce&tr=http%3A%2F%2Ftracker.tjgame.enorth.com.cn%3A8000%2Fannounce&" rel="magnet">Open</a>]

超链接中的href和title属性就是我们的目标。BeautifulSoup也提供了获取属性的方案，select方法返回的每个值中都包含一个attrs字典，可以从字典中获取到相关的属性信息：

def detect(html_doc):

html_soup = BeautifulSoup(html_doc, "html.parser")

anchor = html_soup.select("#archiveResult > tr:nth-of-type(10) > td.action > a:nth-of-type(2)")[0]

print(anchor.attrs['href'])

print(anchor.attrs['title'])

好了，大体就是这样。

不过程序中最难看的就是获取超链接的方案：一个一个地获取是不可能。好在BeautifulSoup支持通过属性的值来获取对象，最后调整下就是这样子了：

def detect(html_doc):

html_soup = BeautifulSoup(html_doc, "html.parser")

anchors = html_soup.select('a[href^="magnet:?xt"]')

for i in range(len(anchors)):

print(anchors[i].attrs['title'])

print(anchors[i].attrs['href'])

上面的代码中的a[href^=”magnet:?xt”]表示查询的是所有<a>标签，且<a>标签的href属性需要以“magnet:?xt”开头。（看到“^”有没有觉得熟悉，这个“^”和正则式中的“^”意义是一样的）。通过这个select方法得到<a>标签列表，然后遍历标签列表，从标签的attrs字典中读取到相关的属性信息。

完整的代码如下：

#!python

# encoding: utf-8

from urllib import request

from urllib import parse

from bs4 import BeautifulSoup

DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}

DEFAULT_TIMEOUT = 360

def get(url):

req = request.Request(url, headers=DEFAULT_HEADERS)

response = request.urlopen(req, timeout=DEFAULT_TIMEOUT)

content = ""

if response:

content = response.read().decode("utf8")

response.close()

return content

def post(url, **paras):

param = parse.urlencode(paras).encode("utf8")

req = request.Request(url, param, headers=DEFAULT_HEADERS)

response = request.urlopen(req, timeout=DEFAULT_TIMEOUT)

content = ""

if response:

content = response.read().decode("utf8")

response.close()

return content

def detect(html_doc):

html_soup = BeautifulSoup(html_doc, "html.parser")

anchors = html_soup.select('a[href^="magnet:?xt"]')

for i in range(len(anchors)):

print(anchors[i].attrs['title'])

print(anchors[i].attrs['href'])

def main():

url = "https://www.torrentkitty.tv/search/"

html_doc = post(url, q=parse.quote("超人"))

detect(html_doc)

if __name__ == "__main__":

main()

###############

Python网络爬虫3 – 使用BeautifulSoup解析网页

发表评论取消回复

我的专题

归档

友情链接

其他操作

Python网络爬虫3 – 使用BeautifulSoup解析网页

发表评论 取消回复

我的专题

归档

友情链接

其他操作

标签云

发表评论取消回复