目录
  1. 1. 妹子图网站“妹子图”爬取(re+requests)
妹子图网站爬取美图(re+requests)

妹子图网站“妹子图”爬取(re+requests)

嘿嘿,废话不多说,先上车!

妹子图资源网址为: https://mzitu.com

最基础版本——ver0.1

源代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import requests
import re
import os
#在这里可以修改所要爬取的页数
for i in range(0,4):
url = 'https://www.mzitu.com/page/%s/'%(i)

headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_dbc355aef238b6c32b43eacbbf161c3c=1571725635,1571743086; Hm_lpvt_dbc355aef238b6c32b43eacbbf161c3c=1571743406',
'if-modified-since': 'Sun, 20 Oct 2019 15:13:50 GMT',
'referer': 'https://www.mzitu.com/page/2/',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}

response = requests.get(url,headers=headers)
html = response.text
meizi_url_list = re.findall('', html)

if not os.path.exists('download'):
os.mkdir('download')

for meizili in meizi_url_list:
for i in range(1, 10):
urlli = meizili + '/%s' % (i)
#在这里可以选择所需要爬取的一个系列内的页数

headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_dbc355aef238b6c32b43eacbbf161c3c=1571725635,1571743086; Hm_lpvt_dbc355aef238b6c32b43eacbbf161c3c=1571743406',
'if-modified-since': 'Sun, 20 Oct 2019 15:13:50 GMT',
'referer': 'https://www.mzitu.com/page/2/',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}

responseli = requests.get(urlli,headers=headers)
htmlli = responseli.text

pattern = r'img\ssrc="(.+?)"\sa'
# 妹子jpg的url
try:
meizili_url = re.findall(pattern, htmlli)[0]
print(meizili_url, type(meizili_url))
headers = {
'Referer': meizili,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
}
aaa = requests.get(meizili_url, headers=headers).content

print(aaa)
filename = 'download/' + meizili_url.split('/')[-1]
with open(filename, 'wb') as pic:
pic.write(aaa)
except:
pass

可以看到成功的爬取到了所需要的资源

image

小试牛刀!

文章作者: jiangzuojiben
文章链接: http://jiangzuojiben.github.io/2019/10/22/%E5%A6%B9%E5%AD%90%E5%9B%BE%E7%BD%91%E7%AB%99%E7%88%AC%E5%8F%96%E7%BE%8E%E5%9B%BE-re-requests/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 jiangzuojiben