BeautifulSoup入门

获取网页

导入库之后，首先利用requests库获取网页源码

import requests
from bs4 import BeautifulSoup
import re

try:
    headers={
    'user-agent': 'Mozilla/5.0'
    }
    url = 'https://blog.csdn.net/JeronZhou'
    r=requests.get(url,headers=headers)     # 必须加 headers
    r.encoding=r.apparent_encoding
    r.raise_for_status()

    text = r.text
except Exception as e:
	print(e)

在利用requests获取内容的过程中，要注意以下几个点
构造请求头

如果不构造请求头直接访问url，将会显示
1
2
3
>>> r=requests.get("https://www.baidu.com/")
>>> r.requests.headers
{'User-Agent': 'python-requests/2.23.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
设定编码

r.encoding是从header中猜测的响应内容编码（一般是charset）

而 r.apparent_encoding是从内容中分析得到

区别如下：
1
2
3
4
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
状态码

可以通过r.status_code确认状态码是否为200

或者通过r.raise_for_status()抛出HTTPError

解析内容
1. 利用BeautifulSoup类“煲一锅汤”
2. 注意html.parser是bs4自带的解析器，也可以导入lxml库之后换成soup=BeautifulSoup(text,'lxml')
3. 利用正则表达式匹配标签（tag）
4. 需要指定class属性时，避免与python保留字冲突，应使用bs4自带关键字class_
1
2
3
4
soup=BeautifulSoup(text,'html.parser')

img=soup.find_all(re.compile(r'img'),src=re.compile(r'profile'))
username=soup.find_all(re.compile(r'h1'),class_="user-profile-title")

结语与robots协议

不知道大家在爬取之前，是否注意到CSDN遵循了robots协议（没有关注过的可以点击这个网址:https://www.csdn.net/robots.txt)

为了不给网站的管理员带来麻烦，希望大家在爬取的时候能尽量遵循robots协议；若在学习过程中在不可避免地无法遵循robots协议，也尽量维持爬虫爬取频率与人类正常访问频率相当，不过多占用服务器资源

另外也希望大家能够多多支持大可，有什么问题都可以提交，我也会及时为大家解决。最后也欢迎大家光临我的小站 https://cheungducknew.github.io/

本教程仅供学习，若被他人用于其他用途，与本人无关