手把手带你学 Python3 | 2019国自然信息爬取

这里是python3学习笔记的第八篇。初涉爬虫，请多多指教。

最近国自然评审结果出炉，几家实验室欢喜几家愁，今天就通过爬取国自然网页来简单介绍一下python爬虫。今天要用到的包是经常出现的包：

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

首先是是安装python包

urllib库是python内置的，无需我们额外安装。

pip3 install beautifulsoup4,requests

pycharm可以在底下的终端界面安装，安装完可以简单测试一下：

from bs4 import BeautifulSoup
soup = BeautifulSoup('Hello
', 'html.parser')
print(soup.p.string)

---
Hello

一个经典的例子：

html_doc = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title"><b>The Dormouse's storyb>p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">


    
Elsiea>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>

<p class="story">...p>
"""

可以将使用BeautifulSoup 对象按照标准格式输出：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# 
#  
#   
#    The Dormouse's story
#   
#  
#  
#   
#    
#     The Dormouse's story
#    
#   
#   
#    Once upon a time there were three little sisters; and their names were
#    
#     Elsie



    
#    
#    ,
#    
#     Lacie
#    
#    and
#    
#     Tillie
#    
#    ; and they lived at the bottom of a well.
#   
#   
#    ...
#   
#  
#

同时也包含几个获取元素的方法：

soup.title
# <title>The Dormouse's storytitle>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's storyb>p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsiea>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsiea>,
#  <a 


    
class="sister" href="http://example.com/lacie" id="link2">Laciea>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tilliea>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tilliea>

国自然的小案例：

登入页面

然后右键查看网页源代码：

获取上图中的信息，代码如下：

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
    server = 'http://fund.sciencenet.cn/'
    target = 'http://fund.sciencenet.cn/search?yearStart=2019&filter%5Bsubject%5D%5B0%5D=C&submit=list&page=1'
    req = requests.get(url=target)
    html= req.text
    div_bf = BeautifulSoup(html)
    div = div_bf.find_all('div', class_='resultLst')
    a_bf = BeautifulSoup(str(div[0]))
    A = a_bf.find_all('a')
    span = a_bf.find_all('span') 
    for eachs in A:



    
        print(eachs.string)   
    for each in span:
        i = each.children
        for child in i:
            print(child.string)

结果如下：

伞形科“东亚分支”系统分类学研究及其药用植物种源鉴定
UV-B诱导叶用莴苣维生素C积累的分子机制研究
...
负责人：
周静
申请单位：
昆明医科大学
批准年度：
2019
金额：
40万
关键词：
None
负责人：
周华
申请单位：
江西省科学院
批准年度：
2019
金额：
37万
关键词：
None
...

由于技术不到家和篇幅问题，并没有完全展示所有抓取到的信息。将其整理成制表符分割的形式就能够用于筛查和统计。然后，网页循环，爬虫能够获取选定条件的完整信息。

最后，专题的内容也整理放送，希望你会有所收获。

手把手带你学 Python3 | python3 脚本小实战（代码放送）

参考资料：

https://beautifulsoup.readthedocs.io/zh_CN/latest/

全国巡讲约你

第1-11站北上广深杭，西安，郑州，吉林，武汉，成都，港珠澳（全部结束）

一年一度的生信技能树单细胞线下培训班（已结束）

全国巡讲第13站-杭州（生信技能树爆款入门课）(下一站甘肃兰州，火热报名)