社区教程 Wiki

注册登录

创作新主题

社区所有版块导航

Python

python开源 Django Python DjangoApp pycharm

DATA

docker Elasticsearch

分享

问与答闲聊招聘翻译创业分享发现分享创造求职区块链支付之战

aigc

aigc chatgpt

WEB开发

linux MongoDB Redis DATABASE NGINX 其他Web框架 web工具 zookeeper tornado NoSql Bootstrap js peewee Git bottle IE MQ Jquery

机器学习

机器学习算法

Python88.com

反馈公告社区推广

产品

短视频

印度

印度

一周十大热门主题

从入门到入魔，100个Python实战项目练习(附答案)！

航空发动机用上大模型：解决复杂时序问题，性能超越ChatGPT-4o实现SOTA｜上交创智复旦

顶刊《Materials Today》最新成果：机器学习+增材制造

Github 今日推荐 WebGL2神技！deck.gl：秒渲海量数据的可视化神器"

31 岁程序员，6 个月赚了 5.8 亿。看到一个案例，在外网我搜了搜相关新闻，竟然是真...

Nginx和Apache要成旧爱了？PHP有了新搭档：缝合怪FrankenPHP！

郑州大学田芸/周震 | 下一代电池安全管理：机器学习辅助寿命预测与性能提升

【2025版附安装包】超详细Python+Pycharm安装保姆级教程，永久免费使用，Python环...

ChatGPT化身生活操作系统：奥特曼预告下一代顶级AI

西南交通大学张云辉团队JH｜利用无监督机器学习和正定矩阵因子分解模型驱动煤矿农业区域的地下水化学成因...

关注

Py学习 » Python

[精华] 最好用的爬虫利器 Requests (HTTP for Humans)

Py站长 • 12 年前 • 23519 次点击

推荐理由：

官方介绍：（很强大！）

“Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

Things shouldn’t be this way. Not in Python.”
stackoverflow的问题Should I use urllib or urllib2 or requests?

也是推荐它的！

用起来非常不错哦。经常抓网页的可以考虑下，抓取效率有10%的提升。

源码位置：https://github.com/kennethreitz/requests

常用功能罗列如下

认证、状态码、header、编码、json

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

发起请求

import requests
URL="http://www.bsdmap.com/"
r = requests.get(URL)
r = requests.post(URL)
r = requests.put(URL)
r = requests.delete(URL)
r = requests.head(URL)
r = requests.options(URL)

通过URL传递参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)
>>> print r.url
u'http://httpbin.org/get?key2=value2&amp;key1=value1'

返回内容

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

二进制内容

You can also access the response body as bytes, for non-text requests:

>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

The gzip and deflate transfer-encodings are automatically decoded for you.

For example, to create an image from binary data returned by a request,
 ou can use the following code:

>>> from PIL import Image
>>> from StringIO import StringIO
>>> i = Image.open(StringIO(r.content))

JSON

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.json()
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

超时

>>> requests.get('http://github.com', timeout=0.001)

自定义header

>>> import json
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> headers = {'content-type': 'application/json'}

>>> r = requests.post(url, data=json.dumps(payload), headers=headers)

更多见官方文档：

http://docs.python-requests.org/en/latest/user/quickstart/

http://docs.python-requests.org/en/latest/user/advanced/#advanced

Python社区是高质量的Python/Django开发社区
本文地址：http://www.python88.com/topic/120

23519 次点击

文章 [ 13 ] | 最新文章 10 年前

Reply

• 1 楼

Py站长 10 年前

@olivetree 没看源码，不过有一点可以肯定的是，更人性化的使用。

Reply

• 2 楼

olivetree 10 年前

requests 用的是urllib3，那么他在 urllib3 的基础上做了哪些改进？

Reply

• 3 楼

olivetree 10 年前

@Django中国社区默认已经压缩了

Reply

• 4 楼

Py站长 10 年前

@olivetree 遵循的是HTTP协议啊，设定为gzip应该就会自动压缩

Reply

• 5 楼

olivetree 10 年前

这个能不能压缩传输呢？

Reply

• 6 楼

Py站长 11 年前

@zyloveszjj 赞~

Reply

• 7 楼

lzjun567 11 年前

配合BeautifulSoup使用太方便了

Reply

• 8 楼

zyloveszjj 11 年前

那个存图片的可以这样写: with open('test.png', 'wb') as f: f.write(res.content) 个人觉得比用StringIO方便的多~

Reply

• 9 楼

boostbob 11 年前

看起来不错，用java写过爬虫，一般都推荐usrlib，看起来用这个应该爽....

Reply

• 10 楼

powgolf 12 年前

@Django中国社区谢谢啦

Reply

• 11 楼

Py站长 12 年前

@powgolf

类似：

payload = {
           options.VERSION: '1_0', \
           options.PRODUCT_LINE: 'dan', \
           options.SERVICE: 'dnwebbilling', \
           options.ENV: 'online', \
           options.MAIN: 'master', \
           options.FILENAME: 'a.properties', \
           }
r = requests.get("http://localhost:8000/obj", params = payload)
print r.url

Reply

• 12 楼

powgolf 12 年前

问一下，requests如何获得最终的跳转链接？就像 urllib2 里的 geturl() 谢谢

Reply

• 13 楼

Py站长 12 年前

有一点需要注意的是，在写爬虫时，在Requests中可以设置keep-alive=False的，否则，可能会被网站屏蔽。

登录后回复

关于移动版

Py学习 - 专注于Python技术发展的社区(原Django社区)

沪ICP备11025650号