urllib库（1）

一、介绍

urllib库可以用来访问一些不需要验证的网络资源和Cookie等，主要有四个模块组成，分别是urllib.request，urllib.parse，urllib.error，urllib.robotparser。urllib.request模块提供了最基本的构造HTTP请求的方法，利用它可以模拟浏览器的一个请求发起过程，同时它还带有处理授权验证（authenticaton）、重定向（redirection)、浏览器Cookies以及其他内容，parse库主要用来解析URL，error中包含了可能的异常类型。

二、基本使用

1、发送简单请求(urllib.request.urlopen)

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print (response.read().decode('utf-8'))

返回response可能是HTTPResposne类型的对象，它主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg、version、status、reason、debuglevel、closed等属性。

response.status             #获取状态码

200               

response.getheader('server')#获取header中对应项的值
BWS/1.1          

response.info()              #获取header中的全部信息
Accept-Ranges: bytes
Cache-Control: no-cache
Content-Length: 227
Content-Type: text/html
Date: Wed, 28 Mar 2018 00:49:16 GMT
Last-Modified: Thu, 15 Mar 2018 08:23:00 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=A818A688506452DAF0B4135CD343ACE4; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1522198156; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Strict-Transport-Security: max-age=0
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
'''

2、GET和POST（urlopen中的data参数）

第一种说法：

GET请求的数据会附在URL之后，以' ? '分割URL和传输数据，参数之间以' & '相连，数据为字母/数字时，原样发送；空格变为' + '；中文/其他字符，则加密（不是转码吗？）。而POST则是把提交的数据放置在HTTP包的包体中，因此无法直接看到。

实际上GET的数据可以以' data '的形式在urlopen()中直接加入，

import urllib.request
import urllib.parse

url = 'http://www.baidu.com' #暂用
values = {'username':'me','password':'you'}
data = bytes(urllb.parse.urlencode(values), encoding='utf8')
response = urllib.request.urlopen(url, data)
print (response.read().decode('utf-8'))

关于编码，我找到了这个据网站采用的编码不同，或是gbk或是utf-8，赋赋予不同的编码，进行不同的url转码。

GBK格式，一个中文字符转为%xx%xx，共两组；

utf-8格式，一个中文字符转为%xx%xx%xx，共三组。

print(bytes('你好', 'utf8'))
b'\xe4\xbd\xa0\xe5\xa5\xbd'

POST的请求方式往往需要header信息，或者其他参数，一般来讲，只需要User-Agent一项。

from urllib import request,parse

url = 'http://www.baidu.com' #暂用
headers = {}
values = {'username':'me','password':'you'}
data = bytes(parse.urlencode(values), encoding = 'utf8')
request = request.Request(url,data,headers)

第二种说法：

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

#原文结果
{
     "args": {},
     "data": "",
     "files": {},
     "form": {
         "word": "hello"
     },
     "headers": {
         "Accept-Encoding": "identity",
         "Content-Length": "10",
         "Content-Type": "application/x-www-form-urlencoded",
         "Host": "httpbin.org",
         "User-Agent": "Python-urllib/3.5"
     },
     "json": null,
     "origin": "123.124.23.253",
     "url": "http://httpbin.org/post"
}
#我的结果
b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "json": null, \n  "origin": "116.7.245.184", \n  "url": "http://httpbin.org/post"\n}\n'

我们传递的参数出现在了form字段中，这表明是模拟了表单提交的方式，以POST方式传输数据。

3、Request（headers信息的修改）

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

参数url用于请求URL，这是必传参数，其他都是可选参数。
参数data如果要传，必须传bytes（字节流）类型的。如果它是字典，可以先用urllib.parse模块里的urlencode()编码。
参数headers是一个字典，它就是请求头，我们可以在构造请求时通过headers参数直接构造，也可以通过调用请求实例的add_header()方法添加。

添加请求头最常用的用法就是通过修改User-Agent来伪装浏览器，默认的User-Agent是Python-urllib，我们可以通过修改它来伪装浏览器。比如要伪装火狐浏览器，你可以把它设置为：

Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11

参数origin_req_host指的是被请求方的host名称或者IP地址。
参数unverifiable表示这个请求是否是无法验证的，默认是False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，这时unverifiable的值就是True`。
参数method是一个字符串，用来指示请求使用的方法，比如GET、POST和PUT等，默认为GET，添加data后自动改为POST。

方法和属性

Request.add_headers(key,values) # 向headers中添加信息

headers参数

accept： 客户端可接受的数据类型 
text/html 即html的文本 
application/xhtml+xml 即xhtml，像jsp, asp等

accept-encoding: 客户端要求服务器返回的数据编码格式

一般设为gzip, deflate的比较多

accept-language ： 要求的自然语言（注意语言和编码的概念不同）

connection： 代表client与server的连接性，是keep-alive或者是None

cookie： client与server之间的沟通状态信息
host：client发过去请求后，由哪个地址来解析该请求。
If-Modified-Since：代表该网页自哪个时间开始没有再改变过
user-agent：客户端把os， browser type等封装成ua发送给服务器端
age: 代码页面是从缓存中取出后的多场时间
cache_control：服务器要求客户端是否缓存该页面
Content-Encoding: server给client的数据的编码格式，往往为gzip
Content-Length: server给client端传送数据的总字节数，经常用于判断是否接收结束。
Content-Type: 返回数据的类型，一般为text/html， 即纯文本类型。

4、handler和opener

我们可以把handler理解为各种处理器，有专门处理登录验证的，有处理Cookies的，有处理代理设置的。利用它们，我们几乎可以做到HTTP请求中所有的事情。

urllib.request模块里的BaseHandler类，它是所有其他Handler的父类，它提供了最基本的方法，例如default_open()、protocol_request()等。

各种子类（非全部）如下

HTTPDefaultErrorHandler：用于处理HTTP响应错误，错误都会抛出HTTPError类型的异常。
HTTPRedirectHandler：用于处理重定向。
HTTPCookieProcessor：用于处理Cookies。
ProxyHandler：用于设置代理，默认代理为空。
HTTPPasswordMgr：用于管理密码，它维护了用户名和密码的表。
HTTPBasicAuthHandler：用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

另一个比较重要的类就是OpenerDirector，我们可以称为Opener。我们之前用过urlopen()这个方法，实际上它就是urllib为我们提供的一个Opener。之前使用的Request和urlopen()相当于类库为你封装好了极其常用的请求方法，利用它们可以完成基本的请求，但是现在不一样了，我们需要实现更高级的功能，所以需要深入一层进行配置，使用更底层的实例来完成操作，所以这里就用到了Opener。

Opener可以使用open()方法，返回的类型和urlopen()如出一辙。那么，它和Handler有什么关系呢？简而言之，就是利用Handler来构建Opener。

4.1 验证

有些网站在打开时就会弹出提示框，直接提示你输入用户名和密码，验证成功后才能查看页面，借助HTTPBasicAuthHandler就可以完成

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

这里首先实例化HTTPBasicAuthHandler对象，其参数是HTTPPasswordMgrWithDefaultRealm对象，它利用add_password()添加进去用户名和密码，这样就建立了一个处理验证的Handler。

接下来，利用这个Handler并使用build_opener()方法构建一个Opener，这个Opener在发送请求时就相当于已经验证成功了。

接下来，利用Opener的open()方法打开链接，就可以完成验证了。这里获取到的结果就是验证后的页面源码内容。

4.2 设置代理

有时服务器为了避免承受过多的访问，会限制爬虫的次数。这时就需要代理。

import urllib.request

proxy_handler = urllib.request.ProxyHandler({'http':'http://127.0.0.1:9743'}) #输入找到的代理IP
opener = urllib.request.build_opener(Proxy_handler)
try:
    response = opener.urlopen('http://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

4.2 Cookie

之前用来打开URL的都是默认的opener，除了代理的时候，自己设置了一个。这样的opener一般只能输入URL，data和timeout参数。而对于一些网站为了识别用户身份，而进行session（会话）跟踪而存储在用户本地终端的数据（通常经过加密）。

import urllib.request
from http import cookiejar 

cookie = cookiejar.CookieJar()                    #创建一个CookieJar实例来保存cookie数据
handler = urllib.request.HTTPCookieProcessor(cookie)   #创建cookie处理器
opener = urllib.request.build_opener(handler)     #构建opener
response = opener.open('http://www.baidu.com')
for item in cookie:
    print ('name =' + item.name )
    print ('value = ' + item.value)

保存cookie数据为文件：需要FileCookieJar这个对象（MozillaCookieJar和LWPCookieJar），这两种保存的格式不一样，

from http import cookiejar
import urllib.request

filename = 'cookie.txt'
cookie = cookiejar.MozillaCookieJar(filename)       #创建文件类的cookie对象，也可以用上面的LWPCookieJar
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.build_opener(handler)
response = opener.urlopen('http://www.baidu.com')
cookie.save(ignore_discard = True, ignore_expires = True)

读取文件中的cookie

import urllib.request
import cookielib

cookie = cookielib.MozillaCookieJar()
cookie.load('cookie.txt',ignore_discard = True,ignore_expires = True)
req = urllib.request.Request('http://www.baidu.com')
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.build_opener(handler)
response = opener.open(req)

5、异常处理（urllib.error模块）

urllib的error模块定义了由request模块产生的异常。如果出现了问题，request模块便会抛出error模块中定义的异常。

5.1 URLError

URLError类来自urllib库的error模块，它继承自OSError类，是error异常模块的基类，由request模块生的异常都可以通过捕获这个类来处理。它具有一个属性reason，即返回错误的原因。

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

结果输出

Not Found

5.2 HTTPError

它是URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败等。它有如下3个属性。

code：返回HTTP状态码，比如404表示网页不存在，500表示服务器内部错误等。
reason：同父类一样，用于返回错误的原因。
headers：返回请求头。

from urllib import request,error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')

输出

Not Found
404
Server: nginx/1.4.6 (Ubuntu)
Date: Wed, 03 Aug 2016 08:54:22 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
X-Powered-By: PHP/5.5.9-1ubuntu4.14
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Link: <http://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

较好的写法是

from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

有时候，reason属性返回的不一定是字符串，也可能是一个对象。

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

输出

<class 'socket.timeout'>
TIME OUT

reason属性的结果是socket.timeout类。所以，这里我们可以用isinstance()方法来判断它的类型，作出更详细的异常判断。

Previous二、爬虫 Nexturllib库（2）

Last updated 6 years ago

Was this helpful?

response.status #获取状态码 200 response.getheader('server')#获取header中对应项的值 BWS/1.1 response.info() #获取header中的全部信息 Accept-Ranges: bytes Cache-Control: no-cache Content-Length: 227 Content-Type: text/html Date: Wed, 28 Mar 2018 00:49:16 GMT Last-Modified: Thu, 15 Mar 2018 08:23:00 GMT P3p: CP=" OTI DSP COR IVA OUR IND COM " Pragma: no-cache Server: BWS/1.1 Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300 Set-Cookie: BIDUPSID=A818A688506452DAF0B4135CD343ACE4; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com Set-Cookie: PSTM=1522198156; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com Strict-Transport-Security: max-age=0 X-Ua-Compatible: IE=Edge,chrome=1 Connection: close '''

import urllib.request import urllib.parse url = 'http://www.baidu.com' #暂用 values = {'username':'me','password':'you'} data = bytes(urllb.parse.urlencode(values), encoding='utf8') response = urllib.request.urlopen(url, data) print (response.read().decode('utf-8'))

from urllib import request,parse url = 'http://www.baidu.com' #暂用 headers = {} values = {'username':'me','password':'you'} data = bytes(parse.urlencode(values), encoding = 'utf8') request = request.Request(url,data,headers)

import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post', data=data) print(response.read())

#原文结果 { "args": {}, "data": "", "files": {}, "form": { "word": "hello" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "10", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Python-urllib/3.5" }, "json": null, "origin": "123.124.23.253", "url": "http://httpbin.org/post" } #我的结果 b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.6"\n }, \n "json": null, \n "origin": "116.7.245.184", \n "url": "http://httpbin.org/post"\n}\n'

accept：客户端可接受的数据类型 text/html 即html的文本 application/xhtml+xml 即xhtml，像jsp, asp等 accept-encoding: 客户端要求服务器返回的数据编码格式一般设为gzip, deflate的比较多 accept-language ：要求的自然语言（注意语言和编码的概念不同） connection：代表client与server的连接性，是keep-alive或者是None cookie： client与server之间的沟通状态信息 host：client发过去请求后，由哪个地址来解析该请求。 If-Modified-Since：代表该网页自哪个时间开始没有再改变过 user-agent：客户端把os， browser type等封装成ua发送给服务器端 age: 代码页面是从缓存中取出后的多场时间 cache_control：服务器要求客户端是否缓存该页面 Content-Encoding: server给client的数据的编码格式，往往为gzip Content-Length: server给client端传送数据的总字节数，经常用于判断是否接收结束。 Content-Type: 返回数据的类型，一般为text/html，即纯文本类型。

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener from urllib.error import URLError username = 'username' password = 'password' url = 'http://localhost:5000/' p = HTTPPasswordMgrWithDefaultRealm() p.add_password(None, url, username, password) auth_handler = HTTPBasicAuthHandler(p) opener = build_opener(auth_handler) try: result = opener.open(url) html = result.read().decode('utf-8') print(html) except URLError as e: print(e.reason)

import urllib.request proxy_handler = urllib.request.ProxyHandler({'http':'http://127.0.0.1:9743'}) #输入找到的代理IP opener = urllib.request.build_opener(Proxy_handler) try: response = opener.urlopen('http://www.baidu.com') print(response.read().decode('utf-8')) except URLError as e: print(e.reason)

import urllib.request from http import cookiejar cookie = cookiejar.CookieJar() #创建一个CookieJar实例来保存cookie数据 handler = urllib.request.HTTPCookieProcessor(cookie) #创建cookie处理器 opener = urllib.request.build_opener(handler) #构建opener response = opener.open('http://www.baidu.com') for item in cookie: print ('name =' + item.name ) print ('value = ' + item.value)

from http import cookiejar import urllib.request filename = 'cookie.txt' cookie = cookiejar.MozillaCookieJar(filename) #创建文件类的cookie对象，也可以用上面的LWPCookieJar handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.build_opener(handler) response = opener.urlopen('http://www.baidu.com') cookie.save(ignore_discard = True, ignore_expires = True)

import urllib.request import cookielib cookie = cookielib.MozillaCookieJar() cookie.load('cookie.txt',ignore_discard = True,ignore_expires = True) req = urllib.request.Request('http://www.baidu.com') handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.build_opener(handler) response = opener.open(req)

Not Found 404 Server: nginx/1.4.6 (Ubuntu) Date: Wed, 03 Aug 2016 08:54:22 GMT Content-Type: text/html; charset=UTF-8 Transfer-Encoding: chunked Connection: close X-Powered-By: PHP/5.5.9-1ubuntu4.14 Vary: Cookie Expires: Wed, 11 Jan 1984 05:00:00 GMT Cache-Control: no-cache, must-revalidate, max-age=0 Pragma: no-cache Link: <http://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

from urllib import request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print('Request Successfully')

import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01) except urllib.error.URLError as e: print(type(e.reason)) if isinstance(e.reason, socket.timeout): print('TIME OUT')