python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

Python/PHP/Perl 开发与设计
回复
lzhp1501
帖子: 39
注册时间: 2009-12-10 22:55
送出感谢: 1 次
接收感谢: 0

python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#1

帖子 lzhp1501 » 2016-12-12 22:07

使用shell的CURL还有Python3的urllib的urlopen得到的结果不一样啊,

用curl得到的是没有源码的网页内容


用python的urlopen得到read后,貌似把源码也爬下来了,

如何把源码去掉,变得和curl输出结果一致


用python得到的是这个 view-source:https://www.baidu.com/
上次由 lzhp1501 在 2016-12-13 20:52,总共编辑 1 次。
我说我不帅,大家都打我,还骂我虚伪~
头像
vickycq
论坛版主
帖子: 4507
注册时间: 2011-03-20 13:12
系统: Debian
来自: 山东省寿光县
送出感谢: 100 次
接收感谢: 994 次
联系:

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#2

帖子 vickycq » 2016-12-12 22:19

建议将相关内容全部复制贴上来,具体说明问题

如下所示,没问题,是一样的
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)
[GCC 6.2.1 20161119] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.request import urlopen
>>> html = urlopen("http://mirrors.163.com/")
>>> page = html.read()
>>> page.decode('utf-8')
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- ... d">\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\n <link rel="stylesheet" type="text/css" href="/.media/mirror.css" media="screen" />\n <link rel="shortcut icon" href="/.media/favicon.ico" />\n <title>欢迎访问网易开源镜像站</title>\n</head>\n\n<body>\n\n<h1>欢迎访问网易开源镜像站</h1>\n\n<table id="distro-table" cellpadding="0" cellspacing="0">\n <colgroup>\n <col width="50%"/>\n <col width="25%"/>\n <col width="25%"/>\n </colgroup>\n <thead>\n <tr>\n <th>镜像名</th>\n <th>上次更新时间</th>\n <th>使用帮助</th>\n </tr>\n </thead>\n <tbody>\n <tr class="odd">\n <td><a href="/archlinux/">archlinux/</a></td>\n <td>2016-12-12 18:27</td>\n <td><a href="/.help/archlinux.html">archlinux使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/archlinux-cn/">archlinux-cn/</a></td>\n <td>2016-12-12 05:28</td>\n <td><a href="/.help/archlinux-cn.html">archlinux-cn使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/centos/">centos/</a></td>\n <td>2016-12-12 15:44</td>\n <td><a href="/.help/centos.html">centos使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ceph/">ceph/</a></td>\n <td>2016-12-07 02:38</td>\n <td><a href="/.help/ceph.html">ceph使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/cpan/">cpan/</a></td>\n <td>2016-12-12 07:55</td>\n <td><a href="/.help/cpan.html">cpan使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/cygwin/">cygwin/</a></td>\n <td>2016-12-12 17:27</td>\n <td><a href="/.help/cygwin.html">cygwin使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian/">debian/</a></td>\n <td>2016-12-12 04:27</td>\n <td><a href="/.help/debian.html">debian使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-backports/">debian-backports/</a></td>\n <td>2016-03-31 04:28</td>\n <td><a href="/.help/debian-backports.html">debian-backports使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian-cd/">debian-cd/</a></td>\n <td>2016-12-12 03:18</td>\n <td><a href="/.help/debian-cd.html">debian-cd使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-security/">debian-security/</a></td>\n <td>2016-12-12 22:20</td>\n <td><a href="/.help/debian-security.html">debian-security使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/gentoo/">gentoo/</a></td>\n <td>2016-12-09 18:16</td>\n <td><a href="/.help/gentoo.html">gentoo使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/gentoo-portage/">gentoo-portage/</a></td>\n <td>2016-12-12 20:04</td>\n <td><a href="/.help/gentoo-portage.html">gentoo-portage使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/slackware/">slackware/</a></td>\n <td>2016-12-12 05:51</td>\n <td><a href="/.help/slackware.html">slackware使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/tinycorelinux/">tinycorelinux/</a></td>\n <td>2016-12-12 21:37</td>\n <td><a href="/.help/tinycorelinux.html">tinycorelinux使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/ubuntu/">ubuntu/</a></td>\n <td>2016-12-12 00:54</td>\n <td><a href="/.help/ubuntu.html">ubuntu使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ubuntu-releases/">ubuntu-releases/</a></td>\n <td>2016-12-12 07:14</td>\n <td><a href="/.help/ubuntu-releases.html">ubuntu-releases使用帮助</a></td>\n </tr>\n </tbody>\n</table>\n<div id="footer">\n <a target="_blank" href="http://www.163.com/">网易首页</a>\n <a target="_blank" href="/.help/index.html">使用帮助</a>\n <a href="mailto:mirror@service.netease.com">联系我们</a>\n <a target="_blank" href="http://corp.163.com/eng/about/overview.html">About NetEase</a>\n</div>\n\n</body>\n</html>\n'
$ curl http://mirrors.163.com/ > page
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4655 100 4655 0 0 29142 0 --:--:-- --:--:-- --:--:-- 29276

$ python3
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)
[GCC 6.2.1 20161119] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('page')
>>> page = f.read()
>>> f.close()
>>> page
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- ... d">\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\n <link rel="stylesheet" type="text/css" href="/.media/mirror.css" media="screen" />\n <link rel="shortcut icon" href="/.media/favicon.ico" />\n <title>欢迎访问网易开源镜像站</title>\n</head>\n\n<body>\n\n<h1>欢迎访问网易开源镜像站</h1>\n\n<table id="distro-table" cellpadding="0" cellspacing="0">\n <colgroup>\n <col width="50%"/>\n <col width="25%"/>\n <col width="25%"/>\n </colgroup>\n <thead>\n <tr>\n <th>镜像名</th>\n <th>上次更新时间</th>\n <th>使用帮助</th>\n </tr>\n </thead>\n <tbody>\n <tr class="odd">\n <td><a href="/archlinux/">archlinux/</a></td>\n <td>2016-12-12 18:27</td>\n <td><a href="/.help/archlinux.html">archlinux使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/archlinux-cn/">archlinux-cn/</a></td>\n <td>2016-12-12 05:28</td>\n <td><a href="/.help/archlinux-cn.html">archlinux-cn使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/centos/">centos/</a></td>\n <td>2016-12-12 15:44</td>\n <td><a href="/.help/centos.html">centos使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ceph/">ceph/</a></td>\n <td>2016-12-07 02:38</td>\n <td><a href="/.help/ceph.html">ceph使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/cpan/">cpan/</a></td>\n <td>2016-12-12 07:55</td>\n <td><a href="/.help/cpan.html">cpan使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/cygwin/">cygwin/</a></td>\n <td>2016-12-12 17:27</td>\n <td><a href="/.help/cygwin.html">cygwin使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian/">debian/</a></td>\n <td>2016-12-12 04:27</td>\n <td><a href="/.help/debian.html">debian使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-backports/">debian-backports/</a></td>\n <td>2016-03-31 04:28</td>\n <td><a href="/.help/debian-backports.html">debian-backports使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian-cd/">debian-cd/</a></td>\n <td>2016-12-12 03:18</td>\n <td><a href="/.help/debian-cd.html">debian-cd使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-security/">debian-security/</a></td>\n <td>2016-12-12 22:20</td>\n <td><a href="/.help/debian-security.html">debian-security使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/gentoo/">gentoo/</a></td>\n <td>2016-12-09 18:16</td>\n <td><a href="/.help/gentoo.html">gentoo使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/gentoo-portage/">gentoo-portage/</a></td>\n <td>2016-12-12 20:04</td>\n <td><a href="/.help/gentoo-portage.html">gentoo-portage使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/slackware/">slackware/</a></td>\n <td>2016-12-12 05:51</td>\n <td><a href="/.help/slackware.html">slackware使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/tinycorelinux/">tinycorelinux/</a></td>\n <td>2016-12-12 21:37</td>\n <td><a href="/.help/tinycorelinux.html">tinycorelinux使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/ubuntu/">ubuntu/</a></td>\n <td>2016-12-12 00:54</td>\n <td><a href="/.help/ubuntu.html">ubuntu使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ubuntu-releases/">ubuntu-releases/</a></td>\n <td>2016-12-12 07:14</td>\n <td><a href="/.help/ubuntu-releases.html">ubuntu-releases使用帮助</a></td>\n </tr>\n </tbody>\n</table>\n<div id="footer">\n <a target="_blank" href="http://www.163.com/">网易首页</a>\n <a target="_blank" href="/.help/index.html">使用帮助</a>\n <a href="mailto:mirror@service.netease.com">联系我们</a>\n <a target="_blank" href="http://corp.163.com/eng/about/overview.html">About NetEase</a>\n</div>\n\n</body>\n</html>\n'
>>>
Debian 中文论坛 - forums.debiancn.org
欢迎所有 Debian GNU/Linux 用户
onlylove
论坛版主
帖子: 4428
注册时间: 2007-01-14 16:23
送出感谢: 0
接收感谢: 99 次

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#3

帖子 onlylove » 2016-12-12 23:10

vickycq 写了:建议将相关内容全部复制贴上来,具体说明问题

如下所示,没问题,是一样的
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)
[GCC 6.2.1 20161119] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.request import urlopen
>>> html = urlopen("http://mirrors.163.com/")
>>> page = html.read()
>>> page.decode('utf-8')
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- ... d">\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\n <link rel="stylesheet" type="text/css" href="/.media/mirror.css" media="screen" />\n <link rel="shortcut icon" href="/.media/favicon.ico" />\n <title>欢迎访问网易开源镜像站</title>\n</head>\n\n<body>\n\n<h1>欢迎访问网易开源镜像站</h1>\n\n<table id="distro-table" cellpadding="0" cellspacing="0">\n <colgroup>\n <col width="50%"/>\n <col width="25%"/>\n <col width="25%"/>\n </colgroup>\n <thead>\n <tr>\n <th>镜像名</th>\n <th>上次更新时间</th>\n <th>使用帮助</th>\n </tr>\n </thead>\n <tbody>\n <tr class="odd">\n <td><a href="/archlinux/">archlinux/</a></td>\n <td>2016-12-12 18:27</td>\n <td><a href="/.help/archlinux.html">archlinux使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/archlinux-cn/">archlinux-cn/</a></td>\n <td>2016-12-12 05:28</td>\n <td><a href="/.help/archlinux-cn.html">archlinux-cn使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/centos/">centos/</a></td>\n <td>2016-12-12 15:44</td>\n <td><a href="/.help/centos.html">centos使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ceph/">ceph/</a></td>\n <td>2016-12-07 02:38</td>\n <td><a href="/.help/ceph.html">ceph使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/cpan/">cpan/</a></td>\n <td>2016-12-12 07:55</td>\n <td><a href="/.help/cpan.html">cpan使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/cygwin/">cygwin/</a></td>\n <td>2016-12-12 17:27</td>\n <td><a href="/.help/cygwin.html">cygwin使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian/">debian/</a></td>\n <td>2016-12-12 04:27</td>\n <td><a href="/.help/debian.html">debian使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-backports/">debian-backports/</a></td>\n <td>2016-03-31 04:28</td>\n <td><a href="/.help/debian-backports.html">debian-backports使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian-cd/">debian-cd/</a></td>\n <td>2016-12-12 03:18</td>\n <td><a href="/.help/debian-cd.html">debian-cd使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-security/">debian-security/</a></td>\n <td>2016-12-12 22:20</td>\n <td><a href="/.help/debian-security.html">debian-security使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/gentoo/">gentoo/</a></td>\n <td>2016-12-09 18:16</td>\n <td><a href="/.help/gentoo.html">gentoo使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/gentoo-portage/">gentoo-portage/</a></td>\n <td>2016-12-12 20:04</td>\n <td><a href="/.help/gentoo-portage.html">gentoo-portage使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/slackware/">slackware/</a></td>\n <td>2016-12-12 05:51</td>\n <td><a href="/.help/slackware.html">slackware使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/tinycorelinux/">tinycorelinux/</a></td>\n <td>2016-12-12 21:37</td>\n <td><a href="/.help/tinycorelinux.html">tinycorelinux使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/ubuntu/">ubuntu/</a></td>\n <td>2016-12-12 00:54</td>\n <td><a href="/.help/ubuntu.html">ubuntu使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ubuntu-releases/">ubuntu-releases/</a></td>\n <td>2016-12-12 07:14</td>\n <td><a href="/.help/ubuntu-releases.html">ubuntu-releases使用帮助</a></td>\n </tr>\n </tbody>\n</table>\n<div id="footer">\n <a target="_blank" href="http://www.163.com/">网易首页</a>\n <a target="_blank" href="/.help/index.html">使用帮助</a>\n <a href="mailto:mirror@service.netease.com">联系我们</a>\n <a target="_blank" href="http://corp.163.com/eng/about/overview.html">About NetEase</a>\n</div>\n\n</body>\n</html>\n'
$ curl http://mirrors.163.com/ > page
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4655 100 4655 0 0 29142 0 --:--:-- --:--:-- --:--:-- 29276

$ python3
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)
[GCC 6.2.1 20161119] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('page')
>>> page = f.read()
>>> f.close()
>>> page
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- ... d">\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\n <link rel="stylesheet" type="text/css" href="/.media/mirror.css" media="screen" />\n <link rel="shortcut icon" href="/.media/favicon.ico" />\n <title>欢迎访问网易开源镜像站</title>\n</head>\n\n<body>\n\n<h1>欢迎访问网易开源镜像站</h1>\n\n<table id="distro-table" cellpadding="0" cellspacing="0">\n <colgroup>\n <col width="50%"/>\n <col width="25%"/>\n <col width="25%"/>\n </colgroup>\n <thead>\n <tr>\n <th>镜像名</th>\n <th>上次更新时间</th>\n <th>使用帮助</th>\n </tr>\n </thead>\n <tbody>\n <tr class="odd">\n <td><a href="/archlinux/">archlinux/</a></td>\n <td>2016-12-12 18:27</td>\n <td><a href="/.help/archlinux.html">archlinux使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/archlinux-cn/">archlinux-cn/</a></td>\n <td>2016-12-12 05:28</td>\n <td><a href="/.help/archlinux-cn.html">archlinux-cn使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/centos/">centos/</a></td>\n <td>2016-12-12 15:44</td>\n <td><a href="/.help/centos.html">centos使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ceph/">ceph/</a></td>\n <td>2016-12-07 02:38</td>\n <td><a href="/.help/ceph.html">ceph使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/cpan/">cpan/</a></td>\n <td>2016-12-12 07:55</td>\n <td><a href="/.help/cpan.html">cpan使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/cygwin/">cygwin/</a></td>\n <td>2016-12-12 17:27</td>\n <td><a href="/.help/cygwin.html">cygwin使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian/">debian/</a></td>\n <td>2016-12-12 04:27</td>\n <td><a href="/.help/debian.html">debian使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-backports/">debian-backports/</a></td>\n <td>2016-03-31 04:28</td>\n <td><a href="/.help/debian-backports.html">debian-backports使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian-cd/">debian-cd/</a></td>\n <td>2016-12-12 03:18</td>\n <td><a href="/.help/debian-cd.html">debian-cd使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-security/">debian-security/</a></td>\n <td>2016-12-12 22:20</td>\n <td><a href="/.help/debian-security.html">debian-security使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/gentoo/">gentoo/</a></td>\n <td>2016-12-09 18:16</td>\n <td><a href="/.help/gentoo.html">gentoo使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/gentoo-portage/">gentoo-portage/</a></td>\n <td>2016-12-12 20:04</td>\n <td><a href="/.help/gentoo-portage.html">gentoo-portage使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/slackware/">slackware/</a></td>\n <td>2016-12-12 05:51</td>\n <td><a href="/.help/slackware.html">slackware使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/tinycorelinux/">tinycorelinux/</a></td>\n <td>2016-12-12 21:37</td>\n <td><a href="/.help/tinycorelinux.html">tinycorelinux使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/ubuntu/">ubuntu/</a></td>\n <td>2016-12-12 00:54</td>\n <td><a href="/.help/ubuntu.html">ubuntu使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ubuntu-releases/">ubuntu-releases/</a></td>\n <td>2016-12-12 07:14</td>\n <td><a href="/.help/ubuntu-releases.html">ubuntu-releases使用帮助</a></td>\n </tr>\n </tbody>\n</table>\n<div id="footer">\n <a target="_blank" href="http://www.163.com/">网易首页</a>\n <a target="_blank" href="/.help/index.html">使用帮助</a>\n <a href="mailto:mirror@service.netease.com">联系我们</a>\n <a target="_blank" href="http://corp.163.com/eng/about/overview.html">About NetEase</a>\n</div>\n\n</body>\n</html>\n'
>>>
别乱说,你试试百度首页 :em01
头像
b33e
论坛版主
帖子: 3862
注册时间: 2011-06-07 14:20
系统: Mint18
送出感谢: 16 次
接收感谢: 62 次

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#4

帖子 b33e » 2016-12-12 23:25

应该是一样的,没看明白你的意思
头像
b33e
论坛版主
帖子: 3862
注册时间: 2011-06-07 14:20
系统: Mint18
送出感谢: 16 次
接收感谢: 62 次

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#5

帖子 b33e » 2016-12-12 23:35

那就是百度改过了,我记得最早学的时候就是抓的百度的首页。
不过requests库依旧可以。
In [1]: import requests

In [2]: r=requests.get('https://www.baidu.com')

In [3]: r.encoding='utf-8'

In [4]: r.text
Out[4]: '<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2 ... title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?lo ... z_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');\r\n </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2016&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
lzhp1501
帖子: 39
注册时间: 2009-12-10 22:55
送出感谢: 1 次
接收感谢: 0

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#6

帖子 lzhp1501 » 2016-12-13 14:28

是滴,正如楼下说的,curl 百度和urlopen百度得到不一样



vickycq 写了:建议将相关内容全部复制贴上来,具体说明问题

如下所示,没问题,是一样的
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)
[GCC 6.2.1 20161119] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.request import urlopen
>>> html = urlopen("http://mirrors.163.com/")
>>> page = html.read()
>>> page.decode('utf-8')
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- ... d">\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\n <link rel="stylesheet" type="text/css" href="/.media/mirror.css" media="screen" />\n <link rel="shortcut icon" href="/.media/favicon.ico" />\n <title>欢迎访问网易开源镜像站</title>\n</head>\n\n<body>\n\n<h1>欢迎访问网易开源镜像站</h1>\n\n<table id="distro-table" cellpadding="0" cellspacing="0">\n <colgroup>\n <col width="50%"/>\n <col width="25%"/>\n <col width="25%"/>\n </colgroup>\n <thead>\n <tr>\n <th>镜像名</th>\n <th>上次更新时间</th>\n <th>使用帮助</th>\n </tr>\n </thead>\n <tbody>\n <tr class="odd">\n <td><a href="/archlinux/">archlinux/</a></td>\n <td>2016-12-12 18:27</td>\n <td><a href="/.help/archlinux.html">archlinux使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/archlinux-cn/">archlinux-cn/</a></td>\n <td>2016-12-12 05:28</td>\n <td><a href="/.help/archlinux-cn.html">archlinux-cn使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/centos/">centos/</a></td>\n <td>2016-12-12 15:44</td>\n <td><a href="/.help/centos.html">centos使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ceph/">ceph/</a></td>\n <td>2016-12-07 02:38</td>\n <td><a href="/.help/ceph.html">ceph使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/cpan/">cpan/</a></td>\n <td>2016-12-12 07:55</td>\n <td><a href="/.help/cpan.html">cpan使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/cygwin/">cygwin/</a></td>\n <td>2016-12-12 17:27</td>\n <td><a href="/.help/cygwin.html">cygwin使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian/">debian/</a></td>\n <td>2016-12-12 04:27</td>\n <td><a href="/.help/debian.html">debian使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-backports/">debian-backports/</a></td>\n <td>2016-03-31 04:28</td>\n <td><a href="/.help/debian-backports.html">debian-backports使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian-cd/">debian-cd/</a></td>\n <td>2016-12-12 03:18</td>\n <td><a href="/.help/debian-cd.html">debian-cd使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-security/">debian-security/</a></td>\n <td>2016-12-12 22:20</td>\n <td><a href="/.help/debian-security.html">debian-security使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/gentoo/">gentoo/</a></td>\n <td>2016-12-09 18:16</td>\n <td><a href="/.help/gentoo.html">gentoo使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/gentoo-portage/">gentoo-portage/</a></td>\n <td>2016-12-12 20:04</td>\n <td><a href="/.help/gentoo-portage.html">gentoo-portage使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/slackware/">slackware/</a></td>\n <td>2016-12-12 05:51</td>\n <td><a href="/.help/slackware.html">slackware使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/tinycorelinux/">tinycorelinux/</a></td>\n <td>2016-12-12 21:37</td>\n <td><a href="/.help/tinycorelinux.html">tinycorelinux使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/ubuntu/">ubuntu/</a></td>\n <td>2016-12-12 00:54</td>\n <td><a href="/.help/ubuntu.html">ubuntu使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ubuntu-releases/">ubuntu-releases/</a></td>\n <td>2016-12-12 07:14</td>\n <td><a href="/.help/ubuntu-releases.html">ubuntu-releases使用帮助</a></td>\n </tr>\n </tbody>\n</table>\n<div id="footer">\n <a target="_blank" href="http://www.163.com/">网易首页</a>\n <a target="_blank" href="/.help/index.html">使用帮助</a>\n <a href="mailto:mirror@service.netease.com">联系我们</a>\n <a target="_blank" href="http://corp.163.com/eng/about/overview.html">About NetEase</a>\n</div>\n\n</body>\n</html>\n'
$ curl http://mirrors.163.com/ > page
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4655 100 4655 0 0 29142 0 --:--:-- --:--:-- --:--:-- 29276

$ python3
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)
[GCC 6.2.1 20161119] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('page')
>>> page = f.read()
>>> f.close()
>>> page
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- ... d">\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\n <link rel="stylesheet" type="text/css" href="/.media/mirror.css" media="screen" />\n <link rel="shortcut icon" href="/.media/favicon.ico" />\n <title>欢迎访问网易开源镜像站</title>\n</head>\n\n<body>\n\n<h1>欢迎访问网易开源镜像站</h1>\n\n<table id="distro-table" cellpadding="0" cellspacing="0">\n <colgroup>\n <col width="50%"/>\n <col width="25%"/>\n <col width="25%"/>\n </colgroup>\n <thead>\n <tr>\n <th>镜像名</th>\n <th>上次更新时间</th>\n <th>使用帮助</th>\n </tr>\n </thead>\n <tbody>\n <tr class="odd">\n <td><a href="/archlinux/">archlinux/</a></td>\n <td>2016-12-12 18:27</td>\n <td><a href="/.help/archlinux.html">archlinux使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/archlinux-cn/">archlinux-cn/</a></td>\n <td>2016-12-12 05:28</td>\n <td><a href="/.help/archlinux-cn.html">archlinux-cn使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/centos/">centos/</a></td>\n <td>2016-12-12 15:44</td>\n <td><a href="/.help/centos.html">centos使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ceph/">ceph/</a></td>\n <td>2016-12-07 02:38</td>\n <td><a href="/.help/ceph.html">ceph使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/cpan/">cpan/</a></td>\n <td>2016-12-12 07:55</td>\n <td><a href="/.help/cpan.html">cpan使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/cygwin/">cygwin/</a></td>\n <td>2016-12-12 17:27</td>\n <td><a href="/.help/cygwin.html">cygwin使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian/">debian/</a></td>\n <td>2016-12-12 04:27</td>\n <td><a href="/.help/debian.html">debian使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-backports/">debian-backports/</a></td>\n <td>2016-03-31 04:28</td>\n <td><a href="/.help/debian-backports.html">debian-backports使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/debian-cd/">debian-cd/</a></td>\n <td>2016-12-12 03:18</td>\n <td><a href="/.help/debian-cd.html">debian-cd使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/debian-security/">debian-security/</a></td>\n <td>2016-12-12 22:20</td>\n <td><a href="/.help/debian-security.html">debian-security使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/gentoo/">gentoo/</a></td>\n <td>2016-12-09 18:16</td>\n <td><a href="/.help/gentoo.html">gentoo使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/gentoo-portage/">gentoo-portage/</a></td>\n <td>2016-12-12 20:04</td>\n <td><a href="/.help/gentoo-portage.html">gentoo-portage使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/slackware/">slackware/</a></td>\n <td>2016-12-12 05:51</td>\n <td><a href="/.help/slackware.html">slackware使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/tinycorelinux/">tinycorelinux/</a></td>\n <td>2016-12-12 21:37</td>\n <td><a href="/.help/tinycorelinux.html">tinycorelinux使用帮助</a></td>\n </tr>\n <tr class="odd">\n <td><a href="/ubuntu/">ubuntu/</a></td>\n <td>2016-12-12 00:54</td>\n <td><a href="/.help/ubuntu.html">ubuntu使用帮助</a></td>\n </tr>\n <tr class="even">\n <td><a href="/ubuntu-releases/">ubuntu-releases/</a></td>\n <td>2016-12-12 07:14</td>\n <td><a href="/.help/ubuntu-releases.html">ubuntu-releases使用帮助</a></td>\n </tr>\n </tbody>\n</table>\n<div id="footer">\n <a target="_blank" href="http://www.163.com/">网易首页</a>\n <a target="_blank" href="/.help/index.html">使用帮助</a>\n <a href="mailto:mirror@service.netease.com">联系我们</a>\n <a target="_blank" href="http://corp.163.com/eng/about/overview.html">About NetEase</a>\n</div>\n\n</body>\n</html>\n'
>>>
我说我不帅,大家都打我,还骂我虚伪~
lzhp1501
帖子: 39
注册时间: 2009-12-10 22:55
送出感谢: 1 次
接收感谢: 0

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#7

帖子 lzhp1501 » 2016-12-13 19:29

对滴,用openurl得到的比curl要多出好多一大串

b33e 写了:那就是百度改过了,我记得最早学的时候就是抓的百度的首页。
不过requests库依旧可以。
In [1]: import requests

In [2]: r=requests.get('https://www.baidu.com')

In [3]: r.encoding='utf-8'

In [4]: r.text
Out[4]: '<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2 ... title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?lo ... z_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');\r\n </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2016&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
我说我不帅,大家都打我,还骂我虚伪~
onlylove
论坛版主
帖子: 4428
注册时间: 2007-01-14 16:23
送出感谢: 0
接收感谢: 99 次

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#8

帖子 onlylove » 2016-12-13 19:59

大概就这么回事
附件
0229.jpg
lzhp1501
帖子: 39
注册时间: 2009-12-10 22:55
送出感谢: 1 次
接收感谢: 0

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#9

帖子 lzhp1501 » 2016-12-13 20:33

对滴对滴,用python的urlopen得到的就像图片里的那样



用curl得到的就是直接的html内容,


现在怎么把python的弄成和curl一样
onlylove 写了:大概就这么回事
我说我不帅,大家都打我,还骂我虚伪~
onlylove
论坛版主
帖子: 4428
注册时间: 2007-01-14 16:23
送出感谢: 0
接收感谢: 99 次

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#10

帖子 onlylove » 2016-12-16 22:31

lzhp1501 写了:对滴对滴,用python的urlopen得到的就像图片里的那样



用curl得到的就是直接的html内容,


现在怎么把python的弄成和curl一样
onlylove 写了:大概就这么回事
用requests库吧,好像是因为urllib不会解释js脚本的原因(python用的不多,不知道urllib有没有能支持js的用法,也不知道这猜测对不对)
lzhp1501
帖子: 39
注册时间: 2009-12-10 22:55
送出感谢: 1 次
接收感谢: 0

Re: python的urllib和shell的curl得到的结果不一致,请问如何弄得一样

#11

帖子 lzhp1501 » 2016-12-19 11:39

onlylove 写了:
lzhp1501 写了:对滴对滴,用python的urlopen得到的就像图片里的那样



用curl得到的就是直接的html内容,


现在怎么把python的弄成和curl一样
onlylove 写了:大概就这么回事
用requests库吧,好像是因为urllib不会解释js脚本的原因(python用的不多,不知道urllib有没有能支持js的用法,也不知道这猜测对不对)

谢谢啦,最后还是用requests,效果一样样的
我说我不帅,大家都打我,还骂我虚伪~
回复

回到 “Python/Php/Perl”