Python里beautifulsoup分析网页

opp · #1

首先，我是刚接触编程的，因为刚开始接触的是python就从它开始了，关键是觉得相对容易学。说这句费话的原因是我只想问问题，不想再出现让我用别的语言之争，基本掌握它了我再去学其它的。
问题：我们经常关注船舶动向，老是打开网页挺烦的，需要从下面这个网站中过滤出所有"卸货名称"为"铁矿砂"的"中文船名"，中文船名是个链接，点开后要获得备注里的信息，每次都手点去查看太烦人了。
现在beautifulsoup(html)后就不会获取了，希望各位多多指教啊，在此先谢过啦。看着网页源码好像不复杂的样子，就是解不出啊。
要输出的格式就是：船名卸货名称备注信息
http://www.porttrade.net/workinfo/PrintMD.aspx

oneleaf · #2

直接使用正则表达式就好了。

opp · #3

oneleaf 写了：直接使用正则表达式就好了。

惊动了oneleaf了啊，谢谢捧场！
这些html标签都被断成许多行了，不像普通的那么好处理，我再研究一下试试吧。

oneleaf · #4

正则表达式支持多行匹配。 re.DOTALL

zhw2101024 · #5

python的htmlparser楼主尝试过吗？我也没用过，不过感觉这么多年了python应该有比较好的html解析库了吧。

opp · #6

oneleaf 写了：正则表达式支持多行匹配。 re.DOTALL

可能是我比较愚笨吧，忽略换行就会取出许多不相关的内容来，因为是表格形式好像每一个标签都一样的，还是想不出如何取出符合“卸货名称”为铁矿砂的，这一步成功了才能谈取出它前面的href的链接地址。实在是想不出法子。

opp · #7

zhw2101024 写了：python的htmlparser楼主尝试过吗？我也没用过，不过感觉这么多年了python应该有比较好的html解析库了吧。

这个没有尝试过，一般现在都用BeautifulSoup来分析标签，可这个源码里面的标签用得太多了，又都拆行，不像一般的网页那么好取。

buntutu · #8

建议使用 lxml 模块，利用 xpath 提取内容, xpath 功能强大:

代码：全选

import urllib2
from lxml.html import fromstring
URL="http://www.porttrade.net/workinfo/PrintMD.aspx"
def get_ships(url):
    fd = urllib2.urlopen(url)
    content = fd.read()
    fd.close()
    content = content.decode("GBK", "replace")

    xml = fromstring(content)
    anchors = xml.xpath('//table[@id="DataGrid2"]//tr[@class="black_9"]//a')
    names = [x.text_content().strip() for x in anchors]
    for n in names: print(n)
    return names
get_ships(URL)

buntutu · #9

喔，原来还要指定货物，复杂了点儿

代码：全选

#!/usr/bin/python
# vim:fileencoding=utf-8:sw=4:et

import urllib2
from lxml.html import fromstring, tostring

URL="http://www.porttrade.net/workinfo/PrintMD.aspx"
def get_ships_for_catalog(url, cata):
    fd = urllib2.urlopen(url)
    content = fd.read()
    fd.close()
    content = content.decode("GBK", "replace")

    xml = fromstring(content)
    xml.make_links_absolute(url)
    anchors = xml.xpath('//table[@id="DataGrid2"]//tr[@class="black_9"]//a')

    # collect catalog info
    ship_infos = []
    for anchor in anchors:
        td = anchor.xpath("./ancestor::td[1]/following-sibling::td[5]")
        cata1 = td[0].text_content().strip()
        td = anchor.xpath("./ancestor::td[1]/following-sibling::td[7]")
        cata2 = td[0].text_content().strip()
        ship_name = anchor.text_content().strip()
        link = anchor.attrib["href"]
        ship_infos.append([ship_name, cata1, cata2, link])

    # filter catalog
    names = [(x[0], x[3]) for x in ship_infos if cata in x[1] or cata in x[2]]
    for n in names: print(": ".join(n))
    return names

def main():
    get_ships_for_catalog(URL, "铁矿砂")
if __name__ == '__main__':
    main()

Python里beautifulsoup分析网页

Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页

Re: Python里beautifulsoup分析网页