[python][已解决]如何恢复链接的绝对路径?
发表于 : 2009-02-26 17:17
我提取出网页http://www.opensolaris.org/os/community/on/fla ... ll/的一段代码如下
里面的链接都是相对路径 而我想把它们换成绝对路径:
html 不是正则的语言 所以正则表达式不是最好的解决方案 还是用html解析会好一些
推荐Beautiful Soup 几行代码就搞定了 哈哈~
代码: 全选
<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>
代码: 全选
<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="http://www.opensolaris.org/os/community/on/flag-days/all//pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>
推荐Beautiful Soup 几行代码就搞定了 哈哈~
代码: 全选
base_url = "http://www.opensolaris.org/os/community/on/flag-days/"
soup = BeautifulSoup(html_text)
link_set = soup.findAll('a')
links = [ e['href'] for e in link_set ]
#get html_text
for e in links:
html_text = string.replace(html_text,e, urljoin(base_url,e),1)