当前时区为 UTC + 8 小时



发表新帖 回复这个主题  [ 4 篇帖子 ] 
作者 内容
1 楼 
 文章标题 : 转-bash shell 读取rss
帖子发表于 : 2009-06-11 10:22 

注册: 2007-11-10 8:57
帖子: 198
送出感谢: 0 次
接收感谢: 0 次
Tapping RSS with Shell Scripts

If you're like me, you want to keep up with the latest news and information. Shell scripts help me do just that. In this article I'll show you how I wrote a shell script that watches the news at Slashdot.org and automatically shows me the latest story headlines every time I launch a Terminal application.
First Things First

Before any shell script work begins, the first step is to figure out the URL of the RSS page on Slashdot.

TIP: RSS is Really Simple Syndication, an XML-format data stream that's much more easily parsed and tracked than HTML pages, at least programmatically.

The Slashdot home page doesn't make it particularly easy to find, but the very bottom line, the very rightmost link, is "rss", and the URL behind that link is http://slashdot.org/index.rss.

To look at it from within the Terminal, I'm going to utilize the powerful curl application, piping the output to head to ensure that I'm not drowned in output:

$ curl --silent 'http://slashdot.org/index.rss' | head
<?xml version="1.0" encoding="ISO-8859-1"?>

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"

Yes, this looks fairly scary as output goes, I admit, but with a little help from the grep utility, this can quickly become a lot more user-friendly. In this case, let's just pull out the lines that are tagged as either the <title> or the <description>:

$ curl --silent "$url" | grep -E '(title>|description>)' | head
<title>Slashdot</title>
<description>News for nerds, stuff that matters</description>
<title>Slashdot</title>
<title>Yahoo To Charge For Search Listings</title>
<description>ibi writes "Yahoo will start taking payments
to "tilt the playing field" for companies that want their
listings given more prominence by Yahoo's search engine. ...</description>
<title>Infinium Labs Threatens HardOCP Again</title>
<description>XBox4Evr writes "In a follow-up from two weeks ago,
Infinium Labs is again threatening the tech web site HardOCP
with legal action. This in itself, is no big ...</description>
<title>SCO Postpones Lawsuit, Now Threatening Two</title>
<description>zzxc writes "In a surprise turn of events, SCO says
that they need more time to prepare an announcement of who
they are going to sue. According to SCO, the ...</description>
<title>Gyroscopic Wireless Mouse</title>

Not bad. In fact, that's really almost all we need. So let's turn this into a shell.
Headlines Only

To turn this command line into a shell script is a breeze: just open up your favorite Terminal command-line editor (I use vi but I've been trapped in Unix since 1980 so it's already subverted my neural pathways. You might prefer pico or even BBEdit or similar) Whichever you choose, type in the following, a standard shell script preamble:

#!/bin/sh

This tells the operating system that when this particular file is executed, it should be given to the shell (sh) to be run. Then let's create a variable that contains the URL:

url="http://slashdot.org/index.rss"

Now we can reference $url and the entire script has become more portable and easily modified. The next line is the entire command:

curl --silent "$url" | grep -E '(title>|description>)'

NOTE: If you get a "command not found" error with curl, you might need to specify a full path. In Panther, the curl command can be found at /usr/bin/curl in standard installations.

This script produces the output already seen, so let's make two tweaks to it so it's more useful. First off, the first three lines of output, the Slashdot title and description, never change so it'd be just as easy to strip them out of the output. This can be done a variety of ways, but I'm going to turn to the sed command, which has many hidden powers. One of them is that if you specify the '-n' flag, by default it won't output any of its input. The value of this? Then we can specify a pattern of some sort and only output those lines that match the pattern. Like this:

curl --silent "$url" | grep -E '(title>|description>)' | \
sed -n '4,$p'

Notice the trailing backslash here: rather than have our command pipe stretch longer and longer, the backslash (which must be the very last character on the line) let's me wrap the command to multiple lines and make it generally more readable.

We're getting close to trying the script. The only other tweak worth making is to strip out the <title>, </title>, <description>, and </description> tags themselves. This too can be done with sed, in a typically Unix-y fashion:

curl --silent "$url" | grep -E '(title>|description>)' | \
sed -n '4,$p' | \
sed -e 's/<title>//' -e 's/<\/title>//' -e 's/<description>/ /' \
-e 's/<\/description>//'

The XML tags are effectively stripped out, except the <description> tag is replaced by two spaces, just for formatting. The result, assuming you've saved this as slash-rss.sh, as I have:

$ sh slash-rss.sh | head -4
Yahoo To Charge For Search Listings
ibi writes "Yahoo will start taking payments to "tilt the
playing field" for companies that want their listings given more
prominence by Yahoo's search engine. ...
Infinium Labs Threatens HardOCP Again
XBox4Evr writes "In a follow up from two weeks ago, Infinium Labs
is again threatening the tech web site HardOCP with legal action. This in
itself, is no big ...

This shows the top two stories (4 lines = two titles + two descriptions). Not bad. Not beautiful, but certainly functional for a first script.

I always spend way too much time fine-tuning scripts to get just the output I want, so let's continue working on this to ensure that the output is more readable, shall we? It's so easy, you'll be amazed:

curl --silent "$url" | grep -E '(title>|description>)' | \
sed -n '4,$p' | \
sed -e 's/<title>//' -e 's/<\/title>//' -e 's/<description>/ /' \
-e 's/<\/description>//' | \
fmt

The results, piped through head again:

$ sh slash-rss.sh | head
Yahoo To Charge For Search Listings
ibi writes "Yahoo will start taking payments to "tilt the playing
field" for companies that want their listings given more prominence
by Yahoo's search engine. ...
Infinium Labs Threatens HardOCP Again
XBox4Evr writes "In a follow up from two weeks ago, Infinium
Labs is again threatening the tech web site HardOCP with legal
action. This in itself, is no big ...
SCO Postpones Lawsuit, Now Threatening Two
zzxc writes "In a surprise turn of events, SCO says that they

The problem now is that the head needs to be between the sed invocations and the fmt command, since we have no way of knowing how many lines each description is going to produce when fed through fmt. The solution is to build the next generation of this script!
Headlines, As Many As You Want

The obvious solution is to add a command flag that lets you specify how many headlines you want: multiply it by two and you'll know what value to feed head within the script. Here's how that looks as part of a shell script ($# is the number of arguments and $1 is the first argument):

#!/bin/sh

url="http://slashdot.org/index.rss"

if [ $# -eq 1 ] ; then
headarg=$(( $1 * 2 )) # $(( )) specifies that you're using an equation
else
headarg="-8" # default is four headlines
fi

curl --silent "$url" | grep -E '(title>|description>)' | \
sed -n '4,$p' | \
sed -e 's/<title>//' -e 's/<\/title>//' -e 's/<description>/ /' \
-e 's/<\/description>//' | \
head $headarg | fmt

Now I can specify that I only want the top headline, the newest entry on the Slashdot site, by simply specifying '-1' when I invoke the script:

$ sh slash-rss.sh -1
Yahoo To Charge For Search Listings
ibi writes "Yahoo will start taking payments to "tilt the playing
field" for companies that want their listings given more prominence
by Yahoo's search engine. ...

That's pretty cool, I think. I could tweak it forever, but let's stop here and see how to turn this into a Unix command just like ls and cd.

TIP: You can download this shell script in finished form.
Turning It Into a Command

There are two ways to turn a shell script into a command: create an alias or make the script executable and ensure it's in your PATH. To create an alias, if you're using Bash, an alias can be created like this:

alias slashdot="sh slash-rss.sh"

Then you can see the headlines by just typing slashdot on your command line.

To make the shell script itself executable, first make sure you've saved it in a directory that's in your PATH by typing:

$ echo $PATH
/bin:/sbin:/usr/bin:/usr/sbin:/sw/bin:/usr/X11R6/bin:
/usr/local/bin:/Users/dt/bin:/sw/bin

You can see that my PATH includes /Users/dt/bin - that's where I save this script and similar. Once it's in the right place, you'll need to make it executable by using the chmod command:

$ chmod +x slash-rss.sh

Optionally, you could rename the script to be a bit more friendly, of course.
Finally, Having It Auto-Execute Upon Terminal Launch

If you're running the Bash shell, which you probably are if you're in Panther, then it's a breeze: move to your home directory and append an invocation of the script to your .bash_login file:

$ cd
$ echo "sh slash-rss.sh -2" >> .bash_login

Make extra sure that you use two >>, not one, on that last command!

Now the next time you start up a Terminal application window, you'll see:

Last login: Tue Mar 2 23:09:36 on ttyp3
Welcome to Darwin!
Yahoo To Charge For Search Listings
ibi writes "Yahoo will start taking payments to "tilt the playing
field" for companies that want their listings given more prominence
by Yahoo's search engine. ...
Infinium Labs Threatens HardOCP Again
XBox4Evr writes "In a follow up from two weeks ago, Infinium
Labs is again threatening the tech web site HardOCP with legal
action. This in itself, is no big ...
$

It's also worth noting that this use of shell scripts to parse and format XML has more applications. For example, go to http://www.casino-bookstore.com/ and have a close look at the "Latest Gambling News" box: it's using almost an identical script to keep track of the gambling news XML feed from about.com. Another example? Go to http://www.healthy-bookstore.com/ and look at the medicinenet news feed. Again, it's using curl and sed to turn the XML data into HTML data.

原文 地址:http://www.askdavetaylor.com/can_i_track_an_rss_feed_with_a_shell_script.html
版权归作者所有


页首
 用户资料  
 
2 楼 
 文章标题 : Re: 转-bash shell 读取rss
帖子发表于 : 2009-06-11 11:01 
头像

注册: 2007-11-19 21:51
帖子: 6956
地址: 成都
送出感谢: 0 次
接收感谢: 4
八错 :em11


页首
 用户资料  
 
3 楼 
 文章标题 : Re: 转-bash shell 读取rss
帖子发表于 : 2009-06-11 11:31 
头像

注册: 2005-08-14 21:55
帖子: 58428
地址: 长沙
送出感谢: 4
接收感谢: 272
我给你们一个算了。折腾啥。看标题的

代码:
☎ cat rss.pl
#!/usr/bin/perl

sub cv {
        open(CV, "|/usr/bin/enconv|ascii2uni -a D -q|tr '\n' ' '") or die("没有命令:enconv。\n");
        print CV $_[0];
        close CV;
}
@RSS=(
"http://cn.engadget.com/rss.xml",
"http://forum.ubuntu.org.cn/feed.php",
"http://linuxtoy.org/feed/",
"http://feed.feedsky.com/ldcn",
"http://www.cnbeta.com/backend.php",
"http://solidot.org/index.rss",
"http://feed.feedsky.com/lerosua",
"http://eexpress.blog.ubuntu.org.cn/feed/",
"http://yaoms.blogspot.com/feeds/posts/default?alt=rss",
"http://www.ibm.com/developerworks/cn/views/rss/customrssatom.jsp?zone_by=Linux&max_entries=10&feed_by=rss",
"http://imtx.cn/feed/latest/",
);

if(!$ARGV[0]){
print "全部rss地址列表,按照次序匹配:".join(" ►  ",@RSS);
exit;
}

use LWP::UserAgent;
my $url=shift;
if($url!~/^http/){
foreach(@RSS){
if($_=~/$url/) {$url=$_; goto FOUND;}
}
die "列表中找不到此URL。\n";
}
FOUND:

my $ua=new LWP::UserAgent();
my $re= $ua->get($url);
die if (!$re->is_success);
my $html= $re->content;

$n=8;
print "RSS新闻:";
#得到页面中所有RSS标题和链接
while($html=~m{<title>(.*?)</title>.*?<link>(.*?)</link>}gsi){
#cv "► $1 --> $2 ";
$_="► $1 --> $2 ";
s/&amp;/&/g; s/&gt;/>/g; s/&lt;/</g; s/&quot;/"/g; s/&nbsp;/ /g;
s/<!\[CDATA\[//g; s/]]>//g; s/&p=[0-9#p]*//g;
cv $_;
$n--; last if ($n==0);
if($n==4){print "\n";`sleep 1`;};
}


_________________
● 鸣学


页首
 用户资料  
 
4 楼 
 文章标题 : Re: 转-bash shell 读取rss
帖子发表于 : 2009-06-11 12:01 
头像

注册: 2007-11-19 21:51
帖子: 6956
地址: 成都
送出感谢: 0 次
接收感谢: 4
ee家的狗呢?改养猫啦 :em04


页首
 用户资料  
 
显示帖子 :  排序  
发表新帖 回复这个主题  [ 4 篇帖子 ] 

当前时区为 UTC + 8 小时


在线用户

正在浏览此版面的用户:没有注册用户 和 4 位游客


不能 在这个版面发表主题
不能 在这个版面回复主题
不能 在这个版面编辑帖子
不能 在这个版面删除帖子
不能 在这个版面提交附件

前往 :  
本站点为公益性站点,用于推广开源自由软件,由 DiaHosting VPSBudgetVM VPS 提供服务。
我们认为:软件应可免费取得,软件工具在各种语言环境下皆可使用,且不会有任何功能上的差异;
人们应有定制和修改软件的自由,且方式不受限制,只要他们自认为合适。

Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
简体中文语系由 王笑宇 翻译