CROSSBOW(7) | Miscellaneous Information Manual (urm) | CROSSBOW(7) |
crossbow-cookbook
—
cookbookish examples of
crossbow(1) usage
This manual page contains short recipes demonstrating how to use
the crossbow
feed aggregator.
We want a periodic notification, via local mail, of the availability of new stories on a site.
The configuration in crossbow.conf(5) would look like this:
feed debian_micro url https://micronews.debian.org/feeds/feed.rss format %ft: %l\n
The invocation of crossbow(1) will emit on stdout(3) a line like the following for each new item:
Debian micronews: https://micronews.debian.org/....html
By placing the following string in a crontab(5), a check for updates will be run automatically every two hours:
0 0-23/2 * * * crossbow
Assuming that local mail delivery is enabled, and since the output of a cronjob is mailed to the owner of the crontab(5), the user will receive a mail with one line for each entry that appeared in the last two hours.
Let's consider a feed whose XML reports the whole article for each entry. We want to store individual articles in a separate file, under a specific directory on the filesystem.
The configuration in crossbow.conf(5) would look like this:
feed cosmic.voyage url gopher://cosmic.voyage:70/0/atom.xml handler pipe command sed -n w%n.txt chdir ~/scifi_stories/cosmic.voyage/
The invocation of
crossbow(1) will spawn one
sed(1) process for each new entry. The
content, corresponding to the %d
placeholder, will
be piped to the subprocess. This in turn will write it on the specified file
(w
command), but not on
stdout(3) (-n
flag).
As a result, the
~/scifi_stories/cosmic.voyage directory will be
populated with files named 000000.txt,
000001.txt, 000002.txt,
...etc, since %n
is expanded with an incremental
numeric value. See
crossbow-format(5).
Security remark: unless the feed is trusted, it
is strongly discouraged to name filesystem paths after entry properties
others than %n
. Consider for example the case where
%t
is used as a file name, and the title of a post
is something like ../../.profile.
%n
is safe to use, since its value is not dependent
on the feed content.
This scenario is similar to the previous one, but it tackles the situation where the feed entry does not contain the full content, while the entry's link field contains a valid URL, which is intended to be reached by means of a web browser.
In this case we can leverage curl(1) to do the retrieval:
feed debian_micro url https://micronews.debian.org/feeds/feed.rss handler exec command curl -o %n.html %l chdir ~/debian_micronews/
The "%n" and "%l" placeholders do not need to be quoted: they are handled safely even when their expansions contain white spaces. See crossbow-format(5).
It is of course possible to use any non-interactive download manager in place of curl(1), or maybe a specialized script that fetches the entry link and scrapes the content out of it.
We want to turn individual feed entries into plain (HTML-free) text messages, and deliver them via email.
Our goal can be achieved by means of a generic shell script like the following:
#!/bin/sh set -e feed_title="$1" post_title="$2" link="$3" lynx "${link:--stdin}" -dump -force_html | sed "s/^~/~~/" | # Escape dangerous tilde expressions mail -s "${feed_title:+${feed_title}: }${post_title:-...}" "${USER:?}"
The script can be installed in the PATH
,
e.g. as /usr/local/bin/crossbow-to-mail, and then
integrated in crossbow(1) as
follows:
feed debian_micro url https://micronews.debian.org/feeds/feed.rss handler pipe command crossbow-to-mail %ft %t
feed lobsters.c url https://lobste.rs/t/c.rss handler exec command crossbow-to-mail %ft %t %l
Note: The crossbow-to-mail script leverages lynx(1) to download and parse the HTML into textual form. Any other
Security remark: The "s/^~/~~/" sed(1) regex prevents accidental or malicious tilde escapes from being interpreted by the mail(1) program. The mutt(1) mail user agent, if available, can be used as a safer drop-in replacement.
Many sites specialized in multimedia delivery can be scraped using tools such as youtube-dl(1). If the web site allows the subscription of a feed, crossbow(1) can be combined with these tools in order to maintain incrementally a local collection of files.
For example, YouTube provides feeds for users, channels and playlists. Each of these entities is assigned with a unique identifier, which can be easily figured by looking at the web URL.
What follows is a convenient wrapper script that ensures proper
file naming (although it is always wiser to use %n
,
as explained above):
#!/bin/sh link="${1:?missing link}" incremental_id="${2:?missing incremental id}" format="$3" # Transform a title in a reasonably safe 'slug' slugify() { tr -d \\n | # explicitly drop new-lines tr /[:punct:][:space:] . | # turn all sly chars into dots tr -cs [:alnum:] # squeeze repetitions } fname="$( youtube-dl \ --get-filename \ -o "%(id)s_%(title)s.%(ext)s" \ "$link" )" || exit 1 youtube-dl \ ${format:+-f "$format"} \ -o "$(printf %s_%s "$incremental_id" "$fname" | slugify)" \ --no-progress \ "$link"
Once again, the script can be installed in the
PATH
, e.g. as
/usr/local/bin/crossbow-ytdl, and then integrated in
crossbow(1) as follows:
feed computerophile url https://youtube.com/feeds/videos.xml?user=Computerphile handler exec command crossbow-ytdl %l %n
feed nodumb url https://youtube.com/feeds/videos.xml?channel_id=UCVnIvJuTZqM5nnwGFpA57_Q handler exec command crossbow-ytdl %l %n
crossbow(1), lynx(1), sed(1), youtube-dl(1), crontab(5), cron(8)
Giovanni Simoni <dacav@fastmail.com>
October 9, 2021 |