CROSSBOW(7) Miscellaneous Information Manual (urm) CROSSBOW(7)

crossbow-cookbookcookbookish examples of crossbow(1) usage

This manual page contains short recipes demonstrating how to use the crossbow feed aggregator.

  1. Simple local mail notification
  2. Incremental files collection
  3. Download the full article
  4. One mail per entry
  5. Maintain a local multimedia collection

We want a periodic notification, via local mail, of the availability of new stories on a site.

The configuration in crossbow.conf(5) would look like this:

feed debian_micro
  url https://micronews.debian.org/feeds/feed.rss
  format %ft: %l\n

The invocation of crossbow(1) will emit on stdout(3) a line like the following for each new item:

Debian micronews: https://micronews.debian.org/....html

By placing the following string in a crontab(5), a check for updates will be run automatically every two hours:

0 0-23/2 * * * crossbow

Assuming that local mail delivery is enabled, and since the output of a cronjob is mailed to the owner of the crontab(5), the user will receive a mail with one line for each entry that appeared in the last two hours.

Let's consider a feed whose XML reports the whole article for each entry. We want to store individual articles in a separate file, under a specific directory on the filesystem.

The configuration in crossbow.conf(5) would look like this:

feed cosmic.voyage
  url gopher://cosmic.voyage:70/0/atom.xml
  handler pipe
  command sed -n w%n.txt
  chdir ~/scifi_stories/cosmic.voyage/

The invocation of crossbow(1) will spawn one sed(1) process for each new entry. The content, corresponding to the %d placeholder, will be piped to the subprocess. This in turn will write it on the specified file (w command), but not on stdout(3) (-n flag).

As a result, the ~/scifi_stories/cosmic.voyage directory will be populated with files named 000000.txt, 000001.txt, 000002.txt, ...etc, since %n is expanded with an incremental numeric value. See crossbow-format(5).

Security remark: unless the feed is trusted, it is strongly discouraged to name filesystem paths after entry properties others than %n. Consider for example the case where %t is used as a file name, and the title of a post is something like ../../.profile. %n is safe to use, since its value is not dependent on the feed content.

This scenario is similar to the previous one, but it tackles the situation where the feed entry does not contain the full content, while the entry's link field contains a valid URL, which is intended to be reached by means of a web browser.

In this case we can leverage curl(1) to do the retrieval:

feed debian_micro
  url https://micronews.debian.org/feeds/feed.rss
  handler exec
  command curl -o %n.html %l
  chdir ~/debian_micronews/

The "%n" and "%l" placeholders do not need to be quoted: they are handled safely even when their expansions contain white spaces. See crossbow-format(5).

It is of course possible to use any non-interactive download manager in place of curl(1), or maybe a specialized script that fetches the entry link and scrapes the content out of it.

We want to turn individual feed entries into plain (HTML-free) text messages, and deliver them via email.

Our goal can be achieved by means of a generic shell script like the following:

#!/bin/sh

set -e

feed_title="$1"
post_title="$2"
link="$3"

lynx "${link:--stdin}" -dump -force_html |
    sed "s/^~/~~/" |    # Escape dangerous tilde expressions
    mail -s "${feed_title:+${feed_title}: }${post_title:-...}" "${USER:?}"

The script can be installed in the PATH, e.g. as /usr/local/bin/crossbow-to-mail, and then integrated in crossbow(1) as follows:

  • If the tracked feed encloses the whole content in the XML:
    feed debian_micro
      url https://micronews.debian.org/feeds/feed.rss
      handler pipe
      command crossbow-to-mail %ft %t
  • If the feed entries only relay the link to the article:
    feed lobsters.c
      url https://lobste.rs/t/c.rss
      handler exec
      command crossbow-to-mail %ft %t %l

: The crossbow-to-mail script leverages lynx(1) to download and parse the HTML into textual form. Any other

Security remark: The "s/^~/~~/" sed(1) regex prevents accidental or malicious from being interpreted by the mail(1) program. The mutt(1) mail user agent, if available, can be used as a safer drop-in replacement.

Many sites specialized in multimedia delivery can be scraped using tools such as youtube-dl(1). If the web site allows the subscription of a feed, crossbow(1) can be combined with these tools in order to maintain incrementally a local collection of files.

For example, YouTube provides feeds for users, channels and playlists. Each of these entities is assigned with a unique identifier, which can be easily figured by looking at the web URL.

What follows is a convenient wrapper script that ensures proper file naming (although it is always wiser to use %n, as explained above):

#!/bin/sh

link="${1:?missing link}"
incremental_id="${2:?missing incremental id}"
format="$3"

# Transform a title in a reasonably safe 'slug'
slugify() {
    tr -d \\n |                     # explicitly drop new-lines
    tr /[:punct:][:space:] . |      # turn all sly chars into dots
    tr -cs [:alnum:]                # squeeze repetitions
}

fname="$(
    youtube-dl \
        --get-filename \
        -o "%(id)s_%(title)s.%(ext)s" \
        "$link"
)" || exit 1

youtube-dl \
    ${format:+-f "$format"} \
    -o "$(printf %s_%s "$incremental_id" "$fname" | slugify)" \
    --no-progress \
    "$link"

Once again, the script can be installed in the PATH, e.g. as /usr/local/bin/crossbow-ytdl, and then integrated in crossbow(1) as follows:

  • To save each published video:
    feed computerophile
      url https://youtube.com/feeds/videos.xml?user=Computerphile
      handler exec
      command crossbow-ytdl %l %n
  • To save only the audio of each published video:
    feed nodumb
      url https://youtube.com/feeds/videos.xml?channel_id=UCVnIvJuTZqM5nnwGFpA57_Q
      handler exec
      command crossbow-ytdl %l %n

crossbow(1), lynx(1), sed(1), youtube-dl(1), crontab(5), cron(8)

Giovanni Simoni <dacav@fastmail.com>

October 9, 2021