CROSSBOW-FORMAT(5) File Formats Manual (urm) CROSSBOW-FORMAT(5)

crossbow-formatformat string reference

The crossbow(1) feed aggregator processes each new entry by means of a handler, according to the feed settings in the crossbow.conf(5) configuration file.

The print handler prints a textual representation of an entry on stdout(3). The optional format setting can define a template to be used in place of the default. The exec and pipe handlers process individual entries by invoking an external program. They both require the definition of a command setting, to expand into the argument vector of the invoked subprocess.

The values of format and command are interpreted as format strings: they are allowed to contain placeholders that are replaced with the fields of the processed entry. All the available placeholders are listed below (see Supported placeholders).

The placeholders syntax resembles the printf(3) function, with some differences and simplifications:

In the context of the exec or pipe handlers, the argument vector of a subprocess is constructed on the command setting.

The value is split on white-spaces into an intermediate array of tokens. When a new entry is processed, each element of the intermediate array is further evaluated for placeholder expansion. The obtained array is then used as argument vector for the subprocess. The first of its elements determines the command to execute.

Details:

  • Whites-spaces are intended as in isspace(3);
  • Since the tokenization happens before the placeholder expansion, an expanded field can never be part of two arguments, even if it contains white-spaces.
  • Multiple white-spaces are considered part of the same delimiter. In case it is needed, it is still possible to pass around an empty argument by using a zero width break (See Zero width break below).
  • The subprocess is invoked via execvp(3). The PATH environment variable is honoured.
  • It is not possible to violate the convention by which the value of ‘argv[0]’ corresponds to the name of the executable.
  • For security reasons the command line is parsed by a shell interpreter (see Security below). Special operators, such as input redirection, are therefore not available.

The language recognized by the output format parser allows the placeholders to be composed by multiple characters. While this feature makes it easier to have mnemonic placeholders (such as ‘%a’ for "Author" and ‘%am’ for "Author eMail"), it introduces some additional edge cases.

The zero width break sequence ‘\:’ has been introduced to cover a case of ambiguity which can be easily explained by means of an example.

Let ‘%x’ and ‘%xn’ be valid placeholders. In such case obtaining the expansion of ‘%x’ followed by a literal ‘n’ would be impossible, as the sequence ‘%x\n’ would be rendered as the expansion of ‘%x’ followed by a new-line, while ‘%xn’ would be rendered as the expansion of ‘%xn’. Using the backslash would have worked if ‘\n’ wasn't a recognized escape sequence.

The zero width break can be used to force the termination of an escape sequence, so that whatever follows can be interpreted independently of it.

The behaviour is summarized by the following table, where (x) expresses the expansion of the ‘%x’ placeholder into the string representation of the corresponding , and the "." operator expresses string concatenation.

%x expand(x)
%xn expand(xn)
%x\m expand(x) . "m" Works because ‘\m’ is not a recognized escape sequence.
%x\n expand(x) . "\n"
%x\:n expand(x) . "n"

Additionally, although not originally intended for this purpose, the zero width break can be used to pass empty strings as subprocess arguments. This is demonstrated in the following example, where the configuration prints the entry author, followed by a space character, and by the entry title:

feed foobar
  url https://example.conf/feed.xml
  handler exec
  command printf \%s\%s\%s\\n %a \: %t

: the literal ‘%s’ is intended to be interpreted by the printf(1) command. The corresponding percent character is escaped, since it is meant to be a literal percent, and not to be expanded by crossbow(1).

The following table shows the supported placeholders and the corresponding entry properties for the RSS and Atom feed formats.

Depending on the feed format, some placeholders may refer to unavailable entry properties. If this is the case they are expanded with an empty string, with some exceptions (see Notes).

%a - author.name
%am author author.email
%au - author.uri
%ca category[0] (1) -
%co comments -
%c content:encoded (2) content
%g guid id
%gp guid_isPermaLink (3) - (4)
%l link link[0] (1)
%pd pubDate published
%s description summary
%t title title

Notes:

(1)
Some of the entry properties can have multiple values. Only the first one is currently available. This limitation might be lifted with future updates.
(2)
The ‘<content:encoded>’ element of RSS is an extension provided by the http://purl.org/rss/1.0/modules/content/ XML namespace. If such namespace is not enabled for the feed, the ‘%c’ placeholder will be expanded with the regular ‘<content>’ tag of RSS.
(3)
Boolean, represented with ‘0’ or ‘1’.
(4)
According to the Atom definition, the ‘<id>’ element conveys a permanent, universally unique identifier. ‘%gp’ is therefore undefined, but always expanded as ‘1’.

Some additional placeholders, not referring to any entry field, are also available:

%fi The unique name of the feed.
%ft The feed title
%n A per-feed zero-padded six digits incremental number.

The incremental number expanded in place of ‘%n’ is initialized to zero for new feeds, and gets incremented for every new feed entry. This is an important security feature: the value of this number is not controlled by the feed content, thus it can be used safely as filename. The value is persisted across executions, and incremented even if the item was not successfully processed, so that the same value is never used twice.

Since the exec and pipe handlers process entries by passing parameters to a subprocess, it is important to keep security in mind when configuring the corresponding command setting.

  1. Inserting an interpreted code snippet within a command template is strongly discouraged, as it might constitute an easy target for code injection.

    Consider, for example, the following configuration:

    feed foobar
      url https://example.conf/feed.xml
      handler exec
      command sh -c echo\ "%t"\ |\ wc\ -c

    The provided command is dangerous in that the entry title, expanded in place of ‘%t’, might be exploited by a malicious XML like the following:

    <?xml version="1.0" encoding="utf8"?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
     <channel>
      <item>
        <guid>inject</guid>
        <title>"; echo pwned; echo "</title>
      </item>
     </channel>
    </rss>

    The correct way of achieving the same result consists in moving the code into a shell script, and making the entry properties available to it by means of a safe, uninterpreted, parameter passing:

    feed foobar
      url https://example.conf/feed.xml
      handler exec
      command /usr/local/bin/count_bytes %t
  2. Extra care should be taken when the expansion of a placeholder directly determines the name of a file. Consider, for example, the following configuration:
    feed foobar
      url https://example.conf/feed.xml
      handler pipe
      command sed -n w%t

    This is an effective (yet dangerous) way of dumping the entry contents into a file named after the entry title. A specially crafted XML can exploit a similar configuration to attempt the replacement of sensitive files:

    <?xml version="1.0" encoding="utf8"?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
     <channel>
      <item>
        <guid>shenanigans</guid>
        <title>.bashrc</title>
        <description>echo pwned</description>
      </item>
      <item>
        <guid>shenanigans_1</guid>
        <title>../.bashrc</title>
        <description>echo pwned</description>
      </item>
      <item>
        <guid>shenanigan's_2</guid>
        <title>../../.bashrc</title>
        <description>echo pwned</description>
      </item>
      ...
     </channel>
    </rss>

    The correct way of achieving the same result consists in using the ‘%n’ placeholder in place of ‘%t’, obtaining a safer (although admittedly less descriptive) file naming.

See crossbow-cookbook(7).

crossbow(1), crossbow-conf(5), crossbow-cookbook(7)

Giovanni Simoni <dacav@fastmail.com>

September 30, 2021