Supercazzola is my own web tar pit, designed to dynamically generate an endless graph of webpages.

I wrote it with the purpose of poisoning web scrapers that ignore my robots.txt. This work was inspired by a similar work available on maurycyz.com.

To see it in action, feel free to take a (somewhat boring) tour under /L/. Bots activity can be monitored in real time (hint: the info endpoint for this site is publicly accessible under /Lq/).

Dependencies

This software requires cmake, pkg-config and libevent >= 2 as dependencies. It has been tested to work under GNU/Linux and FreeBSD.

Releases

License

3-Clause BSD License.

Resources

How it works

One or more text files are processed offline to construct a Markov chain, which is compiled into a binary image. The image is then loaded by the main daemon, spamgend(8), which will use it to generate pseudo-random HTML pages on demand.

When a page is requested to the server, a hash of the URI path is used as a seed for a xorshift DRNG. The obtained pseudo-random sequence is used to traverse the Markov chain and generate the whole page content. Each generated page embeds a number of pseudo-random links to other pseudo-random pages, thus forming an Eternal Garbage Braid (EGB).

Bots monitoring

Real-world observations indicate that it is a common tactic for scrapers to utilize numerous hosts — often in ridiculous numbers — to evade detection while conducting unauthorized web scraping activities.

spamgend(8) identifies individual actors by embedding a reasonably unique identifier within the generated page links. If the requested path lacks an identifier, a new one is created by hashing the IP address of the requesting peer. The same identifier being used by multiple hosts implies that all of them are taking part in the same scraping operation.

A similar technique is used to track the depth of the scraping operation, as each page includes a depth value in its outbound links. This value is derived by incrementing the depth of the current page by one. If a page’s path does not include a depth value, it is implicitly assigned a depth of 0.

spamgend(8) does not keep logs, but aggregated data is made available in the form of a histogram by the info endpoint (see below).

The recommended setup consists of forwarding requests from a reverse proxy to the spam endpoint. The advantages are:

  • Seamless integration with an existing website

  • Pages can be served via HTTPS (TLS is not implemented by spamgen(8))

  • Bots monitoring data becomes available in the reverse proxy access log, making it possible to take broader countermeasure for identified scrapers (e.g. ban via firewall, or redirect their requests for regular pages to even more garbage!)

The purpose of spamgend(8) is to mess with greedy AI bots that violate the netiquette. It is therefore highly recommended to list the URI path leading to the spam endpoint in your robots.txt, so that legitimate scrapers are not poisoned.

User-agent: *
Disallow: /spam/

Binaries

  • mchain(1) - Compile a markov chain from one or more text files

  • spamgen(1) - Generate pseudo-random sentences out of a compiled Markov chain

  • spamgend(8) - Web daemon generating pseudo-random HTML pages out of a compiled Markov chain

How-To

The following instructions refer to the provisioning and installation under FreeBSD systems, but they can be easily adapted to other operating systems (e.g. GNU/Linux).

  1. Build the software

    • Install pkg-config and libevent2:

      root@freebsd:~ # pkg install -y devel/pkgconf devel/libevent devel/cmake-core
    • Unpack, build and install the package:

      root@freebsd:~ # tar -xzf ./supercazzola-*.tar.gz
      root@freebsd:~ # cmake -S ./supercazzola-*/ -B ./build
      root@freebsd:~ # cmake --build ./build
      root@freebsd:~ # cmake --install ./build
  2. Create and install Markov chain

    • Get some long text, e.g. Frankenstein from Gutenberg.org and turn it into a Markov chain:

      root@freebsd:~ # fetch 'https://www.gutenberg.org/ebooks/84.txt.utf-8'
      84.txt.utf-8                                           438 kB  589 kBps    00s
      root@freebsd:~ # mkdir /usr/local/share/spamgend
      root@freebsd:~ # mchain ./84.txt.utf-8 /usr/local/share/spamgend/default.markov
      mchain: number of states:  42181 (build-time max: 81920)
      mchain: number of edges:   65106
      mchain: spamgend(8) mallocs:  858296 bytes
    • Sample results with spamgen(1)

      root@freebsd:~ # spamgen -k /usr/local/share/spamgend/default.markov
    • Test result by running daemon in foreground:

      root@freebsd:~ # spamgend -f
      spamgend 2171 - - listening on localhost:7180
      spamgend 2171 - - listening on localhost:7181
  3. Configure spamgend(8)

    spamgend(8) will try to read configuration data from /usr/local/etc/spamgend/spamgend.conf or from the file specified with -c on the command line.

    If the default file does not exist, and if no alternative file is specified, spamgend(8) will be configured with default settings.

    See Configuration below.

  4. Start spamgend(8)

    • Enable the spamgend(8) service

      root@freebsd:~ # service spamgend enable
      spamgend enabled in /etc/rc.conf
      root@freebsd:~ # service spamgend start
      Starting spamgend.
    • The daemon will log via syslog(3) on the "daemon" facility (check "/var/log/daemon.log" if needed).

      root@freebsd:~ # tail -n2 /var/log/daemon.log
      Dec  8 23:53:23 freebsd spamgend[3500]: listening on localhost:7180
      Dec  8 23:53:23 freebsd spamgend[3500]: listening on localhost:7181
  5. Sit and enjoy some spam

    spamgend(8) will serve spam on the spam endpoint (http://localhost:7180 by default) and provide information about the visitors on the info endpoint (http://localhost:7181 by default).

Configuration

The configuration file of spamgend(8) contains key-value pairs or key only toggles, one per line. Empty lines and lines starting with # are treated as comments. key only lines (toggles) are permitted only for settings having boolean type, and interpreted as true.

Follows a list of accepted keys and their meaning:

daemon.foreground

Tells spamgend(8) not to invoke daemon(3)

  • Type: boolean

  • Default: false

daemon.gid

Tells spamgend(8) to drop privileges via setgid(2) to the supplied gid. If daemon.gid is not specified, spamgend(8) will not try to use setgid(2), but it will still ensure the process is not executed with gid 0.

  • Type: string

  • Default: undefined

daemon.pidfile

Location on the filesystem of the pidfile. The pidfile is generated before dropping permissions, and is therefore not unlinked when the daemon terminates. The init system is responsible for unlinking this file.

  • Type: string

  • Default: /var/run/spamgend.pid

daemon.uid

Tells spamgend(8) to drop privileges via setuid(2) to the supplied uid. If daemon.uid is not specified, spamgend(8) will not try to use setuid(2), but it will still ensure the process is not executed with uid 0.

  • Type: string

  • Default: undefined

info_ep.backlog

TCP backlog of the info endpoint. See listen(2).

  • Type: integer

  • Default: 32

info_ep.bind

Bind address of the info endpoint. See bind(2).

  • Type: string

  • Default: localhost:7181

spam_ep.backlog

TCP backlog of the spam endpoint. See listen(2).

  • Type: integer

  • Default: 32

spam_ep.bind

Bind address of the info endpoint. See bind(2).

  • Type: string

  • Default: localhost:7180

spam_ep.max_sentence_len

Maximum length of a pseudo-random sentence served on the spam endpoint. Sentences are allowed to be shorter, according to the length of the pseudo-random walk on the Markov chain.

  • Type: integer

  • Default: 40

spam_ep.mkvchain

File system path of the Markov chain. Markov chain files are constructed via mchain(1).

  • Type: string

  • Default: /usr/local/share/spamgend/default.markov

spam_ep.n_paragraphs

Number of paragraphs in each page served by the spam endpoint.

  • Type:

  • Default: 3

spam_ep.n_references

Number of outbound links in each page served by the spam endpoint.

  • Type: integer

  • Default: 7

spam_ep.n_sentences

Number of pseudo-random sentences per paragraph served by the spam endpoint.

  • Type: integer

  • Default: 5

spam_ep.uri_prefix

Expected prefix of URIs in the spam endpoint. This option is useful when spamgend(8) is made reachable through a reverse proxy, which is prepending a prefix to each request URI.

  • Type: string

  • Default: undefined