Supercazzola

Supercazzola is my own scraper tar pit, designed to generate dynamically an endless graph of webpages.

I wrote it with the purpose of poisoning web crawlers that ignore my robots.txt

This software requires cmake, pkg-config and libevent >= 2 as dependencies. It has been tested to work under GNU/Linux and FreeBSD.

History

This work was inspired by a similar work available on maurycyz.com.

Releases

License

3-Clause BSD License. See COPYING.txt

Binaries

mchain(1) - Compile a markov chain from one or more text files
spamgen(1) - Generate random sentences out of a compiled Markov chain
spamd(2) - Web daemon generating random HTML pages out of a compiled Markov chain

How-To

The following instructions refer to the provisioning and installation under FreeBSD systems, but they can be easily adapted to other operating systems (e.g. GNU/Linux).

Build the software

Install pkg-config and libevent2:

root@freebsd:~ # pkg install -y devel/pkgconf devel/libevent devel/cmake-core

Unpack, build and install the package:

root@freebsd:~ # tar -xzf ./supercazzola-*.tar.gz
root@freebsd:~ # cmake -S ./supercazzola-*/ -B ./build
root@freebsd:~ # cmake --build ./build
root@freebsd:~ # cmake --install ./build

Create and install Markov chain

Get some long text, e.g. Frankenstein from Gutenberg.org and turn it into a Markov chain:

root@freebsd:~ # fetch 'https://www.gutenberg.org/ebooks/84.txt.utf-8'
84.txt.utf-8                                           438 kB  589 kBps    00s
root@freebsd:~ # mkdir /usr/local/share/spamd
root@freebsd:~ # mchain ./84.txt.utf-8 /usr/local/share/spamd/default.markov
mchain: number of states:  42181 (build-time max: 81920)
mchain: number of edges:   65106
mchain: spamd(8) mallocs:  858296 bytes

Sample results with spamgen(1)

root@freebsd:~ # spamgen -k /usr/local/share/spamd/default.markov

Test result by running daemon in foreground:

root@freebsd:~ # spamd -f
spamd 2171 - - listening on localhost:7180
spamd 2171 - - listening on localhost:7181

Configure spamd(8)

spamd(8) will try to read configuration data from /usr/local/etc/spamd/spamd.conf or from the file specified with -c on the command line.

If the default file does not exist, and if no alternative file is specified, spamd(8) will be configured with default settings.

See "Configuration" below.

Start spamd

Enable the spamd(8) service

root@freebsd:~ # service spamd enable
spamd enabled in /etc/rc.conf
root@freebsd:~ # service spamd start
Starting spamd.

The daemon will log via syslog(3) on the "daemon" facility (check "/var/log/daemon.log" if needed).

root@freebsd:~ # tail -n2 /var/log/daemon.log
Dec  8 23:53:23 freebsd spamd[3500]: listening on localhost:7180
Dec  8 23:53:23 freebsd spamd[3500]: listening on localhost:7181

Sit and enjoy some spam

spamd(8) will serve spam on the spam endpoint (http://localhost:7180 by default) and provide information about the visitors on the info endpoint (http://localhost:7181 by default).

Intended use

The recommended setup consists in forwarding requests from a web server acting as reverse proxy to the spam endpoint. spamd(8) supports the X-Forwarded-For header when determining the IP address of the peer for statistical purposes, and allows to specify a prefix to strip from request URI (see spam_ep.uri_prefix below).

The purpose of spamd(8) is to mess with greedy AI bots that violate the netiquette. It is therefore highly recommended to list the URI path leading to the spam endpoint in your robots.txt:

User-agent: *
Disallow: /spam/

Configuration

The configuration file of spamd(8) contains key-value pairs or key only toggles, one per line. Empty lines and lines starting with # are treated as comments. key only lines (toggles) are permitted only for settings having boolean type, and interpreted as true.

Follows a list of accepted keys and their meaning:

daemon.foreground
Tells spamd(8) not to invoke daemon(3)
Type: boolean

Default: false
daemon.gid
Tells spamd(8) to drop privileges via setgid(2) to the supplied gid. If daemon.gid is not specified, spamd(8) will not try to use setgid(2), but it will still ensure the process is not executed with gid 0.
Type: string

Default: undefined
daemon.pidfile
Location on the filesystem of the pidfile. The pidfile is generated before dropping permissions, and is therefore not unlinked when the daemon terminates. The init system is responsible for unlinking this file.
Type: string

Default: /var/run/spamd.pid
daemon.uid
Tells spamd(8) to drop privileges via setuid(2) to the supplied uid. If daemon.uid is not specified, spamd(8) will not try to use setuid(2), but it will still ensure the process is not executed with uid 0.
Type: string

Default: undefined
info_ep.backlog
TCP backlog of the info endpoint. See listen(2).
Type: integer

Default: 32
info_ep.bind
Bind address of the info endpoint. See bind(2).
Type: string

Default: localhost:7181
spam_ep.backlog
TCP backlog of the spam endpoint. See listen(2).
Type: integer

Default: 32
spam_ep.bind
Bind address of the info endpoint. See bind(2).
Type: string

Default: localhost:7180
spam_ep.max_sentence_len
Maximum length of a pseudo-random sentence served on the spam endpoint. Sentences are allowed to be shorter, according to the length of the random walk on the Markov chain.
Type: integer

Default: 40
spam_ep.mkvchain
File system path of the Markov chain. Markov chain files are constructed via mchain(1).
Type: string

Default: /usr/local/share/spamd/default.markov
spam_ep.n_paragraphs
Number of paragraphs in each page served by the spam endpoint.
Type:

Default: 3
spam_ep.n_references
Number of outbound links in each page served by the spam endpoint.
Type: integer

Default: 7
spam_ep.n_sentences
Number of pseudo-random sentences per paragraph served by the spam endpoint.
Type: integer

Default: 5
spam_ep.uri_prefix
Expected prefix of URIs in the spam endpoint. This option is useful when spamd(8) is made reachable through a reverse proxy, which is prepending a prefix to each request URI.
Type: string

Default: undefined

TODO (higher priority first)

control panel pages + bake in git version
reload config file upon SIGHUP
spamd(8) handle more gracefully missing file (e.g. default page?)
mchain option to pick output format (dot or binary)
Add sanitizers to cmake (default ON, possibly OFF, only in Debug build-type)

Supercazzola - Generate spam for web scrapers