Supercazzola is my own web tar pit, designed to dynamically generate an endless graph of webpages.
I wrote it with the purpose of poisoning web scrapers that ignore my robots.txt. This work was inspired by a similar work available on maurycyz.com.
Dependencies
This software requires cmake, pkg-config and libevent >= 2 as
dependencies.
It has been tested to work under GNU/Linux and FreeBSD.
License
3-Clause BSD License.
Resources
How it works
One or more text files are processed offline to construct a Markov chain, which
is compiled into a binary image.
The image is then loaded by the main daemon, spamgend(8), which will use it to
generate pseudo-random HTML pages on demand.
When a page is requested to the server, a hash of the URI path is used as a seed for a xorshift DRNG. The obtained pseudo-random sequence is used to traverse the Markov chain and generate the whole page content. Each generated page embeds a number of pseudo-random links to other pseudo-random pages, thus forming an Eternal Garbage Braid (EGB).
Bots monitoring
Real-world observations indicate that it is a common tactic for scrapers to utilize numerous hosts — often in ridiculous numbers — to evade detection while conducting unauthorized web scraping activities.
spamgend(8) identifies individual actors by embedding a reasonably unique
identifier within the generated page links.
If the requested path lacks an identifier, a new one is created by hashing the
IP address of the requesting peer.
The same identifier being used by multiple hosts implies that all of them are
taking part in the same scraping operation.
A similar technique is used to track the depth of the scraping operation, as each page includes a depth value in its outbound links. This value is derived by incrementing the depth of the current page by one. If a page’s path does not include a depth value, it is implicitly assigned a depth of 0.
spamgend(8) does not keep logs, but aggregated data is made
available in the form of a histogram by the info endpoint (see below).
Recommended setup
The recommended setup consists of forwarding requests from a reverse proxy to the spam endpoint. The advantages are:
-
Seamless integration with an existing website
-
Pages can be served via HTTPS (TLS is not implemented by
spamgen(8)) -
Bots monitoring data becomes available in the reverse proxy access log, making it possible to take broader countermeasure for identified scrapers (e.g. ban via firewall, or redirect their requests for regular pages to even more garbage!)
The purpose of spamgend(8) is to mess with greedy AI bots that violate the
netiquette.
It is therefore highly recommended to list the URI path leading
to the spam endpoint in your robots.txt, so that legitimate scrapers
are not poisoned.
User-agent: * Disallow: /spam/
Binaries
-
mchain(1)- Compile a markov chain from one or more text files -
spamgen(1)- Generate pseudo-random sentences out of a compiled Markov chain -
spamgend(8)- Web daemon generating pseudo-random HTML pages out of a compiled Markov chain
How-To
The following instructions refer to the provisioning and installation under FreeBSD systems, but they can be easily adapted to other operating systems (e.g. GNU/Linux).
-
Build the software
-
Install
pkg-configandlibevent2:root@freebsd:~ # pkg install -y devel/pkgconf devel/libevent devel/cmake-core
-
Unpack, build and install the package:
root@freebsd:~ # tar -xzf ./supercazzola-*.tar.gz root@freebsd:~ # cmake -S ./supercazzola-*/ -B ./build root@freebsd:~ # cmake --build ./build root@freebsd:~ # cmake --install ./build
-
-
Create and install Markov chain
-
Get some long text, e.g. Frankenstein from Gutenberg.org and turn it into a Markov chain:
root@freebsd:~ # fetch 'https://www.gutenberg.org/ebooks/84.txt.utf-8' 84.txt.utf-8 438 kB 589 kBps 00s root@freebsd:~ # mkdir /usr/local/share/spamgend root@freebsd:~ # mchain ./84.txt.utf-8 /usr/local/share/spamgend/default.markov mchain: number of states: 42181 (build-time max: 81920) mchain: number of edges: 65106 mchain: spamgend(8) mallocs: 858296 bytes
-
Sample results with
spamgen(1)root@freebsd:~ # spamgen -k /usr/local/share/spamgend/default.markov
-
Test result by running daemon in foreground:
root@freebsd:~ # spamgend -f spamgend 2171 - - listening on localhost:7180 spamgend 2171 - - listening on localhost:7181
-
-
Configure
spamgend(8)spamgend(8)will try to read configuration data from/usr/local/etc/spamgend/spamgend.confor from the file specified with-con the command line.If the default file does not exist, and if no alternative file is specified,
spamgend(8)will be configured with default settings.See Configuration below.
-
Start
spamgend(8)-
Enable the
spamgend(8)serviceroot@freebsd:~ # service spamgend enable spamgend enabled in /etc/rc.conf root@freebsd:~ # service spamgend start Starting spamgend.
-
The daemon will log via
syslog(3)on the "daemon" facility (check "/var/log/daemon.log" if needed).root@freebsd:~ # tail -n2 /var/log/daemon.log Dec 8 23:53:23 freebsd spamgend[3500]: listening on localhost:7180 Dec 8 23:53:23 freebsd spamgend[3500]: listening on localhost:7181
-
-
Sit and enjoy some spam
spamgend(8)will serve spam on the spam endpoint (http://localhost:7180by default) and provide information about the visitors on the info endpoint (http://localhost:7181by default).
Configuration
The configuration file of spamgend(8) contains key-value pairs or key only
toggles, one per line. Empty lines and lines starting with # are treated as
comments. key only lines (toggles) are permitted only for settings having
boolean type, and interpreted as true.
Follows a list of accepted keys and their meaning:
daemon.foreground-
Tells
spamgend(8)not to invokedaemon(3)-
Type: boolean
-
Default:
false
-
daemon.gid-
Tells
spamgend(8)to drop privileges viasetgid(2)to the supplied gid. If daemon.gid is not specified,spamgend(8)will not try to usesetgid(2), but it will still ensure the process is not executed with gid 0.-
Type: string
-
Default: undefined
-
daemon.pidfile-
Location on the filesystem of the pidfile. The pidfile is generated before dropping permissions, and is therefore not unlinked when the daemon terminates. The init system is responsible for unlinking this file.
-
Type: string
-
Default:
/var/run/spamgend.pid
-
daemon.uid-
Tells
spamgend(8)to drop privileges viasetuid(2)to the supplied uid. If daemon.uid is not specified,spamgend(8)will not try to usesetuid(2), but it will still ensure the process is not executed with uid 0.-
Type: string
-
Default: undefined
-
info_ep.backlog-
TCP backlog of the info endpoint. See
listen(2).-
Type: integer
-
Default:
32
-
info_ep.bind-
Bind address of the info endpoint. See
bind(2).-
Type: string
-
Default:
localhost:7181
-
spam_ep.backlog-
TCP backlog of the spam endpoint. See
listen(2).-
Type: integer
-
Default:
32
-
spam_ep.bind-
Bind address of the info endpoint. See
bind(2).-
Type: string
-
Default:
localhost:7180
-
spam_ep.max_sentence_len-
Maximum length of a pseudo-random sentence served on the spam endpoint. Sentences are allowed to be shorter, according to the length of the pseudo-random walk on the Markov chain.
-
Type: integer
-
Default:
40
-
spam_ep.mkvchain-
File system path of the Markov chain. Markov chain files are constructed via
mchain(1).-
Type: string
-
Default:
/usr/local/share/spamgend/default.markov
-
spam_ep.n_paragraphs-
Number of paragraphs in each page served by the spam endpoint.
-
Type:
-
Default:
3
-
spam_ep.n_references-
Number of outbound links in each page served by the spam endpoint.
-
Type: integer
-
Default:
7
-
spam_ep.n_sentences-
Number of pseudo-random sentences per paragraph served by the spam endpoint.
-
Type: integer
-
Default:
5
-
spam_ep.uri_prefix-
Expected prefix of URIs in the spam endpoint. This option is useful when
spamgend(8)is made reachable through a reverse proxy, which is prepending a prefix to each request URI.-
Type: string
-
Default: undefined
-