Module ReferrerCop
In: referrercop

ReferrerCop

Parses an Apache log file or AWStats data file and filters out entries for referrers that are known spammers.

Visit wonko.com/software/referrercop for news, usage examples, and updates (including updated blacklists).

Version:1.0.4 (10/17/2005)
Author:Ryan Grove (ryan@wonko.com)
Copyright:Copyright © 2005 Ryan Grove
License:ReferrerCop is open source software distributed under the terms of the GNU General Public License.

Dependencies

Usage

      referrercop [-f | -i | -n | -s] [options] [<file> ...]
      referrercop -u <url> [options]
      referrercop -U [options]
      referrercop {-h | -V}

Modes:

 -f, --filter             Filter the specified files (or standard input if no
                          files are specified), sending the results to
                          standard output. This is the default mode.
 -i, --in-place           Filter the specified files in place, replacing each
                          file with the filtered version. A backup of the
                          original file will be created with a .bak extension.
 -n, --extract-ham        Extract ham (nonspam) URLs from the input data and
                          send them to standard output. Duplicates will be
                          suppressed.
 -s, --extract-spam       Extract spam URLs from the input data and send
                          them to standard output. Duplicates will be
                          suppressed.
 -u, --url <url>          Test the specified URL.
 -U, --update             Check for an updated version of the default
                          blacklist and download it if available.

Options:

 -b, --blacklist <file>   Blacklist to use instead of the default list.
 -v, --verbose            Print verbose status and statistical info to stderr.
 -w, --whitelist <file>   Whitelist to use instead of the default list.

Information:

 -h, --help               Display usage information (this message).
 -V, --version            Display version information.

Methods

Constants

APP_NAME = 'ReferrerCop'
APP_VERSION = '1.0.4'
UPDATE_SERVER = 'wonko.com'
UPDATE_PORT = 80
UPDATE_PATH = '/files/referrercop/blacklist.refcop.gz'
CONFIG_PATHS = [ '.', File::SEPARATOR + File.join('usr', 'local', 'share', 'referrercop'), File::SEPARATOR + File.join('usr', 'share', 'referrercop')   List of paths that will be searched for blacklist/whitelist files if they aren’t specified on the command line.
REGEXPS = { :apache_combined => /^[^\s]+ - [^\s]+ \[.+\] "[A-Z]+ [^\s]+(?: [^\s]+")? [0-9]+ [0-9-]+ "(.*)" ".*"$/i, :awstats_header => /^AWSTATS DATA FILE /, :awstats_map => /^BEGIN_MAP.*^END_MAP$/m, :awstats_pagerefs_extract => /^BEGIN_PAGEREFS.*?$.*?^(.*?)^END_PAGEREFS$/m, :awstats_pagerefs_replace => /^BEGIN_PAGEREFS.*?^END_PAGEREFS$/m, :awstats_url => /^(https?:\/\/[^\s]+)/i, :text_url => /^(https?:\/\/[^\s]+)/i   Common regular expressions used throughout the application.

Public Class methods

Determines the format of input and extracts URLs of the specified type.

type should be either :ham or :spam.

Extracts URLs of the specified type (:ham or :spam) from an Apache combined log file.

Determines the format of input and filters it for referrer spam. The filtered data will be sent to output.

Parses and filters Apache combined log entries from input. The filtered log entries will be sent to output.

Parses and filters AWStats data from input. The filtered data will be sent to output.

Parses and filters input as a list of URLs (one per line). The filtered URLs will be sent to output.

Examines input and returns its type. The following input types are supported:

:apache_combined
Apache combined log file.
:awstats
AWStats data file.
:text
Unrecognized format (assumed to be a list of URLs).

Loads filename as a blacklist, compiling all regular expressions for speed. If filename is nil and a blacklist exists at one of the paths specified in CONFIG_PATHS, that blacklist will be loaded.

Loads filename as a whitelist, compiling all regular expressions for speed. If filename is nil and a whitelist exists at one of the paths specified in CONFIG_PATHS, that whitelist will be loaded.

Returns true if the passed URL is referrer spam, false otherwise.

[Validate]