4. Monitoring

Here are some of things that you may find in your Slony-I logs, and explanations of what they mean.

4.1. CONFIG notices

These entries are pretty straightforward. They are informative messages about your configuration.

Here are some typical entries that you will probably run into in your logs:

CONFIG main: local node id = 1
CONFIG main: loading current cluster configuration
CONFIG storeNode: no_id=3 no_comment='Node 3'
CONFIG storePath: pa_server=5 pa_client=1 pa_conninfo="host=127.0.0.1 dbname=foo user=postgres port=6132" pa_connretry=10
CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3
CONFIG storeSet: set_id=1 set_origin=1 set_comment='Set 1'
CONFIG main: configuration complete - starting threads

4.2. DEBUG Notices

Debug notices are always prefaced by the name of the thread that the notice originates from. You will see messages from the following threads:

localListenThread

This is the local thread that listens for events on the local node.

remoteWorkerThread-X

The thread processing remote events. You can expect to see one of these for each node that this node communicates with.

remoteListenThread-X

Listens for events on a remote node database. You may expect to see one of these for each node in the cluster.

cleanupThread

Takes care of things like vacuuming, cleaning out the confirm and event tables, and deleting old data.

syncThread

Generates SYNC events.

4.3. How to read Slony-I logs

Note that as far as slon is concerned, there is no "master" or "slave." They are just nodes.

What you can expect, initially, is to see, on both nodes, some events propagating back and forth. Firstly, there should be some events published to indicate creation of the nodes and paths. If you don't see those, then the nodes aren't likely to be able to communicate with one another, and nothing else will happen...

After that, you'll mainly see two sorts of behaviour:

WriteMe: I can't decide the format for the rest of this. I think maybe there should be a "how it works" page, explaining more about how the threads work, what to expect in the logs after you run a

4.4. Nagios& Replication Checks

The script in the tools directory called pgsql_replication_check.pl represents some of the best answers arrived at in attempts to build replication tests to plug into the Nagios system monitoring tool.

A former script, test_slony_replication.pl, took a "clever" approach where a "test script" is periodically run, which rummages through the Slony-I configuration to find origin and subscribers, injects a change, and watches for its propagation through the system. It had two problems:

The new script, pgsql_replication_check.pl, takes the minimalist approach of assuming that the system is an online system that sees regular "traffic," so that you can define a view specifically for the replication test called replication_status which is expected to see regular updates. The view simply looks for the youngest "transaction" on the node, and lists its timestamp, age, and some bit of application information that might seem useful to see.

An instance of the script will need to be run for each node that is to be monitored; that is the way Nagios works.

4.5. Monitoring Slony-I using MRTG

One user reported on the Slony-I mailing list how to configure mrtg - Multi Router Traffic Grapher to monitor Slony-I replication.

... Since I use mrtg to graph data from multiple servers I use snmp (net-snmp to be exact). On database server, I added the following line to snmpd configuration:

exec replicationLagTime  /cvs/scripts/snmpReplicationLagTime.sh 2

where /cvs/scripts/snmpReplicationLagTime.sh looks like this:

#!/bin/bash
/home/pgdba/work/bin/psql -U pgdba -h 127.0.0.1 -p 5800 -d _DBNAME_ -qAt -c
"select cast(extract(epoch from st_lag_time) as int8) FROM _irr.sl_status
WHERE st_received = $1"

Then, in mrtg configuration, add this target:

Target[db_replication_lagtime]:extOutput.3&extOutput.3:public at db::30:::
MaxBytes[db_replication_lagtime]: 400000000
Title[db_replication_lagtime]: db: replication lag time
PageTop[db_replication_lagtime]: <H1>db: replication lag time</H1>
Options[db_replication_lagtime]: gauge,nopercent,growright

4.6. test_slony_state

This script is in preliminary stages, and may be used to do some analysis of the state of a Slony-I cluster.

You specify arguments including database, host, user, cluster, password, and port to connect to any of the nodes on a cluster. You also specify a mailprog command (which should be a program equivalent to Unix mailx) and a recipient of email.

The script then rummages through sl_path to find all of the nodes in the cluster, and the DSNs to allow it to, in turn, connect to each of them.

For each node, the script examines the state of things, including such things as:

The script does some diagnosis work based on parameters in the script; if you don't like the values, pick your favorites!