Dar Documentation


Detailed Notes on Various Topics





Introduction

Here take place a collection of notes. Theses have been created after implementation of a given feature, mainly for further reference but also for user information. The ideas behind theses notes are to remind some choices of implementation, the arguments that lead to this choices in on side, and in the other side let the user have a room to be informed on the choices done and be able to bring his remarks without having to deeply look in the code to learn dar's internals.

Contents



EA & differential backup

Brief presentation of EA:

EA stands for Extended Attributes. In Unix filesystem a regular file is composed of a set of byte (the data) and an inode. The inode add properties to the file, such as owner, group, permission, dates (last modification date of the data [mtime], last access date to data [atime], and last inode change date [ctime]), etc). Last, the name of the file is not contained in the inode, but in the directory(ies) it is linked to. When a file is linked more than once in the directory tree, we speak about "hard links". This way the same data and associated inode appears several time in the same or different directories. This is not the same as a symbolic links, which is a file that contains the path to another file (which may or may not exist). A symbolic link has its own inode. OK, now let's talk about EA:

Extended attributes is a recent feature of Unix file system. They extend attributes provided by the inode and associated to a data. They are not part of the inode, nor part of the data, nor part of a given directory. They are stored beside the inode and are a set of pair of key and value. The owner of the file can add or define any key and eventually associate data to it. It can also list and remove a particular key. What they are used for ? A way to associate information to a file.

One particular interesting use of EA, is ACL: Access Control List. ACL can be implemented using EA and add a more fine grain in assigning access permission to file. For more information one EA and ACL, see the site of Andreas Grunbacher:

EA & Differential Backup

to determine that an EA has changed dar looks at the ctime value. if ctime has changed, (due to EA change, but also to permission or owner change) dar saves the EA. ctime also changes, if atime or mtime changes. So if you access a file or modify it, dar will consider that the EA have changed also. This is not really fair, I admit.

Something better would be to compare EA one by one, and record those that have changed or have been deleted. But to be able to compare all EA and their value reference EA must reside in memory. As EA can grow up to 64 KB by file, this can lead to a quick saturation of the virtual memory, which is already enough solicited by the catalogue.

Theses two schemes implies a different pattern for saving EA in archive. In the first case (no EA in memory except at time of operation on it), to avoid skipping in the archive (and ask the user to change of disks too often), EA must be stored beside the data of the file (if present). Thus they must be distributed all along the archive (except at the end that only contains the catalogue).

In the second case (EA are loaded in memory for comparison), EA must reside beside or within the catalogue, in any case at the end of the archive, not to have to user to need all disks to just take an archive as reference.

As the catalogue, grows already fast with the number of file to save (from a few bytes for hard_link to 400 bytes around per directory inode), the memory saving option has been adopted.

Thus, EA changes are based on the ctime change. Unfortunately, no system call permits to restore ctime. Thus, restoring an differential backup after its reference has been restored, will present restored inode as more recent than those in the differential archive, thus the -r option would prevent any EA restoration. In consequence, -r has been disabled for EA, it does only concern data contents. If you don't want to restore any EA but just more recent data, you can use the following : -r -u "*"


Dar and remote backup server

The situation is the following : you have a host (called local in the following), on which resides an operational system, which you want to backup regularly, without perturbing users. For security reasons you want to store the backup on another host (called remote host in the following), only used for backup. Of course you have not much space on local host to store the archive.

Between theses two hosts, you could use NFS and nothing special would be necessary to add, to use dar as usually. but if for security reasons you don't want to use NFS (insecure network, local user must not have access to backups), but prefer to communicate through an encrypted session, (using ssh for example) then you need to use dar features brought by version 1.1.0:

dar can now output its archive to stdout instead of a given file. To activate it, use "-" as basename. Here is an example :

dar -c - -R / -z | some_program
or
dar -c - -R / -z > named_pipe_or_file

Note, that file splitting is not available as it has not much meaning when writing to a pipe. (a pipe has no name, there is no way to skip (or seek) in a pipe, while dar needs to set back a flag in a slice header when it is not the last slice of the set). At the other end of the pipe (on the remote host), the data can be redirected to a file, with proper filename (something that matches "*.1.dar").

some_other_program > backup_name.1.dar

It is also possible to redirect the output to dar_xform which can in turn on the remote host split the data flow in several files, pausing between them, exactly as dar is able to do:

some_other_program | dar_xform -s 100M - backup_name

this will create backup_name.1.dar and so on. The resulting archive is totally compatible with those directly generated by dar. OK, you are happy, you can backup the local filesystem to a remote server through a secure socket session, in a full featured dar archive without using NFS. But, now you want to make a differential backup taking this archive as reference. How to do that? The simplest way is to use the new feature called "isolation", which extracts the catalogue from the archive and stores it in a little file. On the remote backup server you would type:

dar -A backup_name -C CAT_backup_name -z

if the catalogue is too big to fit on a floppy, you can slit it as usually using dar:

dar -A backup_name -C CAT_backup_name -z -s 1440k

the generated archive (CAT_backup_name.1.dar, and so on), only contains the catalogue, but can still be used as reference for a new backup. You just need to transfer it back to the local host, either using floppies, or through a secured socket session, or even directly isolating the catalogue to a pipe  that goes from the remote host to the local host:

on remote host:
dar -A backup_name -C - -z | some_program

on local host:
some_other_program > CAT_backup_name.1.dar

or use dar_xform as previously if you need splitting :
some_other_program | dar_xform -s 1440k CAT_backup_name

then you can make your differential backup as usual:
dar -A CAT_backup_name -c - -z -R / | some_program

or if this time you prefer to save the archive locally:
dar -A CAT_backup_name -c backup_diff -z -R /

For differential backups instead of isolating the catalogue, it is also possible to read an archive or its extracted catalogue through pipes. Yes, two pipes are required for dar to be able to read an archive. The first goes from dar to the external program "dar_slave" and carries orders (asking some portions of the archive), and the other pipe, goes from "dar_slave" back to "dar" and carries the asked data for reading.

By default, if you specify "-" as basename for -l, -t, -d, -x, or to -A (used with -C or -c), dar and dar_slave will use their standard input and output to communicate. Thus you need additional program to make the input of the first going to the output to the second, and vice versa. Warning: you cannot use named pipe that way, because dar and dar_slave would get blocked upon opening of the first named pipe, waiting for the peer to open it also, even before they have started (dead lock at shell level). For named pipes, there is -i and -o options that helps, they receive a filename as argument, which may be a named pipe. The -i argument is used instead of stdin and -o instead of stdout. Note that for dar -i and -o are only available if "-" is used as basename. Let's take an example:

You now want to restore an archive from your remote backup server. Thus on it you have to run dar_slave this way

on remote server:
some_prog | dar_slave backup_name | some_other_prog
or
dar_slave -o /tmp/pipe_todar -i /tmp/pipe_toslave backup_name

and on the local host you have to run dar this way:

some_prog | dar -x - -v ... | some_other_prog
or
dar -x - -i /tmp/pipe_todar -o /tmp/pipe_toslave -v ...

there is no order to run dar or dar_slave first, dar can use -i and/or -o, while dar_slave does not. What is important here is to connect in a way or in an other their input and output, it does not matter how. The only restriction is that communication support must be perfect: no data loss, no duplication, no order change, thus communication over TCP should be fine.

Of course, you can also isolate a catalogue through pipes, test an archive, make difference, use a reference catalogue this way etc, and even then, output the resulting archive to pipe ! If using -C or -c with "-" while using -A also with "-", it is then mandatory to use -o: The output catalogue will generated on standard output, thus to send order to dar_slave you must use another channel with -o:

       LOCAL HOST                                   REMOTE HOST
   +-----------------+                     +-----------------------------+
   |   filesystem    |                     |     backup of reference     |
   |       |         |                     |            |                |
   |       |         |                     |            |                |
   |       V         |                     |            V                |
   |    +-----+      | backup of reference |      +-----------+          |
   |    | DAR |--<-]=========================[-<--| DAR_SLAVE |          |
   |    |     |-->-]=========================[->--|           |          |
   |    +-----+      | orders to dar_slave |      +-----------+          |
   |       |         |                     |      +-----------+          |
   |       +--->---]=========================[->--| DAR_XFORM |--> backup|
   |                 |        saved data   |      +-----------+ to slices|
   +-----------------+                     +-----------------------------+

on local host :
dar -c - -A - -i /tmp/pipe_todar -o /tmp/pipe_toslave | some_prog

on the remote host :

dar_slave -i /tmp/pipe_toslave -o /tmp/pipe_todar full_backup
dar_slave provides the full_backup for -A option

some_other_prog | dar_xform - diff -s 140M -p ...
while dar_xform make slice of the output archive provided by dar

Last, if you don't want to mess with pipe, you have still the possibility to create a VPN and mount a NFS partition over it. In some cases it may be sufficient for what you want to do. See VPN-HOWTO for simple implementation using sshd and pppd.


Bytes, bits, kilo, mega etc.


you probably know a bit the metric system, where a dimension is expressed by a base unit (the meter for distance, the liter for volume, the joule for energy, the volt for electrical potential, the bar for pressure, the watt for power, the second for time, etc.), and declined using prefixes:

deci  (d) = 0.1
centi (c) = 0.01
milli (m) = 0.001
micro (u) = 0.000,001 (symbol is not "u" but the "mu" Greek letter)
nano  (n) = 0.000,000,001
pico  (p) = 0.000,000,000,001
femto (f) = 0.000,000,000,000,001
atto  (a) = 0.000,000,000,000,000,001
zepto (z) = 0.000,000,000,000,000,000,001
yocto (y) = 0.000,000,000,000,000,000,000,001
deca (da) = 10
hecto (h) = 100
kilo  (k) = 1,000  (yes, a lower case letter)
mega  (M) = 1,000,000
giga  (G) = 1,000,000,000
tera  (T) = 1,000,000,000,000
peta  (P) = 1,000,000,000,000,000
exa   (E) = 1,000,000,000,000,000,000
zetta (Z) = 1,000,000,000,000,000,000,000
yotta (Y) = 1,000,000,000,000,000,000,000,000

This way two milliseconds are 0.002 second, and 5 kilometers are 5,000 meters. All was fine and nice up to the recent time when computer science appeared: In this discipline, the need to measure the size of information storage raised. The smallest size, is the bit (contraction of binary digit), binary because it has two possible states: "0" and "1". Grouping bits by 8 computer scientists called it a byte. A byte has 256 different states, (2 power 8). The ASCII (American Standard Code for Information Interchange) code arrived and assigned a letter or more generally a character to some value of a byte, (A is assigned to 65, space to 32, etc). And as most text is composed of a set of character, they started to count size in byte. Time after time, following technology evolution, memory size approached 1000 bytes.

But as memory is accessed through a bus which is a fixed number of cables (or integrated circuits), on which only two possible voltages are authorized to mean 0 or 1, the total amount of byte that a bus can address is always a power of 2. With a two cable bus, you can have 4 values (00, 01, 10 and 11, where a digit is the state of a cable) so you can address 4 bytes. Giving a value to each cable defines an address to read or write in the memory. Unfortunately 1000 is not a power of 2 and approaching 1000 bytes, was decided that a "kilobyte" would be 1024 bytes which is 2 power 10. Some time after, and by extension, a megabyte has been defined to be 1024 kilobytes, a terabyte to be 1024 megabytes, etc. at the exception of the 1.44 MB floppy where here the capacity is 1440 kilobytes thus here "mega" means 1000 kilo...

In parallel, in the telecommunications domain, going from analogical to digital signal made the bit to be used also. In place of the analogical signal, took place a flow of bits, representing the samples of the original signal. For telecommunications the problem was more a problem of size of flow: how much bit could be transmitted by second. At some ancient time appeared the 1200 bit by second, then 64000, also designed as 64 Kbit/s. Thus here, kilo stays in the usual meaning of 1000 time the base unit (at the exception that the K is uppercase while it should be lowercase). You can also find Ethernet 10 Mbit/s which is 10,000,000 bits by seconds, same thing with Token-Ring that had 4, 16 or 100 Mbit by seconds (4,000,000 16,000,000 or 100,000,000 bits/s). But, even for telecommunications, kilo is not always 1000 times the base unit: the E1 bandwidth at 2Mbit/s for example, is in fact 32*64Kbit/s thus 2048 Kbit/s ... not 2000 Kbit/s

Anyway, back to dar, you have to possibility to give the size in byte or using a single letter as suffix (k or K, M, T, P, E, Z, Y) thus the possibility to provide a size in kilo, mega, tera, peta, exa, zetta or yotta byte, with the computer science definition of theses terms (power of 1024) by default.

Theses suffixes are for simplicity and not to have to computer how much make powers of 1024. For example, if you want to fill a CD-R you will have to use the "-s 650M" option which is equivalent to "-s 6815744400", choose the one you prefer, the result is the same :-). Now, if you want 2 Megabytes slices in the sense of the metric system, simply use "-s 2000000" or read below:

Starting version 2.2.0, you can alter the meaning of all the suffixes used by dar, the

--alter=SI-units

(which can be shorten to -aSI or -asi) change the meaning of the prefixes that follow on the command-line, to the metric system (or System International) up to the end of the line or to a

--alter=binary-units

arguments (-which can be shortened to -abinary), after which we are back to the computer science meaning of kilo, mega, etc. up to the end of the line or up to a next --alter=SI-units. Thus in place of -s 2000000 one could use:

   -aSI -s 2M

Yes, and to make things more confuse, marketing arrived and made sellers count gigabits a third way: I remember some time ago, I bought a hard disk which was described as "2.1 GB", (OK, that's several couple of years ago), but in fact it had only 2097152 bytes available. This is far from 2202009 bytes (= 2.1 GiB for computer science meaning), and a bit more than 2,000,000 bytes (metric system). OK, if it had theses 2202009 bytes (computer science meaning of 2.1 GB), this hard disk would have been sold under the label "2.5 GB"! ... just kidding :-)

Note that to distinguish kilo, mega, tera and so on the new abbreviation are defined:
Ki = 1024
Mi = 1024*1024
GiB = and so on...
Ti
Pi
Ei
Zi
Yi

this for example, we have KiB for kilobytes (1024 bytes), and Kibit for kilobits (1024 bits) and keep KB (1000 Bytes) and Kbit (1000 bits)



Archive structure in brief


The Slice Level

A slice is composed of a header and data

+--------+-------------------------------------------+
| header |  Data                                     |
|        |                                           |
+--------+-------------------------------------------+

the slice header is composed of
  • a magic number that tells this is a dar slice
  • a internal_name which is unique to a given archive
  • a flag that tells if the slice is the last of the archive
  • a extension flag, that says if some extension field is present. Possible extension field is the size of the slices (-s option) when -S is used
+-------+----------+------+-----------+
| Magic | internal | flag | extension |
| Num.  | name     | byte | byte      |
+-------+----------+------+-----------+

Or for the first slice if -s and -S are used together
+-------+----------+------+-----------+-------------------+
| Magic | internal | flag | extension | following slices  |
| Num.  | name     | byte | byte      | size              |
+-------+----------+------+-----------+-------------------+

the header is the first thing to be written, and if it is not the last slice, the flag field is overwritten. The header is also the first part to be read.

To know where a given position takes place in the archive, dar must first read the first archive, to know the size of the first slice, and if extension is present, the size of following slices is read. This is why at start-up dar always asks the first slice.

Archive Level

The archive level describes the structure of the slice's data field, when they are stick together across slices.

+---------+-----------------------------------------+-----------+------+
| version |     Data                                | catalogue | term |
| header  |                                         |           |      |
+---------+-----------------------------------------+-----------+------+

the header version is composed of:
  • edition version of the archive
  • compression algorithm used
  • command line used for creating the archive
  • flag that tells if root or user EA have been saved
+---------+------+---------------+------+
| edition | algo | command line  | flag |
|         |      |               |      |
+---------+------+---------------+------+

The data is a suite of file contents, with EA if present

  ....--+---------------------+----+------------+-----------+----+---....
        |  file data          | EA | file data  | file data | EA |
        | (may be compressed) |    | (no EA)    |           |    |
  ....--+---------------------+----+------------+-----------+----+---....

the catalogue, contains all inode, directory structure and hard_links information. The directory structure is stored in a simple way: the inode of a directory comes, then the inode of the files it contains, then a special entry named "EOD" for End of Directory. Considering the following tree:

 - toto
    | titi
    | tutu
    | tata
    |   | blup
    |   +--
    | boum
    | coucou
    +---

it would generate the following sequence for catalogue storage:

+-------+------+------+------+------+-----+------+--------+-----+
|  toto | titi | tutu | tata | blup | EOD | boum | coucou | EOD |
|       |      |      |      |      |     |      |        |     |
+-------+------+------+------+------+-----+------+--------+-----+

EOD takes on byte, and this way no need to store the full path of each file, just the filename is recorded.

the terminator stores the position of the beginning of the catalogue, it is the last thing to be written. Thus dar first reads the terminator, then the catalogue.

All  Together

Here is an example of how data can be structured in a four slice archive:

+-------------+--------+------------------------+
| slice + ext | version|  file data + EA        |
| header      | header |                        |
+-------------+--------+------------------------+

the first slice has been defined smaller using the -S option

+--------+--------------------------------------------+
| slice  |           file data + EA                   |
| header |                                            |
+--------+--------------------------------------------+

+--------+--------------------------------------------+
| slice  |           file data + EA                   |
| header |                                            |
+--------+--------------------------------------------+

+--------+---------------------+-----------+------+
| slice  |   file data + EA    | catalogue | term |
| header |                     |           |      |
+--------+---------------------+-----------+------+

the last slice is smaller because there was not enough data to make it full.

The archive is written quite sequentially, except when creating a new slice, the flag in the slice header has to be changed to non terminal.

Else, Dar read first the slice header of the first slice, then the version header, then the terminator and catalogue (located on the last slice) and then proceed to the operation. If it is extracting the whole archive, dar goes back to the first slice and asks for all slices one by one.

Other Levels

Things get a bit more complicated if we consider compression and encryption. The way the problem is addressed in dar's code is a bit like networks are designed in computer science, using the notion of _layer_. Here, there is a additional constraint, a given layer may or may not be present (encryption, compression, slicing for example). So all layer must have the same interface for serving the layer above them. This interface is defined by the pure virtual class "generic_file", which provides generic methods for reading, writing, skipping, getting the current offset when writing or reading to a file. This way the compressor class acts like a file which compresses data wrote to it and writes compressed data to another "generic_file". The blowfish and scramble classes act the same but in place of compressing/uncompressing they encrypt/decrypt the data to/from another generic_file object. The slicing we have seen above follows the same principle, this is a "sar" object that transfers data wrote to it to several fichier objects. Class fichier also inherit from generic_file class, and is a wrapper for the plain file system calls.

Here are now the layers:

              +----+--+----+-...........+---------+
archive       |file|EA|file|            |catalogue|
layout        |data|  |data|            |         |
              +----+--+----+-...........+---------+
                   |           |              |
                   |           |              |
                   V           V              V
              +-----------------------------------+
compression   |         (compressed)  data        |
              +-----------------------------------+
                    |                      |
                    |                      |          / Terminateur
                    |                      |          |
                    |                      |          V
elastic  +---+      |                      |       +----+---+
buffers  |EEE|      |                      |       |TTTT|EEE|
         +---+      |                      |       +----+---+
           |        |                      |              |
           V        V                      V              V
         +--------------------------------------------------+
cipher   |        (encrypted) data                          |
         +--------------------------------------------------+
header\               |                         |
version|              |                         |
       |              |                         |
       |              |                         |
       V              V                         V
     +------------------------------------------------------+
sar  |VVV|                  data                            |
     +------------------------------------------------------+
        |         |  |         |   |        |   |        |  |
slice   |         |  |         |   |        |   |        |  |
headers |         |  |         |   |        |   |        |  |
 |  |   |         |  |         |   |        |   |        |  |
 |  +---|------\  |  |         |   |        |   |        |  |
 V      V      V  V  V         V   V        V   V        V  V
+---------+  +---------+  +---------+  +---------+  +-------+
|HH| data |  |HH| data |  |HH| data |  |HH| data |  |HH|data|
+---------+  +---------+  +---------+  +---------+  +-------+
  slice 1      slice 2      slice 3      slice 4      slice 5



Question: why not having put the slices information at the end, for dar not to ask the first then the last slice?

Slicing and archiving are approached in two independent ways. For slicing, making header at the end of the file would require much more complicated and heavy code, because, slice have variable size, and header too. Then it would have cost more memory and process to manage the end of slice, when reading data you must not provide the header as data.

keeping slicing and archiving as two independent classes, is really necessary for dar to evaluate. Without this, it would not have been so easy to create dar_xform, which is only concerned by "sar" class (the C++ class that implements slicing). So putting the slicing information in the catalogue or terminator is really a bad idea for long term evolution and maintenance point of view.

maybe the sar class will receive a new implementation one day in that way that header get stored at the end of slices, and overall slicing information would be stored at the end of the last slice. This way, dar would not ask the first slice before asking for the last one, in the meanwhile, you will have to first provide the first slice in any case.


EA Support & Compilation Problems

If you just want to compile DAR with EA support available, you just need the attr-x.x.x.src.tar.gz package to have the libattr library and header files installed. If you want to use EA, then you need to have EA support in your kernel.

[What follows in this chapter is becoming obsolete, you may skip it as today EA support is available in standard in at least Linux kernels]

I personally got some problem to compile dar with EA support, due to EA package installation problem:

when installing EA package, the /usr/include/attr directory is not created nor the xattr.h file put in it. To solve this problem, create it manually, and copy the xattr.h (and also attributes.h even if it is not required by dar) to it, giving it proper permission (world readable). These include files can be found in the "include" subdir of the xattr package: as root type the following replacing <package> by the path where your package has been compiled:

cd /usr/include
mkdir attr
chmod 0755 attr
cp <package>/include/xattr.h <package>/include/attributes.h attr
cd attr
chmod 0644 *


The second problem is while linking, the static library version does not exist. You can built it using the following command (after package compilation):

as previously as root type:

chdir <package>/libattr
ar r libattr.a syscalls.o libattr.o
mv libattr.a /usr/lib
chmod 0644 /usr/lib/libattr.a


dar should now be able to compile with support for EA activated.




Running DAR in background


DAR can be run in background:

dar [command-line arguments] < /dev/null &


Running command or scripts from DAR

This concerns options -E and -F. They both receive a string as argument. Thus if the argument must be a command with its own arguments, you have to put theses between quotes for they appear as a single string to the shell that interprets the dar command-line. For example if you want to call

df .

[This is two worlds: "df" (the command) and "." its argument] then you have to use the following on DAR command-line:

-E "df ."
or
-E 'df .'


DAR provides several substitution strings:

  • %% is replaced by a single % Thus if you need a % in you command line you MUST replace it by %% in the argument string of -E or -F
  • %p is replaced by the path to the slices
  • %b is replaced by the basename of the slices
  • %n is replaced by the number of the slice
  • %c is replaced by the context replaced by "operation", "init" or "last_slice"depending on the context.
The number of the slice (%n) is either the just written slice or the next slice to be read. For example if you make a backup (-c or -C), this is the number of the last slice completed. Else (using -t, -d, -A (with -c or -C), -l or -x), this is the number of the slice that will be required very soon. While %c (the context) is substituted by "init", "operation" or "last_slice".

  • "init" when the slice is asked before the catalogue is read
  • "operation" once the catalogue is read and/or data treatment has begun.
  • "last_slice" when the last slice has been written (archive creation only)

What the use of this feature? For example you want to burn the band-new slices on CD as soon as they are  available.

let's build a little script for that:

%cat burner
#!/bin/tcsh -f

if("$1" == "" || "$2" == "") then
  echo "usage: $0 <filename> <number>"
  exit 1
endif

mkdir T
mv $1 T
mkisofs -o /tmp/image.iso -r -J -V "archive_$2" T
cdrecord dev=0,0 speed=8 -data /tmp/image.iso
rm /tmp/image.iso
if(! diff /mnt/cdrom/$1 T/$1) then
  exit 2
else
  rm -rf T
endif
%

This little script, receive the slice filename, and its number as argument, what it does is to burn a CD with it, and compare the resulting CD with the original slice. Upon failure, the script return 2 (or 1 if syntax is not correct on the command-line). Note that this script is only here for illustration, there are many more interesting user scripts made by several dar users. Theses are available in the examples part of the documentation.

One could then use it this way:

-E "./burner %p/%b.%n.dar %n"

which can make the following DAR command-line:

dar -c ~/tmp/example -z -R / usr/local -s 650M -E "./burner %p/%b.%n.dar %n" -p

First note that as our script does not change CD from the device, we need to pause between slices (-p option). The pause take place after the execution of the command (-E option). Thus we could add in the script a command to send a mail or play a music to inform us that the slice is burned. The advantage, here is that we don't have to come twice by slices, once when the slice is ready, once when the slice is burnt.

Another example:

you want to send a huge file by email. (OK that's better to use FTP, but sometimes, people think than the less you can do the more they control you, and thus they disable many services, either by fear of the unknown, either by stupidity). So you only have mail available to transfer your data:

dar -c toto -s 2M my_huge_file -E "uuencode %b.%n.dar %b.%n.dar | mail -s 'slice %n' your@email.address ; rm %b.%n.dar ; sleep 300"

Here we make an archive with slices of 2 Megabytes, because our mail system does not allow larger emails. We save only one file: "my_huge_file" (but we could even save the whole filesystem it would also work). The command we execute each time a slice is ready is:

  1. uuencode the file and send the output my email to our address.
  2. remove the slice
  3. wait 5 minutes, to no overload too much the mail system, This is also
  4. useful, if you have a small mailbox, from which it takes time to retrieve mail.
Note that we did not used the %p substitution string, as the slices are saved in the current directory.

Last example, is while extracting, in the case the slices cannot all be present in the filesystem, you need a script or a command to fetch the soon to be requested slice. It could be using ftp, lynx, ssh, etc. I let you do the script as an exercise. :-)


Makefile targets (does only concern versions 1.x.x)


Here follows user available targets and macros to be used with make.

A target is a word that can be put as argument of make like "all" or "clean".
A macro is a variable that can be changed in the Makefile or changed on the make command-line, this way
make INSTALL_ROOT_DIR="/some/where/else"

Targets

default : builds dar dar_xform and dar_slave (used if not target is given)
all     : builds dar dar_xform dar_slave and test programs
depend  : rebuild file dependency. This modifies the Makefile
install : install dar software and manual pages as described by
          INSTALL_ROOT_DIR BIN_DIR and MAN_DIR macros
install-doc : install documentation (tutorial notes, etc.) as described by
          the INSTALL_ROOT_DIR and INSTALL_DOC_DIR macros
uninstall : remove dar software, man pages and documentation if present
          as described by macro used for installation
test    : only builds test programs
clean_all : remove all generated files (temporary and final files)
clean   : remove all files except the C++ generated files with the "usage"
          extension
usage   : builds the dar-help program, and generates "*.usage" files that
          contains generated the C++ code  corresponding to the help text
          displayed with option -h.
clean_usage : remove C++ generated files

Macros

DAR_VERSION       : DO NOT CHANGE IT ! : get dar version from source
INSTALL_ROOT_DIR  : can be changed     : base directory for installation
BIN_DIR           : can be changed     : subdir where to store binaries
MAN_DIR           : can be changed     : subdir where to store man pages
DOC_DIR           : can be changed     : subdir where to store doc files
EA_SUPPORT        : set or unset it    : if set add support for EA
FILEOFFSET        : set or unset it    : if set, support for large files
USE_SYS_SIGLIST   : set or unset it    : if set, uses the sys_siglist vector
OS_BITS           : set or unset it    : if set, change int macros for alpha OS
CXX               : can be changed     : point to your C++ compiler
CC                : can be changed     : point to your C compiler


Scrambling



How does it works?

take the pass phrase. It is a string, thus a sequence of bytes, thus a sequence of integer each one between 0 and 255 (including 0 and 255). The data to "scramble" is also a sequence of byte, usually much longer than the pass phrase. The principle is to add byte by byte the pass phrase to the data, modulo 256. The pass phrase is repeated all along the archive. Let's take an example:

the pass phrase is "he\220lo" (where \220 is the character which value is 220). the data is "example"

taken from ASCII standard:
h = 104
l = 108
o = 111
e = 101
x = 120
a = 97
m = 109
p = 112

        e       x       a       m       p       l       e
        101     120     97      109     112     108     101

+       h       e       \220    l       o       h       e
        104     101     220     108     111     104     101

---------------------------------------------------------------

        205     221     317     217     223     212     202

---------------------------------------------------------------
modulo
256 :   205     221     61      217     223     212     202
        \205    \201    =       \217    \223    \212    \202


thus the data "example" will be written in the archive "\205\201=\217\223\212\202"

This method allows to decode any portion without knowing the rest of the data. It does not consume much resources to compute, but it is terribly weak and easy to crack. Of course, the data is more difficult to retrieve without the key when the key is long. Today dar can also use strong encryption (blowfish algorithm for now) and thanks to a encryption block can still avoid reading the whole archive to restore any single file.



dar_manager



dar_manager is the last member of the dar suite programs. Its role is to gather information about several backups to easily and automatically restore the last versions of a given set of file over many different backups (up to 65534).

A first thing to do is to build a "database" from archives or their extracted catalogues. You may have several databases, they will be independent of each others. Each database is stored in a single (compressed) file.

When you will need a particular file to be restored, using the collected information, dar_manager will call dar with proper options to restore the file(s) from the correct archive. This is particularly useful when making differential backups, because all files are not saved at each backup (those that did not changed) the last version of a given file may be located in an archive done many time ago.

As dar_manager calls dar, it must know the path to each archive. By default, it uses the path given using -A option, as well as the basename given. But you might be feeding the database with an extracted catalogue (dar's -C option), while the real archive is stored on a CD with a different basename. You may either use the -b and -p option to change the basename and path at any time, or set different basename and path when you add archive by giving extra optional argument to -A option.

Next point, you may need some special options to always be passed to dar, this is the use of the -o option.

last point dar_manager looks for dar in the PATH variable. But you can also specify which dar command you want to be used (-d option), if it is not in the PATH or for security reasons.

Normal operation is to update your database(s) after each new archive has been created. And when you need to restore a particular file or set of files, you will just have to call dar_manager with -r option:

dar_manager -r file1 home/my_directory tmp/file2 ...

As well when some archive get destroyed due to archive rotation, you can safely remove them from the dar_manager database.



Files' extension used

dar suite programs use several type of files:

  • slices  (dar, dar_xform, dar_slave, dar_manager)
  • configuration files (dar, dar_xform, dar_slave)
  • databases  (dar_manager)
  • user command  (dar, dar_xform, dar_slave)
  • filter list (dar's -[ and -] options)
If for slice the extension and even the filename format cannot be customized, (basename.slicenumber.dar) there is not mandatory rule for the other type of files.

In the case you have no idea how to name theses, here is what I use:
configuration files receive ".dcf" as extension, (Dar Configuration file)
while database receive      ".dmd" as extension, (Dar Manager Database)
for user command I propose  ".duc" as extension, (Dar User Command)
for filter list I suggest   ".dfl" as extension. (Dar Filter List)

but, you are totally free to use the filename you want !   ;-)



dar and ssh (see also paragraph II)

As reported "DrMcCoy" in the historical forum "Dar Technical Questions", the netcat program can be very helpful if you plane to backup over the network.

The context in which will take place the following examples are a "local" host named "flower" has to be backup or restored form/to a remote host called "honey" (OK, the name of the machines are silly...)

Example of use with netcat. Note that netcat command name is "nc"

Creating a full backup of "flower" saved on "honey"
on honey:
nc -l -p 5000 > backup.1.dar

then on flower:
dar -c - -R / -z | nc -w 3 honey 5000

but this will produce only one slice, instead you could use the following to have several slices on honey:

on honey:
nc -l -p 5000 | dar_xform -s 10M -S 5M -p - backup

on flower:
dar -c - -R / -z | nc -w 3 honey 5000

by the way note that dar_xform can also launch a user script between  slices exactly the same way as dar does, thanks to the -E and -F options.

Testing the archive
testing the archive can be done on honey but you could also do it remotely even if it is not very useful !

on honey:
nc -l -p 5000 | dar_slave backup | nc -l -p 5001

on flower:
nc -w 3 honey 5001 | dar -t - | nc -w 3 honey 5000

note also that dar_slave can run a script between slices, if for example you need to load slices from a robot, this can be done automatically, or if you just want to mount/unmount a removable media eject or load it and ask the user to change it ...

Comparing with original filesystem
on honey:
nc -l -p 5000 | dar_slave backup | nc -l -p 5001

on flower:
nc -w 3 honey 5001 | dar -d - -R / | nc -w 3 honey 5000

Making a differential backup
Here the problem is that dar needs two pipes to send orders and read data coming from dar_slave, and a third pipe to write out the new archive. This cannot be realized only with stdin and stdout as previously. Thus we will need a named pipe (created by the mkfifo command). 

on honey:
nc -l -p 5000 | dar_slave backup | nc -l -p 5001
nc -l -p 5002 | dar_xform -s 10M -p - diff_backup

on flower:
mkfifo toslave
nc -w 3 honey 5000 < toslave &
nc -w 3 honey 5001 | dar -A - -o toslave -c - -R / -z | nc -w 3 honey 5002


with netcat the data goes in clear over the network. You could use ssh instead if you want to have encryption over the network. The principle are the same.

Example of use with ssh

Creating full backup of "flower" saved on "honey"
we assume you have a sshd daemon on flower.
on honey:
ssh flower dar -c - -R / -z > backup.1.dar

or still on honey:
ssh flower dar -c - -R / -z | dar_xform -s 10M -S 5M -p - backup

Testing the archive
on honey:
dar -t backup

or from flower: (assuming you have a sshd daemon on honey)

ssh honey dar -t backup

Comparing with original filesystem
on flower:
mkfifo todar toslave
ssh honey dar_slave backup > todar < toslave &
dar -d - -R / -i todar -o toslave


Important. Depending on the shell you use, it may be necessary to invert the order in which "> todar" and "< toslave" are given on command line. The problem is that the shell hangs trying to open the pipes. Thanks to "/PeO" for his feedback.

or on honey:
mkfifo todar toslave
ssh flower dar -d - -R / > toslave < todar &
dar_slave -i toslave -o todar backup


Making a differential backup
on flower:
mkfifo todar toslave
ssh honey dar_slave backup > todar < toslave &

and on honey:
ssh flower dar -c - -A - -i todar -o toslave > diff_linux.1.dar
or
ssh flower dar -c - -A - -i todar -o toslave | dar_xform -s 10M -S 5M -p - diff_linux



Overflow in arithmetic integer operations


Some code explanation about the detection of integer arithmetic operation overflows. We speak about *unsigned* integers, and we have only portable standard ways to detect overflows when using 32 bits or 64 bits integer in place of infinint.

Developed in binary, a number is a finite suite of digits (0 or 1). To obtain the original number from the binary representation, we must multiply each digit by a power of two. example the binary representation "101101" designs the number N where:

N = 2^5 + 2^3 + 2^2 + 2^0

in that context we will say that 5 is the maximum power of N (the power of the higher non null binary digit).

for the addition "+" operation, if an overflow occurs, the result is less than one or both operands, so overflow is not difficult to detect. To convince you, let's assume that the result is greater both operands while it has overflowed. Thus the real result (without overflow) less the first operands should gives the second argument, but here we get a value that is greater than the all 1 bits integer (because there was an overflow and the resulting overflowed value is greater than the first second and the first operand), so this is absurd, and in case of overflow the resulting value is less than one of the operands.

for substraction "-" operation, if the second operand is greater than the first there will be an overflow (result must be unsigned thus positive) else there will not be any overflow. Thus detection is even more simple.

for division "/" and modulo "%" operations, there is never an overflow (there is just illicit the division by zero).

for multiplication "*" operation, a heuristic has been chosen to quickly detect overflow, the drawback is that it may triggers false overflow when number get near the maximum possible integer value. Here is the heuristic used:

given A and B two integers, which max powers are m and n respectively, we have

A < 2^(m+1)
and
B < 2^(n+1)

thus we also have:

A.B < 2^(m+1).2^(n+1)

which is:

A.B < 2^(m+n+2)

by consequences we know that the maximum power of the product of A by B is at most m+n+1 and while m+n+1 is less than or equal to the maximum power of the integer field there will not be overflow else we consider there will be an overflow even if it may not be always the case (this is an heuristic algorithm).


Using data protection with DAR & PAR

Parchive (PAR in the following) is a very nice program that makes possible to recover a file which has been corrupted. It creates redundancy data stored in a separated file (or set of files), which can be used to repair the original file. This additional data may also be damaged, PAR will be able to repair the original file as well as the redundancy files, up to a certain point, of course. This point is defined by the percentage of redundancy you defined for a given file. But,... check the official PAR site here:

         http://parchive.sourceforge.net

First to remind you the chapter XI above, dar can use several types of files.
  • DUC files (Dar User Scripts), are any script or command. They are bind to dar through the -E option, and are launch after slice creation or before slice reading.
  • - DCF files (Dar Configuration Files), are text files, that extend the command line arguments. They must respect a well defined syntax (see man page 'conditional syntax' for details). Theses configuration files are bind to dar through the -B option.

In the following we will present two DUC files, and one DCF file.  All theses files are distributed and normally installed under $prefix/shared/dar directory where $prefix is /usr/local by default, or else the path given to --prefix=... configure option.

The discussion here is to show how to use PAR with DAR. Two DUC files are available for that. They are intended to be called between slice by dar thanks to dar's -E option. the first "dar_par_create.duc" generates redundancy files for each slice. Theses files will be required later if corruption occurred on a slice. This script is expected to be used when creating an archive (-c option or -C option).

dar -c some_archive -E "/usr/local/share/dar/samples/dar_par_create.duc %p %b %n %e %c 20" ... and other options to dar

The second "dar_par_test.duc", tests the coherence of the redundancy files together with the slice they protect. If a corruption is detected, the scripts asks Parchive to repair the slice (thus this will fail on CD-ROM, as the filesystem is Read-Only, but we will see further how to fix that). This script is also expected to be used as argument to -E option, but this time, when testing an archive (dar's -t option), or "diffing" (-d option) an archive with a filesystem.

dar -t some_archive -E "/usr/local/share/dar/samples/dar_par_test.duc %p %b %n %e %c" ... and other options to dar

In both previous examples, the %p %b %n %e %c are macro that dar replaces respectively by the path, basename, slice number, extension, context of the last slice created or the next slice to read depending on the operation asked (backup or restore for example). See dar's man page and chapter XV of this file below for more. Now, to avoid you having to type all theses -E options, a DCF file named "dar_par.dcf" is provided. You can thus replace the -E options and arguments by

    -B /usr/local/share/dar/samples/dar_par.dcf

you could thus type

dar -c some_archive -B /usr/local/share/dar/samples/dar_par.dcf ...
dar -t some_archive -B /usr/local/share/dar/samples/dar_par.dcf ...
dar -l some_archive -B /usr/local/share/dar/samples/dar_par.dcf ...
etc.


If you plane to always use Parchive with dar, you can even add "-B /usr/local/share/dar/samples/dar_par.dcf" in your $HOME/.darrc file ! This way, dar would always use Parchive (unless -N option is given on command-line). here is an example:

#cat ~/.darrc
all:
-B /usr/local/share/dar/samples/dar_par.dcf


 ... and so on.


What to do of this extra data files generated by PAR ?

You can put them after each slice on a CD-R, which requires you to adjust the slice size a few megabytes below the size of the CD-R to have enough room left to add the extra PAR files. Note that the amount of data generated by par depends on the redundancy rate specified on command-line (PAR's -r option). You can also gather the PAR data of all the slices of your archive and put then on a separated disk. That's up to you.

The problem is when the corrupted slice is on a CD-R and thus cannot be repaired in place. You then need to copy it on your hard disk and run PAR to repair it. The problem is that with the standard 'cp' command to copy the corrupted slice to disk, you will not be able to access the data present after the corruption. Thus, you may miss too much data for it may be repaired by PAR. To solve this problem, the command 'dar_cp' (which is a replacement for 'cp' and which is installed with dar) can be used. It will skip over the I/O error and continue the copy with good readable data. This way you will only miss the corrupted data, which should be most of the time possible to recover thanks to PAR redundancy files.

Once repaired, you can burn it again, but you may rather put symbolic links in the directory where resided your repaired slice, each symbolic link pointing to the name of the *other* slices on CD-R which are not corrupted. Dar must be given a single directory where all slice will be fetched.
 
Let's take an example. You have an archive which basename is 'coucou' and which is made of 183 slices of 650 MB each except the last one which is only 459 MB.

Calling

dar -B /usr/local/share/dar/samples/dar_par.dcf -t /mnt/cdrom/coucou

shows that Parchive detected an error on slice 80 but could not repair it, as the filesystem (on CD-R) is read-only. Thus, you need to copy it on you hard disk for reparation:

dar_cp /mnt/cdrom/coucou.80.dar /tmp

we need also to copy the redundancy files

dar_cp /mnt/cdrom/coucou.80.dar.par2 /tmp
dar_cp /mnt/cdrom/coucou.80.dar.vol000+992.par2 /tmp


then we repair it thanks to Parchive:

cd /tmp
par2repair coucou.80.dar


if this succeeded, then you can burn it back on a new CD-R with all its parity files. But you can also, add, beside the coucou.80.dar slice, as many symbolic link as there is other slices located on CD:

cd /tmp
ln -s coucou.1.dar /mnt/cdrom
ln -s coucou.2.dar /mnt/cdrom
etc... up to (but except slice 80)
ln -s coucou.183.dar /mnt/cdrom

then you can restore your archive giving /tmp in place of /mnt/cdrom to dar:

dar -x /tmp/coucou -R ... etc.

dar will find all slice on CD-R (thanks to the symbolic links we created), except for slice 80 which is the repaired one and which dar will find on hard disk.


Dar User Command

Since version 1.2.0 dar's user can make dar calling their commands or scripts between slices, thanks to the -E and -F options. To be able to easily share your commands or scripts, I propose you the following convention:

- use the ".duc" extension to show anyone the script/command respect the following
- must be called from dar with the following arguments:

example.duc %p %b %n %e %c [other optional arguments]

- must provide brief help on what it does and what are the expected arguments, when called without argument. This is the standard "usage:" convention.

Then, any user, could share their "dar user commands" (i.e.: DUC files) and don't bother much about how to use them. Moreover it would be easy to chain them:

if for example two persons created their own script, one "burn.duc" which burns a slice on CD-R(W) and "par.duc" which makes a Parchive redundancy file from a slice, anybody could use both at a time giving the following argument to dar:

-E "par.duc %p %b %n %e %c 1 ; burn.duc %p %b %n %e %c"

or since version 2.1.0 with the following argument:

-E "par.duc %p %b %n %e %c 1" -E "burn.duc %p %b %n %e %c"

of course a script has not to use all its arguments, in the case of burn.duc for example, the %c (context) is probably useless, and not used inside the script, while it is still possible to give it all the "normal" arguments of a DUC file, extra not used argument are simply ignored.

If you have interesting DUC scripts, you are welcome to contact me by email, for I add them on the web site and in the following releases. For now, check doc/samples directory for a few examples of DUC files.

Note that starting version 2.1.0, several -B options will be possible, each given command will be called in the order of their appearance of the corresponding -B option.


Examples of file filtering

File filtering is what defines which files are saved, listed, restored, compared, tested, and so on. In brief, in the following we will say which file are elected for the operated, meaning by "operation", either a backup, a restoration, an archive contents listing, an archive comparison, etc.

File filtering is done using the following options -X, -I, -P, -R, -[,  -] or -g.

OK, Let's start with some concretes examples:

dar -c toto

this will backup the current directory and all what is located into it to build the toto archive, also located in the current directory. Usually you should get a warning telling you that you are about to backup the archive itself

Now let's see something less obvious:

dar -c toto -R / -g home/ftp

the -R option tell dar to consider all file under the / root directory, while the -g "home/ftp" argument tells dar to restrict the operation only on the home/ftp subdirectory of the given root directory thus here /home/ftp.

But this is a little bit different than the following:

dar -c toto -R /home/ftp

here dar will save any file under /home/ftp without any restriction. So what is the difference? Yes, exactly the same files will be saved as just above, but the file /home/ftp/welcome.msg for example, will be stored as <ROOT>/welcome.msg . Where <ROOT> will be replaced by the argument given to -R option (which defaults to "."), at restoration or comparison time. While in the previous example the same file would have been stored with the following path <ROOT>/home/ftp/welcome.msg .

dar -c toto -R / -P home/ftp/pub -g home/ftp -g etc

as previously, but the -P option make all files under the /home/ftp/pub not to be considered for the operation. Additionally the /etc directory and its subdirectories are saved.

dar -c toto -R / -P etc/password -g etc

here we save all the /etc except the /etc/password file. Arguments given to -P can be plain files also. But when they are directory this exclusion applies to the directory itself and its contents. Note that using -X to exclude "password" does have the same effect:

dar -c toto -R / -X "password" -g etc

will save all the /etc directory except any file with name equal to "password". thus of course /etc/password will no be saved, but if it exists, /etc/rc.d/password will not be saved neither if it is not a directory. Yes, if a directory /etc/rc.d/password exist, it will not be affected by the -X option. As well as -I option, -X option do not apply to directories. The reason is to be able to filter some kind of file without excluding a particular directory for example you want to save all mp3 files and only MP3 files,

dar -c toto -R / -I "*.mp3" -I "*.MP3" home/ftp

will save any mp3 or MP3 ending files under the /home/ftp directories and subdirectories. If instead -I (or -X) applied to directories, we would only be able to recurse in subdirectories ending by ".mp3" or ".MP3". If you had a directory named "/home/ftp/Music" for example, full of mp3, you would not have been able to save it.

Note that the glob expressions (where comes the wild-card '*' '?' and so on), can do much more complicated things like "*.[mM][pP]3" you could replace the previous example by

dar -c toto -R / -I "*.[mM][pP]3" home/ftp

this would cover all .mp3 .mP3 .Mp3 and .MP3 files. One step further, the -acase option makes following filtering arguments become case sensitive (which is the default), while the -ano-case (alias -an in short) set to case insensitive mode filters arguments that follows it. In shorter we have:

dar -c toto -R /  -an -I "*.mp3' home/ftp

Last a very complete example:

dar -c toto -R / -P "*/.mozilla/*/[Cc]ache" -X ".*~" -X ".*~" -I "*.[Mm][pP][123]" -g home/ftp -g "fake"

so what ?

OK, here we save all under /home/ftp and /fake but we do not save the contents of "*/.mozilla/*/[Cc]ache" like for example "/home/ftp/.mozilla/ftp/abcd.slt/Cache" directory and its contents. In theses directories we save any file matching "*.[Mm][pP][123]" files except those ending by a tilde (~ character), Thus for example file which name is "toto.mp3" or ".bloup.Mp2"

Now the inside algorithm:

 a file is elected for operation if
 1 - its name does not match any -X option or it is a directory
*and*
 2 - if some -I is given, file is either a directory or match at least one of the -I option given.
*and*
 3 - path and filename do not match any -P option
*and*
 4 - if some non option are given (building a [list of path]) the path to the file is one of the member of [list of path] or a subdirectory of one of path given to a -g options.

This is the unordered method (-am option), since version 2.2.x there is also an ordered method which gives even more power to filters, the dar man mage will give you all the details.

In parallel of file filtering, you will find Extended Attributes filtering thanks to the -u and -U options (they work the same as -X and -I option but apply to EA), you will also find the file compression filtering (-Z and -Y options) that defines which file to compress or to not compress, here too the way they work is the same as seen with -X and -I options, the -ano-case / -acase options do also apply here, as well as the -am option. Last all theses filtering (file, EA, compression) can also use regular expression in place of glob expression (thanks to the -ag / -ar options).



Strong encryption


Several cyphers are available. Remind that "scrambling" is not a strong encryption cypher, all other are.

to be able to use a strong encrypted archive you need to know the three parameters used at creation time:
  • the cypher used (blowfish, ...)
  • the key or password used
  • the encryption block size used
no information about these parameters is stored in the generated archive. If you make an error on just one of them, you will not be able to use your archive. If you forgot one of them, nobody can help you, you can just consider the data in this archive as lost. This is the drawback of strong encryption.

How is it implemented?

To not completely break the possibility to directly access file, the archive is  not encrypted as a whole (as would do an external program). The encryption is done block of data by block of data. Each block can be decrypted, and if you want to read some data somewhere you need to decrypt the whole block(s) it is in.

In consequence, the larger the block size is, the stronger the encryption is. But the larger the block size is too, the longer it will take to recover an given file, in particular when the file size to restore is much smaller than the encryption block size used.

An encryption block size can range from 10 bytes to 4 GB.

If encryption is used as well as compression, compression is done first, then encryption is done on compressed data.

An "elastic buffer" is introduced at the beginning and at the end of the archive, to protect against plain text attack.  The elastic buffer size randomly varies and is defined at execution time. It is composed of random (srand()) values. Two marks characters '>' and '<' delimit the size field, which indicate the byte size of the elastic buffer. The size field is randomly placed in the buffer. Last, the buffer is encrypted with the rest of the data. Typical elastic buffer size range from 1 byte to 10 kB, for both initial and terminal elastic buffers.

Elastic buffers are also used inside compression blocks. The underlying cypher may not be able to encrypt at the requested block size boundary. If necessary a small elastic buffer is appended to the data before encryption, to be able, at restoration time, to know the amount of data and the amount of noise around it.

Let's take an example with blowfish. Blowfish encrypts by multiple of 8 bytes (blowfish chain block cypher). An elastic buffer is always added to the data of a encryption block, its minimal size is 1 byte.

Thus, if you request a encryption block of 3 bytes, theses 3 bytes will be padded by an elastic buffer of 5 bytes for theses 8 bytes to be encrypted. This will make a very poor compression ratio as only 3 bytes on 8 bytes are significant.

If you request a encryption block of 8 bytes, as there is no room for the minimal elastic buffer of 1 byte, a second 8 byte block is used to put the elastic buffer, so the real encryption block will be 16 bytes.

Ideally, a encryption block of 7 bytes, will use 8 bytes with 1 byte for the elastic buffer.

This problem tends to disappear when the encryption block size grows, so this should not be a problem in normal conditions. Encryption block of 3 bytes is not a good idea to have a strong encryption scheme, for information, the default encryption block size is 10kB.



libdar and thread-safe requirement


This is for those who plane to use libdar in their own programs.

If you plan to have only one thread using libdar there is no problem, of course, you will however have to call one of the get_version() first, as usual. Thing change if you intend to have several concurrent threads using libdar library.

libdar is thread-safe under certain conditions:

Several 'configure' options have an impact on thread-safe support:

--enable-test-memory is a debug option that avoid libdar to be thread-safe,  so don't use it.
--enable-special-alloc (set by default), makes a thread-safe library only if POSIX mutex are available (pthread_mutex_t type).
--disable-thread-safe avoid looking for mutex, so unless --disable-special-alloc is also used, the generated library will not be thread safe.

You can check the thread safe capability of a library thanks to the get_compile_time_feature(...) call from the API. Or use 'dar -V' command, to  quickly have the corresponding values and check using 'ldd' to see which library has been dynamically linked to dar, if applicable.

IMPORTANT:
As more as before it is mandatory to call get_version() call before any other call, when the call returns, libdar is ready for thread safe. Note that even if the prototype does not change get_version() *may* now throw an exception, so use get_version_noexcept() if you don't want to manage exceptions.

For more information about libdar and its API, check the doc/api_tutorial.html document and the API reference manual under doc/html/index.html


Dar_manager and delete files


This is for further reference and explanations.

In dar archive when a file has been deleted since the backup of reference (in case of differential archive), an entry of a special type (called "detruit") is put in the catalogue of the archive which only contains the name of the missing file.

In a dar_manager database, to each files that have been found in one of the archive used to build this database corresponds a list of association. This associations put in relation the mtime (date of modification of the file) to the archive number where the file has been found in that state.

There is thus no way to record "detruit" entries in a dar_manager database, as no date is associated with it. Yes, in a dar archive, we can only notice a file has been destroyed because it is not present in the filesystem but is present in the catalogue of the archive of reference. Thus we know the file has been destroyed between the date the archive of reference has been done and the date the current archive is actually done. Unfortunately, no date is recorded in dar archives telling it has been done at which time.

From dar_manager, inspecting a catalogue, there thus is no way to give a significant date to a "detuit" entry. In consequences, for a given file which has been removed, then recreated, then removed again along a series of differential backups, it is not possible to order the times when this file has been removed in the series of date when it has existed.

The ultimate consequence is that if the user asks dar_manager to restore a directory in the state just before a given date (-w option), it will not be possible to know if that file existed at that time. We can effectively see that it was not present in a given archive but as we don't know the date of that archive we cannot determine if it is before of after the date requested by the user, and dar_manager is not able to restore the non existence of a file for a given time, we must use dar directly with the archive that has been done at the date we wish.

Note that having a date stored in each dar archive would not solve the problem without some more informations. First, we should assume that the date is consistent from host to host and from time to time (What is the user change of time due to daylight saving or move around the Earth, or if two users in two different places share a filesystem --- with rsync, nfs, or other mean --- and do backups alternatively...). Let's assume the system time is significant and thus let's imagine what would the matter if in each archive this date of archive construction was stored.

Then when a detruit object is met in an archive it can be given the date the archive has been built and thus ordered in the series of dates when the corresponding file was found in other archives. So when the user asks for restoration at of a directory a given file's state is possible to know, and thus the restoration of the corresponding archive will do what we expect : either remove the file (if the selected backup contains an  "detruit" object, or  restore the file in the state it had).

Suppose now, a dar_manager database built with a series of full backup. There will thus no be any "detruit" objects, but a file may be present or may be missing in a given archive. The solution is thus that once an archive has been integrated in the database, the last step is to scan the whole database for files that have no date associated with this last archive, thus we can assume theses files were not present and add the date of the archive creation with the information that this file was removed at that time. Moreover, if the last archive add a file which was not know in archives already present in the database, we must consider this file was deletes in each of theses previous archives, but then we must record the dates of creation for all theses previous archive to be able to put this information properly in the database. But, in that case we would not be able to make dar remove a file, as no "detruit" object exist (all archive are full backups), and dar_manager should remove itself the entry from the filesystem. Beside the fact that it is not the role to dar_manager to directly interact with the filesystem, dar_manager should record an additional information to know if a file is deleted because it has been found a "detruit" object in an archive, or if it is deleted because it has not been found any entry in an given archive. This is necessary to known whether to rely on dar to remove the file or to make dar_manager do it itself, or maybe better is to never rely on dar to remove a file but always let dar_manager do it itself.

Assuming we accept to make dar_manager able to rm entries from filesystem without relying on dar, we must store the date of the archive creation in each archive, and store theses dates for each archive in dar_manager databases. Then instead of using the mtime of each file, we could do something much more simple in database: for each file, record if it was present or not in each archive used to built the database, and beside this, store only the archive creation date of each archive. This way, dar_manager would only have for each file to take the last state of the file (deleted or present) before the given date (or the last known state if no date is given) and either restore the file from the corresponding archive or remove it.

But if a user has removed a file by accident and only notice this mistake after several backups, it would become painful to restore this file, as the user should find manually at which date it was present to be able to feed dar_manager with the proper -w option, this worse than looking for the last archive that has the file we look for.

Here we are back to the restoration of a file and the restoration of a state. By state, I mean the state a directory tree had at a given time, like a photo. In its original version dar_manager was aimed to restore files, whatever they exist or not in the last archive added to a database. It only finds the last archive where the file is present. Making dar_manager restore a state, and thus considering files that have been removed at a given date, is no more no less than restoring from a given archive, directly with dar. So all this discussion about the fact that dar_manager is not able to handle files that have been removed, to arrive to the fact that adding this feature to dar_manager will make it become quite useless... sight. But, that was necessary.



Native Language Support / gettext / libintl


Native Language Support (NLS) is the fact a given program can display its messages in different languages. For dar, this is implemented using the gettext tools. This tool must be installed on the system for dar can be able to display messages in another language than English. 

Things are the following:
- On a system without gettext dar will not use gettext at all. All messages will be in English (OK maybe better saying Frenglish) ;-)
- On a system with gettext dar will use the system's gettext, unless you use --disable-nls option with the configure script.

If NLS is available you just have to set the LANG environment variable to your locale settings to change the language in which dar displays its messages (see ABOUT-NLS for more about the LANG variable).

just for information, gettext() is the name of the call that makes translations of string in the program. This call is implemented in the library called 'libintl' (intl for Internationalization). Last point, gettext by translating strings, makes the Native Language Support (NLS) possible, in other words, it let you have the messages of your preferred  programs being displayed in you native language for those not having the English as mother tong.

This was necessary to say, because you may miss the links between "gettext" , "libintl" and "NLS".

READ the ABOUT-NLS file at the root of the source package to learn more about the way to display dar's messages in your own language. Note that not all languages are yet supported, this is up to you to send me a translation in your language and/or contact a translating team as explained in ABOUT-NLS.

To know which languages are supported by dar, read the po/LINGUAS file and check out for the presence of the corresponding *.po files in this directory.