The file "analyze.cfg" is used to setup avida when its in analysis-only mode, which can be done by running "avida -a". It is used to perform additional tests on genotypes after a run has completed.
This analysis language is basically a simple programming language. The structure of a program involves loading in genotypes in one or more "batches", and then either manipulating single batches, or doing comparisons between batches. Currently there can be up to 300 batches of genotypes, but we will eventually remove this limit.
The rest of this file describes how individual commands work, as well as some notes on other languages features, like how to use variables. As a formatting guide, command arguments will be presented between brackets, such as [filename]. If that argument is mandatory, it will be in blue. If it is optional, it will be in green, and (if relevant) a default value will be listed, such as [filename="output.dat"]
There are currently six ways to load in genotypes:
Table 1: Genotype Loading Commands
LOAD_ORGANISM [filename] Load in a normal single-organism file of the type that is output from avida. These consist of lots of organismal information inside of comments, and then the full genome of the organism with one instruction per line. |
LOAD_BASE_DUMP [filename] Load in a basic dump file from avida. Each line contains a genotype sequence, but little additional information. |
LOAD_DETAIL_DUMP [filename] Load in a detail file. These are similar to the basic dump files, but contain a lot more information on each line. These files are saved from avida typically beginning with the word "detail" or "historic". |
LOAD [filename] Load in a file. This command LOAD_BASE_DUMP and LOAD_DETAIL_DUMP by automatically recognizing the type of the file being loaded. |
LOAD_SEQUENCE [sequence] Load in a user-provided sequence as the genotype. Avida has a symbol associated with each instruction; this command is simply followed by a sequence of such symbols that is than translated back into a proper genotype. |
LOAD_MULTI_DETAIL [start-UD] [step-UD] [stop-UD] [dir='./'] [start batch=0] Allows the user to load in multiple detail files at once, one per batch. This is helpful when you're trying to do parallel analysis on many detail files, or else to create a phylogenetic depth map. Example: LOAD_MULTI_DETAIL 100 100 100000 ../my_run/run100/ This would load in the files detail_pop.100 through detail_pop.100000 in steps of 100, from the directory of my choosing. Since 1000 files will be loaded and we didn't specify starting batch, they will be put in batches 0 through 999. |
A future addition to this list is a command that will use the "dominant.dat" file to identify all of the dominant genotypes from a run, and then lookup and load their individual genomes from the genebank directory. Also, the commands LOAD_BASE_DUMP and LOAD_DETAIL_DUMP currently require fixed-formated files. New output files from avida have tags listed for their column names, and as such we already have a working prototype of a generic "LOAD" command that will figure out the file format and be able to load in all of the data properly.
All of the load commands place the new genotypes into the "current" batch, which can be set with the "SET_BATCH" command. Below is the list of control functions that allow you to manipulate the batches.
Table 2: Batch Control Commands
SET_BATCH [id] Set the batch that is currently active; the initial active batch at the start of a program is 0. |
NAME_BATCH [name] Attach a name to the current batch. Some of the printing methods will print data from multiple batches, and we want the data from each batch to be attached to a meaningful identifier. |
PURGE_BATCH [id=current] Remove all genotypes in the specified batch (if no argument is given, the current batch is purged. |
DUPLICATE [id1]
[id2=current] Copy the genotypes from batch id1 into id2. By default, copy id1 into the current batch. Note that duplicate is non-destructive so you should purge the target batch first if you don't want to just add more genotypes to the ones already in that batch. |
STATUS Print out (to the screen) the genotype count of each non-empty batch and identify the currently active batch. |
There are several other commands that will allow you to interact with the analysis mode in some very important ways, but don't actually trigger any analysis tests or output. Below are a list of some of the more important control commands.
Table 3: More Analysis Control Commands
VERBOSE Toggle verbose/minimal messages. Verbose messages will print all of the details of what is happening to the screen. Minimal messages will only briefly state the process being run. Verbose messages are recommended if you're in interactive mode. |
SYSTEM [command] Run the command listed on the command line. This is particularly useful if you need to unzip files before you can use them, or if you want to delete files no longer in use. |
INCLUDE [filename] Include another file into this one and run its contents immediately. This is useful if you have some pre-written routines that you want to have available in several analysis files. Watch out because there are currently no protections against circular includes. |
INTERACTIVE Place Avida analysis into interactive mode so that you can type commands have have them immediately acted upon. You can place this anywhere within the analyze file, so that you can have some processing done before interactive mode starts. You can type "quit" at any point to continue with the normal processing of the file. |
DEBUG [message] ECHO [message] These are both "echo" commands that will print a message (the arguments given) onto the screen. If there are any variables (see below) in the message, they will be translated before printing, so this is a good way of debugging your programs. |
Now that we know how to interact with analysis mode, and load in genotypes, its important to be able to manipulate them. The next batch of commands will do basic analysis on genotypes, and allow the user to prune batches to only include those genotypes that are needed.
Table 4: Genotype Manipulation Commands
RECALCULATE [use_resources=0] Run all of the genotypes in the current batch through a test CPU and record the measurements taken (fitness, gestation time, etc.). This overrides any values that may have been loaded in with the genotypes. The use_resources flags signifies whether or not the test cpu will use resources when it runs. For more information on resources, see the summary below. |
FIND_GENOTYPE [type="num_cpus" ...] Remove all genotypes but the one selected. Type indicates which genotype to choose. Options available for type are "num_cpus" (to choose the genotype with the maximum organismal abundance at time of printing), "total_cpus" (number of organisms ever of this genotype), "fitness", or "merit". If a the type entered is numerical, it is used as an id number to indicate the desired genotype (if no such id exists, a warning will be given). Multiple arguments can be given to this command, in which case all those genotypes in that list will be preserved and the remainder deleted. |
FIND_ORGANISM [random] Picks out a random organism from the population and removes all others. It is different from FIND_GENOTYPE because it takes into account relative number of organisms within each genotype. To pick more than one organisms, list the word 'random' multiple times. This is essentially sampling without replacement from the population. |
FIND_LINEAGE [type="num_cpus"] Delete everything except the lineage from the chosen genotype back to the most distant ancestor available. This command will only function properly if parental information was loaded in with the genotypes. Type is the same as the FIND_GENOTYPE command. |
FIND_SEX_LINEAGE [type="num_cpus"]
[parent_method="rec_region_size"] Delete everything except the lineage from the chosen genotype back to the most distant ancestor available. Similar to FIND_LINEAGE but works in sexual populations. To simplify things, only maternal lineage plus immediate fathers are saved, i.e. info about father's parents is discarded. The second option, parent_method, determines which parent is considered the "mother" in each particular recombination. If parent_method is "rec_region_size" : "mother" is the parent contributing more code to the offspring genome (default); if it's "genome_size", "mother" is the parent with the longer genome, no matter how much of it was contributed to the offspring. This command will only function properly if parental information was loaded in with the genotypes. Type is the same as the FIND_GENOTYPE command. |
ALIGN Create an alignment of all the genome's sequences; It will place '_'s in the sequences to show the alignment. Note that a "FIND_LINEAGE" must first be run on the batch in order for the alignment to be possible. |
SAMPLE_ORGANISMS [fraction]
[test_viable=0] Keep only "fraction" of organisms in the current batch. This is done per organism, not per genotype. Thus, genotypes of high abundance may only have their abundance lowered, while genotypes of abundance 1 will either stay or be removed entirely. If test_viable is set to 1, sample only from the viable organisms. |
SAMPLE_GENOTYPES [fraction]
[test_viable=0] Keep only fraction of genotypes in the current batch. If test_viable is set to 1, sample only from the viable genotypes. |
RENAME [start_id=0] Change the id numbers of all the genotypes to start at a given value. Often in long runs we will be dealing with ID's in the millions. In particular, after reducing a batch to a lineage, we will often want to number the genotypes in order from the ancestor to the final one. |
Next, we are going to look at the standard output commands that will used to save information generated in analyze mode.
Table 5: Basic Output Commands
PRINT [dir="genebank/"]
[filename] Print the genotypes from the current batch as individual files (one genotype per file) in the directory given. If no filename is specified, the files will be named by the genotype name, with a ".gen" appended to them. Specifying the filename is useful when printing a single genotype. |
TRACE [dir="genebank/"] [ use_resources=0] Trace all of the genotypes and print a listing of their execution. This will show step-by-step the status of all of the CPU components and the genome during the course of the execution. The filename used for each trace will be the genotype's name with a ".trace" appended. The use resources flag signifies whether or not the test cpu will use resources when it runs. For more information on resources, see the summary below. |
PRINT_TASKS [file="tasks.dat"] This will print out the tasks doable by each genotype, one per line in the output file specified. Note that this information must either have been loaded in, or a RECALCULATE must have been run to collect it. |
DETAIL [file="detail.dat"] [format ...] Print out all of the stats for each genotype, one per line. The format indicates the layout of columns in the file. If the filename specified ends in ".html", html formatting will be used instead of plain text. For the format, see the section on "Output Formats" below. |
DETAIL_TIMELINE [file="detail_timeline.dat"]
[time_step=100]
[max_time=100000] Details a time-sequence of dump files. |
DETAIL_BATCHES [file="detail_baches.dat"]
[format ...] Details all batches. |
DETAIL_INDEX [file]
[min_batch] [max_batch]
[format ...] Detail all the batches between min_batch and max_batch. |
DETAIL_AVERAGE [file="detail.dat"]
[format ...] Detail the current batch, but print out the average for each argument, as opposed to the individual values for each genotype, the way DETAIL would. Arguments are the same as for DETAIL. it takes into account the relative abundance of each genotype in the batch when calculating the averages. |
And at last, we have the actual analysis commands that perform tests on the data and output the results.
Table 6: Analysis Commands
LANDSCAPE [file="landscape.dat"]
[dist=1]
[nume_test=(all)] For each genotype in the current batch, test all possible mutations (or combinations of mutations if dist > 1) and summarize the results, one per line in the specified file. If nume_test is specified, randomly pick that many mutants to test, instead of testing all (important for large dist) |
ANALYZE_EPISTASIS [file="epistasis.dat"]
[num_test=(all)] For each genotype in the current batch, test possible double mutatants, and single mutations composing them; print both of individual relative fitnesses and the double mutant relative fitness. By default all double mutants are tested. If in a hurry, specify the number to be tested (num_test) |
MAP_TASKS [dir="phenotype/"]
[flags ...]
[format ...] Construct a genotype-phenotype array for each genotype in the current batch. The format is the list of stats that you want to include as columns in the array. Additionally you can have special format flags; the possible flags are "html" to print output in HTML format, and "link_maps" to create html links between consecutive genotypes in a lineage. |
MAP_MUTATIONS [dir="mutations/"]
[flags ...] Construct a genome-mutation array for each genotype in the current batch. The format has each line in the genome as a row in the chart, and all available instructions representing the columns. The cells in the chart indicate the fitness were a mutation to occur at the position in the matrix, to the listed instruction. If the "html" flag is used, the charts will be output in HTML format. |
MAP_DEPTH [filename='depth_map.dat'] [min_batch=0] [max_batch=cur_batch-1] This will create a depth map (like those we use for phylogeny visualization) in the filename specified. You can direct which batches to take this from, but by default it will work perfectly after a LOAD_MULTI_DETAIL. |
AVERAGE_MODULATITY [file="modularity.dat"]
[task.0 task.1 task.2 task.3 task.4 task.5
task.6 task.7 task.8] Calculate several modularity measuers, such as how many tasks is an instruction involved in, number of sites required for each task, etc. The measures are averaged over all the organisms in the current batch that perform any tasks. For the full output list, do "AVERAGE_MODULATITY legend.dat" At the moment doesn't support html output format and works with only 1 and 2 input tasks. |
HAMMING [file="hamming.dat"]
[b1=current]
[b2=b1] Calculate the hamming distance between batches b1 and b2. If only one batch is given, calculations are on all pairs within that batch. |
LEVENSTEIN [file="lev.dat"]
[batch1]
[b2=b1] Calculate the levenstein distance (edit distance) between batches b1 and b2. This metric is similar to hamming distance, but calculates the minimum number of single insertions, deletions, and mutations to move from one sequence to the other. |
SPECIES [file="species.dat"]
[bach1]
[bach2]
[num_recombinants] Calculates the percentage of non-viable recombinants between all pairs of organisms from batches 1 and 2. Number of random recombination events for each pair of organisms is specified by num_recombinants. Recombination is done in the same way as in the birth chamber when divide-sex is executed. Output: Batch1Name Batch2Name AveDistance Count FailCount |
RECOMBINE
[bach1]
[bach2]
[bach3]
[num_recombinants] Similar to Species command, but instead of calculating things on the spot, just create all the recombinant genotypes using organisms from baches 1 and 2 and put them in the batch3. |
This summary is given to help explain the use and constraints for using resources.
When a command specifies the use of resources for the test cpu, it should not affect the state of the test cpu after the command has finished. However, this means that the test cpu is no longer guaranteed to be reentrant. Each command will set up the environment and the resource count in the test cpu with it's own environment and resource count. When the command has finished it will set the the test cpu's environment and resource count back to what they were before the command was executed.
Resource usage for the test cpu occurs by setting the environment and then setting up the resource count using the environment. Once the resource count has been set up, it will not change during the use of the test cpu. When an organism performs and IO, completing a task, the concentrations are not changed. This was a design decision, but is easily changed.
In analyze, a new data structure was included which contains a time ordered list of resource concentrations. This list can be used to set up resources from different time points. By using the FillResources function, you can have the resource library updated with resource concentrations from a time point closest to the user specified time point. If the LOAD_RESOURCES command is not called, the list defaults to a single entry which is the the initial concentrations of the resources specified in the environment configuration file.
Table 7: Test CPU Resource Related Commands
PRINT_TEST_CPU_RESOURCES This command first prints the whether or not the test cpu is using resources. Then it will print the concentration for each resource. |
LOAD_RESOURCES [file_name="resource.dat"] This command loads a time oriented list of resource concentrations. The command takes a file name containing this type of data, and defaults to resource.dat. The format of the file must be the same as resource.dat, and each line must be in the correct chronological order with oldest first. |
Several commands (such as DETAIL and MAP) require format parameters to specify what genotypic features should be output. Before the such commands are used, other collection functions may need to be run.
Allowable formats after a normal load (assuming these values were available from the input file to be loaded in) are:
id (Genome ID) | parent_id (Parent ID) | num_cpus (Number of CPUs) |
total_cpus (Total CPUs Ever) | length (Genome Length) | update_born (Update Born) |
update_dead (Update Dead) | depth (Tree Depth) | sequence (Genome Sequence) |
After a RECALCULATE, the additional formats become available:
viable (Is Viable [0/1]) | copy_length (Copied Length) | exe_length (Executed Length) |
merit (Merit) | comp_merit (Computational Merit) | gest_time (Gestation Time) |
efficiency (Replication Efficiency) | fitness (Fitness) | div_type (Divide type used; 1 is default) |
task.n (# of times task number n is done) | task.n:binary (is task n done, 0/1) |
If a FIND_LINEAGE was done before the RECALCULATE, the parent genotype for each regular genotype will be available, enabling the additional formats:
parent_dist (Parent Distance) | comp_merit_ratio, (Computational Merit Ratio with parent) |
efficiency_ratio (Replication Efficiency Ratio with parent) | fitness_ratio (Fitness Ratio with parent) |
parent_muts (Mutations from Parent) | html.sequence (Genome Sequence in Color; html format) |
Finally, if an ALIGN is run, one additional format is available: alignment (Aligned Sequence)
For the moment, all variables can only be a single character (letter or number) and begin with a $ whenever they need to be translated to their value. Lowercase letters are global variables, capital letters are local to a function (described later), and numbers are arguments to a function. A $$ will act as a single dollar sign, if needed.
Table 8: Variable-Related Commands
SET [variable]
[value] Sets the variable to the value... |
FOREACH [variable]
[value]
[value ...] Set the variable to each of the values listed, and run the code that follows between here and the next END command once for each of those values. |
FORRANGE [variable]
[min_value]
[max_value]
[step_value=1] Set the variable to each of the values between min and max (at steps given), and run the code that follows between here and the next END command, once for each of those values. |
These functions are currently very primitive with fixed inputs of $0 through $9. $0 is always the function name, and then there can be up to 9 other arguments passed through. Once a function is created, it can be run just like any other command.
Table 9: Function-Related Commands
FUNCTION [name] This will create a function of the given name, including in it all of the commands up until an END is found. These commands will be bound to the function, but are not executed until the function is run as a command. Inside the function, the variables $1 through $9 can be used to access arguments passed in. |
Currently there are no conditionals or mathematical commands in this
scripting language. These are both planned for the future.