NABEECO v0.1 supplementary README
[1st Feb 2013]

Newer versions of this text, as well as the source code, may be found at:

http://nabeeco.mpi-inf.mpg.de/


============
FILE FORMATS
============

*** NTW file ***
Text file with protein interactions. Each line represents a directed interaction.
Two entries in the opposite direction each are considered an undirected edge.
A .ntw file can store multiple networks, therefore the first word in each line
is the network name this interaction belongs to.
Protein names are truncated at the first pipe `|` character, and are case sensitive!
- This is handled in the same way for every other file type, too.

It looks like this:

Header Line
group_name protein1 protein2
group_name protein2 protein3
group_name protein2 protein1
[...]

The header line is ignored.
Anything else after the first 3 words per line is ignored.


*** SIF file ***
Text file with protein interactions. Common format for import/export with Cytoscape.
Only one network can be stored in a .sif file.

It looks like this:

protein1 interaction protein2
protein2 interaction protein3
protein42 interaction protein55
[...]

An interaction can be one of the following:
DirectedEdge
directed
d
UndirectedEdge
undirected
u

The interaction type is not case sensitive.
If the interaction type is none of the above, undirected is assumed.


*** SIGS file ***
Text file with graphlet signature vectors for one network.

It looks like this:

proteinName 1 3 4 4 1 2 3 2 4 6 6 3 2 1 3 5 3 2 1 [...]
[...]

The first word is the protein name, followed by 73 integer numbers which represent
the number of occurances of each graphlet.


*** DIST file ***
Text file with a matrix of numbers, used for precomputed BLAST E-values.

It looks like this:

3 5
1 proteinA1
1 proteinA2
1 proteinA3
2 proteinB1
2 proteinB2
2 proteinB3
2 proteinB4
2 proteinB5
0.11 0.12 0.13 0.14 0.15
0.21 0.22 0.23 0.24 0.25
0.31 0.32 0.33 0.34 0.35

Assuming two networks A (3 nodes) and B (5 nodes).
3 and 5, respectively, are the number of nodes in each group (this directly corresponds
with the --groups setting. For this matrix, it is important which network is the first one!).
The two initial numbers are followed by the protein names in the order they appear
(group 1 ~ columns, group 2 ~ rows).
The rest are the individual distances.

For example: d(proteinA2, proteinB1) = 0.21


=======================
INTERNAL CONFIG OPTIONS
=======================


These are the available parameters for --config opt=value (or -c opt=value).
If in doubt, leave these at the defaults, or grep the source code for the variable of interest.

The recongized options are listed with their default values.
Some of these have a direct equivalent command line parameter, for example:
--pop <N> = --config maxAgents=N


*** Final weights ***

* weightGED = 1
* weightGraphlets = 0
* weightNodeDist = 0
* weightPairsum = 0

These weights scale the relative influence of every scoring function
when evaluating the final score of an individual; the resulting value defines survival chance.
If a weight is zero, the corresponding function is not used for the final score.

weightNodeDist is the simple node degree distance, which is equal to the 2-graphlet signature distance.
It is automatically used instead of the graphlet score if no graphlet signatures are available,
but the graphlet score is used.

weightPairsum is all pair scores summed up (see below).


*** Pairwise weights ***

* pairWeightGED = 1
* pairWeightGraphlets = 1
* pairWeightNodeDist = 0

These are similar to the final weights, but they are applied to every pair;
the resulting value measures the difference of one specific node pair and
the probability of it being picked up and re-assigned by the EA during offspring generation.
For example, pairs with a low score are likely to be kept intact,
whereas pairs with a high score are likely to be broken and re-assigned.

It is important that the pair score and the final score use sane multipliers,
that is, offspring generation should go in a direction that will be considered
good in the evaluation phase. In other words, both scores must correlate,
otherwise the EA will work against itself and give no useful results.

pairWeightNodeDist is the same as above, but per-pair.


*** BLAST threshold ***

* thresholdBLAST = 0.000001

If the BLAST score for a given pair is below or equal to this threshold,
it is used exclusively to score this pair, otherwise a combination of
the other pairwise scores is used instead, appropriately scaled by their weights. 


** Default pair score **

* pairNullValue = 1

This is used for scores where no better value is available, i.e. when mapping a valid node against nil.
	
This value controls how "forcefully" the two graphs are aligned, in other words,
how high the penalty for mapping a node to nil is.
A value of 1 is the worst score possible, so the algorithm will try to prevent node <-> nil pairs at all costs.
Lower values will cause exceptionally bad pairs to be considered worse than mapping against nil,
so preferably hard-to-align nodes will be paired with nil and thus counted as inserted or deleted.
	
In short: a value of 1 produces an alignment as compact as possible, therefore minimizing the GED,
lower values produce more biologically meaningful alignments, and very low values align only nodes
that fit really well, thus causing the alignment to fall apart into unconnected subgraphs.


*** Data preprocessing ***

* forceUndirectedEdges = 0
If enabled, any directed edges will be transformed into undirected edges.
This is useful when the input network data are known to be undirected,
but only edges in one direction are present. Additionally, if only undirected edges exist,
a slightly faster implementation for GED score calculation is automatically chosen.

* matchSameNames = 1
This allows pre-matching proteins that, based on their name, are known to be equal.
Pre-matched pairs will never be separated, even if the other data are incomplete
and the resulting score would cause these pairs to be broken quickly.
(Names are case sensitive!)


*** GED penalties ***

* edgeAddedGEDScore = 1
* edgeRemovedGEDScore = 1
* edgeSubstitutedGEDScore = 0
* edgeFlippedGEDScore = 0.8
* edgeDirectedToUndirectedGEDScore = 0.2
* edgeUndirectedToDirectedGEDScore = 0.2
* nodeAddedGEDScore = 0
* nodeRemovedGEDScore = 0

The raw GED for a mapping is calculated from the number of removed/added/substituted/flipped
edges required to transform one graph into the other graph, given this mapping.
The raw counts are multiplied with these penalties and summed up, producing the actual GED score.
A substitution is desirable and should not contribute to the GED score;
the same is true for node insertions and deletions, because any of the latter
already cause insertion/deletion of all edges associated with that node.


*** Population settings ***

* maxAgents = 400
This controls how many new individuals are created in each offspring generation phase.
The number of new individuals in each iteration has the greatest impact on convergence
and execution speed, and some impact on memory utilization.
Each individual has to be created and later evaluated, both taking most of the runtime of the algorithm.
A value of 100 - 1000, depending on the network size, has been found to work well
(1000 for smaller networks up to 500 nodes, 200 or more for large graphs up to 10000 nodes).

* basicHealth = 100
* maxHealthDrop = 100
The first value is used as the initial health for each individual,
which is then decreased by the second value relative to it.
The worst individual will be reduced by the full amount, better ones by linearly interpolated amounts.
This setting is only useful to prolong or shorten survival times,
and has no direct influence on convergence, only on the total number
of individuals existing at any given time.


*** Miscellaneous ***

* varGroup = 0
If this is 0, G1 is aligned to G2 (the nodes of G2 are the reference mapping);
if it is 1, G2 is aligned to G1 (the nodes of G1 are the reference mapping).
The setting has a little impact on execution speed if the number of nodes
in both graphsdiffers greatly; some optimizations can be performed faster
if the reference mapping contains nodes of the larger graph.
(It is never necessary to touch this setting, unless you have a DIST file that requires
a certain network order, but you want to explicitly specify which network is permuted
and which one stays the static reference. Implementation details...)

* numThreads = 0
Number of worker threads to use.
By default the program will detect the number of available CPUs automatically
and create one thread per CPU.