8 Jan 2002 mcl version 1.00
mcl - the Amsterdam implementation of the Markov Cluster Algorithm (MCL algorithm)
mcl <-|fname>
[-I f (inflation)]
[-o str (fname)]
These two options are sufficient in 95 percent of the cases, or
more.
mcl <-|fname> [-I f (inflation)] [-o str (fname)] [-c f (centering)] [-p f (cutoff)] [-P n (1/cutoff)] [-S n (selection number)] [-R n (recovery number)] [-pct f (recover percentage)] [-scheme k (use preset scheme)] [--show-schemes (show preset schemes)] [-warn-pct n (prune warn percentage)] [-warn-factor n (prune warn factor)] [--rigid (pruning)] [-ae f (adaptive pruning exponent)] [-af f (adaptive pruning factor)] [--adapt (pruning)] [-nx x (track worst n)] [-ny y (track worst n)] [-v str (verbosity type on)] [-V str (verbosity type off)] [--silent (very)] [--verbose (very)] [-progress k (gauge)] [-te k (#expansion threads)] [-ti k (#inflation threads)] [--clone (when threading)] [-cloneat n (trigger)] [-t k (#threads)] [-l n (initial iteration number)] [-L n (main iteration number)] [-i f (initial inflation)] [-a f (loop weight)] [-dumpi i:j (interval i..j-1)] [-dumpm k (dump i+0..i+k..)] [-dumpstem stem (file stem)] [-dump str (type)] [-digits n (printing precision)] [--show (print matrices to screen)] [--ascii (output format)] [--binary (output format)] [--overlap (keep it)] [--expand-only (factor out computation)] [--inflate-first (rather then expand)] [-preprune n (input matrix)] [-z (show current settings)]
mcl implements the MCL algorithm, short for the Markov cluster algorithm, a cluster algorithm for graphs developed by Stijn van Dongen at the Centre for Mathematics and Computer Science in Amsterdam, the Netherlands. The algorithm simulates flow using two simple algebraic operations on matrices. The theory behind it is extensively described elsewhere (see REFERENCES). The program described here is a fast threaded implementation written by the algorithm's creator with contributions by several others. Anton Enright co-implemented threading; see the HISTORY/CREDITS section for a complete account. The implementation is used for the TRIBES project in which large numbers of proteins are clustered into families, and has become all the better from the feedback this has generated. See the APPLICABILITY section for a description of the type of graph mcl likes best, and for a qualitative assessment of its speed.
The -I f option is the main control, affecting cluster granularity. Using mcl is as simple as typing (assuming a file proteins contains a matrix/graph in mcl input format)
mcl proteins -I 2.0
The above will result in a clustering written to the file named out.mcl. The mcl input format is described in the mcxformat section. Clusterings are also stored as matrices - this is again discussed in the mcxformat section. In finding good mcl parameter settings for a particular domain, or in finding cluster structure at different levels of granularity, one typically runs mcl multiple times for varying values of f (refer to the -I option for further information).
mcl expects a nonnegative matrix in the input file, or equivalently, a weighted (possibly directed) graph. NOTE - mcl interprets the matrix entries or graph edge weights as similarities, and it likes symmetric input graphs best. It can handle asymmetric graphs, but any node pair (i,j) for which w(i,j) is much smaller than w(j,i) or vice versa will presumably have a slightly negative effect on the clusterings output by mcl. Many such node pairs will have a distinctly negative effect, so try to make your input graphs symmetric. How your edge weights are computed may affect mcl's performance. In protein clustering, it is best to choose the negated logarithm of the BLAST probabilities (see REFERENCES).
mcl's default parameters should make it quite fast in almost all circumstances. Taking default parameters, mcl has been used to generate good protein clusters on 133k proteins, taking 10 minutes running time on a Compaq ES40 system with four alpha EV6.7 processors.
For large graphs, there are several groups of parameters available for tuning the mcl computing process, should it be necessary. The easiest thing to do is just vary the -scheme option. This triggers different settings for the group of pruning parameters {-p/-P, -R, -S, and -pct}. The default setting corresponds with -scheme 2. There is an additional group of control parameters {--adapt, --rigid, -ae, -af}, which may be helpful in speeding up mcl. When doing multiple mcl runs for the same graphs with different -I settings (for obtaining clusterings at different levels of granularity), it can be useful to factor out the first bit of computation that is common to all runs, by using the --expand-only option one time and then using --inflate-first for each run in the set. Whether mcl considers a graph large depends mainly on the graph connectivity; a highly connected graph on 50,000 nodes is large to mcl (so that you might want to tune resources) whereas a sparsely connected graph on 500,000 nodes may be business as usual. If graphs are really huge, the time to read a graph from file can be shortened by converting the input graph from ascii mcl format to binary mcl format with mcxconvert.
Two other groups of interest are the thread-related options (you can specify the number of threads to use) {-t, -te, -ti, --clone, -cloneat} and the verbosity-related options {--verbose, --silent, -v, -V}. The actual settings are shown with -z, and for graphs with at most 12 nodes or so you can view the MCL matrix iterands on screen by supplying --show (this may give some more feeling).
The first option is the input file name (see the mcxformat section for its expected format), or a single hyphen to read from stdin. The rationale is that you typically do several runs with different parameters, and in command line mode it is pleasant if you do not have to skip over an immutable parameter all the time.
In the OPTIONS section options are listed in order of importance, with related options grouped together.
The creator of this page feels that manual pages are a valuable resource, that online html documentation is also a good thing to have, and that info pages are way way ahead of their time. The NOTES section explains how this page was created.
A second option for affecting cluster granularity is the -c option. It may possibly increase granularity.
With low values for -I, like -I 1.2, you should be
prepared to use more resources in order to maintain quality of
clusterings, i.e. increase the arguments to the
-P/-S/-R options or simply increase
the argument to -scheme.
The following resource control options are discussed as a group:
After computing a new (column stochastic) matrix vector during expansion (which is matrix multiplication c.q. squaring), the vector is successively exposed to different pruning strategies. The intent of pruning is that many small entries are removed while retaining much of the stochastic mass of the original vector. After pruning, vectors are rescaled to be stochastic again. MCL iterands are theoretically known to be sparse in a weighted sense, and this manoever effectively perturbs the MCL process a little in order to obtain matrices that are genuinely sparse, thus keeping the computation tractable. An example of monitoring pruning can be found in the discussion of -v pruning under the --verbose option.
mcl proceeds as follows. First, entries that are smaller than cutoff are removed, resulting in a vector with at most 1/cutoff entries. The cutoff can be supplied either by -p, or as the inverse value by -P. The latter is more intuitive, if your intuition is like mine (and the P stands for precision or pruning by the way). The cutoff just described is rigid; it is the same for all vectors. The --adapt option causes the computation of a cutoff that depends on a vector's homogeneity properties, and this option may speed up mcl considerably.
Second, if the remaining stochastic mass (i.e. the sum of all remaining entries) is less than pct/100 and the number of remaining entries is less than r (as specified by the -R flag), mcl will try to regain ground by recovering the largest discarded entries. The total number of entries is not allowed to grow larger than r. If recovery was not necessary, mcl tries to prune the vector further down to at most s entries (if applicable), as specified by the -S flag. If this results in a vector that satisfies the recovery condition then recovery is attempted, exactly as described above. The latter will not occur of course if r <= s.
The default setting is something like -P 2000 -S 500 -R 600. Check the -z flag to be sure. There is a set of precomposed settings, which can be triggered with the -scheme k option. k=2 is the default scheme; higher values for k result in costlier and more accurate computations (vice versa for lower, cheaper, and less accurate). The schemes are listed using the --show-schemes option. It is advisable to use the -scheme option only in interactive mode, and to use the explicit expressions when doing batch processing. The reason is that there is no guarantee whatsoever that the schemes will not change between different releases. This is because the scheme options should reflect good general purpose settings, and it may become appararent that other schemes are better.
Note that 'less accurate' or 'more accurate' computations may still generate the same output clusterings. Use clmdist to compare output clusterings for different resource parameters. Refer to clmdist for a discussion of this issue.
As a reminder of the existence of pruning and its importance for both
speed and accuracy, mcl reports three numbers when it is done, the
'jury marks'.
These are somewhat (but not totally) indicative for the quality of
pruning, and they are excerpts from the output produced by
-v pruning, namely the numbers listed under the nx column
for the three first iterations. Each number gives the average stochastic
mass of the nx worst instances of pruning, i.e. those vectors
for which the most mass was removed. The average is listed as a
percentage. The numbers should preferably be higher than 70. If they are
in the vicinity of 80 or 90, mcl is doing fine as far as pruning is
concerned. Choose a higher scheme if you think them too low.
For very dense graphs that do have strong cluster structure,
the jury marks can sink as low as to the 30's
and 40's, but the clusterings generated by mcl may still be good.
Refer to the -v option for more information, and
note that the jury becomes friendlier, resp. harsher when the
-nx option is increased/decreased.
-warn-pct takes an integer between 0 and 100 as parameter,
-warn-factor takes a real positive number. They default to
something like 30 and 50.0. If you want to see less warnings, decrease
warn-pct and increase warn-factor. Set warn-factor to zero
if you want no warnings.
The --adapt behaviour only affects the first pruning stage, c.q. the computation of the first threshold (see the discussion under the -P option). It does not interfere with either selection or recovery. It is affected however by the threshold as specified by the -P option. When using --adapt, you typically use the -P option as well, and you can and should use a higher value then you would without using --adapt.
All that said, --adapt triggers this behaviour: Given a stochastic vector v, its mass center of order two is computed, which is the sum of each entry squared. The mass center of v, call it c, is strongly related to its homogeneity properties (see REFERENCES). The threshold T is computed as 1/f * pow(c, e), where e and f are the arguments to the -af f and -ae e options respectively (check -z for the respective defaults). For either e or f decreasing it means that T becomes larger. Finally, T is maxed with the rigid threshold value, which can be altered using either -p f or -P n. The latter is why you should increase the -P parameter n (so that the rigid threshold is decreased) once you switch to adaptive pruning. The adaptive threshold should be the main factor controlling pruning, with the rigid threshold acting as a safeguard that does not take over too often.
This may seem complicated, but the rules are actually quite simple, and
you may just disregard the definition of T. The usefulness of these
options will vary. If you want to speed up mcl, try it out
and add --adapt to your settings.
progress
pruning
explain
all
where all means all three previous modes.
--verbose and -v all
turn them all on, --silent and -V all
turn them all off. -v str and -V str
turn on/off the single mode str (for str
equal to one of progress, pruning, or explain).
Each verbosity mode is given its own entry below.
Pruning verbosity mode causes mcl to emit several statistics related to the pruning process, each of which is described below. Use -v explain to get explanatory headers in the output as well.
Selection and recovery
The number of selections and recoveries mcl had to perform during each
iteration is shown. It also shows the number of vectors for which the
mass after final pruning was below the fraction defined by the
-pct option as a percentage (default probably 90
or 95).
Initial and final vector footprint distributions
The distribution of the vector footprints (i.e. the number of nonzero
entries) before and after pruning is shown. This is assembled in a terse
(horrid if you will) ascii output format, looking as follows
(with some context stripped, noting that the data for three
multiplications is shown):
---------------------------------------------------- mass percentages | distribution of vec footprints| | |__ compose ________ prune _____| prune | final |000 00 0 |000 00 0 | all ny nx|all ny nx|8532c8532c8532c|8532c8532c8532c| ---------.---------.---------------.---------------. 98 88 86 98 91 86 ____0224567899@ ______02346899@ 98 89 86 98 94 91 __002456789@@@@ ______02346899@ 98 90 89 99 95 94 __002355689@@@@ ______02346789@ ...
This particular output was generated (and truncated after three rounds of expansion and inflation) from clustering a protein graph on 9058 nodes with settings -I 1.4, -P 2000, -S 500, -R 600, and -pct 95 (which was supplied more succinctly as -scheme 2 -pct 95).
The header entries 8532c85.. with the zeroes on top indicate thresholds going from 8000, 5000, 2000, 1250, 800, all the way down to 30, 20, and 12. The character 'c' signifies the base 12.5 (for no apparent reason). The second entry '2' (after '0') signifies that roughly 20 percent of all the vectors had footprint (#nonzero entries) between 800 and 1250. Likewise, 80 percent had footprint between 500 and 800. The '0' entries signify a fraction somewhere below 5 percent, and the '@' entries signify a fraction somewhere above 95 percent.
Two columns are listed, one for the composed vector footprints (i.e. after squaring), and the other for the vector footprints right after initial pruning took place (i.e. before selection and recovery, after either adaptive or rigid pruning). This may give an idea of the soundness of the initial pruning process (overly severe, or overly mild), and the extent to which you want to apply selection and/or recovery.
Initial and final vector mass distributions
The mass averages of the pruned vectors after the first selection
stage are shown, and the mass averages of the vectors as finally
pruned, i.e. after selection and recovery. Note that the latter
corresponds to a different stage than what is shown for the vector
footprints, if either selection or recovery is turned on.
The mass averages are shown as percentages: '88' means that overall
88 percent of the stochastic mass of the matrix was kept. For both
cases, three averages are shown: the average of all vectors,
the average of the worst x cases, and the average of the worst y
cases. The values x and y default to something like x=10 and y=100;
check the -z option to be sure. They can be
changed using -nx x and -ny y.
The jury marks refered to earlier are in this particular case [86,91,94].
In the example above it is clearly seen that many entries could be
removed while retaining much of the stochastic mass. The effect of the
recovery (-R) parameter is also clear: the final averages are
higher than the initial averages, as a result of mcl undoing some
overenthousiastic pruning.
The --clone option says to give each thread its own copy of the matrix being expanded/squared. The latter option can be further controlled using the --cloneat k option. Copies are only made if the source matrix (the one to be squared) has on average at least k positive entries per vector.
When threading, it is best not to turn on pruning verbosity
mode if you are letting mcl run unattended, unless you want to
scrutinize its output later. This is because it makes mcl run
somewhat slower, although the difference is not dramatic.
mcl proteins -i 1.4 -l 2 -I 4.0
one lets expansion prevail during the first two iterations,
followed by inflation catching up (in a figurative way of writing).
This may be useful in certain cases, but this type of experiment
is certainly secondary to simply varying -I (capital eye).
att
ite
cls
Repeated use is allowed. The att option says to output a vector measuring for each node how much it is attracted to itself (which measures the extent to which nodes are situated in the core of a cluster). It is somewhat forlorn because the other mcl utilities (see the SEE ALSO section) can not yet utilize its output. The ite option writes mcl iterands to file. The cls option writes clusterings associated with mcl iterands to file.
The -dumpstem sets the stem for file names of dumped
objects (default mcl). The -dumpi and -dumpm
allow a selection of iterands to be made.
This option has more than theoretical use because mcl is able to generate clusterings associated with intermediate iterands. For these clusterings, overlap is more than a theoretical possibility, and will often occur. If you specify the -L k option, mcl will output the clustering associated with the last iterand computed, and it may well contain overlap.
This option has no effect on the clusterings that are
output when using -dump cls -
the default for those is that overlap is not touched,
and this default can not yet be overridden.
Note that the -scheme parameters affect the
matrix computed with --expand-only. Other options that affect
the matrix resulting from this option:
-preprune, -c,
and -digits. The latter option
sets the precision for output in native ascii format.
This is overloading the -digits option, as it has a different
meaning if --expand-only is not supplied.
mcl will work very well for graphs in which the diameter of the natural clusters is not too large. The presence of many edges between different clusters is not problematic; as long as there is cluster structure, mcl will find it. It is less likely to work well for graphs with clusters (inducing subgraphs) of large diameter, e.g. grid-like graphs derived from Euclidean data. So mcl in its canonical form is certainly not fit for boundary detection or image segmentation. I experimented with a modified mcl and image segmentation in the thesis pointed to below (see REFERENCES). This was fun and not entirely unsuccesful, but not something to be pursued further.
mcl likes symmetric input graphs best, and it really dislikes graphs with node pairs (i,j) for which an arc going from i to j is present and the counter-arc from j to i is absent. Try to make your input graph symmetric. Furthermore, mcl interprets edge weights in graphs as similarities. If you are used to working with dissimilarities, you will have to convert those to similarities using some conversion formula. The most important thing is that you feel confident that the similarities are reasonable, i.e. if X is similar to Y with weight 2, and X is similar to Z with weight 200, then this should mean that the similarity of Y (to X) is neglectible compared with the similarity of Z (to X).
mcl has a worst-case time complexity O(N*k^2), where N is the number of nodes in the graph, and k is the maximum number of neighbours tracked during computations. k depends on the -P and -S options. If the -S option is used (which is the default setting) then k equals the value corresponding with this option. Typical values for k are in the range 500..1000. The average case is much better than the worst case though, as cluster structure itself has the effect of helping mcl's pruning schemes, certainly if the diameter of natural clusters is not large.
There are currently no resource nor configuration files. The mcl matrix format is described in the mcxformat section.
Currently, no environmental issues with mcl.
If mcl issues a diagnostic error, it will most likely be because the input matrix could not be parsed succesfully. mcl tries to be helpful in describing the kind of parse error. The mcl matrix format is described in the mcxformat section.
No known bugs at this time. Please send bug reports to mcl-bugs@mdcc.cx.
The MCL algorithm was conceived in spring 1996 by the present author. The first implementation of the MCL algorithm followed that spring and summer. It was written in Perl and proved the viability of the algorithm. The implementation described here began its life in autumn 1997. The first versions of the vital matrix library were designed jointly by Stijn van Dongen and Annius Groenink in the period Oktober 1997 - May 1999. The efficient matrix-vector multiplication routine was written by Annius. This routine is without significant changes still one of the cornerstones of this MCL implementation.
Since May 1999 all MCL libraries have seen much development and redesign by the present author. Matrix-matrix multiplication has been rewritten several times to take full advantage of the sparseness properties of the stochastic matrices brought forth by the MCL algorithm. This mostly concerns the issue of pruning - removal of small elements in a stochastic column in order to keep matrices sparse.
Very instructive was that around April 2001 Rob Koopman pointed out that selecting the k largest elements out of a collection of n is best done using a min-heap. This was the key to the second major rewrite (now counting three) of the MCL pruning schemes, resulting in much faster code, generally producing a more accurate computation of the MCL process.
In May 2001 Anton Enright initiated the parallellization of the mcl code and threaded inflation. From this example, Stijn threaded expansion. This was great, as the MCL data structures and operands (normal matrix multiplication and Hadamard multiplication) just beg for parallellization.
Joost van Baal set up the mcl CVS tree and packaged mcl for Debian GNU/Linux. He completely autotooled the sources, so much so that at first I found it hard to find them back amidst bootstrap, aclocal.m4, depcomp, and other beauties.
Jan van der Steen shared his elegant mempool code. Philip Lijnzaad and Shawn Hoon sent useful bug reports.
mcxformat - a description of the mcl matrix format.
mcx - an interpreter for a stack language that enables interaction with the mcl matrix libraries. It can be used both from the command line and interactively, and supports a rich set of operations such as transposition, scaling, column scaling, multiplication, Hadamard powers and products, et cetera. The general aim is to provide handles for simple number and matrix arithmetic, and for graph, set, and clustering operations. The following is a very simple example of implementing and using mcl in this language.
2.0 .i def # define inflation value. /small lm # load matrix in file 'small'. dim id add # add identity matrix. st .x def # make stochastic, bind to x. { xpn .i infl vm } .mcl def # define one mcl iteration. 20 .x .mcl repeat # iterate 20 times imac # interpret matrix as clustering. vm # view matrix (clustering).
One of the more interesting things that can be done is doing mcl runs with more complicated inflation profiles than the two-constant approach used in mcl itself.
Several other utilities come with mcl, facilitating analysis and comparison of different clusterings.
clmdist - compute the split/join distance between two partitions. The split/join distance is better suited for measuring partition similarity than the long-known equivalence mismatch coefficient. The former measures the number of node moves required to transform one partition into the other, the latter measures differences between volumes of edges of unions of complete graphs associated with partitions.
clminfo - compute a performance measure saying how well a clustering captures the edge weights of the input graph. Useful for comparing different clusterings on the same graph, best used in conjunction with clmdist - because comparing clusterings at different levels of granularity should somewhat change the performance interpretation. The latter issue is discussed in the clmdist entry.
clmmeet - compute the intersection of a set of clusterings, i.e. the largest clustering that is a subclustering of all. Useful for measuring the consistency of a set of different clusterings at supposedly different levels of granularity (in conjunction with clmdist).
clmconf - for inspecting local cluster structure. Computes how well nodes fit into the cluster in which they are located (for a given clustering) by looking at the (weighted) percentage of its edges going to that same cluster. Computes also the cohesiveness of a cluster, by computing and averaging the above over all nodes in a cluster. Useful for inspecting local cluster structure.
mcxsubs - compute a submatrix of a given matrix, where row and column index sets can be specified as lists of indices combined with list of clusters in a given clustering. Useful for inspecting local cluster structure.
mcxconvert - convert matrices from ascii mcl format to binary mcl format or vice versa.
Graph Clustering by Flow Simulation (thesis)
http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm
A cluster algorithm for graphs (technical report)
http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z
A stochastic uncoupling process for graphs (technical report)
http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z
Performance criteria for graph clustering and
Markov cluster experiments (technical report)
http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z
An efficient algorithm for large-scale detection of protein families
(preprint)
Not yet available.
This page was generated from ZOEM manual macros. Both html and roff pages can be created from the same source without having to bother with all the usual conversion problems, while keeping some level of sophistication in the typesetting. The ZOEM primitives only provide macro expansion and filter capabilities; the proof of the typesetting is in striking the macros right.