Edit Graph
Running thapbi_pict edit-graph
This is not run as part of the pipeline command, but must be run separately:
$ thapbi_pict edit-graph -h
...
This command does not use metadata, but can optionally use the intermediate TSV files. It requires the sample tally file:
$ thapbi_pict edit-graph -i summary/thapbi-pict.ITS1.tally.tsv \
-o summary/thapbi-pict.edit-graph.onebp.xgmml
...
This will generate an XGMML (eXtensible Graph Markup and Modeling Language) file by default, but you can also request other formats including PDF (which requires additional dependencies including GraphViz):
$ thapbi_pict edit-graph -i summary/thapbi-pict.ITS1.tally.tsv \
-o summary/thapbi-pict.edit-graph.onebp.pdf -f pdf
...
Nodes and edges
In this context, we are talking about a graph in the mathematical sense of nodes connected by edges. Our nodes are unique sequences (which we can again label by the MD5 checksum), and the edges are how similar two sequences are. Specially, we are using the Levenshtein edit distance. This means an edit distance of one could be a single base substitution, insertion or deletion.
The tool starts by compiling a list of all the unique sequences in your
samples (i.e. all the rows in the thapbi_pict read-summary
report), and
optionally all the unique sequences in the database. It then computes the
edit distance between them all (this can get slow).
We build the network graph by adding edges for edits of up to three base pairs (by default). This gives small connected components or sub-graphs which are roughly at the species level.
Redundant edges are dropped, for example if A is one edit away from B, and B is one edit away from C, there is need to draw the two edit line from A to C.
We draw the nodes as circles, scaled by the number of samples that unique sequence appeared in. If that exact sequence is in the database, is it colored according to genus, defaulting to grey.
Color |
RGB value |
Meaning |
---|---|---|
Red |
|
Phytophthora |
Lime |
|
Peronospora |
Blue |
|
Hyaloperonospora |
Yellow |
|
Bremia |
Cyan |
|
Pseudoperonospora |
Magenta |
|
Plasmopara |
Maroon |
|
Nothophytophthora |
Olive |
|
Peronosclerospora |
Green |
|
Perofascia |
Purple |
|
Paraperonospora |
Teal |
|
Protobremia |
Dark red |
|
Other known genus |
Dark orange |
|
Conflicting genus |
Orange |
|
Synthetic sequence |
Grey |
|
Not in the database |
The edges are all grey, solid for a one base pair edit distance, dashed for a two base pair edit distance, and dotted for a three base pair edit distance.
Viewing the PDF
You should be able to open the PDF file easily, and get something like this - lots of red circles for Phytophthora, some grey circles for sequences not in the database, and plenty of grey straight line edges between them.

In the PDF (and XGMML) output, nodes are coloured by genus (red for Phytophthora), but only labelled if in the database at species level.
The edges are solid for a one base pair edit distance, dashed for a two base pair edit distance, and dotted for a three base pair edit distance. All grey.
Viewing the XGMML
You should be able to open the PDF file easily, and while it is interesting it is read only and non-interactive. This is where the XGMML output shines. You will need to install the free open source tool Cytoscape to use this.
Open Cytoscape, and from the top level menu select File
, Import
,
Network from file...
, then select
summary/thapbi-pict.edit-graph.onebp.xgmml
(the XGMML file created above).
You should get something like this, where initially all the nodes are drawn on top of each other:

From the top level menu select “Layout”, “Perfuse Force Directed Layout”, “Edit-distance-weight”, and you should then see something prettier - if you zoom in you should see something like this:

This time you can interact with the graph, moving nodes about with the mouse, try different layouts, view and search the attributes of the nodes and edges.
Here the nodes are labelled with the species if they were in the database at species level, or otherwise as the start of the MD5 checksum in curly brackets (so that they sort nicely). The default node colors are as in the PDF output, likewise the grey edge styles.
The node attributes include the full MD5 (so you can lookup the full sequence
or classification results for any node of interest), sample count, total read
abundance (both numbers shown in the thapbi_pict summary
reports),
genus (allowing you to do your own color scheme), and species if known.
The edge attributes include Edit-distance
(values 1
, 2
, 3
for number of base pairs difference between sequences) and matching
Edit-distance-weight
(values 3
, 2
, 1
used earlier for the
layout where we prioritise the small edit distance edges).
Halo effect
In this final screenshot we have zoomed in and selected all 11 nodes in the connected component centered on P. pseudosyringae (Cytoscape highlights selected nodes in yellow):

The node table view is automatically filtered to show just these nodes, and we can see that all the grey nodes appeared in only one sample each - with the P. pseudosyringae entry in the database in 66 samples, while the one base away P. ilics sequence was in 6 samples.
This kind of grey-node halo around highly abundance sequences is more common when plotting larger datasets. It is consistent with PCR artefacts occurring in just one (or two) samples giving rise to (almost) unique sequences based on the template sequence.