Summary reports
Running thapbi_pict summary
The reports from the pipeline can be generated separately by the thapbi_pict
$ thapbi_pict summary -h
To mimic what the pipeline command would do, run the following:
$ thapbi_pict summary -i intermediate/ \
summary/thapbi-pict.ITS1.onebp.tsv \
-o summary/thapbi-pict.ITS1
Note the trailing slash \
at the end of the first line indicates the
command continues on the next line. You can actually type this at the standard
Linux command prompt (or include it in a copy and paste), or just enter this
as one very long command.
We will look at the output in a moment, along side the equivalent reports generated with metadata (see linked discussion about column numbers):
$ thapbi_pict summary -i intermediate/ \
summary/thapbi-pict.ITS1.onebp.tsv \
-o summary/with-metadata.ITS1 \
-t metadata.tsv -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -x 16
Both the read report and sample report are tables, produced as both computer-friendly plain text tab-separated variable (TSV), and human-friendly Excel (with colors and conditional formatting).
Read Report
The heart of the read report is a large table, of unique sequences (ASVs rows) versus sequenced samples (columns), with read abundance counts. There are additional columns with sequence information, and when Metadata is present, extra rows at the start with sample information.
This read report has a row for each unique sequence. The first columns are the marker name (here always “ITS1”), the unique sequence MD5 checksum, any species prediction, the sequence itself, the number of samples it was detected in above the threshold, the maximum number of reads with this sequence in any one sample, and the total number of reads (from samples where it was above the threshold). Then the main columns (one per sample) list the abundance of each unique sequence in that sample (if above the threshold).
In the Excel version, conditional formatting is used to highlight the non-zero counts with a red background. Furthermore, with metadata it will attempt to assign repeated bands of background color to groups (pink, orange, yellow, green, blue). In this example, each sample site gets a new color:

Typical sample naming schemes will result in replicates as neighbouring columns - meaning you should see very similar patterns of red (non-zero). Certainly in this dataset scanning horizontally we do see some sequences clearly show presence/absence patterns consistent with the samples.
The default row sorting will result in a dominant sequence being followed by any close variants assigned to the same species. Many of these rows will represent PCR artefacts found in just one or two samples. This contributes to the “halo” effect seen in the Edit Graph representation, discussed next.
Sample Report
The heart of the sample report is a table of samples (rows) versus species predictions (columns), with read abundance counts. There are additional columns with sample read counts, and when Metadata is present, extra columns at the start with sample information.
Here is a screenshot of the summary/with-metadata.ITS1.samples.onebp.xlsx
file opened in Excel:

The metadata is in the first columns, then the sequence filename stem, a text summary of the species predictions, some inferred sequence count data, and the one column for each unique species or ambiguous species combinations.
Using the metadata each site has one or more rows in the same background color (pink, orange, yellow, green, blue, repeated), with one row for each time it was sequenced (the per-site sampling).
The values are total read counts for that row/column, with conditional formatting applied so non-zero entries have a bright red background.
For example, the final rows are the two DNA mixture controls (blue and pink) and the negative controls (orange). These have almost no metadata, and the negative controls read counts are all zero.
The plain text table with-metadata.ITS1.samples.onebp.xlsx
is the same,
but without the colors and formatting. The files generated without metadata
etc) lack the extra columns and the
background color bands.
The files without metadata start with the FASTQ filename stem as the inferred sample name in column 1:
$ cut -f 1 summary/thapbi-pict.ITS1.samples.onebp.tsv | head
#Sequencing sample
In contrast, the 15 extra metadata columns are inserted before this, and are used to sort the samples:
$ cut -f 1,16 summary/with-metadata.ITS1.samples.onebp.tsv | head
#Site Sequencing sample
01 Site_1_sample_1
01 Site_1_sample_2
01 Site_1_sample_3
01 Site_1_sample_4
01 Site_1_sample_5
01 Site_1_sample_6
01 Site_1_sample_7
01 Site_1_sample_8
01 Site_1_sample_9-2
Like the FASTQ filename stems, the metadata is still sorted as strings, but by
using leading zeros and YYYY-MM-DD
style for any dates, you can achieve a
logical presentation.
After the sequencing sample name (the FASTQ filename stem), we have the classification summary as a comma separated list - attempting to summarise the later per-species columns. Species listed here with (*) are where sequences matched multiple species equally well. For example, Phytophthora andina, P. infestans, and P. ipomoeae, share an identical ITS1 marker.
The next columns are derived from the data itself, reads counts in the samples as raw FASTQ, after read merging with Flash, primer trimming with Cutadapt, information about the abundance thresholds used (omitted below), the maximum ASV read count for non-spike-in or spike-in sequences, number of singletons, total number of reads for the accepted ASVs (i.e. passing the abundance threshold), and the number of unique ASVs accepted. It may be easier to look at this in Excel, but at the command line:
$ cut -f 16,18-20,24-28 summary/with-metadata.ITS1.samples.onebp.tsv | head
As a table:
Sequencing sample |
Flash |
Cutadapt |
Max non-spike |
Max spike-in |
Singletons |
Accepted |
Unique |
Site_1_sample_1 |
6136 |
5900 |
5886 |
2269 |
0 |
692 |
4180 |
8 |
Site_1_sample_2 |
6135 |
5955 |
5947 |
2532 |
0 |
671 |
4548 |
8 |
Site_1_sample_3 |
6778 |
6484 |
6470 |
2146 |
0 |
579 |
5060 |
5 |
Site_1_sample_4 |
4145 |
3984 |
3974 |
1499 |
0 |
469 |
2852 |
7 |
Site_1_sample_5 |
4722 |
4232 |
4213 |
3130 |
0 |
433 |
3130 |
1 |
Site_1_sample_6 |
12633 |
12070 |
12034 |
5864 |
0 |
1217 |
9208 |
4 |
Site_1_sample_7 |
7560 |
7170 |
7141 |
3372 |
0 |
741 |
5402 |
5 |
Site_1_sample_8 |
6324 |
5956 |
5942 |
2037 |
0 |
630 |
4524 |
5 |
Site_1_sample_9-2 |
4542 |
4335 |
4331 |
2780 |
0 |
385 |
3436 |
2 |
Finally, we get to the main part of the sample table, one column per classifier result, with the number of reads. Picking out some examples:
$ cut -f 16,31,41,65 summary/with-metadata.ITS1.samples.onebp.tsv | head
As a table:
Sequencing sample |
Phytophthora austrocedri |
Phytophthora gonapodyides |
Unknown |
Site_1_sample_1 |
165 |
1158 |
0 |
Site_1_sample_2 |
445 |
718 |
101 |
Site_1_sample_3 |
0 |
1110 |
1313 |
Site_1_sample_4 |
204 |
861 |
0 |
Site_1_sample_5 |
0 |
3130 |
0 |
Site_1_sample_6 |
0 |
0 |
0 |
Site_1_sample_7 |
0 |
902 |
161 |
Site_1_sample_8 |
0 |
1863 |
116 |
Site_1_sample_9-2 |
0 |
0 |
656 |
Generally we hope to see single species predictions for each ASV, however when there are conflicts such as equally good matches, or a reference sequence that is shared between species, both are reported. For example:
$ cut -f 16,35 summary/with-metadata.ITS1.samples.onebp.tsv | head
As a table:
Sequencing sample |
Phytophthora chlamydospora;Phytophthora x stagnum |
Site_1_sample_1 |
0 |
Site_1_sample_2 |
0 |
Site_1_sample_3 |
0 |
Site_1_sample_4 |
0 |
Site_1_sample_5 |
0 |
Site_1_sample_6 |
1217 |
Site_1_sample_7 |
0 |
Site_1_sample_8 |
0 |
Site_1_sample_9-2 |
0 |
In this example, Site_1_sample_6
had sequences matching both
Phytophthora chlamydospora and Phytophthora x stagnum. These species are
listed with a (*)
suffix in the earlier classification summary column:
$ grep Site_1_sample_6 summary/with-metadata.ITS1.samples.onebp.tsv | cut -f 16,17
Site_1_sample_6 Phytophthora castanetorum, Phytophthora chlamydospora(*), Phytophthora pseudosyringae, Phytophthora syringae, Phytophthora x stagnum(*)