Summary reports

Running thapbi_pict summary

The reports from the pipeline can be generated separately by the thapbi_pict summary command:

$ thapbi_pict summary -h
...

To mimic what the pipeline command would do, run the following:

$ thapbi_pict summary -i intermediate/ \
  summary/thapbi-pict.ITS1.onebp.tsv \
  -o summary/thapbi-pict.ITS1
...

Note the trailing slash \ at the end of the first line indicates the command continues on the next line. You can actually type this at the standard Linux command prompt (or include it in a copy and paste), or just enter this as one very long command.

We will look at the output in a moment, along side the equivalent reports generated with metadata (see linked discussion about column numbers):

$ thapbi_pict summary -i intermediate/ \
  summary/thapbi-pict.ITS1.onebp.tsv \
  -o summary/with-metadata.ITS1 \
  -t metadata.tsv -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -x 16
...

Both the read report and sample report are tables, produced as both computer-friendly plain text tab-separated variable (TSV), and human-friendly Excel (with colors and conditional formatting).

Read Report

The heart of the read report is a large table, of unique sequences (ASVs rows) versus sequenced samples (columns), with read abundance counts. There are additional columns with sequence information, and when Metadata is present, extra rows at the start with sample information.

This read report has a row for each unique sequence. The first columns are the marker name (here always “ITS1”), the unique sequence MD5 checksum, any species prediction, the sequence itself, the number of samples it was detected in above the threshold, the maximum number of reads with this sequence in any one sample, and the total number of reads (from samples where it was above the threshold). Then the main columns (one per sample) list the abundance of each unique sequence in that sample (if above the threshold).

In the Excel version, conditional formatting is used to highlight the non-zero counts with a red background. Furthermore, with metadata it will attempt to assign repeated bands of background color to groups (pink, orange, yellow, green, blue). In this example, each sample site gets a new color:

Screenshot of Excel showing ``summary/with-metadata.samples.onebp.xlsx`` file.

Typical sample naming schemes will result in replicates as neighbouring columns - meaning you should see very similar patterns of red (non-zero). Certainly in this dataset scanning horizontally we do see some sequences clearly show presence/absence patterns consistent with the samples.

The default row sorting will result in a dominant sequence being followed by any close variants assigned to the same species. Many of these rows will represent PCR artefacts found in just one or two samples. This contributes to the “halo” effect seen in the Edit Graph representation, discussed next.

Sample Report

The heart of the sample report is a table of samples (rows) versus species predictions (columns), with read abundance counts. There are additional columns with sample read counts, and when Metadata is present, extra columns at the start with sample information.

Here is a screenshot of the summary/with-metadata.ITS1.samples.onebp.xlsx file opened in Excel:

Excel screenshot showing with-metadata.ITS1.samples.onebp.xlsx

The metadata is in the first columns, then the sequence filename stem, a text summary of the species predictions, some inferred sequence count data, and the one column for each unique species or ambiguous species combinations.

Using the metadata each site has one or more rows in the same background color (pink, orange, yellow, green, blue, repeated), with one row for each time it was sequenced (the per-site sampling).

The values are total read counts for that row/column, with conditional formatting applied so non-zero entries have a bright red background.

For example, the final rows are the two DNA mixture controls (blue and pink) and the negative controls (orange). These have almost no metadata, and the negative controls read counts are all zero.

The plain text table with-metadata.ITS1.samples.onebp.xlsx is the same, but without the colors and formatting. The files generated without metadata (thapbi-pict.ITS1.samples.onebp.xlsx etc) lack the extra columns and the background color bands.

The files without metadata start with the FASTQ filename stem as the inferred sample name in column 1:

$ cut -f 1 summary/thapbi-pict.ITS1.samples.onebp.tsv | head
#Sequencing sample
DNA10MIX_bycopynumber
DNA10MIX_diluted25x
DNA10MIX_undiluted
DNA15MIX
NEGATIVE_firstplate
NEGATIVE_secondplate
Site_10_sample_7
Site_10_sample_8
Site_11_sample_1

In contrast, the 15 extra metadata columns are inserted before this, and are used to sort the samples:

$ cut -f 1,16 summary/with-metadata.ITS1.samples.onebp.tsv | head
#Site  Sequencing sample
   Site_1_sample_1
   Site_1_sample_2
   Site_1_sample_3
   Site_1_sample_4
   Site_1_sample_5
   Site_1_sample_6
   Site_1_sample_7
   Site_1_sample_8
   Site_1_sample_9-2

Like the FASTQ filename stems, the metadata is still sorted as strings, but by using leading zeros and YYYY-MM-DD style for any dates, you can achieve a logical presentation.

After the sequencing sample name (the FASTQ filename stem), we have the classification summary as a comma separated list - attempting to summarise the later per-species columns. Species listed here with (*) are where sequences matched multiple species equally well. For example, Phytophthora andina, P. infestans, and P. ipomoeae, share an identical ITS1 marker.

The next columns are derived from the data itself, reads counts in the samples as raw FASTQ, after read merging with Flash, primer trimming with Cutadapt, information about the abundance thresholds used (omitted below), the maximum ASV read count for non-spike-in or spike-in sequences, number of singletons, total number of reads for the accepted ASVs (i.e. passing the abundance threshold), and the number of unique ASVs accepted. It may be easier to look at this in Excel, but at the command line:

$ cut -f 16,18-20,24-28 summary/with-metadata.ITS1.samples.onebp.tsv | head
<SEE TABLE BELOW>

As a table:

Sequencing sample	Raw FASTQ	Flash	Cutadapt	Max non-spike	Singletons	Accepted	Unique
Site_1_sample_1	6136	5900	5886	2269	692	4180	8
Site_1_sample_2	6135	5955	5947	2532	671	4548	8
Site_1_sample_3	6778	6484	6470	2146	579	5060	5
Site_1_sample_4	4145	3984	3974	1499	469	2852	7
Site_1_sample_5	4722	4232	4213	3130	433	3130	1
Site_1_sample_6	12633	12070	12034	5864	1217	9208	4
Site_1_sample_7	7560	7170	7141	3372	741	5402	5
Site_1_sample_8	6324	5956	5942	2037	630	4524	5
Site_1_sample_9-2	4542	4335	4331	2780	385	3436	2

Finally, we get to the main part of the sample table, one column per classifier result, with the number of reads. Picking out some examples:

$ cut -f 16,31,41,65 summary/with-metadata.ITS1.samples.onebp.tsv | head
<SEE TABLE BELOW>

As a table:

Sequencing sample	Phytophthora austrocedri	Phytophthora gonapodyides	Unknown
Site_1_sample_1	165	1158	0
Site_1_sample_2	445	718	101
Site_1_sample_3	0	1110	1313
Site_1_sample_4	204	861	0
Site_1_sample_5	0	3130	0
Site_1_sample_6	0	0	0
Site_1_sample_7	0	902	161
Site_1_sample_8	0	1863	116
Site_1_sample_9-2	0	0	656

Generally we hope to see single species predictions for each ASV, however when there are conflicts such as equally good matches, or a reference sequence that is shared between species, both are reported. For example:

$ cut -f 16,35 summary/with-metadata.ITS1.samples.onebp.tsv | head
<SEE TABLE BELOW>

As a table:

Sequencing sample	Phytophthora chlamydospora;Phytophthora x stagnum
Site_1_sample_1	0
Site_1_sample_2	0
Site_1_sample_3	0
Site_1_sample_4	0
Site_1_sample_5	0
Site_1_sample_6	1217
Site_1_sample_7	0
Site_1_sample_8	0
Site_1_sample_9-2	0

In this example, Site_1_sample_6 had sequences matching both Phytophthora chlamydospora and Phytophthora x stagnum. These species are listed with a (*) suffix in the earlier classification summary column:

$ grep Site_1_sample_6 summary/with-metadata.ITS1.samples.onebp.tsv | cut -f 16,17
Site_1_sample_6  Phytophthora castanetorum, Phytophthora chlamydospora(*), Phytophthora pseudosyringae, Phytophthora syringae, Phytophthora x stagnum(*)