Presence and absence

This example is a controlled setup where we know what the classifier ought ideally to report for every single sample.

We have replicated negative controls (which should have no marker sequences present), and plenty of positive controls (which should have the expected 19 species only).

Of course, just as in the original author’s analysis, not everything we think was present is detected, and vice versa, lots of unwanted things are detected. These kinds of controls are perfect for discussing what to use as a minimum abundance threshold - how many reads do we need to declare a species as present in a sample?

Negative controls

If you have called the provided setup.py to download the FASTQ files and run.py to call THAPBI PICT, it would have used an optimistic minimum abundance threshold of 10 copies of each unique sequence (the default is a far more pesimitic 100).

This is not a good idea in general, but for your negative controls it can be interesting to deliberately set no threshold, and accept any sequence even if only supported by one read.

(Be sure to remove the intermediate FASTA files if you try this, as otherwise THAPBI PICT would not replace the older higher threshold files).

If you do this, just how bad are the contamination levels? These little tables were extracted manually from the sample level reports run with -a 1 (accepting even sequences seen in only one read). The counts are the total number of reads in each sample, while max is the highest single sequence’s abundance.

Amplicon library one, ITS1 using the BITS/B58S3 primer pair, samples replicated in duplicate:

Description	MiSeq-name	Accession	Count	Max
negative control from DNA extraction	NegDNAA_S163	SRR5314317	112	64
negative control from DNA extraction	NegDNAB_S175	SRR5314316	132	101
negative control from PCR step	NegPCRA_S187	SRR5314315	1153	1085
negative control from PCR step	NegPCRB_S104	SRR5314314	4343	3961

Amplicon library two, ITS1 using the ITS1f/ITS2 primer pair:

Description	MiSeq-name	Accession	Count	Max
negative control with GoTaq	NegCtlGoTq_S20	SRR5838526	2	1
negative control with Phusion	NegCtlPhGn_S30	SRR5838523	8	4
negative control with reAmp	NegCtlPrmp_S10	SRR5838524	9	1

Amplicon library two, ITS2 using the ITS3‐KYO2 and ITS4‐KYO3 primer pair:

Description	MiSeq-name	Accession	Count	Max
negative control with GoTaq	NegCtlGoTq_S20	SRR5838526	14	2
negative control with Phusion	NegCtlPhGn_S30	SRR5838523	17	4
negative control with PreAmp	NegCtlPrmp_S10	SRR5838524	5	1

Looking at these numbers the levels in amplicon library two are commendably low, at most four copies of any unique sequence - suggesting using a minimum threshold of 10 here is quite reasonable.

Hereafter we will assume the minimum abundance threshold of 10 was used, and you are encouraged to look at the sample or read level reports (e.g. in Excel) while following along with this discussion.

However, the levels in amplicon library one are cause for concern. Starting with the negative control from the DNA extraction (given a green background in the Excel reports), we see both replicates had two unwanted sequences. Look at summary/AL1.BITS-B58S3.reads.onebp.xlsx in Excel, or the TSV version at the command line:

$ cut -f 1,2,7,35,36 summary/AL1.BITS-B58S3.reads.onebp.tsv | grep -v "[[:space:]]0[[:space:]]0$"
#                                               Sample-type      negative control     negative control
#                                               Group            from DNA extraction  from DNA extraction
#                                               Protocol         standard workflow    standard workflow
#                                               Condition        Neg_DNA              Neg_DNA
#                                               Replicate        1                    2
#                                               MiSeq-name       NegDNAA_S163         NegDNAB_S175
#                                               Raw FASTQ        12564                16297
#                                               Flash            11641                15829
#                                               Cutadapt         112                  131
#                                               Threshold pool   AL1                  AL1
#                                               Threshold        10                   10
#                                               Control          Sample               Sample
#                                               Max non-spike    64                   100
#                                               Singletons       14                   17
#                                               Accepted         98                   110
#                                               Unique           2                    2
#Marker       MD5                               Total-abundance  SRR5314317           SRR5314316
MAX or TOTAL  -                                 881219           64                   100
BITS-B58S3    d51507f661ebee38a85bec35b70b7ee1  47984            64                   100
BITS-B58S3    daadc4126b5747c43511bd3be0ea2438  34               34                   0
BITS-B58S3    e5b7a8b5dc0da33108cc8a881eb409f5  10               0                    10

Using a minimum of 10 has excluded lots of singletons etc here.

Both have d51507f661ebee38a85bec35b70b7ee1 as their more common unwanted sequence, a perfect match to Fusarium graminearum in the mock community (classifier summary column omitted above for a clearer layout).

The lower abundance sequence daadc4126b5747c43511bd3be0ea2438 gives perfect NCBI BLAST matches to several accessions of fungus Wallemia muriae, likewise e5b7a8b5dc0da33108cc8a881eb409f5 gives perfect NCBI BLAST matches to Wallemia muriae and Wallemia sebi. They have no match from the classifier.

Moving on to the worst case, the negative control from the PCR reaction (given a pale blue background in the Excel reports). Again, look at the Excel file, or if working at the terminal:

$ cut -f 1,2,7,37,38 summary/AL1.BITS-B58S3.reads.onebp.tsv | grep -v "[[:space:]]0[[:space:]]0$"
#                                               Sample-type      negative control   negative control
#                                               Group            from PCR step      from PCR step
#                                               Protocol         standard workflow  standard workflow
#                                               Condition        Neg_PCR            Neg_PCR
#                                               Replicate        1                  2
#                                               MiSeq-name       NegPCRA_S187       NegPCRB_S104
#                                               Raw FASTQ        19406              7285
#                                               Flash            12140              6128
#                                               Cutadapt         1153               4340
#                                               Threshold pool   AL1                AL1
#                                               Threshold        10                 10
#                                               Control          Sample             Sample
#                                               Max non-spike    1085               3958
#                                               Singletons       42                 127
#                                               Accepted         1085               4014
#                                               Unique           1                  4
#Marker       MD5                               Total-abundance  SRR5314315         SRR5314314
MAX or TOTAL  -                                 881219           1085               3958
BITS-B58S3    d51507f661ebee38a85bec35b70b7ee1  47984            1085               3958
BITS-B58S3    716f6111ac2ee192c23282e07d23078a  31294            0                  25
BITS-B58S3    5194a4ae3a27d987892a8fee7b1669b9  17               0                  17
BITS-B58S3    702929cef71042156acb3a28270d8831  14               0                  14

The minimum abundance excluded lots of singletons etc. The vast majority of those were slight variants of the dominant sequence, and can thus be explained as PCR noise.

Again, both samples have d51507f661ebee38a85bec35b70b7ee1 as their main (or only) unwanted sequence above the threshold, a perfect match to Fusarium graminearum in the mock community. Additionally 716f6111ac2ee192c23282e07d23078a matched Mortierella verticillata from the mock community.

Then 5194a4ae3a27d987892a8fee7b1669b9 gives perfect NCBI BLAST matches to fungus Trichosporon asahii and 702929cef71042156acb3a28270d8831 to fungus Candida tropicalis, which are unexpected contamination.

I concur with the author that the high levels of Fusarium graminearum are most likely cross-contamination from the mock-community samples:

Negative control samples in this sequencing run displayed some contamination by F. graminearum. This taxon was represented at slightly, but not dramatically, higher than expected relative abundances in the mock community samples; some of the increase over expected relative abundance may have been related to cross‐sample contamination.

Looking at the DNA extraction control alone, the THAPBI PICT default threshold of 100 seems reasonable. However, if we set that aside the likely Fusarium graminearum contamination, then the next worst contamination in any of these four controls is at 32 copies, so you might argue 100 is a little harsh?

Certainly I think for amplicon library one, a threshold of 10 is too low, but it could be defended for amplicon library two (where the controls had up to four copies of an unwanted sequence).

Missing positive controls

We will look at the ratios later, but were all 19 species in the mock community found? Perhaps the quickest way to answer this is to look at the classification assessment output. At the command line, looking at the BLAST based classifier as the most fuzzy of the three:

$ cut -f 1-5,9,11 summary/AL1.BITS-B58S3.assess.blast.tsv
<SEE TABLE BELOW>

Or open this in Excel. You should find:

#Species	TP	FP	FN	TN	F1	Ad-hoc-loss
OVERALL	345	5	168	71	0.80	0.334
Alternaria alternata	26	0	1	4	0.98	0.037
Aspergillus flavus	25	0	2	4	0.96	0.074
Candida apicola	27	0	0	4	1.00	0.000
Chytriomyces hyalinus	0	0	27	4	0.00	1.000
Claviceps purpurea	27	0	0	4	1.00	0.000
Fusarium graminearum	27	4	0	0	0.93	0.129
Fusarium oxysporum	27	0	0	4	1.00	0.000
Fusarium verticillioides	0	0	27	4	0.00	1.000
Mortierella verticillata	27	1	0	3	0.98	0.036
Naganishia albida	27	0	0	4	1.00	0.000
Neosartorya fischeri	24	0	3	4	0.94	0.111
Penicillium expansum	22	0	5	4	0.90	0.185
Rhizoctonia solani	19	0	8	4	0.83	0.296
Rhizomucor miehei	0	0	27	4	0.00	1.000
Rhizophagus irregularis	13	0	14	4	0.65	0.519
Saccharomyces cerevisiae	0	0	27	4	0.00	1.000
Saitoella complicata	27	0	0	4	1.00	0.000
Trichoderma reesei	27	0	0	4	1.00	0.000
Ustilago maydis	0	0	27	4	0.00	1.000

Or, open this plain text tab separated Excel.

Five expected species were never found (FN with zero true positives) at the 10 reads abundance threshold: Chytriomyces hyalinus, Fusarium verticillioides, Rhizomucor miehei, Saccharomyces cerevisiae and Ustilago maydis.

The author wrote:

Two of the expected 19 phylotypes, Fusarium verticillioides and Saccharomyces cerevisiae, were not detected in any of the samples. A large number of reads, presumably including many F. verticillioides reads, were binned into a phylotype as unclassified Fusarium. The primers used in ITS1 amplification for this sequencing library match the rRNA gene sequence of S. cerevisiae. However, the expected ITS1 amplicon length is 402 bases for this taxon, compared to a range of 141‐330 bases across the remaining taxa in the mock community. Examining the data at earlier stages of processing revealed that S. cerevisiae was originally represented in the data set, but was completely removed during quality screening (Table S3).

Chytriomyes hyalinus, Rhizomucor miehei and Ustilago maydis were detected at dramatically lower abundances than expected. Each of these taxa possesses sequence mismatches compared to the PCR primers that were used. The number of mismatches to the forward and reverse primers was as follows: for C. hyalinus, 2 and 1; for R. miehei, 0 and 2; and for U. maydis, 2 and 1. Thus, selection against these taxa may have been due to primer annealing efficiency.

That’s pretty consistent (we’ve talked about Fusarium verticillioides earlier), and suggests using a minimum abundance threshold of 10 in THAPBI PICT is a little stricter that the author’s pipeline.

Moving on to the second amplicon library, the larger ITS1 marker using the ITS1f/ITS2 primer is more successful:

$ cut -f 1-5,9,11 summary/AL2.ITS1f-ITS2.assess.blast.tsv
<SEE TABLE BELOW>

Or open this in Excel. You should find:

#Species	TP	FN	TN	F1	Ad-hoc-loss
OVERALL	398	115	57	0.87	0.224
Alternaria alternata	23	4	3	0.92	0.148
Aspergillus flavus	27	0	3	1.00	0.000
Candida apicola	12	15	3	0.62	0.556
Chytriomyces hyalinus	25	2	3	0.96	0.074
Claviceps purpurea	27	0	3	1.00	0.000
Fusarium graminearum	27	0	3	1.00	0.000
Fusarium oxysporum	27	0	3	1.00	0.000
Fusarium verticillioides	12	15	3	0.62	0.556
Mortierella verticillata	27	0	3	1.00	0.000
Naganishia albida	27	0	3	1.00	0.000
Neosartorya fischeri	23	4	3	0.92	0.148
Penicillium expansum	24	3	3	0.94	0.111
Rhizoctonia solani	24	3	3	0.94	0.111
Rhizomucor miehei	4	23	3	0.26	0.852
Rhizophagus irregularis	11	16	3	0.58	0.593
Saccharomyces cerevisiae	9	18	3	0.50	0.667
Saitoella complicata	27	0	3	1.00	0.000
Trichoderma reesei	25	2	3	0.96	0.074
Ustilago maydis	17	10	3	0.77	0.370

Everything was found, although Rhizomucor miehei in particular found rarely, followed by Saccharomyces cerevisiae. The original author wrote:

The ITS1 data set yielded 18 of the expected 19 taxa (Tables S3, S5); as in the first library, no reads were classified as F. verticillioides, although many reads were placed in unclassified Fusarium. Rhizomucor miehei and S. cerevisiae were substantially underrepresented. Compared to primers ITS1f and ITS2, R. miehei had three mismatches in the forward and two mismatches in the reverse. Saccharomyces cerevisiae had one mismatch in the forward primer and again likely suffered negative bias associated with amplicon length (Table 3) and low sequence quality (Table S3).

Again, broad agreement here, with the problem of Fusarium verticillioides discussed earlier.

And finally, amplicon library two for ITS2 using the ITS3-KYO2 and ITS4-KYO3 primers:

$ cut -f 1-5,9,11 summary/AL2.ITS3-KYO2-ITS4-KYO3.assess.blast.tsv
<SEE TABLE BELOW>

Or open this in Excel. You should find:

#Species	TP	FN	TN	F1	Ad-hoc-loss
OVERALL	313	200	57	0.76	0.390
Alternaria alternata	16	11	3	0.74	0.407
Aspergillus flavus	24	3	3	0.94	0.111
Candida apicola	0	27	3	0.00	1.000
Chytriomyces hyalinus	0	27	3	0.00	1.000
Claviceps purpurea	23	4	3	0.92	0.148
Fusarium graminearum	27	0	3	1.00	0.000
Fusarium oxysporum	27	0	3	1.00	0.000
Fusarium verticillioides	27	0	3	1.00	0.000
Mortierella verticillata	12	15	3	0.62	0.556
Naganishia albida	27	0	3	1.00	0.000
Neosartorya fischeri	16	11	3	0.74	0.407
Penicillium expansum	23	4	3	0.92	0.148
Rhizoctonia solani	11	16	3	0.58	0.593
Rhizomucor miehei	0	27	3	0.00	1.000
Rhizophagus irregularis	5	22	3	0.31	0.815
Saccharomyces cerevisiae	27	0	3	1.00	0.000
Saitoella complicata	26	1	3	0.98	0.037
Trichoderma reesei	22	5	3	0.90	0.185
Ustilago maydis	0	27	3	0.00	1.000

This time we’re missing Candida apicola, Chytriomyces hyalinus, Rhizomucor miehei and Ustilago maydis.

This too is in board agreement with the original author, although Candida apicola must have just dipped below our abundance threshold.

Different amplification biases were evident between the ITS1 and ITS2 loci. In the ITS2 data set, only 16 of the 19 taxa that were present could be detected; C. hyalinus, R. miehei and U. maydis were not observed (Tables S3, S6). … Rhizomucor miehei has one mismatch to the forward primer and three mismatches to the reverse primer. While neither C. hyalinus nor U. maydis have sequence mismatches compared to the primers, these two taxa have longer ITS2 amplicons than any others in the mock community (Table 3). These two taxa were originally represented with a small number of reads in the raw data, but were completely removed during quality screening (Table S3). Candida apicola, which possesses two mismatches to the reverse primer for this amplicon, was detected at substantially lower than expected frequencies (Figure 7; Figures S5, S6).

So, using THAPBI PICT on these amplicon datasets with a minimum abundance threshold of 10 gives broad agreement with the original analysis.