Presence and absence
This example is a controlled setup where we know what the classifier ought ideally to report for every single sample.
We have replicated negative controls (which should have no marker sequences present), and plenty of positive controls (which should have the expected 19 species only).
Of course, just as in the original author’s analysis, not everything we think was present is detected, and vice versa, lots of unwanted things are detected. These kinds of controls are perfect for discussing what to use as a minimum abundance threshold - how many reads do we need to declare a species as present in a sample?
Negative controls
If you have called the provided setup.py
to download the FASTQ files
and run.py
to call THAPBI PICT, it would have used an optimistic
minimum abundance threshold of 10 copies of each unique sequence (the
default is a far more pesimitic 100).
This is not a good idea in general, but for your negative controls it can be interesting to deliberately set no threshold, and accept any sequence even if only supported by one read.
(Be sure to remove the intermediate FASTA files if you try this, as otherwise THAPBI PICT would not replace the older higher threshold files).
If you do this, just how bad are the contamination levels? These little
tables were extracted manually from the sample level reports run with
-a 1
(accepting even sequences seen in only one read). The counts
are the total number of reads in each sample, while max is the highest
single sequence’s abundance.
Amplicon library one, ITS1 using the BITS/B58S3 primer pair, samples replicated in duplicate:
Description |
MiSeq-name |
Accession |
Count |
Max |
negative control from DNA extraction |
NegDNAA_S163 |
SRR5314317 |
112 |
64 |
negative control from DNA extraction |
NegDNAB_S175 |
SRR5314316 |
132 |
101 |
negative control from PCR step |
NegPCRA_S187 |
SRR5314315 |
1153 |
1085 |
negative control from PCR step |
NegPCRB_S104 |
SRR5314314 |
4343 |
3961 |
Amplicon library two, ITS1 using the ITS1f/ITS2 primer pair:
Description |
MiSeq-name |
Accession |
Count |
Max |
negative control with GoTaq |
NegCtlGoTq_S20 |
SRR5838526 |
2 |
1 |
negative control with Phusion |
NegCtlPhGn_S30 |
SRR5838523 |
8 |
4 |
negative control with reAmp |
NegCtlPrmp_S10 |
SRR5838524 |
9 |
1 |
Amplicon library two, ITS2 using the ITS3‐KYO2 and ITS4‐KYO3 primer pair:
Description |
MiSeq-name |
Accession |
Count |
Max |
negative control with GoTaq |
NegCtlGoTq_S20 |
SRR5838526 |
14 |
2 |
negative control with Phusion |
NegCtlPhGn_S30 |
SRR5838523 |
17 |
4 |
negative control with PreAmp |
NegCtlPrmp_S10 |
SRR5838524 |
5 |
1 |
Looking at these numbers the levels in amplicon library two are commendably low, at most four copies of any unique sequence - suggesting using a minimum threshold of 10 here is quite reasonable.
Hereafter we will assume the minimum abundance threshold of 10 was used, and you are encouraged to look at the sample or read level reports (e.g. in Excel) while following along with this discussion.
However, the levels in amplicon library one are cause for concern.
Starting with the negative control from the DNA extraction (given a green
background in the Excel reports), we see both replicates had two unwanted
sequences. Look at summary/AL1.BITS-B58S3.reads.onebp.xlsx
in Excel, or
the TSV version at the command line:
$ cut -f 1,2,7,35,36 summary/AL1.BITS-B58S3.reads.onebp.tsv | grep -v "[[:space:]]0[[:space:]]0$"
# Sample-type negative control negative control
# Group from DNA extraction from DNA extraction
# Protocol standard workflow standard workflow
# Condition Neg_DNA Neg_DNA
# Replicate 1 2
# MiSeq-name NegDNAA_S163 NegDNAB_S175
# Raw FASTQ 12564 16297
# Flash 11641 15829
# Cutadapt 112 131
# Threshold pool AL1 AL1
# Threshold 10 10
# Control Sample Sample
# Max non-spike 64 100
# Singletons 14 17
# Accepted 98 110
# Unique 2 2
#Marker MD5 Total-abundance SRR5314317 SRR5314316
MAX or TOTAL - 881219 64 100
BITS-B58S3 d51507f661ebee38a85bec35b70b7ee1 47984 64 100
BITS-B58S3 daadc4126b5747c43511bd3be0ea2438 34 34 0
BITS-B58S3 e5b7a8b5dc0da33108cc8a881eb409f5 10 0 10
Using a minimum of 10 has excluded lots of singletons etc here.
Both have d51507f661ebee38a85bec35b70b7ee1
as their more common unwanted
sequence, a perfect match to Fusarium graminearum in the mock community
(classifier summary column omitted above for a clearer layout).
The lower abundance sequence daadc4126b5747c43511bd3be0ea2438
gives
perfect NCBI BLAST matches to several accessions of fungus Wallemia muriae,
likewise e5b7a8b5dc0da33108cc8a881eb409f5
gives perfect NCBI BLAST matches
to Wallemia muriae and Wallemia sebi. They have no match from the
classifier.
Moving on to the worst case, the negative control from the PCR reaction (given a pale blue background in the Excel reports). Again, look at the Excel file, or if working at the terminal:
$ cut -f 1,2,7,37,38 summary/AL1.BITS-B58S3.reads.onebp.tsv | grep -v "[[:space:]]0[[:space:]]0$"
# Sample-type negative control negative control
# Group from PCR step from PCR step
# Protocol standard workflow standard workflow
# Condition Neg_PCR Neg_PCR
# Replicate 1 2
# MiSeq-name NegPCRA_S187 NegPCRB_S104
# Raw FASTQ 19406 7285
# Flash 12140 6128
# Cutadapt 1153 4340
# Threshold pool AL1 AL1
# Threshold 10 10
# Control Sample Sample
# Max non-spike 1085 3958
# Singletons 42 127
# Accepted 1085 4014
# Unique 1 4
#Marker MD5 Total-abundance SRR5314315 SRR5314314
MAX or TOTAL - 881219 1085 3958
BITS-B58S3 d51507f661ebee38a85bec35b70b7ee1 47984 1085 3958
BITS-B58S3 716f6111ac2ee192c23282e07d23078a 31294 0 25
BITS-B58S3 5194a4ae3a27d987892a8fee7b1669b9 17 0 17
BITS-B58S3 702929cef71042156acb3a28270d8831 14 0 14
The minimum abundance excluded lots of singletons etc. The vast majority of those were slight variants of the dominant sequence, and can thus be explained as PCR noise.
Again, both samples have d51507f661ebee38a85bec35b70b7ee1
as their main
(or only) unwanted sequence above the threshold, a perfect match to Fusarium
graminearum in the mock community.
Additionally 716f6111ac2ee192c23282e07d23078a
matched Mortierella
verticillata from the mock community.
Then 5194a4ae3a27d987892a8fee7b1669b9
gives perfect NCBI BLAST matches to
fungus Trichosporon asahii and 702929cef71042156acb3a28270d8831
to fungus
Candida tropicalis, which are unexpected contamination.
I concur with the author that the high levels of Fusarium graminearum are most likely cross-contamination from the mock-community samples:
Negative control samples in this sequencing run displayed some contamination by F. graminearum. This taxon was represented at slightly, but not dramatically, higher than expected relative abundances in the mock community samples; some of the increase over expected relative abundance may have been related to cross‐sample contamination.
Looking at the DNA extraction control alone, the THAPBI PICT default threshold of 100 seems reasonable. However, if we set that aside the likely Fusarium graminearum contamination, then the next worst contamination in any of these four controls is at 32 copies, so you might argue 100 is a little harsh?
Certainly I think for amplicon library one, a threshold of 10 is too low, but it could be defended for amplicon library two (where the controls had up to four copies of an unwanted sequence).
Missing positive controls
We will look at the ratios later, but were all 19 species in the mock community found? Perhaps the quickest way to answer this is to look at the classification assessment output. At the command line, looking at the BLAST based classifier as the most fuzzy of the three:
$ cut -f 1-5,9,11 summary/AL1.BITS-B58S3.assess.blast.tsv
<SEE TABLE BELOW>
Or open this in Excel. You should find:
#Species |
TP |
FP |
FN |
TN |
F1 |
Ad-hoc-loss |
---|---|---|---|---|---|---|
OVERALL |
345 |
5 |
168 |
71 |
0.80 |
0.334 |
Alternaria alternata |
26 |
0 |
1 |
4 |
0.98 |
0.037 |
Aspergillus flavus |
25 |
0 |
2 |
4 |
0.96 |
0.074 |
Candida apicola |
27 |
0 |
0 |
4 |
1.00 |
0.000 |
Chytriomyces hyalinus |
0 |
0 |
27 |
4 |
0.00 |
1.000 |
Claviceps purpurea |
27 |
0 |
0 |
4 |
1.00 |
0.000 |
Fusarium graminearum |
27 |
4 |
0 |
0 |
0.93 |
0.129 |
Fusarium oxysporum |
27 |
0 |
0 |
4 |
1.00 |
0.000 |
Fusarium verticillioides |
0 |
0 |
27 |
4 |
0.00 |
1.000 |
Mortierella verticillata |
27 |
1 |
0 |
3 |
0.98 |
0.036 |
Naganishia albida |
27 |
0 |
0 |
4 |
1.00 |
0.000 |
Neosartorya fischeri |
24 |
0 |
3 |
4 |
0.94 |
0.111 |
Penicillium expansum |
22 |
0 |
5 |
4 |
0.90 |
0.185 |
Rhizoctonia solani |
19 |
0 |
8 |
4 |
0.83 |
0.296 |
Rhizomucor miehei |
0 |
0 |
27 |
4 |
0.00 |
1.000 |
Rhizophagus irregularis |
13 |
0 |
14 |
4 |
0.65 |
0.519 |
Saccharomyces cerevisiae |
0 |
0 |
27 |
4 |
0.00 |
1.000 |
Saitoella complicata |
27 |
0 |
0 |
4 |
1.00 |
0.000 |
Trichoderma reesei |
27 |
0 |
0 |
4 |
1.00 |
0.000 |
Ustilago maydis |
0 |
0 |
27 |
4 |
0.00 |
1.000 |
Or, open this plain text tab separated Excel.
Five expected species were never found (FN with zero true positives) at the 10 reads abundance threshold: Chytriomyces hyalinus, Fusarium verticillioides, Rhizomucor miehei, Saccharomyces cerevisiae and Ustilago maydis.
The author wrote:
Two of the expected 19 phylotypes, Fusarium verticillioides and Saccharomyces cerevisiae, were not detected in any of the samples. A large number of reads, presumably including many F. verticillioides reads, were binned into a phylotype as unclassified Fusarium. The primers used in ITS1 amplification for this sequencing library match the rRNA gene sequence of S. cerevisiae. However, the expected ITS1 amplicon length is 402 bases for this taxon, compared to a range of 141‐330 bases across the remaining taxa in the mock community. Examining the data at earlier stages of processing revealed that S. cerevisiae was originally represented in the data set, but was completely removed during quality screening (Table S3).
Chytriomyes hyalinus, Rhizomucor miehei and Ustilago maydis were detected at dramatically lower abundances than expected. Each of these taxa possesses sequence mismatches compared to the PCR primers that were used. The number of mismatches to the forward and reverse primers was as follows: for C. hyalinus, 2 and 1; for R. miehei, 0 and 2; and for U. maydis, 2 and 1. Thus, selection against these taxa may have been due to primer annealing efficiency.
That’s pretty consistent (we’ve talked about Fusarium verticillioides earlier), and suggests using a minimum abundance threshold of 10 in THAPBI PICT is a little stricter that the author’s pipeline.
Moving on to the second amplicon library, the larger ITS1 marker using the ITS1f/ITS2 primer is more successful:
$ cut -f 1-5,9,11 summary/AL2.ITS1f-ITS2.assess.blast.tsv
<SEE TABLE BELOW>
Or open this in Excel. You should find:
#Species |
TP |
FP |
FN |
TN |
F1 |
Ad-hoc-loss |
---|---|---|---|---|---|---|
OVERALL |
398 |
0 |
115 |
57 |
0.87 |
0.224 |
Alternaria alternata |
23 |
0 |
4 |
3 |
0.92 |
0.148 |
Aspergillus flavus |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Candida apicola |
12 |
0 |
15 |
3 |
0.62 |
0.556 |
Chytriomyces hyalinus |
25 |
0 |
2 |
3 |
0.96 |
0.074 |
Claviceps purpurea |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Fusarium graminearum |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Fusarium oxysporum |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Fusarium verticillioides |
12 |
0 |
15 |
3 |
0.62 |
0.556 |
Mortierella verticillata |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Naganishia albida |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Neosartorya fischeri |
23 |
0 |
4 |
3 |
0.92 |
0.148 |
Penicillium expansum |
24 |
0 |
3 |
3 |
0.94 |
0.111 |
Rhizoctonia solani |
24 |
0 |
3 |
3 |
0.94 |
0.111 |
Rhizomucor miehei |
4 |
0 |
23 |
3 |
0.26 |
0.852 |
Rhizophagus irregularis |
11 |
0 |
16 |
3 |
0.58 |
0.593 |
Saccharomyces cerevisiae |
9 |
0 |
18 |
3 |
0.50 |
0.667 |
Saitoella complicata |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Trichoderma reesei |
25 |
0 |
2 |
3 |
0.96 |
0.074 |
Ustilago maydis |
17 |
0 |
10 |
3 |
0.77 |
0.370 |
Everything was found, although Rhizomucor miehei in particular found rarely, followed by Saccharomyces cerevisiae. The original author wrote:
The ITS1 data set yielded 18 of the expected 19 taxa (Tables S3, S5); as in the first library, no reads were classified as F. verticillioides, although many reads were placed in unclassified Fusarium. Rhizomucor miehei and S. cerevisiae were substantially underrepresented. Compared to primers ITS1f and ITS2, R. miehei had three mismatches in the forward and two mismatches in the reverse. Saccharomyces cerevisiae had one mismatch in the forward primer and again likely suffered negative bias associated with amplicon length (Table 3) and low sequence quality (Table S3).
Again, broad agreement here, with the problem of Fusarium verticillioides discussed earlier.
And finally, amplicon library two for ITS2 using the ITS3-KYO2 and ITS4-KYO3 primers:
$ cut -f 1-5,9,11 summary/AL2.ITS3-KYO2-ITS4-KYO3.assess.blast.tsv
<SEE TABLE BELOW>
Or open this in Excel. You should find:
#Species |
TP |
FP |
FN |
TN |
F1 |
Ad-hoc-loss |
---|---|---|---|---|---|---|
OVERALL |
313 |
0 |
200 |
57 |
0.76 |
0.390 |
Alternaria alternata |
16 |
0 |
11 |
3 |
0.74 |
0.407 |
Aspergillus flavus |
24 |
0 |
3 |
3 |
0.94 |
0.111 |
Candida apicola |
0 |
0 |
27 |
3 |
0.00 |
1.000 |
Chytriomyces hyalinus |
0 |
0 |
27 |
3 |
0.00 |
1.000 |
Claviceps purpurea |
23 |
0 |
4 |
3 |
0.92 |
0.148 |
Fusarium graminearum |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Fusarium oxysporum |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Fusarium verticillioides |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Mortierella verticillata |
12 |
0 |
15 |
3 |
0.62 |
0.556 |
Naganishia albida |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Neosartorya fischeri |
16 |
0 |
11 |
3 |
0.74 |
0.407 |
Penicillium expansum |
23 |
0 |
4 |
3 |
0.92 |
0.148 |
Rhizoctonia solani |
11 |
0 |
16 |
3 |
0.58 |
0.593 |
Rhizomucor miehei |
0 |
0 |
27 |
3 |
0.00 |
1.000 |
Rhizophagus irregularis |
5 |
0 |
22 |
3 |
0.31 |
0.815 |
Saccharomyces cerevisiae |
27 |
0 |
0 |
3 |
1.00 |
0.000 |
Saitoella complicata |
26 |
0 |
1 |
3 |
0.98 |
0.037 |
Trichoderma reesei |
22 |
0 |
5 |
3 |
0.90 |
0.185 |
Ustilago maydis |
0 |
0 |
27 |
3 |
0.00 |
1.000 |
This time we’re missing Candida apicola, Chytriomyces hyalinus, Rhizomucor miehei and Ustilago maydis.
This too is in board agreement with the original author, although Candida apicola must have just dipped below our abundance threshold.
Different amplification biases were evident between the ITS1 and ITS2 loci. In the ITS2 data set, only 16 of the 19 taxa that were present could be detected; C. hyalinus, R. miehei and U. maydis were not observed (Tables S3, S6). … Rhizomucor miehei has one mismatch to the forward primer and three mismatches to the reverse primer. While neither C. hyalinus nor U. maydis have sequence mismatches compared to the primers, these two taxa have longer ITS2 amplicons than any others in the mock community (Table 3). These two taxa were originally represented with a small number of reads in the raw data, but were completely removed during quality screening (Table S3). Candida apicola, which possesses two mismatches to the reverse primer for this amplicon, was detected at substantially lower than expected frequencies (Figure 7; Figures S5, S6).
So, using THAPBI PICT on these amplicon datasets with a minimum abundance threshold of 10 gives broad agreement with the original analysis.