Minimum Abundance Threshold
THAPBI PICT has a default minimum absolute abundance threshold of 100 reads per marker per sample, and 0.1% of the reads per marker per sample, before accepting any unique sequence. Background contamination and PCR noise levels will vary, so having multiple Abundance & Negative Controls will help set this objectively.
In this dataset there is a single negative control for the MOL16S marker,
library BIM8M
aka SRR5534986
. However, we can also treat all the
SPH16S libraries as negative controls for the MOL16S marker, and vice versa.
You could do this automatically within THAPBI PICT via the -n
or
--negctrls
command line option, but as we shall see in this example it
will discard most of the data.
In order to examine an appropriate minimum abundance threshold, the run.sh
script provided uses -a 10 -f 0
to accept any unique sequence seen in
sample at least ten times (regardless the fraction of the sample read total).
This does allow unwanted noise though to the reports.
SPH16S
This was the more specific primer pair, expected to only amplify sphaeriid mussel species, so in general we expect less unique sequences than with the more general MOL16S primers.
Looking at some key columns in the sample report,
$ cut -f 1,2,4,7,9,14 summary/SPH16S.samples.onebp.tsv
<SEE TABLE BELOW>
Or, open SPH16S.samples.onebp.xlsx
in Excel. Focusing on the left hand
columns, you should see:
#Marker |
Group |
Library-name |
Raw FASTQ |
Cutadapt |
Accepted |
---|---|---|---|---|---|
MOL16S |
Aquarium |
BIR2M |
306311 |
2 |
0 |
MOL16S |
Aquarium |
BIR6M |
291954 |
14 |
0 |
MOL16S |
Control |
BIM8M |
2433 |
0 |
0 |
MOL16S |
Mock Community |
SC3PRO1 |
689712 |
17 |
0 |
MOL16S |
Mock Community |
SC3PRO2 |
405048 |
0 |
0 |
MOL16S |
Mock Community |
SC3PRO3 |
402219 |
16 |
0 |
MOL16S |
Mock Community |
NFSC3PRO3 |
349590 |
33 |
10 |
MOL16S |
Mock Community |
SC3PRO4 |
671241 |
6 |
0 |
MOL16S |
Mock Community |
NFSC3PRO4 |
420015 |
7 |
0 |
MOL16S |
Mock Community |
SC3PRO5 |
480606 |
13 |
0 |
MOL16S |
River |
BIM6M |
821849 |
0 |
0 |
MOL16S |
River |
BIM2M |
1119271 |
0 |
0 |
MOL16S |
River |
BIM4M |
709472 |
40 |
19 |
SPH16S |
Aquarium |
BIR2S |
498926 |
251148 |
209402 |
SPH16S |
Aquarium |
BIR6S |
240360 |
226012 |
191456 |
SPH16S |
Mock Community |
SPSC3PRO1 |
425271 |
317960 |
224689 |
SPH16S |
Mock Community |
SPSC3PRO2 |
341476 |
282516 |
204249 |
SPH16S |
Mock Community |
SPSC3PRO4 |
410780 |
303957 |
197507 |
Things to note:
In the “Raw FASTQ” column, the control has far fewer raw reads (good).
The “Cutadapt” column shows reads after SPH16S primer trimming. There are hundreds of thousands for the final five samples amplified with these primers (good). The first 13 samples were amplified with the MOL16S primers, but still have low levels of sequences matching the SPH16S primers (bad).
The “Read count” column is after applying the minimum abundance threshold (here 10). Two negative controls still have reads, lifting the threshold to 20 or more would fix this. These are Sphaerium simile in mock community
NFSC3PRO3
, and an unknown in river sampleBIM4M
.
So, using the MOL16S samples as negative controls suggests that for the SPH16S the default minimum abundance threshold is perhaps overly harsh - but using at least 20 would be wise.
MOL16S
We’ll initially looking at the same key columns in the sample report,
$ cut -f 1,2,4,7,9,14 summary/MOL16S.samples.onebp.tsv
<SEE TABLE BELOW>
Or, open MOL16S.samples.onebp.xlsx
in Excel. Focusing on the left hand
columns, you should see:
#Marker |
Group |
Library-name |
Raw FASTQ |
Cutadapt |
Accepted |
---|---|---|---|---|---|
MOL16S |
Aquarium |
BIR2M |
306311 |
297657 |
256386 |
MOL16S |
Aquarium |
BIR6M |
291954 |
286427 |
256470 |
MOL16S |
Control |
BIM8M |
2433 |
1014 |
551 |
MOL16S |
Mock Community |
SC3PRO1 |
689712 |
656661 |
550293 |
MOL16S |
Mock Community |
SC3PRO2 |
405048 |
377026 |
297877 |
MOL16S |
Mock Community |
SC3PRO3 |
402219 |
380347 |
304626 |
MOL16S |
Mock Community |
NFSC3PRO3 |
349590 |
328956 |
262963 |
MOL16S |
Mock Community |
SC3PRO4 |
671241 |
628644 |
494257 |
MOL16S |
Mock Community |
NFSC3PRO4 |
420015 |
364233 |
262726 |
MOL16S |
Mock Community |
SC3PRO5 |
480606 |
458896 |
383865 |
MOL16S |
River |
BIM6M |
821849 |
799349 |
703578 |
MOL16S |
River |
BIM2M |
1119271 |
954787 |
823782 |
MOL16S |
River |
BIM4M |
709472 |
367539 |
317363 |
SPH16S |
Aquarium |
BIR2S |
498926 |
25 |
0 |
SPH16S |
Aquarium |
BIR6S |
240360 |
27 |
0 |
SPH16S |
Mock Community |
SPSC3PRO1 |
425271 |
35 |
0 |
SPH16S |
Mock Community |
SPSC3PRO2 |
341476 |
168 |
27 |
SPH16S |
Mock Community |
SPSC3PRO4 |
410780 |
420 |
108 |
Looking at the same points, I see two problems:
The control sample BIM8M (SRR5534986) had almost a thousand unwanted MOL16S matches, reduced to 551 with a minimum abundance threshold of 10.
All the SPH16S mock community samples have unwanted MOS16S matches, the worst case being SPSC3PRO4 (SRR5534980) with over four hundred reads reduced to 108 with the minimum abundance threshold of 10.
To see exactly what is in these two problematic samples, we can turn to the
read report summary/MOL16S.reads.onebp.xlsx
in Excel, or the TSV version
at the command line (using grep to drop the rows ending with a zero count):
$ cut -f 1,2,3,7,10 summary/MOL16S.reads.onebp.tsv | grep -v "[[:space:]]0$"
# Marker MOL16S
# Group Control
# Sample Blank MOL16S
# Library-name BIM8M
# Raw FASTQ 2433
# Flash 1963
# Cutadapt 1014
# Threshold pool raw_data
# Threshold 10
# Control Sample
# Singletons 258
# Accepted 551
# Unique 4
#Marker MD5 onebp-predictions Total-abundance SRR5534986
MAX or TOTAL - - 4914872 478
MOL16S 20c0669e4c6f8436c9d42736df727c83 Sphaerium simile 152924 478
MOL16S e1d838b4f39bffe88d8c0e79b52700f1 Sphaerium simile 3215 13
MOL16S 778e3dace4b993135e11d450e6c559ff Sphaerium simile 249 11
MOL16S a36d3f7291c173c4243f22c2afbd111e Sphaerium simile 147 49
The unwanted sequences in the control sample are dominated by a single sequence (and three variants of it), which were matched to Sphaerium simile.
This is consistent with the original author’s analysis - although our pipeline has produced higher read counts:
Finally, our water blank sample had 71 reads, eight of those being singletons with the remaining belonging to Sphaerium striatinum (Table 9), likely due to amplicon contamination in the lab.
What about the other problematic sample? Again, you can find this in the Excel read report, or at the command line:
$ cut -f 1,2,7,25 summary/MOL16S.reads.onebp.tsv | grep -v "[[:space:]]0$"
# Marker SPH16S
# Group Mock Community
# Sample Mock Community 4 SPH16S
# Library-name SPSC3PRO4
# Raw FASTQ 410780
# Flash 375539
# Cutadapt 420
# Threshold pool raw_data
# Threshold 10
# Control Sample
# Singletons 272
# Accepted 108
# Unique 3
#Marker MD5 Total-abundance SRR5534980
MAX or TOTAL - 4914872 46
MOL16S ecdaa082b70f5e268f76128693531760 269109 45
MOL16S 98dc259e48de3e258cb93a34c38a9484 120026 17
MOL16S 20c0669e4c6f8436c9d42736df727c83 152924 46
The species names are too long to include in the above, listing them directly:
$ grep -E "(MD5|20c0669e4c6f8436c9d42736df727c83|ecdaa082b70f5e268f76128693531760|98dc259e48de3e258cb93a34c38a9484)" \
summary/MOL16S.reads.onebp.tsv | cut -f 2,3
<SEE TABLE BELOW>
Giving:
MD5 |
onebp-predictions |
---|---|
ecdaa082b70f5e268f76128693531760 |
Dreissena bugensis;Dreissena rostriformis |
98dc259e48de3e258cb93a34c38a9484 |
Dreissena polymorpha |
20c0669e4c6f8436c9d42736df727c83 |
Sphaerium simile |
The unwanted mock community sample content is split between Sphaerium and Dreissena, and suggest using a minimum threshold of perhaps 50 reads?
Minimum threshold
Clearly using a minimum abundance threshold of 10 is too low, and it should be increased to at perhaps 50 based on this. However, we have one exceptional sequence present at almost 500 copies. Setting the minimum that high seems excessive - but perhaps the THAPBI PICT default of 100 is more reasonable?