Minimum Abundance Threshold

THAPBI PICT has a default minimum absolute abundance threshold of 100 reads per marker per sample, and 0.1% of the reads per marker per sample, before accepting any unique sequence. Background contamination and PCR noise levels will vary, so having multiple Abundance & Negative Controls will help set this objectively.

In this dataset there is a single negative control for the MOL16S marker, library BIM8M aka SRR5534986. However, we can also treat all the SPH16S libraries as negative controls for the MOL16S marker, and vice versa. You could do this automatically within THAPBI PICT via the -n or --negctrls command line option, but as we shall see in this example it will discard most of the data.

In order to examine an appropriate minimum abundance threshold, the run.sh script provided uses -a 10 -f 0 to accept any unique sequence seen in sample at least ten times (regardless the fraction of the sample read total). This does allow unwanted noise though to the reports.

SPH16S

This was the more specific primer pair, expected to only amplify sphaeriid mussel species, so in general we expect less unique sequences than with the more general MOL16S primers.

Looking at some key columns in the sample report,

$ cut -f 1,2,4,7,9,14 summary/SPH16S.samples.onebp.tsv
<SEE TABLE BELOW>

Or, open SPH16S.samples.onebp.xlsx in Excel. Focusing on the left hand columns, you should see:

#Marker	Group	Library-name	Raw FASTQ	Cutadapt	Accepted
MOL16S	Aquarium	BIR2M	306311	2	0
MOL16S	Aquarium	BIR6M	291954	14	0
MOL16S	Control	BIM8M	2433	0	0
MOL16S	Mock Community	SC3PRO1	689712	17	0
MOL16S	Mock Community	SC3PRO2	405048	0	0
MOL16S	Mock Community	SC3PRO3	402219	16	0
MOL16S	Mock Community	NFSC3PRO3	349590	33	10
MOL16S	Mock Community	SC3PRO4	671241	6	0
MOL16S	Mock Community	NFSC3PRO4	420015	7	0
MOL16S	Mock Community	SC3PRO5	480606	13	0
MOL16S	River	BIM6M	821849	0	0
MOL16S	River	BIM2M	1119271	0	0
MOL16S	River	BIM4M	709472	40	19
SPH16S	Aquarium	BIR2S	498926	251148	209402
SPH16S	Aquarium	BIR6S	240360	226012	191456
SPH16S	Mock Community	SPSC3PRO1	425271	317960	224689
SPH16S	Mock Community	SPSC3PRO2	341476	282516	204249
SPH16S	Mock Community	SPSC3PRO4	410780	303957	197507

Things to note:

In the “Raw FASTQ” column, the control has far fewer raw reads (good).
The “Cutadapt” column shows reads after SPH16S primer trimming. There are hundreds of thousands for the final five samples amplified with these primers (good). The first 13 samples were amplified with the MOL16S primers, but still have low levels of sequences matching the SPH16S primers (bad).
The “Read count” column is after applying the minimum abundance threshold (here 10). Two negative controls still have reads, lifting the threshold to 20 or more would fix this. These are Sphaerium simile in mock community NFSC3PRO3, and an unknown in river sample BIM4M.

So, using the MOL16S samples as negative controls suggests that for the SPH16S the default minimum abundance threshold is perhaps overly harsh - but using at least 20 would be wise.

MOL16S

We’ll initially looking at the same key columns in the sample report,

$ cut -f 1,2,4,7,9,14 summary/MOL16S.samples.onebp.tsv
<SEE TABLE BELOW>

Or, open MOL16S.samples.onebp.xlsx in Excel. Focusing on the left hand columns, you should see:

#Marker	Group	Library-name	Raw FASTQ	Cutadapt	Accepted
MOL16S	Aquarium	BIR2M	306311	297657	256386
MOL16S	Aquarium	BIR6M	291954	286427	256470
MOL16S	Control	BIM8M	2433	1014	551
MOL16S	Mock Community	SC3PRO1	689712	656661	550293
MOL16S	Mock Community	SC3PRO2	405048	377026	297877
MOL16S	Mock Community	SC3PRO3	402219	380347	304626
MOL16S	Mock Community	NFSC3PRO3	349590	328956	262963
MOL16S	Mock Community	SC3PRO4	671241	628644	494257
MOL16S	Mock Community	NFSC3PRO4	420015	364233	262726
MOL16S	Mock Community	SC3PRO5	480606	458896	383865
MOL16S	River	BIM6M	821849	799349	703578
MOL16S	River	BIM2M	1119271	954787	823782
MOL16S	River	BIM4M	709472	367539	317363
SPH16S	Aquarium	BIR2S	498926	25	0
SPH16S	Aquarium	BIR6S	240360	27	0
SPH16S	Mock Community	SPSC3PRO1	425271	35	0
SPH16S	Mock Community	SPSC3PRO2	341476	168	27
SPH16S	Mock Community	SPSC3PRO4	410780	420	108

Looking at the same points, I see two problems:

The control sample BIM8M (SRR5534986) had almost a thousand unwanted MOL16S matches, reduced to 551 with a minimum abundance threshold of 10.
All the SPH16S mock community samples have unwanted MOS16S matches, the worst case being SPSC3PRO4 (SRR5534980) with over four hundred reads reduced to 108 with the minimum abundance threshold of 10.

To see exactly what is in these two problematic samples, we can turn to the read report summary/MOL16S.reads.onebp.xlsx in Excel, or the TSV version at the command line (using grep to drop the rows ending with a zero count):

$ cut -f 1,2,3,7,10 summary/MOL16S.reads.onebp.tsv | grep -v "[[:space:]]0$"
#                                                                  Marker           MOL16S
#                                                                  Group            Control
#                                                                  Sample           Blank MOL16S
#                                                                  Library-name     BIM8M
#                                                                  Raw FASTQ        2433
#                                                                  Flash            1963
#                                                                  Cutadapt         1014
#                                                                  Threshold pool   raw_data
#                                                                  Threshold        10
#                                                                  Control          Sample
#                                                                  Singletons       258
#                                                                  Accepted         551
#                                                                  Unique           4
#Marker       MD5                               onebp-predictions  Total-abundance  SRR5534986
MAX or TOTAL  -                                 -                  4914872          478
MOL16S        20c0669e4c6f8436c9d42736df727c83  Sphaerium simile   152924           478
MOL16S        e1d838b4f39bffe88d8c0e79b52700f1  Sphaerium simile   3215             13
MOL16S        778e3dace4b993135e11d450e6c559ff  Sphaerium simile   249              11
MOL16S        a36d3f7291c173c4243f22c2afbd111e  Sphaerium simile   147              49

The unwanted sequences in the control sample are dominated by a single sequence (and three variants of it), which were matched to Sphaerium simile.

This is consistent with the original author’s analysis - although our pipeline has produced higher read counts:

Finally, our water blank sample had 71 reads, eight of those being singletons with the remaining belonging to Sphaerium striatinum (Table 9), likely due to amplicon contamination in the lab.

What about the other problematic sample? Again, you can find this in the Excel read report, or at the command line:

$ cut -f 1,2,7,25 summary/MOL16S.reads.onebp.tsv | grep -v "[[:space:]]0$"
#                                               Marker           SPH16S
#                                               Group            Mock Community
#                                               Sample           Mock Community 4 SPH16S
#                                               Library-name     SPSC3PRO4
#                                               Raw FASTQ        410780
#                                               Flash            375539
#                                               Cutadapt         420
#                                               Threshold pool   raw_data
#                                               Threshold        10
#                                               Control          Sample
#                                               Singletons       272
#                                               Accepted         108
#                                               Unique           3
#Marker       MD5                               Total-abundance  SRR5534980
MAX or TOTAL  -                                 4914872          46
MOL16S        ecdaa082b70f5e268f76128693531760  269109           45
MOL16S        98dc259e48de3e258cb93a34c38a9484  120026           17
MOL16S        20c0669e4c6f8436c9d42736df727c83  152924           46

The species names are too long to include in the above, listing them directly:

$ grep -E "(MD5|20c0669e4c6f8436c9d42736df727c83|ecdaa082b70f5e268f76128693531760|98dc259e48de3e258cb93a34c38a9484)" \
  summary/MOL16S.reads.onebp.tsv | cut -f 2,3
<SEE TABLE BELOW>

Giving:

MD5	onebp-predictions
ecdaa082b70f5e268f76128693531760	Dreissena bugensis;Dreissena rostriformis
98dc259e48de3e258cb93a34c38a9484	Dreissena polymorpha
20c0669e4c6f8436c9d42736df727c83	Sphaerium simile

The unwanted mock community sample content is split between Sphaerium and Dreissena, and suggest using a minimum threshold of perhaps 50 reads?

Minimum threshold

Clearly using a minimum abundance threshold of 10 is too low, and it should be increased to at perhaps 50 based on this. However, we have one exceptional sequence present at almost 500 copies. Setting the minimum that high seems excessive - but perhaps the THAPBI PICT default of 100 is more reasonable?