High level overview

The high level summary is that all the samples have high coverage, much higher than most of the examples we have used. Some of the samples yield over a million reads for the COI and 12S amplicons, which with the default fractional minimum abundance threshold of 0.1% (-f 0.001) would mean using over 1000 reads as the threshold. This was too stringent, so the worked example reduces this to 0.01% (with -f 0.0001) matching the author’s analysis, and dropped the default absolute abundance threshold of 100 to 50 (with -a 50).

Note that the rarest members of the mock communities are expected from 1 in 500 individuals (0.02%) or 1 in 1000 individuals (0.01%), which is ten times higher than the fractional abundance threshold.

Sequence yield

We’ll start by looking at the number of read-pairs found for each marker. After calling ./run.sh you should be able to inspect these report files at the command line or in Excel.

$ cut -f 3,6-8,10,12-14 summary/COI.samples.onebp.tsv
<SEE TABLE BELOW>

Or open the Excel version summary/COI.samples.onebp.xlsx, and focus on those early columns:

sample_alias	Raw FASTQ	Flash	Cutadapt	Threshold	Singletons	Accepted	Unique
100-Pool-1	478705	474621	109233	50	11074	86402	178
250-Pool-1	1845819	1829913	157310	50	23119	118383	251
500-Pool-1	647776	643030	51092	50	6446	36718	127
1000-Pool-1	855997	848914	66002	50	7967	49058	149
100-Pool-2	737998	732014	432826	50	29168	368249	418
250-Pool-2	2037475	2022814	1250718	126	85718	1042562	482
500-Pool-2	1908370	1895715	1231908	124	59702	1042441	442
1000-Pool-2	1068715	1060596	584017	59	33060	498955	445
100-Pool-3	950692	940342	249422	50	24964	189156	371
250-Pool-3	1631700	1615113	274422	50	39974	192944	562
500-Pool-3	923807	916621	358429	50	32221	284819	567
1000-Pool-3	1773647	1758637	468361	50	42263	374487	733
100-Pool-4	634017	628523	117499	50	14596	74799	175
250-Pool-4	2501145	2480381	441558	50	61904	324512	707
500-Pool-4	572779	568565	144488	50	18279	96537	306
1000-Pool-4	1198812	1189853	294607	50	30130	220678	470
100-Pool-5	1817929	1800594	434739	50	45224	329015	660
250-Pool-5	1632786	1617219	440995	50	58159	328842	729
500-Pool-5	807060	801471	321428	50	30944	247519	484
1000-Pool-5	1423279	1411512	332286	50	32751	255309	584
Trap-1	1759819	1719671	110882	50	19740	73024	251
Trap-10	2445993	2420303	308371	50	58670	204842	480
Trap-2	1127739	1107970	110856	50	24385	55757	92
Trap-3	2422054	2366037	161686	50	30631	110043	268
Trap-4	742893	732907	63107	50	11933	35225	77
Trap-5	3437292	3346620	346696	50	71464	208989	542
Trap-6	697389	689125	91284	50	17153	57037	149
Trap-7	2853448	2820200	223330	50	31011	169121	319
Trap-8	2196646	2161966	146646	50	28814	92632	220
Trap-9	2065455	2049024	70591	50	14636	40131	109

The marker specific tables show the threshold applied was usually 50, the default absolute value set via -a 50 at the command line. Occasionally this has been increased to 0.1% of the sequences matching the primers for this marker, set via -f 0.0001 at the command line.

The numbers are similar for the 12S and 18S markers, or pooling them all:

$ cut -f 3,6,7,13,14 summary/pooled.samples.onebp.tsv
<SEE TABLE BELOW>

Again, alternatively open Excel file summary/pooled.samples.onebp.xlsx, and focus on those early columns:

sample_alias	Raw FASTQ	Flash	Accepted	Unique
100-Pool-1	478705	474621	371045	703
250-Pool-1	1845819	1829913	1508292	689
500-Pool-1	647776	643030	522396	800
1000-Pool-1	855997	848914	692639	950
100-Pool-2	737998	732014	587902	886
250-Pool-2	2037475	2022814	1243165	837
500-Pool-2	1908370	1895715	1551757	1142
1000-Pool-2	1068715	1060596	863574	1024
100-Pool-3	950692	940342	684297	1479
250-Pool-3	1631700	1615113	1158575	1241
500-Pool-3	923807	916621	697552	1457
1000-Pool-3	1773647	1758637	1366298	1993
100-Pool-4	634017	628523	451801	879
250-Pool-4	2501145	2480381	1867605	1171
500-Pool-4	572779	568565	416456	925
1000-Pool-4	1198812	1189853	918004	1660
100-Pool-5	1817929	1800594	1369274	1918
250-Pool-5	1632786	1617219	1128901	1475
500-Pool-5	807060	801471	603390	1276
1000-Pool-5	1423279	1411512	1104412	1716
Trap-1	1759819	1719671	392775	919
Trap-10	2445993	2420303	492325	1079
Trap-2	1127739	1107970	129956	273
Trap-3	2422054	2366037	427533	953
Trap-4	742893	732907	232800	403
Trap-5	3437292	3346620	486282	1177
Trap-6	697389	689125	80003	170
Trap-7	2853448	2820200	1158684	842
Trap-8	2196646	2161966	683669	1024
Trap-9	2065455	2049024	1352408	689

The “Accepted” column is the number of reads matching the primer pairs and passing our abundance thresholds. The fraction accepted varies from 61% to 82% for the mock community samples, but is considerably lower for the environmental traps, varying from 11% to 65%. Much of that would be noise and trace level environmental DNA.

The “Unique” column is the number of accepted unique sequences. For the mock communities this should be up to 18 with at most six species each, and three markers. The observed counts are much higher, so we might want to denoise, or and/or raise the abundance threshold higher. Dropping it further does raise the false positive rate inferred from the mock communities.