Presence and absence
Controls
Quoting the Muri et al. (2020) paper:
A low-frequency noise threshold of 0.001 (0.1%) was applied across the dataset to reduce the probability of false positives arising from cross-contamination or tag-jumping (De Barba et al. 2014; Hänfling et al. 2016). Based on the level of contamination found in sampling/filtration blanks and PCR negatives, a second arbitrary threshold was applied and all records occurring with less than 50 reads assigned were removed.
To match the paper, this example uses -a 50
for an absolute threshold of
50, and -f 0.001
for a 0.1% sample specific factional threshold.
At this threshold, the 4 cichlid “positive” samples, 6 PCR “negative”, and 8 “blank” controls are perfect - as far as the fish go. We do see unexpected human and chicken reads in the PCR negatives, and also ducks, cattle and pigs in the field “blanks”:
$ grep -E "(^#|positive|negative|blank)" summary/drained_ponds.12S.samples.onebp.tsv | cut -f 5,10-11,16,19
<SEE TABLE BELOW>
Or, filter/search summary/drained_ponds.12S.samples.onebp.tsv
in Excel:
control |
Sequencing sample |
Classification summary |
Threshold |
Accepted |
---|---|---|---|---|
blank |
SRR11949861 |
50 |
0 |
|
blank |
SRR11949885 |
50 |
0 |
|
blank |
SRR11949884 |
(Off-target) Homo sapiens, (Off-target) Sus scrofa |
50 |
544 |
blank |
SRR11949883 |
(Off-target) Bos taurus, (Off-target) Homo sapiens, (Off-target) Sus scrofa |
50 |
1629 |
blank |
SRR11949882 |
(Off-target) Anatidae (waterfowl) |
50 |
61 |
blank |
SRR11949881 |
(Off-target) Homo sapiens |
50 |
56 |
blank |
SRR11949880 |
(Off-target) Anatidae (waterfowl), (Off-target) Homo sapiens |
50 |
436 |
blank |
SRR11949834 |
(Off-target) Homo sapiens |
50 |
175 |
negative |
SRR11949908 |
50 |
0 |
|
negative |
SRR11949907 |
(Off-target) Gallus gallus, (Off-target) Homo sapiens |
50 |
606 |
negative |
SRR11949851 |
50 |
0 |
|
negative |
SRR11949850 |
50 |
0 |
|
negative |
SRR11949838 |
(Off-target) Homo sapiens |
50 |
71 |
negative |
SRR11949837 |
(Off-target) Homo sapiens |
50 |
356 |
positive |
SRR11949836 |
Astatotilapia calliptera(*), Maylandia zebra(*) |
50 |
39748 |
positive |
SRR11949835 |
Astatotilapia calliptera(*), Maylandia zebra(*) |
50 |
39244 |
positive |
SRR11949906 |
Astatotilapia calliptera(*), Maylandia zebra(*) |
65 |
62249 |
positive |
SRR11949849 |
Astatotilapia calliptera(*), Maylandia zebra(*) |
50 |
24566 |
Only in one sample (SRR11949906
, a positive control) was the percentage
based abundance threshold stricter than the absolute threshold (65 not 50),
and it still gives the highest number of reads.
Note that the positive samples only yield a single unique sequence (MD5
checksum 17dbc1c331d17cd075aabd6f710a039b
) which matches both the cichlid
control species Astatotilapia calliptera and Maylandia zebra.
High level overview
Looking over summary/drained_ponds.12S.samples.onebp.xlsx
in Excel, or the
TSV equivalent of the sample report, there are some general trends visible.
First, as noted above the controls are extremely clean with just the expected cichlid species for the positive controls, and a few off-target matches in the PCR negatives and field blanks as noted above.
Next, both the Middle Lake Sterivex Ethanol Buffer, and Middle Lake Sterivex
RNAlater Buffer are very clean. There are traces of human and waterfowl, and
but only three of these buffer samples shows any fish (eg SRR11949911
aka
5RNB
). In contrast, most of the Middle Lake Sterivex Longmire Buffer
samples do give fish reads.
Controls and buffers aside, all the field samples gave Anatidae (waterfowl) matches, most had human. There were traces of other birds and mammals such as pig and dog - with most of the new lake samples showing sheep (Ovis).
As to the fish, we see strong signal in most samples for Abramis brama, Carassius carassius, Cyprinus carpio, Rutilus rutilus and Tinca tinca.
Expected Fish
This paper was selected as an example because something is known about the expected content of all the biological samples - the lakes were drained and all the fish identified, counted and weighed. However, we cannot expect all the species present to have left DNA at all the sampling points within their lake, but that is a useful approximation for assessing the classifier:
$ cut -f 1-5,9,11 summary/drained_ponds.12S.assess.onebp.tsv
<SEE TABLE BELOW>
You might prefer to open this in Excel:
#Species |
TP |
FP |
FN |
TN |
F1 |
Ad-hoc-loss |
---|---|---|---|---|---|---|
OVERALL |
433 |
388 |
331 |
5877 |
0.55 |
0.624 |
(Off-target) Anatidae (waterfowl) |
0 |
70 |
0 |
29 |
0.00 |
1.000 |
(Off-target) Apodemus |
0 |
4 |
0 |
95 |
0.00 |
1.000 |
(Off-target) Ardea cinerea |
0 |
11 |
0 |
88 |
0.00 |
1.000 |
(Off-target) Bos taurus |
0 |
3 |
0 |
96 |
0.00 |
1.000 |
(Off-target) Canis lupus familiaris |
0 |
7 |
0 |
92 |
0.00 |
1.000 |
(Off-target) Capra hircus |
0 |
1 |
0 |
98 |
0.00 |
1.000 |
(Off-target) Columba |
0 |
47 |
0 |
52 |
0.00 |
1.000 |
(Off-target) Gallinula chloropus |
0 |
50 |
0 |
49 |
0.00 |
1.000 |
(Off-target) Gallus gallus |
0 |
13 |
0 |
86 |
0.00 |
1.000 |
(Off-target) Homo sapiens |
0 |
83 |
0 |
16 |
0.00 |
1.000 |
(Off-target) Ovis aries |
0 |
17 |
0 |
82 |
0.00 |
1.000 |
(Off-target) Ovis dalli |
0 |
1 |
0 |
98 |
0.00 |
1.000 |
(Off-target) Phalacrocorax carbo |
0 |
25 |
0 |
74 |
0.00 |
1.000 |
(Off-target) Sturnus |
0 |
3 |
0 |
96 |
0.00 |
1.000 |
(Off-target) Sus scrofa |
0 |
16 |
0 |
83 |
0.00 |
1.000 |
(Off-target) Turdus |
0 |
7 |
0 |
92 |
0.00 |
1.000 |
Abramis brama |
65 |
0 |
16 |
18 |
0.89 |
0.198 |
Acipenser spp. |
0 |
0 |
9 |
90 |
0.00 |
1.000 |
Alburnus mossulensis |
0 |
1 |
0 |
98 |
0.00 |
1.000 |
Astatotilapia calliptera |
4 |
0 |
0 |
95 |
1.00 |
0.000 |
Barbus barbus |
46 |
0 |
35 |
18 |
0.72 |
0.432 |
Carassius carassius |
64 |
0 |
17 |
18 |
0.88 |
0.210 |
Ctenopharyngodon idella |
3 |
15 |
6 |
75 |
0.22 |
0.875 |
Cyprinus carpio |
61 |
0 |
20 |
18 |
0.86 |
0.247 |
Maylandia zebra |
4 |
0 |
0 |
95 |
1.00 |
0.000 |
Perca fluviatilis |
40 |
0 |
41 |
18 |
0.66 |
0.506 |
Pseudorasbora parva |
0 |
2 |
0 |
97 |
0.00 |
1.000 |
Rutilus rutilus |
63 |
0 |
18 |
18 |
0.88 |
0.222 |
Scardinius erythrophthalmus |
6 |
0 |
75 |
18 |
0.14 |
0.926 |
Silurus glanis |
9 |
0 |
0 |
90 |
1.00 |
0.000 |
Spinibarbus denticulatus |
0 |
11 |
0 |
88 |
0.00 |
1.000 |
Squalidus gracilis |
0 |
1 |
0 |
98 |
0.00 |
1.000 |
Squalius cephalus |
6 |
0 |
75 |
18 |
0.14 |
0.926 |
Tinca tinca |
62 |
0 |
19 |
18 |
0.87 |
0.235 |
OTHER 37 SPECIES IN DB |
0 |
0 |
0 |
3663 |
0.00 |
0.000 |
False positives
We touched on the assorted “false positives” from the off-target 12S PCR amplification above. What is more interesting is the fish false positives. Let’s look at these starting with the most false positives.
Ctenopharyngodon idella
First, many middle lake samples unexpectedly have Ctenopharyngodon idella
(this is expected in the new lake samples). Why? They all stem from sequence
285edce3d193c92b1959e60bc130b518
which was matched to both C. idella
and Tinca tinca (expected in both lakes):
>285edce3d193c92b1959e60bc130b518
ACTATGCTCAGCCATAAACCTAGACATCCACCTACAATTAAACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCA
AAGGACCTGACGGTGCCTTAGACCCCC
This is both a one base pair edit away from AY897013.1 etc as C. idella, and from AB218686.1 etc as T. tinca. Reviewing the NCBI BLAST matches both sets of species are supported from multiple complete mitochondrion genomes and a range of research groups. In the context of this experiment, we could infer for the four middle lake samples this sequence was T. tinca.
Spinibarbus denticulatus
Next, we see 16 samples with unexpected cyprinid fish Spinibarbus
denticulatus. Referring to the read report, all are from a single sequence
4c53f6ed1ecdad3af2299999ec83d756
which has been matched perfectly to both
this unexpected species and expected species Carassius carassius:
>4c53f6ed1ecdad3af2299999ec83d756
ACTATGCTCAGCCGTAAACTTAGACATCCTACTACAATAGATGTCCGCCAGGGTACTACGAGCATTAGCTTAAAACCCAA
AGGACCTGACGGTGTCTCAGACCCCC
Given the actual fish in these lakes have been taxonomically identified, we can safely dismiss this - and perhaps drop AP013335.1 S. denticulatus from the ad-hoc DB?
A similar choice was made in compiling the ad hoc database, dropping all the Sander sp. entries for the following sequence in favour of just Perca fluviatilis as the sole expected Percidae:
>7e88b1bdeff6b6a361cc2175f4f630fd
ACTATGCCTAGCCATAAACATTGGTAGCACACTACACCCACTACCCGCCTGGGAACTACGAGCATCAGCTTGAAACCCAA
AGGACTTGGCGGTGCTTTAGATCCAC
This was based on the authors’ choice:
All fish OTUs were identified to species level with the exceptions of records matching the family Percidae. Percidae records were manually assigned to P. fluviatilis as this was the only species of the family identified in the study area during fish relocation.
Pseudorasbora parva
We see two samples containing Pseudorasbora parva, the invasive species which prompted these fish ponds to be drained as a control measure. You can find this in the read report, at the command line:
$ grep -E "(Pseudorasbora parva|samples|predictions)" \
summary/drained_ponds.12S.reads.onebp.tsv | cut -f 2,3,7,48,59
samples 2LMB 3LMF
MD5 onebp-predictions Total-abundance SRR11949854 SRR11949925
e819f3c222d6493572534fb6a5b7cda7 Pseudorasbora parva 520 323 197
Specifically we saw 323 reads in SRR11949854
aka 2LMB
and 197 reads in
SRR11949925
aka 3LMF
- both middle lake Sterivex (STX) samples.
Quoting the paper:
P. parva reads found in two Middle Lake-STX samples (279 and 148 reads) were also excluded from further analyses as after eradication this species was not physically present at the site surveyed.
The exact counts differ, but referring to the paper’s supplementary data the sample names match.
Other Fish
We also see one false positive for each of the two fish species Alburnus mossulensis, and Squalidus gracilis:
$ grep -E "(Alburnus mossulensis|samples|predictions)" \
summary/drained_ponds.12S.reads.onebp.tsv | cut -f 2,3,7,25
samples M3-MF2
MD5 onebp-predictions Total-abundance SRR11949859
916da937dccfd5d29502e83713e5d998 Abramis brama;Alburnus mossulensis 98 98
This sequence is ambiguous with equally good matches to expected species Abramis brama. Again, we might remove Alburnus mossulensis from the DB?
$ grep -E "(Squalidus gracilis|samples|predictions)" \
summary/drained_ponds.12S.reads.onebp.tsv | cut -f 2,3,7,20
samples M3-4F2
MD5 onebp-predictions Total-abundance SRR11949871
c0d532d1c6f8ffff9c72ac4a1873151c Squalidus gracilis 82 82
This sequence match is with AP011393.1 in the provided reference set.
False negatives
The classifier assessment shown above expected all the fish in each lake to be found at all the sites within that lake - an overly strong assertion which could explain many of the reported false negatives.
However, there is one clear false negative - neither this nor the original analysis found any Acipenser spp.
True positives
Rather than reviewing all of the true positives, I will note that in some cases we found more reads and thus declared a result in more samples. For example, we report Barbus barbus in 49 samples, versus:
In addition, Barbus barbus was detected at two sites (202 reads), …
We found Scardinius erythrophthalmus in six samples:
$ grep -E "(Scardinius erythrophthalmus|samples|predictions)" \
summary/drained_ponds.12S.reads.onebp.tsv | cut -f 7,8,12,13,83,84,85
samples M3-1F1 M3-5F1 M3-6F1 7RNF 8RNF MRNF
Total-abundance SRR11949879 SRR11949870 SRR11949868 SRR11949893 SRR11949886 SRR11949852
761 156 120 147 136 76 126
Quoting the original paper:
The presence of Scardinius erythrophthalmus was found at two sites with a low number of reads (38 and 25 reads) and, therefore, removed after applying the filter threshold
In these cases at least, we are seeing much higher read counts. Given the supplementary data provided, it could be possible to plot the read counts from the two methods against each other.
Conclusion
While not in-depth, this hopefully demonstrates the THAPBI PICT could be meaningfully applied to this 12S dataset which was originally analysed with metaBEAT v0.97.11.