Presence and absence
This example includes mock communities which are a controlled setup where we know what the classifier ought ideally to report for every sample - and all their expected marker sequences are in the classification database.
The thapbi_pict assess
command run via example script run.sh
uses a
configuration file with all the mock community species for MOL16S, and the
three sphaeriid mussel species for SPH16S - regardless of the target copy
number in the mixture (see Klymus et al. (2017) Table 2), or
presence/absence of the fish block.
Of course, just as in the original author’s analysis, not everything we think was present is detected. And vice versa, we see some things which are not classified.
SPH16S
This was the more specific primer pair, expected to only amplify sphaeriid mussel species, so in general we expect less unique sequences than with the more general MOL16S primers.
Only three members of the mock community should match. Looking at the
summary/SPH16S.assess.onebp.tsv
output file in Excel or at the
command line, when run at a minimum abundance threshold of 10, these are the
key numbers:
$ cut -f 1-5,9,11 summary/SPH16S.assess.onebp.tsv
<SEE TABLE BELOW>
Or open this in Excel. You should find:
#Species |
TP |
FP |
FN |
TN |
F1 |
Ad-hoc-loss |
---|---|---|---|---|---|---|
OVERALL |
9 |
5 |
0 |
656 |
0.78 |
0.357 |
Pisidium compressum |
3 |
0 |
0 |
7 |
1.00 |
0.000 |
Sphaerium corneum |
3 |
0 |
0 |
7 |
1.00 |
0.000 |
Sphaerium nucleus |
0 |
3 |
0 |
7 |
0.00 |
1.000 |
Sphaerium simile |
3 |
1 |
0 |
6 |
0.86 |
0.250 |
Sphaerium striatinum |
0 |
1 |
0 |
9 |
0.00 |
1.000 |
OTHER 62 SPECIES IN DB |
0 |
0 |
0 |
620 |
0.00 |
0.000 |
No false negatives (but we have set the threshold very low), but 5 false positives: Three cases of Sphaerium nucleus, and one each of S. simile and S. striatinum.
The S. nucleus matches are simply down to an ambiguous sequence in the
database from both this and expected species S. corneum. See also the output
from thapbi_pict conflicts -d SPH16S.sqlite
which can report this.
The S. striatinum prediction came from SPSC3PRO1
aka SRR5534978
, and
is down to several sequences one base pair away the expected S. simile
reference, but also one base pair away from an S. striatinum database entry.
We already discussed the trace level of 10 reads for Sphaerium simile in
mock community sample NFSC3PRO3
using the SOL16S primers. As suggested,
raising the minimum abundance threshold to at least 20 reads would solve this,
but the other false positives here are limitations of the reference set.
MOL16S
Looking at the summary/MOL16S.assess.onebp.tsv
output file in
Excel or at the command line, when run at a minimum abundance threshold of 10,
these are the key numbers:
$ cut -f 1-5,9,11 summary/MOL16S.assess.onebp.tsv
<SEE TABLE BELOW>
Or open this in Excel. You should find:
#Species |
TP |
FP |
FN |
TN |
F1 |
Ad-hoc-loss |
---|---|---|---|---|---|---|
OVERALL |
74 |
23 |
3 |
1220 |
0.85 |
0.260 |
Cipangopaludina chinensis |
7 |
0 |
0 |
4 |
1.00 |
0.000 |
Corbicula fluminea |
0 |
1 |
0 |
10 |
0.00 |
1.000 |
Dreissena bugensis |
0 |
8 |
0 |
3 |
0.00 |
1.000 |
Dreissena polymorpha |
7 |
1 |
0 |
3 |
0.93 |
0.125 |
Dreissena rostriformis |
7 |
1 |
0 |
3 |
0.93 |
0.125 |
Gillia altilis |
7 |
0 |
0 |
4 |
1.00 |
0.000 |
Melanoides tuberculata |
7 |
0 |
0 |
4 |
1.00 |
0.000 |
Mytilopsis leucophaeata |
7 |
0 |
0 |
4 |
1.00 |
0.000 |
Pisidium compressum |
7 |
0 |
0 |
4 |
1.00 |
0.000 |
Potamopyrgus antipodarum |
7 |
0 |
0 |
4 |
1.00 |
0.000 |
Sander vitreus |
4 |
0 |
3 |
4 |
0.73 |
0.429 |
Sphaerium corneum |
7 |
1 |
0 |
3 |
0.93 |
0.125 |
Sphaerium nucleus |
0 |
8 |
0 |
3 |
0.00 |
1.000 |
Sphaerium simile |
7 |
2 |
0 |
2 |
0.88 |
0.222 |
Sphaerium striatinum |
0 |
1 |
0 |
10 |
0.00 |
1.000 |
OTHER 105 SPECIES IN DB |
0 |
0 |
0 |
1155 |
0.00 |
0.000 |
This time we do have false negatives - three of the seven samples are missing
Sander vitreus. Two of these are from Community 3 where this is intended to
be at only 14 copies, the third was SC3PRO2
aka SRR5534972
for Mock
Community 2 MOL16S with Fish Block Primer, with a target abundance of 72
copies. Here the fish block worked.
Again we have lots of false positives, mostly sister species which reflects limitations of the reference set.
The exception is Corbicula fluminea. Referring to the sample summary report
MOL16S.samples.onebp.xlsx
, this is from SC3PRO1
aka SRR5534973
, and
at low abundance. This species was present in the aquaria sample sediment, but
as discussed in the paper did not amplify from there - so cross-contamination
seem less likely.
Unknowns
Looking at SPH16S.samples.onebp.xlsx
and MOL16S.samples.onebp.xlsx
even our controls have unknown reads. To study these, next I’d look at the
edit-graphs.