Presence and absence

This example includes mock communities which are a controlled setup where we know what the classifier ought ideally to report for every sample - and all their expected marker sequences are in the classification database.

There are five different mock communities, made up with different numbers of individuals. Running an overall assessment on the pooled species assignments from all three markers we have both false positives, and false negatives:

$ cut -f 1-5,9,11 summary/pooled.assess.onebp.tsv
<SEE TABLE BELOW>

As a table:

#Species	TP	FP	FN	TN	F1	Ad-hoc-loss
OVERALL	85	21	3	1104331	0.88	0.220
Acizzia alternata	16	1	0	3	0.97	0.059
Acizzia solanicola	16	1	0	3	0.97	0.059
Bactericera cockerelli	10	0	2	8	0.91	0.167
Diuraphis noxia	11	3	1	5	0.85	0.267
Metopolophium dirhodum	16	0	0	4	1.00	0.000
Rhopalosiphum nymphaeae	0	16	0	4	0.00	1.000
Rhopalosiphum padi	16	0	0	4	1.00	0.000
OTHER 55215 SPECIES IN DB	0	0	0	1104300	0.00	0.000

Our 3 false negatives on the pooled results are Bactericera cockerelli (2 cases, 1000-Pool-1` and ``250-Pool-4 matching the authors, with 500-Pool-4 just passing at 62 reads), and Diuraphis noxia (1 case, 500-Pool-3). The authors also reported 500-Pool-3 missing D. noxia, here it just passes the threshold at 54 reads.

Most of the false positives are 16 cases of Rhopalosiphum nymphaeae, which is unfortunately indistinguishable from community member R. padi with the 12S marker (see the thapbi_pict conflicts ... output).

Likewise splitting Acizzia alternata and solanicola is not possible with 18S, but we still have unwanted Acizzia alternata and solanicola from 12S in 500 Pool 2, with 53 and 96 reads respectively. These counts are low enough to consider raising the abundance threshold(s) to exclude them. In both cases they match the dominant 12S sequences from the other controls and could be due to Illumina tag switching?

The remaining false positives are 3 cases of Diuraphis noxia via the 18S marker, and in one case via the COI marker.

The authors only reported false positives for Diuraphis noxia in one sample, 1000-Pool-1 for both COI and 18S (Figure 3), traced to an unwanted nymph specimen (Figure 4). We see that too, but have two other false positives:

$ cut -f 3,37,38 summary/pooled.samples.onebp.tsv
<SEE TABLE BELOW>

Note 576 unwanted reads in 1000-Pool-1 for Diuraphis noxia (consistent with the author’s analysis), but also 503 reads in 100-Pool-5 and 63 in 1000-Pool-5 with a fuzzy match for Diuraphis noxia and/or Metopolophium dirhodum:

sample_alias	Diuraphis noxia	Diuraphis noxia;Metopolophium dirhodum
100-Pool-1	0	0
250-Pool-1	0	0
500-Pool-1	0	0
1000-Pool-1	576	0
100-Pool-2	1630	0
250-Pool-2	1454	0
500-Pool-2	1660	0
1000-Pool-2	228	0
100-Pool-3	2705	126
250-Pool-3	4059	0
500-Pool-3	0	0
1000-Pool-3	94	0
100-Pool-4	6446	113
250-Pool-4	5701	0
500-Pool-4	742	0
1000-Pool-4	684	54
100-Pool-5	0	503
250-Pool-5	0	0
500-Pool-5	0	0
1000-Pool-5	0	63
Trap-1	25009	135
Trap-10	126570	53
Trap-2	106	0
Trap-3	272	0
Trap-4	1351	0
Trap-5	4235	0
Trap-6	16758	0
Trap-7	2733	0
Trap-8	99678	53
Trap-9	37358	0

Consulting the read report, these Diuraphis noxia false positives from 100-Pool-5 (503 copies) and 1000-Pool-5 (just 63 copies) are from the same 18S sequence:

>d153aa679f3c184a2790cd26aac9c784
CCGCATTAAGGTGAAACCGCGAAAGGCTCATTAAATCAGTTGTGGTTCCTTAGATCGTACCCAAGTTACTTGGATAACTG
TGGTAATTCTAGAGCTAATACATGCCGACAGAGTTCCGACCGTCGCGGCGCCCTCGGGCGTCGCGCGCGGGAGGAACGCT
TTTATTAGATCAAAACCGGCCCGTCGCGGCGCGCTTCGTGCGCGTCCCGATCGCGGCCCGCGCAAAGACCTGGTGACTCT
GAATAACTTCGAGCTGATCGCACGGTCTCCGTACCGGCGACGCATCTTTCAAAT

With over 500 copies in 100-Pool-5 this cannot be dismissed as a difference in noise filtering versus the authors’ original analysis. Querying this on NCBI BLAST confirms it to be 1bp away from multiple Diuraphis noxia accessions, and a Metopolophium dirhodum voucher sequence (as in the DB here), but also third species via a sequence labelled as Acyrthosiphon pisum. Rather than reporting multiple conflicting species, the author’s pipeline likely assigned a lower rank?

Visualisation

We can look at this visually by reproducing Figure 3 from the original paper. The authors provided their R based analysis, from which I have exported the numbers used to draw the figure (see figure3original.R) giving a simple tab-separated file (figure3original.tsv). Likewise the Python script figure3reproduction.py will produce an equivalent table using the output from THABPI PICT (figure3reproduction.tsv). Finally, Python script recreate_figure3.py uses MatPlotLib to reproduce an annotated recreation of the original.

Original analysis:

https://user-images.githubusercontent.com/63959/216340233-e9424600-f413-4ce7-b125-b3243fe8e956.png

This re-analysis:

https://user-images.githubusercontent.com/63959/216340220-3247d8d5-ed19-47a7-b568-687883ef4a01.png

In the original paper the false positives and false negatives were marked with pluses and minus in colour coded circles, and these were added by hand. Here this annotation is automated, but is less aesthetically pleasing. The false negatives get a cross, false positives are shown with an exclamation mark, and furthermore low abundance (under 5%) true positives get a tick. Again, these are all species coloured.

Overall this seems to show very good agreement with the published analysis.