Assessing the classifier

This sample dataset includes two positive control mock communities. We know the species which went into the two different DNA mixes used, so for each sequenced positive control sample we can compare the expected list of species with the predicted list of species, and thus count true positives, false positives, false negatives, etc.

We will first do this by hand, and then explore the tool’s own built in assessment framework.

Counting species for one sample by hand

The woody hosts dataset had two positive control mixes. From the first plate, a set of 15 Phytophthora species (listed here alphabetically):

Phytophthora austrocedri
Phytophthora boehmeriae
Phytophthora cactorum
Phytophthora cambivora (now Phytophthora x cambivora)
Phytophthora chlamydospora
Phytophthora cinnamomi
Phytophthora gonapodyides
Phytophthora ilicis
Phytophthora kernoviae
Phytophthora lateralis
Phytophthora obscura
Phytophthora plurivora
Phytophthora pseudosyringae
Phytophthora ramorum
Phytophthora syringae

Quoting from the sample summary report, using the default settings for classification of DNA15MIX, we got:

$ grep DNA15MIX summary/thapbi-pict.ITS1.samples.onebp.tsv | cut -f 2
Phytophthora aleatoria(*), Phytophthora alpina(*), Phytophthora austrocedri, Phytophthora cactorum(*), Phytophthora gonapodyides, Phytophthora ilicis, Phytophthora kernoviae, Phytophthora obscura, Phytophthora pseudosyringae, Phytophthora ramorum

Or, as a list:

Phytophthora aleatoria (uncertain/ambiguous)
Phytophthora alpina (uncertain/ambiguous)
Phytophthora austrocedri
Phytophthora cactorum (uncertain/ambiguous)
Phytophthora gonapodyides
Phytophthora ilicis
Phytophthora kernoviae
Phytophthora obscura
Phytophthora pseudosyringae
Phytophthora ramorum

The good news is that eight are correct classifications (eight true positives, 8 TP), but two false positives (2 FP). Those false positives Phytophthora alpina and P. aleatoria are indistinguishable from P. cactorum, a problem flagged via the conflicts command:

$ thapbi_pict conflicts | grep cactorum
f27df8e8755049e831b1ea4521ad6eb3  species  Phytophthora aleatoria;Phytophthora alpina;Phytophthora cactorum
$ grep f27df8e8755049e831b1ea4521ad6eb3 intermediate/ITS1/DNA15MIX.fasta
>f27df8e8755049e831b1ea4521ad6eb3_981

The bad news is we are missing seven expected species (seven false negatives, 7 FN):

Phytophthora boehmeriae
Phytophthora chlamydospora
Phytophthora cinnamomi
Phytophthora lateralis
Phytophthora plurivora
Phytophthora syringae
Phytophthora x cambivora

We will return to interpretation after showing how to get the tool to compute these FP, FP and FN values.

The positive controls from the second plate had a different mix of ten Phytophthora species, again listed alphabetically:

Phytophthora boehmeriae
Phytophthora cactorum
Phytophthora capsici
Phytophthora castaneae
Phytophthora fallax
Phytophthora foliorum
Phytophthora obscura
Phytophthora plurivora
Phytophthora rubi
Phytophthora siskiyouensis

Again referring to the sample summary report from running with default settings, for DNA10MIX_undiluted and DNA10MIX_diluted25x we got:

Phytophthora agathidicida (uncertain/ambiguous)
Phytophthora capsici
Phytophthora castaneae (uncertain/ambiguous)
Phytophthora fallax
Phytophthora foliorum
Phytophthora gloveri (uncertain/ambiguous)
Phytophthora obscura
Phytophthora plurivora
Phytophthora rubi
Phytophthora siskiyouensis

Plus the results from DNA10MIX_bycopynumber were almost the same - but this time there wasn’t a sequence only matched to P. capsici, so that was also flagged as “(uncertain/ambiguous)”.

Leaving aside the ambiguous qualifier, there are ten species predictions, but only nine are correct (9 TP: P. capsici, P. castaneae, P. fallax, P. foliorum, P. obscura, P. plurivora, P. rubi, P. siskiyouensis), with two wrong guesses (2 FP: P. agathidicida and P. gloveri), and two missing predictions (2 FN: P. boehmeriae and P. cactorum).

The uncertain/ambiguous prediction of Phytophthora agathidicida is easily explained, it comes from a sequence present in all three samples with MD5 checksum 5122dde24762f8e3d6a54e3f79077254, and this exact sequence is in the database with entries for both Phytophthora castaneae (which was in the DNA control mixture) and also Phytophthora agathidicida (e.g. accession KP295308).

You can confirm this by looking at the sample tally TSV files, e.g. using grep to find the unique sequence matched to this species, and the sample counts for that sequence:

$ grep "Phytophthora agathidicida" summary/thapbi-pict.ITS1.onebp.tsv | cut -f 1,125,126
ITS1/29de890989becddc5e0b10ecbbc11b1a_1524  1642459;1642465  Phytophthora agathidicida;Phytophthora castaneae
$ grep -E "(Sequence|29de890989becddc5e0b10ecbbc11b1a)" \
  summary/thapbi-pict.ITS1.tally.tsv | cut -f 2-5
DNA10MIX_bycopynumber  DNA10MIX_diluted25x  DNA10MIX_undiluted  DNA15MIX
245                    655                  624                 0
$ thapbi_pict conflicts | grep 29de890989becddc5e0b10ecbbc11b1a
29de890989becddc5e0b10ecbbc11b1a  species  Phytophthora agathidicida;Phytophthora castaneae

The same applies to Phytophthora capsici and Phytophthora gloveri. i.e. These false positives are unavoidable.

As noted above, the woody hosts paper concluded the failure to detect P. boehmeriae in either DNA mix was due to inefficient primer annealing in a species mixture. We have an unexpected FN for P. cactorum though.

Running thapbi_pict assess for one sample

Comparing a few samples like this by hand is one thing, but doing it at scale requires automation. For assessing changes to the classifier method and database, we mainly run thapbi_pict assess against a set of single isolate positive controls. This requires a computer readable files listing the expected species in a particular format.

$ thapbi_pict assess -h
...

The “known” file uses the same column based layout as the intermediate TSV files, but while you can provide the expected species for each unique sequence in the sample, this can be simplified to a single wildcard * line followed by all the NCBI taxids (optional), and species names using semi-colon separators.

The simplest way to run the assess command is to tell it two TSV input filenames, named <sample_name>.known.tsv (the expected results) and <sample_name>.<method>.tsv (from running thapbi_pict classify` on <sample_name>.fasta). However, although early versions of the pipeline did this, it has for a long time combined the samples for classification - partly for speed.

Instead we typically pass the assess command the sample-tally TSV file listing how many of each unique sequence were found in each sample, the classifier TSV listing the species assigned to each sequence, and one or more per-sample <sample_name>.known.tsv expected results files.

The assess command will default to printing its tabular output to screen - shown here abridged after piping through the cut command to pull out just the first five columns from the 15 species mix:

$ thapbi_pict assess -i summary/thapbi-pict.ITS1.onebp.tsv \
  expected/DNA15MIX.known.tsv | cut -f 1-5
Assessed onebp vs known in 2 files (315 species; 1 samples)
#Species                     TP  FP  FN  TN
OVERALL                      8   2   7   298
Phytophthora aleatoria       0   1   0   0
Phytophthora alpina          0   1   0   0
Phytophthora austrocedri     1   0   0   0
Phytophthora boehmeriae      0   0   1   0
Phytophthora cactorum        1   0   0   0
Phytophthora chlamydospora   0   0   1   0
Phytophthora cinnamomi       0   0   1   0
Phytophthora gonapodyides    1   0   0   0
Phytophthora ilicis          1   0   0   0
Phytophthora kernoviae       1   0   0   0
Phytophthora lateralis       0   0   1   0
Phytophthora obscura         1   0   0   0
Phytophthora plurivora       0   0   1   0
Phytophthora pseudosyringae  1   0   0   0
Phytophthora ramorum         1   0   0   0
Phytophthora syringae        0   0   1   0
Phytophthora x cambivora     0   0   1   0
OTHER 298 SPECIES IN DB      0   0   0   298

More usually, you would output to a named file, and look at that:

$ thapbi_pict assess -i summary/thapbi-pict.ITS1.onebp.tsv \
  expected/DNA15MIX.known.tsv -o DNA15MIX.assess.tsv
Assessed onebp vs known in 2 files (315 species; 1 samples)
$ cut -f 1-5,9,11 DNA15MIX.assess.tsv
<SEE TABLE BELOW>

You should be able to open this DNA15MIX.assess.tsv file in R, Excel, etc, and focus on the same column selection:

#Species	TP	FP	FN	TN	F1	Ad-hoc-loss
OVERALL	8	2	7	298	0.64	0.529
Phytophthora aleatoria	0	1	0	0	0.00	1.000
Phytophthora alpina	0	1	0	0	0.00	1.000
Phytophthora austrocedri	1	0	0	0	1.00	0.000
Phytophthora boehmeriae	0	0	1	0	0.00	1.000
Phytophthora cactorum	1	0	0	0	1.00	0.000
Phytophthora chlamydospora	0	0	1	0	0.00	1.000
Phytophthora cinnamomi	0	0	1	0	0.00	1.000
Phytophthora gonapodyides	1	0	0	0	1.00	0.000
Phytophthora ilicis	1	0	0	0	1.00	0.000
Phytophthora kernoviae	1	0	0	0	1.00	0.000
Phytophthora lateralis	0	0	1	0	0.00	1.000
Phytophthora obscura	1	0	0	0	1.00	0.000
Phytophthora plurivora	0	0	1	0	0.00	1.000
Phytophthora pseudosyringae	1	0	0	0	1.00	0.000
Phytophthora ramorum	1	0	0	0	1.00	0.000
Phytophthora syringae	0	0	1	0	0.00	1.000
Phytophthora x cambivora	0	0	1	0	0.00	1.000
OTHER 298 SPECIES IN DB	0	0	0	298	0.00	0.000

The OVERALL line tells us that there were 8 true positives, 2 false positives, 7 false negatives, and 298 true negatives. The final number needs a little explanation. First, 8+2+7+298 = 315, which is the number of species in the database. With only one sample being considered, 298 is the number of other species in the database which the tool correctly says are not present.

Following this we get one line per species, considering this species in isolation (making this a traditional and simpler to interpret classification problem). Here there is only one sample, so this time TP+FP+FN+TN=1.

You can easily spot the 2 FP in this layout, Phytophthora alpina and P. aleatoria, or the 7 FN.

The additional columns (not all shown here) include traditional metrics like sensitivity, specificity, precision, F1, and Hamming loss. We’ve shown F1 or F-measure here (from zero to one for perfect recall), plus our own metric provisionally called Ad hoc loss which is a modification of the Hamming loss without using the true negative count (which we expect to always be very large as the database will contain many species, while a community might contain only ten). The TN column is omitted below.

Doing that for one of the 10 species mixtures:

$ thapbi_pict assess -i summary/thapbi-pict.ITS1.onebp.tsv \
  expected/DNA10MIX_undiluted.known.tsv -o DNA10MIX.assess.tsv
Assessed onebp vs known in 2 files (315 species; 1 samples)
$ cut -f 1-4,9,11 DNA10MIX.assess.tsv
<SEE TABLE BELOW>

As this is still only one sample, new table DNA10MIX.assess.tsv is very similar to what we had before:

#Species	TP	FP	FN	F1	Ad-hoc-loss
OVERALL	8	2	2	0.80	0.333
Phytophthora agathidicida	0	1	0	0.00	1.000
Phytophthora boehmeriae	0	0	1	0.00	1.000
Phytophthora cactorum	0	0	1	0.00	1.000
Phytophthora capsici	1	0	0	1.00	0.000
Phytophthora castaneae	1	0	0	1.00	0.000
Phytophthora fallax	1	0	0	1.00	0.000
Phytophthora foliorum	1	0	0	1.00	0.000
Phytophthora gloveri	0	1	0	0.00	1.000
Phytophthora obscura	1	0	0	1.00	0.000
Phytophthora plurivora	1	0	0	1.00	0.000
Phytophthora rubi	1	0	0	1.00	0.000
Phytophthora siskiyouensis	1	0	0	1.00	0.000
OTHER 303 SPECIES IN DB	0	0	0	0.00	0.000

It is clear from the metrics that the classifier is performing better on the second 10 species mock community.

Assessing multiple samples

Next, let’s run the assess command on all four positive control samples, by giving the combined intermediate filenames, and all the expected files:

$ thapbi_pict assess -i summary/thapbi-pict.ITS1.onebp.tsv \
  expected/ -o thabpi-pict.ITS1.assess.tsv
Assessed onebp vs known in 5 files (315 species; 4 samples)
$ cut -f 1-4,9,11 thabpi-pict.ITS1.assess.tsv
<SEE TABLE BELOW>

New table thabpi-pict.ITS1.assess.tsv is similar, but notice all the per-species lines have TP+FP+FN+TN=4 as there were 4 samples:

#Species	TP	FP	FN	F1	Ad-hoc-loss
OVERALL	32	8	13	0.75	0.396
Phytophthora agathidicida	0	3	0	0.00	1.000
Phytophthora aleatoria	0	1	0	0.00	1.000
Phytophthora alpina	0	1	0	0.00	1.000
Phytophthora austrocedri	1	0	0	1.00	0.000
Phytophthora boehmeriae	0	0	4	0.00	1.000
Phytophthora cactorum	1	0	3	0.40	0.750
Phytophthora capsici	3	0	0	1.00	0.000
Phytophthora castaneae	3	0	0	1.00	0.000
Phytophthora chlamydospora	0	0	1	0.00	1.000
Phytophthora cinnamomi	0	0	1	0.00	1.000
Phytophthora fallax	3	0	0	1.00	0.000
Phytophthora foliorum	3	0	0	1.00	0.000
Phytophthora gloveri	0	3	0	0.00	1.000
Phytophthora gonapodyides	1	0	0	1.00	0.000
Phytophthora ilicis	1	0	0	1.00	0.000
Phytophthora kernoviae	1	0	0	1.00	0.000
Phytophthora lateralis	0	0	1	0.00	1.000
Phytophthora obscura	4	0	0	1.00	0.000
Phytophthora plurivora	3	0	1	0.86	0.250
Phytophthora pseudosyringae	1	0	0	1.00	0.000
Phytophthora ramorum	1	0	0	1.00	0.000
Phytophthora rubi	3	0	0	1.00	0.000
Phytophthora siskiyouensis	3	0	0	1.00	0.000
Phytophthora syringae	0	0	1	0.00	1.000
Phytophthora x cambivora	0	0	1	0.00	1.000
OTHER 290 SPECIES IN DB	0	0	0	0.00	0.000

This time the OVERALL line says we had 32 TP, 8 FP, 13 FN. The total of TP+FP+FN+TN (not shown) is four times the number of samples times the number of species in the database (four samples).

Running assessment as part of pipeline

Provided they follow the expected naming convention, if you include your control files *.known.tsv as one of the pipeline inputs, it will call the classifier assessment after running the classifier and producing the main reports:

$ thapbi_pict pipeline -i raw_data/ expected/ -s intermediate/ \
  -o summary/with-metadata -n raw_data/NEGATIVE*.fastq.gz \
  -t metadata.tsv -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -x 16
...
$ ls -1 summary/with-metadata.*
summary/with-metadata.ITS1.assess.confusion.onebp.tsv
summary/with-metadata.ITS1.assess.onebp.tsv
summary/with-metadata.ITS1.assess.tally.onebp.tsv
summary/with-metadata.ITS1.onebp.tsv
summary/with-metadata.ITS1.reads.onebp.tsv
summary/with-metadata.ITS1.reads.onebp.xlsx
summary/with-metadata.ITS1.samples.onebp.tsv
summary/with-metadata.ITS1.samples.onebp.xlsx
summary/with-metadata.ITS1.tally.tsv
$ diff summary/with-metadata.ITS1.assess.onebp.tsv \
  thabpi-pict.ITS1.assess.tsv

Output file summary/with-metadata.ITS1.assess.onebp.tsv will match the output above.

Interpretation of the mock communities

Running our pipeline with the default settings results in a number of false positives (all unavoidable as they come from conflicting marker sequences in the database, see the thapbi_pict conflicts command), and some false negatives (on top of the explained absence of Phytophthora boehmeriae). Specifically we have 6 unexplained false negatives on the 15 species mix, and are missing Phytophthora cactorum in all three samples of the 10 species mix.

This means that with the default settings THAPBI PICT gives a more cautious set of predictions than the metapy tool used in the original data analysis (see Riddell et al. (2019) Table 1, Table 2) which appears to consider even singletons.

Attempting to compare the results in their Table 1 with our own numbers is complicated since it appears to show just one of the 10 species mixes (so the TP count is out of 10) while we used all three (for a TP count out of 30).

We can therefore pick a single representative sample for the 10 species mix, to make direct comparison more straightforward:

$ thapbi_pict assess -i summary/thapbi-pict.ITS1.onebp.tsv \
  expected/DNA15MIX.known.tsv expected/DNA10MIX_undiluted.known.tsv \
  | head -n 2 | cut -f 1-4,9,11
Assessed onebp vs known in 3 files (315 species; 2 samples)
#Species  TP  FP  FN  F1    Ad-hoc-loss
OVERALL   16  4   9   0.71  0.448

We can recover most of the missing species (the FN) by dropping the minimum abundance thresholds (which requires deleting the intermediate FASTA files, or using a different intermediate folder, and re-running with lower settings for -a and -f), at the cost of more FP.

For instance, we find traces of P. syringae with less than 10 reads in the 15 species mix (consistent with Table 2), and even P. boehmeriae with less than 10 reads in two of the 10 species mix (not reported in Table 2).

Interestingly even excluding only singletons (using -a 2 -f 0), we didn’t find any matches to Phytophthora cactorum in the three samples of the 10 species mix. However, there is a sequence perfectly matching database entries for P. idaei present at around 40 to 60 copies, and in light of the original paper, this is likely what was intended to be in the mixture as P. cactorum.

Again even excluding only singletons, we didn’t find any matches to P. plurivora in the 15 species mix (Table 2 in the original paper suggests present with only 2 reads).

We can optimise the threshold by maximising the F1 score and minimising ad-hoc-loss for these two samples. This is done at the end of the run.sh script with a simple parameter sweep of the absolute threshold (-a) with the fractional threshold unused (-f 0). This produces a simple table:

$ cut -f 1-4,9,11 summary/mocks_a2.assess-vs-abundance.tsv
<SEE TABLE BELOW>

Open the table in Excel if you prefer, the columns of particular interest:

#Threshold	TP	FP	FN	F1	Ad-hoc-loss
A=2	22	23	3	0.63	0.542
A=10	20	9	5	0.74	0.412
A=20	20	8	5	0.75	0.394
A=30	19	8	6	0.73	0.424
A=40	19	6	6	0.76	0.387
A=50	19	5	6	0.78	0.367
A=60	18	5	7	0.75	0.400
A=70	18	5	7	0.75	0.400
A=80	18	5	7	0.75	0.400
A=90	16	4	9	0.71	0.448
A=100	16	4	9	0.71	0.448

This suggests the optimal absolute abundance threshold for these two samples is in the region of 50 reads, giving 19 TP, 5 FP, and 6 FN for an F1 of 0.78 and ad-hoc-loss of 0.367. If we run the optimisation on all four samples (one with 15 species, three with 10 species), this suggests somewhere in between this and the default of 100.