THAPBI PICT has been designed as a framework which can be applied to multiple biological contexts, demonstrated in the worked examples. Each new set of marker(s) (i.e. PCR primer targets) will require a new reference database be compiled, most likely starting from published sequences, but we also sequenced culture collections.
Applied to environmental samples, some primer pairs will amplify a much wider sequence space than others, either reflecting a more diverse genome region, or simply a longer sequence. Related to this, the fraction of observed sequences with a published reference will also vary - and thus the density of the references in sequence space. This in turn will can change which classifier algorithm is most appropriate. Inspecting the edit-graph produced for all your samples and your initial database entries can help interpret this.
The default classifier allows perfect matches, or a single base pair difference (substitution, insertion or deletion). This requires good database coverage with unambiguous sequences, which we have been able to achieve for the Phytophthora ITS1 region targeted by default.
THAPBI provides a default database which is used when the command line
--database setting is omitted. This is intended for use with a
Phytophthora ITS1 target region, and is used in the first
For further details see the
database/README.rst file in the source code,
database/build_ITS1_DB.sh which automates this.
Ambiguous bases in database
Ideally all the reference sequences in your database will have unambiguous
sequences only (
T). However, some published
species sequences will contain IUPAC ambiguity codes, especially if capillary
sequenced. How this is handled will depend on the classifier algorithm used.
For example Phytophthora condilina accession
KJ372262 has a single
T. In this case for P. condilina in our curated set, we
could select the unambiguous accession
With the strictest
identity classifier, the
W will never be matched
(since the Illumina platform does not produce any ambiguous bases other than
N). With the default
onebp classifier, this can match but the
would be the single allowed mismatch (and any database entry with more than
one ambiguity would never be matched). The
blast classifier uses NCBI
BLAST+ internally, and would handle the base as expected.
Conflicting taxonomic assignments
With any amplicon marker, it is possible that distinct species will share the exact same sequence. For example, this happens with our ITS1 marker for model organism Phytophthora infestans and sister species P. andina and P. ipomoeae. In cases like this where the classifier finds multiple equally valid taxonomic assignments in the database, they are all reported. Should the user wish however, their database could record a single assignment like Phytophthora infestans-complex.
Our default primers for Phytophthora can amplify related genera, not just Peronosporales, but also some Pythiales. Expanding the database coverage runs into two main problems. First, with less published sequences available, the default strict classifier may fail to match many sequences to a published sequence. Second, with past renaming and splitting of some genera, the taxonomic annotation can becomes less consistent.
thapbi_pict conflicts subcommand can be used to report any conflicts
at species or genus level.