Reference database

Introduction

THAPBI PICT has been designed as a framework which can be applied to multiple biological contexts, demonstrated in the worked examples. Each new set of marker(s) (i.e. PCR primer targets) will require a new reference database be compiled, most likely starting from published sequences, but we also sequenced culture collections.

Applied to environmental samples, some primer pairs will amplify a much wider sequence space than others, either reflecting a more diverse genome region, or simply a longer sequence. Related to this, the fraction of observed sequences with a published reference will also vary - and thus the density of the references in sequence space. This in turn will can change which classifier algorithm is most appropriate. Inspecting the edit-graph produced for all your samples and your initial database entries can help interpret this.

The default classifier allows perfect matches, or a single base pair difference (substitution, insertion or deletion). This requires good database coverage with unambiguous sequences, which we have been able to achieve for the Phytophthora ITS1 region targeted by default.

Provided database

THAPBI provides a default database which is used when the command line -d or --database setting is omitted. This is intended for use with a Phytophthora ITS1 target region, and is used in the first worked example.

For further details see the database/README.rst file in the source code, and script database/build_ITS1_DB.sh which automates this.

Ambiguous bases in database

Ideally all the reference sequences in your database will have unambiguous sequences only (A, C, G and T). However, some published species sequences will contain IUPAC ambiguity codes, especially if capillary sequenced. How this is handled will depend on the classifier algorithm used.

For example Phytophthora condilina accession KJ372262 has a single W meaning A or T. In this case for P. condilina in our curated set, we could select the unambiguous accession MG707826 instead.

With the strictest identity classifier, the W will never be matched (since the Illumina platform does not produce any ambiguous bases other than N). With the default onebp classifier, this can match but the W would be the single allowed mismatch (and any database entry with more than one ambiguity would never be matched). The blast classifier uses NCBI BLAST+ internally, and would handle the base as expected.

Conflicting taxonomic assignments

With any amplicon marker, it is possible that distinct species will share the exact same sequence. For example, this happens with our ITS1 marker for model organism Phytophthora infestans and sister species P. andina and P. ipomoeae. In cases like this where the classifier finds multiple equally valid taxonomic assignments in the database, they are all reported. Should the user wish however, their database could record a single assignment like Phytophthora infestans-complex.

Our default primers for Phytophthora can amplify related genera, not just Peronosporales, but also some Pythiales. Expanding the database coverage runs into two main problems. First, with less published sequences available, the default strict classifier may fail to match many sequences to a published sequence. Second, with past renaming and splitting of some genera, the taxonomic annotation can becomes less consistent.

The thapbi_pict conflicts subcommand can be used to report any conflicts at species or genus level.