thapbi_pict.db_import module

Shared code for THAPBI PICT to import FASTA into our database.

This code is used for importing NCBI formatted FASTA files, our curated ITS1 sequence FASTA file databases, and other other FASTA naming conventions.

thapbi_pict.db_import.import_fasta_file(fasta_file, db_url, fasta_entry_fn, entry_taxonomy_fn, marker, left_primer=None, right_primer=None, min_length=None, max_length=None, name=None, trim=True, debug=True, validate_species=False, genus_only=False, tmp_dir=None)

Import a FASTA file into the database.

thapbi_pict.db_import.load_taxonomy(session) set[str]

Pre-load all the species and synonym names as a set.

thapbi_pict.db_import.lookup_genus(session, name: str)

Find genus entry via taxonomy/synonym table (if present).

thapbi_pict.db_import.lookup_species(session, name: str)

Find this species entry in the taxonomy/synonym table (if present).

thapbi_pict.db_import.main(fasta, db_url, marker, left_primer=None, right_primer=None, min_length=0, max_length=9223372036854775807, name=None, convention='simple', sep=None, validate_species=False, genus_only=False, ignore_prefixes=None, tmp_dir=None, debug=False)

Import FASTA file(s) into the database.

For curated FASTA files, use convention “simple” (default here and at the command line), and specify any multi-entry separator you are using.

For NCBI files, convention “ncbi” and for the separator use Ctrl+A (type -s $'\001' at the command line) if appropriate, or “” or None (function default) if single entries are expected.

thapbi_pict.db_import.parse_curated_fasta_entry(text: str, known_species: list[str] | None = None) tuple[int, str]

Split an entry of “Accession genus species etc” into fields.

Does not use the optional known_species argument.

Returns a two-tuple of taxid (0 unless taxid=… entry found), genus-species.

>>> parse_curated_fasta_entry("HQ013219 Phytophthora arenaria")
(0, 'Phytophthora arenaria')

Will look for an NCBI taxid after the species name (and ignore anything following that, such as other key=value entries):

>>> parse_curated_fasta_entry("P13660 Phytophthora aff infestans taxid=907744 etc")
(907744, 'Phytophthora aff infestans')

In this example we expect the NCBI taxid will be matched to a pre-loaded species name to be used in preference (i.e. ‘Phytophthora aff. infestans’ with a dot in it).

thapbi_pict.db_import.parse_ncbi_fasta_entry(text: str, known_species: list[str] | None = None) tuple[int, str]

Split an entry of Accession Genus Species-name Description.

Returns a two-tuple: taxid (always zero), presumed genus-species (may be the empty string).

>>> parse_ncbi_fasta_entry("LC159493.1 Phytophthora drechsleri genes ...")
(0, 'Phytophthora drechsleri')
>>> parse_ncbi_fasta_entry("A57915.1 Sequence 20 from Patent EP0751227")
(0, '')
>>> parse_ncbi_fasta_entry("Y08654.1 P.cambivora ribosomal internal ...")
(0, '')

If a list of known species are used, then right most word is dropped until the text matches a known name. This discards any description (and strain level information if the list is only to species level).

If there is no match to the provided names, heuristics are used but this defaults to the first two words.

Dividing the species name into genus, species, strain etc is not handled here.

thapbi_pict.db_import.parse_ncbi_taxid_entry(text: str, know_species: list[str] | None = None) tuple[int, str]

Find any NCBI taxid as a pattern in the text.

Returns a two-tuple of taxid (zero if not found), and an empty string (use the taxonomy table in the DB to get the genus-species).

Uses a regular expression based on taxid=<digits>, and only considers the first match:

>>> parse_ncbi_taxid_entry("HQ013219 Phytophthora arenaria [taxid=]")
(0, '')
>>> parse_ncbi_taxid_entry("HQ013219 Phytophthora arenaria [taxid=123] [taxid=456]")
(123, '')
thapbi_pict.db_import.parse_obitools_fasta_entry(text: str, known_species: list[str] | None = None) tuple[int, str]

Parse species from the OBITools extended FASTA header.

See https://pythonhosted.org/OBITools/attributes.html which explains that OBITools splits the FASTA line into identifier, zero or more key=value; entries, and a free text description.

We are specifically interested in the species_name, genus_name (used if species_name is missing), and taxid.

>>> entry = "AP009202 species_name=Abalistes stellaris; taxid=392897; ..."
>>> parse_obitools_fasta_entry(entry)
(392897, 'Abalistes stellaris')

Note this will not try to parse any key=value entries embedded in the first word (which taken as the identifier).

thapbi_pict.db_import.parse_sintax_fasta_entry(text: str, known_species: list[str] | None = None) tuple[int, str]

Extract the species from SINTAX taxonomy annotation.

See https://drive5.com/usearch/manual/tax_annot.html which defines this taxonomy annotation convention as used in USEARCH and VSEARCH. The tax=names field is separated from other fields in the FASTA description line by semi-colons, for example:

>>> entry = "X80725_S000004313;tax=d:...,g:Escherichia/Shigella,s:Escherichia_coli"
>>> parse_sintax_fasta_entry(entry)
(0, 'Escherichia coli')

If there is no species entry (prefix s:) then the genus is returned (prefix g:), else the empty string:

>>> parse_sintax_fasta_entry("AB008314;tax=d:...,g:Streptococcus;")
(0, 'Streptococcus')

If the species entry is missing the genus information (which may happen depending how the file was generated), that is inferred heuristically:

>>> entry = "X80725_S000004313;tax=d:...,g:Escherichia,s:coli"
>>> parse_sintax_fasta_entry(entry)
(0, 'Escherichia coli')

This can be unclear:

>>> entry = ">X80725_S000004313;tax=d:...,g:Escherichia/Shigella,s:Escherichia_coli"
>>> parse_sintax_fasta_entry(entry)
(0, 'Escherichia coli')