thapbi_pict.taxdump module

Code for THAPBI PICT to deal with NCBI taxonomy dumps.

The code is needed initially for loading an NCBI taxdump folder (files names.dmp, nodes.dmp, merged.dmp etc) into a marker database.

thapbi_pict.taxdump.filter_tree(tree: dict[int, int], ranks: dict[str, set[int]], ancestors: set[int]) → tuple[dict[int, int], dict[str, set[int]]]

Return a filtered version of the tree & ranks dict.

NOTE: Does NOT preserve the original dict order.

thapbi_pict.taxdump.get_ancestor(taxid: int, tree: dict[int, int], stop_nodes: set[int]) → int: Walk up tree until reach a stop node, or root.

thapbi_pict.taxdump.load_merged(merged_dmp: str, wanted: set[int] | None = None) → dict[int, int]: Load mapping of merged taxids of interest from NCBI taxdump merged.dmp file.

thapbi_pict.taxdump.load_names(names_dmp: str, wanted: set[int] | None = None) → tuple[dict[int, str], dict[int, set[str]]]: Load scientific names of species from NCBI taxdump names.dmp file.

thapbi_pict.taxdump.load_nodes(nodes_dmp: str, wanted_ranks: Sequence[str] | None = None) → tuple[dict[int, int], dict[str, set[int]]]

Load the NCBI taxdump nodes.dmp file.

Returns two dicts, the parent/child relationships, and the ranks (values are lists of taxids).

Default is all ranks, can provide a possibly empty list/set of ranks of interest.

thapbi_pict.taxdump.main(tax: str, db_url: str, ancestors: str, debug: bool = True) → int: Load an NCBI taxdump into a database.

thapbi_pict.taxdump.not_top_species(tree: dict[int, int], ranks: dict[str, set[int]], names: dict[int, str], synonyms: dict[int, set[str]], top_species) → Iterator[tuple[int, str]]

Find all ‘minor’ species, takes set of species taxid to ignore.

Will map assorted sub-species (i.e. any nodes under top_species) to the parent species, e.g. varietas ‘Phytophthora nicotianae var. parasitica’ NCBI:txid4791 will be mapped to species ‘Phytophthora nicotianae’ NCBI:txid4790 instead.

Will map anything else to the parent genus, although generally it will be skipped via the reject_species_name(…) function, e.g.

no-rank entry ‘unclassified Pythium’ NCBI:txid228096 would be mapped to Pythium NCBI:txid4797 - although we’d not interested in importing any unclassified entries.
no-rank entry ‘environmental samples’ NCBI:txid660914 would be mapped to genus ‘Hyaloperonospora’ NCBI:txid184462 - but we skip this.
entry ‘uncultured Hyaloperonospora’ NCBI:txid660915 would be mapped to genus ‘Hyaloperonospora’ NCBI:txid184462 - but we skip uncultured.

However, if you wanted to import this part of the tree:

clade entry ‘Skeletonema marinoi-dohrnii complex’ NCBI:txid1171708 would be mapped to genus ‘Skeletonema’ NCBI:txid2842

Yields (genus taxid, node name) tuples.

thapbi_pict.taxdump.species_or_species_groups(tree: dict[int, int], ranks: dict[str, set[int]], names: dict[int, str]) → Iterator[tuple[int, int]]

Find taxids for species or species groups.

Our “genus” list matches the NCBI rank “genus”, and includes child nodes as aliases (unless they fall on our “species” list or reject list of “environmental samples” or “unclassified <genus>”).

However, our “species” list are either NCBI rank “species” or “species group” (in the later case child species are taken as aliases).

Does not distinguish between “top level” species, or those under “no rank” nodes like “environmental samples” or “unclassified Phytophthora” (taxid 211524),

Yields (species taxid, genus taxid) tuples.