thapbi_pict.classify module

Classifying prepared marker sequences using a marker database.

This implements the thapbi_pict classify ... command.

thapbi_pict.classify.apply_method_to_seqs(method_fn: Callable, input_seqs: dict[str, str], session, marker_name: str, min_abundance: int = 0, debug: bool = False) → Iterator[tuple[str, str, str, str]]

Call given method on each sequence in the dict.

Assumes any abundance filter has already been applied. Input is a dict of identifiers mapped to upper case sequences.

thapbi_pict.classify.consoliate_and_sort_taxonomy(genus_species_taxid: Iterable[tuple[str, str, int]]) → list[tuple[str, str, int]]

Remove any redundant entries, returns new sorted list.

Drops zero taxid entries if has matching non-zero entry.

Drops genus only entries if have species level entries. Note ignoring the TaxID here - would need to know the parent/child relationship to confirm the genus we’re removing does have species level children in the prediction set.

thapbi_pict.classify.main(inputs: list[str], session, marker_name: str, method: str, out_dir: str, ignore_prefixes: tuple[str], tmp_dir: str, min_abundance: int = 0, biom=False, debug: bool = False, cpu: int = 0) → list[str | None]

Implement the thapbi_pict classify command.

For use in the pipeline command, returns a filename list of the TSV classifier output.

The input files should have been prepared with the same or a lower minimum abundance - this acts as an additional filter useful if exploring the best threshold.

thapbi_pict.classify.method_blast(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) → Iterator[tuple[str, str, str, str]]

Classify using BLAST.

Another simplistic classifier, run the reads through blastn against a BLAST database of our marker sequence database entries.

thapbi_pict.classify.method_cleanup() → None

Free any memory and/or delete any files on disk.

Currently no need to generalise this for the different classifiers, but could if for example we also needed to delete any files on disk.

thapbi_pict.classify.method_dist(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) → Iterator[tuple[str, str, str, str]]: Classify using edit distance.

thapbi_pict.classify.method_identity(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) → Iterator[tuple[str, str, str, str]]

Classify using perfect identity.

This is a deliberately simple approach, in part for testing purposes. It looks for a perfect identical entry in the database.

thapbi_pict.classify.method_substr(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) → Iterator[tuple[str, str, str, str]]

Classify using perfect identity including as a sub-string.

Like the ‘identity’ method, but allows for a database where the marker has not been trimmed, or has been imperfectly trimmed (e.g. primer mismatch).

thapbi_pict.classify.perfect_match_in_db(session, marker_name: str, seq: str, debug: bool = False) → tuple[int | str, str, str]

Lookup sequence in DB, returns taxid, genus_species, note as tuple.

If the 100% matches in the DB give multiple species, then taxid and genus_species will be semi-colon separated strings.

thapbi_pict.classify.perfect_substr_in_db(session, marker_name: str, seq: str, debug: bool = False) → tuple[int | str, str, str]

Lookup sequence in DB, returns taxid, genus_species, note as tuple.

If the matches containing the sequence as a substring give multiple species, then taxid and genus_species will be semi-colon separated strings.

thapbi_pict.classify.setup_blast(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0): Prepare a BLAST DB from the marker sequence DB entries.

thapbi_pict.classify.setup_dist2(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) → None: Prepare a set of all DB marker sequences; set dist to 2.

thapbi_pict.classify.setup_dist3(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) → None: Prepare a set of all DB marker sequences; set dist to 3.

thapbi_pict.classify.setup_dist4(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) → None: Prepare a set of all DB marker sequences; set dist to 4.

thapbi_pict.classify.setup_dist5(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) → None: Prepare a set of all DB marker sequences; set dist to 5.

thapbi_pict.classify.setup_dist6(session, marker_name, shared_tmp_dir, debug=False, cpu=0): Prepare a set of all DB marker sequences; set dist to 6.

thapbi_pict.classify.setup_dist7(session, marker_name, shared_tmp_dir, debug=False, cpu=0): Prepare a set of all DB marker sequences; set dist to 7.

thapbi_pict.classify.setup_dist8(session, marker_name, shared_tmp_dir, debug=False, cpu=0): Prepare a set of all DB marker sequences; set dist to 8.

thapbi_pict.classify.setup_dist9(session, marker_name, shared_tmp_dir, debug=False, cpu=0): Prepare a set of all DB marker sequences; set dist to 9.

thapbi_pict.classify.setup_onebp(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) → None: Prepare a set of all the DB marker sequences; set dist to 1.

thapbi_pict.classify.setup_seqs(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) → None

Prepare a set of all the DB marker sequences as upper case strings.

Also setup set of sequences in the DB, and dict of genus to NCBI taxid.

thapbi_pict.classify.taxid_and_sp_lists(taxon_entries: Iterable) → tuple[int | str, str, str]

Return semi-colon separated summary of the taxonomy objects from DB.

Will discard genus level predictions (e.g. ‘Phytophthora’) if there is a species level prediciton within that genus (e.g. ‘Phytophthora infestans’).

If there is a single result, returns a tuple of taxid (integer), genus-species, and debugging comment (strings).

If any of the fields has conflicting values, returns two semi-colon separated string instead (in the same order so you can match taxid to species, sorting on the genus-species string).

thapbi_pict.classify.unique_or_separated(values: Sequence[str | int], sep: str = ';') → str: Return sole element, or a string joining all elements using the separator.