thapbi_pict.classify module
Classifying prepared marker sequences using a marker database.
This implements the thapbi_pict classify ...
command.
- thapbi_pict.classify.apply_method_to_seqs(method_fn: Callable, input_seqs: dict[str, str], session, marker_name: str, min_abundance: int = 0, debug: bool = False) Iterator[tuple[str, str, str, str]]
Call given method on each sequence in the dict.
Assumes any abundance filter has already been applied. Input is a dict of identifiers mapped to upper case sequences.
- thapbi_pict.classify.consoliate_and_sort_taxonomy(genus_species_taxid: Iterable[tuple[str, str, int]]) list[tuple[str, str, int]]
Remove any redundant entries, returns new sorted list.
Drops zero taxid entries if has matching non-zero entry.
Drops genus only entries if have species level entries. Note ignoring the TaxID here - would need to know the parent/child relationship to confirm the genus we’re removing does have species level children in the prediction set.
- thapbi_pict.classify.main(inputs: list[str], session, marker_name: str, method: str, out_dir: str, ignore_prefixes: tuple[str], tmp_dir: str, min_abundance: int = 0, biom=False, debug: bool = False, cpu: int = 0) list[str | None]
Implement the
thapbi_pict classify
command.For use in the pipeline command, returns a filename list of the TSV classifier output.
The input files should have been prepared with the same or a lower minimum abundance - this acts as an additional filter useful if exploring the best threshold.
- thapbi_pict.classify.method_blast(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) Iterator[tuple[str, str, str, str]]
Classify using BLAST.
Another simplistic classifier, run the reads through blastn against a BLAST database of our marker sequence database entries.
- thapbi_pict.classify.method_cleanup() None
Free any memory and/or delete any files on disk.
Currently no need to generalise this for the different classifiers, but could if for example we also needed to delete any files on disk.
- thapbi_pict.classify.method_dist(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) Iterator[tuple[str, str, str, str]]
Classify using edit distance.
- thapbi_pict.classify.method_identity(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) Iterator[tuple[str, str, str, str]]
Classify using perfect identity.
This is a deliberately simple approach, in part for testing purposes. It looks for a perfect identical entry in the database.
- thapbi_pict.classify.method_substr(input_seqs: dict[str, str], session, marker_name: str, tmp_dir: str, shared_tmp_dir: str, min_abundance: int = 0, debug: bool = False, cpu: int = 0) Iterator[tuple[str, str, str, str]]
Classify using perfect identity including as a sub-string.
Like the ‘identity’ method, but allows for a database where the marker has not been trimmed, or has been imperfectly trimmed (e.g. primer mismatch).
- thapbi_pict.classify.perfect_match_in_db(session, marker_name: str, seq: str, debug: bool = False) tuple[int | str, str, str]
Lookup sequence in DB, returns taxid, genus_species, note as tuple.
If the 100% matches in the DB give multiple species, then taxid and genus_species will be semi-colon separated strings.
- thapbi_pict.classify.perfect_substr_in_db(session, marker_name: str, seq: str, debug: bool = False) tuple[int | str, str, str]
Lookup sequence in DB, returns taxid, genus_species, note as tuple.
If the matches containing the sequence as a substring give multiple species, then taxid and genus_species will be semi-colon separated strings.
- thapbi_pict.classify.setup_blast(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0)
Prepare a BLAST DB from the marker sequence DB entries.
- thapbi_pict.classify.setup_dist2(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) None
Prepare a set of all DB marker sequences; set dist to 2.
- thapbi_pict.classify.setup_dist3(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) None
Prepare a set of all DB marker sequences; set dist to 3.
- thapbi_pict.classify.setup_dist4(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) None
Prepare a set of all DB marker sequences; set dist to 4.
- thapbi_pict.classify.setup_dist5(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) None
Prepare a set of all DB marker sequences; set dist to 5.
- thapbi_pict.classify.setup_dist6(session, marker_name, shared_tmp_dir, debug=False, cpu=0)
Prepare a set of all DB marker sequences; set dist to 6.
- thapbi_pict.classify.setup_dist7(session, marker_name, shared_tmp_dir, debug=False, cpu=0)
Prepare a set of all DB marker sequences; set dist to 7.
- thapbi_pict.classify.setup_dist8(session, marker_name, shared_tmp_dir, debug=False, cpu=0)
Prepare a set of all DB marker sequences; set dist to 8.
- thapbi_pict.classify.setup_dist9(session, marker_name, shared_tmp_dir, debug=False, cpu=0)
Prepare a set of all DB marker sequences; set dist to 9.
- thapbi_pict.classify.setup_onebp(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) None
Prepare a set of all the DB marker sequences; set dist to 1.
- thapbi_pict.classify.setup_seqs(session, marker_name: str, shared_tmp_dir: str, debug: bool = False, cpu: int = 0) None
Prepare a set of all the DB marker sequences as upper case strings.
Also setup set of sequences in the DB, and dict of genus to NCBI taxid.
- thapbi_pict.classify.taxid_and_sp_lists(taxon_entries: Iterable) tuple[int | str, str, str]
Return semi-colon separated summary of the taxonomy objects from DB.
Will discard genus level predictions (e.g. ‘Phytophthora’) if there is a species level prediciton within that genus (e.g. ‘Phytophthora infestans’).
If there is a single result, returns a tuple of taxid (integer), genus-species, and debugging comment (strings).
If any of the fields has conflicting values, returns two semi-colon separated string instead (in the same order so you can match taxid to species, sorting on the genus-species string).
- thapbi_pict.classify.unique_or_separated(values: Sequence[str | int], sep: str = ';') str
Return sole element, or a string joining all elements using the separator.