thapbi_pict.assess module

Assess classification of marker reads at species level.

This implements the thapbi_pict assess ... command.

thapbi_pict.assess.class_list_from_tally_and_db_list(tally: dict[tuple[str, str], int], db_sp_list: list[str]) list[str]

Sorted list of all class names used in a confusion table dict.

thapbi_pict.assess.extract_binary_tally(class_name: str, tally: dict[tuple[str, str], int]) tuple[int, int, int, int]

Extract single-class TP, FP, FN, TN from multi-class confusion tally.

Reduces the mutli-class expectation/prediction to binary - did they include the class of interest, or not?

Returns a 4-tuple of values, True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN), which sum to the tally total.

thapbi_pict.assess.extract_global_tally(tally: dict[tuple[str, str], int], sp_list: list[str]) tuple[int, int, int, int]

Process multi-label confusion matrix (tally dict) to TP, FP, FN, TN.

If the input data has no negative controls, all there will be no true negatives (TN).

Returns a 4-tuple of values, True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN).

These values are analogous to the classical binary classifier approach, but are NOT the same. Even if applied to single class expected and predicted values, results differ:

  • Expect none, predict none - 1xTN

  • Expect none, predict A - 1xFP

  • Expect A, predict none - 1xFN

  • Expect A, predict A - 1xTP

  • Expect A, predict B - 1xFP (the B), 1xFN (missing A)

  • Expect A, predict A&B - 1xTP (the A), 1xFP (the B)

  • Expect A&B, predict A&B - 2xTP

  • Expect A&B, predict A - 1xTP, 1xFN (missing B)

  • Expect A&B, predict A&C - 1xTP (the A), 1xFP (the C), 1xFN (missing B)

The TP, FP, FN, TN sum will exceed the tally total. For each tally entry, rather than one of TP, FP, FN, TN being incremented (weighted by the tally count), several can be increased.

If the input data has no negative controls, all there will be no TN.

thapbi_pict.assess.load_tsv(mapping: dict[tuple[str, str], str], classifier_file: str, min_abundance: int) dict[tuple[str, str], str]

Update dict mapping of (marker, MD5) to semi-colon separated species string.

thapbi_pict.assess.main(inputs, known, db_url, method, min_abundance, assess_output, map_output, confusion_output, marker=None, ignore_prefixes=None, debug=False)

Implement the (sample/species level) thapbi_pict assess command.

The inputs argument is a list of filenames and/or folders.

Must provide: * at least one XXX.<method>.tsv file * at least one XXX.<known>.tsv file

These files can cover multiple samples as the sample-tally based classifier output, or legacy per-sample <sample>.<known>.tsv files.

thapbi_pict.assess.save_confusion_matrix(tally: dict[tuple[str, str], int], db_sp_list: list[str], sp_list: list[str], filename: str, exp_total: int, debug: bool = False) None

Output a multi-class confusion matrix as a tab-separated table.

thapbi_pict.assess.save_mapping(tally: dict[tuple[str, str], int], filename: str, debug: bool = False) None

Output tally table of expected species to predicted sp.

thapbi_pict.assess.sp_for_sample(fasta_files: list[str], min_abundance: int, pooled_sp: dict[tuple[str, str], str]) str

Return semi-colon separated species string from FASTA files via dict.

thapbi_pict.assess.sp_in_tsv(classifier_files: list[str], min_abundance: int) str

Return semi-colon separated list of species in column 2.

Will ignore genus level predictions.

thapbi_pict.assess.tally_files(expected_file: str, predicted_file: str, min_abundance: int = 0) dict[tuple[str, str], set[str]]

Make dictionary tally confusion matrix of species assignments.

Rather than the values simply being an integer count, they are the set of MD5 identifiers (take the length for the count).