thapbi_pict.denoise module

Apply UNOISE read-correction to denoise FASTA file(s).

This implements the thapbi_pict denoise ... command, which is a simplified version of the thapbi_pict sample-tally ... command intended to be easier to use outside the THAPBI PICT pipeline.

thapbi_pict.denoise.main(inputs: str | list[str], output: str, denoise_algorithm: str, total_min_abundance: int = 0, min_length: int = 0, max_length: int = 9223372036854775807, unoise_alpha: float | None = None, unoise_gamma: int | None = None, gzipped: bool = False, tmp_dir: str | None = None, debug: bool = False, cpu: int = 0)

Implement the thapbi_pict denoise command.

This is a simplified version of the thapbi_pict sample-tally command which pools one or more FASTA input files before running the UNOISE read correction algorithm to denoise the dataset. The input sequences should use the SWARM <prefix>_<abundance> style naming, which is used on output (taking the first loaded name if a sequence appears more than once).

Arguments min_length and max_length are applied while loading the input FASTA file(s).

Argument total_min_abundance is applied after read correction.

Results sorted by decreasing abundance, then alphabetically by sequence.

thapbi_pict.denoise.read_correction(algorithm: str, counts: dict[str, int], unoise_alpha: float | None = None, unoise_gamma: int | None = None, abundance_based: bool = False, tmp_dir: str | None = None, debug: bool = False, cpu: int = 0) → tuple[dict[str, str], dict[str, str]]

Apply builtin UNOISE algorithm or invoke an external tool like VSEARCH.

Argument algorithm is a string, “unoise-l” for our reimplementation of the UNOISE2 algorithm, or “usearch” or “vsearch” to invoke those tools at the command line.

Argument counts is an (unsorted) dict of sequences (for the same amplicon marker) as keys, with their total abundance counts as values.

Returns a dict mapping input sequences to centroid sequences, and dict of any chimeras detected (empty for some algorithms).

thapbi_pict.denoise.unoise(counts: dict[str, int], unoise_alpha: float | None = 2.0, unoise_gamma: int | None = 4, abundance_based: bool = False, debug: bool = False) → tuple[dict[str, str], dict[str, str]]

Apply UNOISE2 algorithm.

Argument counts is an (unsorted) dict of sequences (for the same amplicon marker) as keys, with their total abundance counts as values.

If not specified (i.e. set to zero or None), unoise_alpha defaults to 2.0 and unoise_gamma to 4.

Returns a dict mapping input sequences to centroid sequences, and an empty dict (no chimera detection performed).

thapbi_pict.denoise.usearch(counts: dict[str, int], unoise_alpha: float | None = None, unoise_gamma: int | None = None, abundance_based: bool = False, tmp_dir: str | None = None, debug: bool = False, cpu: int = 0) → tuple[dict[str, str], dict[str, str]]

Invoke USEARCH to run its implementation of the UNOISE3 algorithm.

Assumes v10 or v11 (or later if the command line API is the same). Parses the four columns tabbed output.

Returns a dict mapping input sequences to centroid sequences, and a dict of MD5 checksums of any sequences flagged as chimeras.

thapbi_pict.denoise.vsearch(counts: dict[str, int], unoise_alpha: float | None = None, unoise_gamma: int | None = None, abundance_based: bool = True, tmp_dir: str | None = None, debug: bool = False, cpu: int = 0) → tuple[dict[str, str], dict[str, str]]

Invoke VSEARCH to run its reimplementation of the UNOISE3 algorithm.

Argument counts is an (unsorted) dict of sequences (for the same amplicon marker) as keys, with their total abundance counts as values.

Returns a dict mapping input sequences to centroid sequences, and a dict of MD5 checksums of any sequences flagged as chimeras.