thapbi_pict.prepare module

Prepare raw amplicon sequencing reads (trimming, merging, etc).

This implements the thapbi_pict prepare-reads ... command.

thapbi_pict.prepare.find_fastq_pairs(filenames_or_folders: list[str], ext: tuple[str, ...] = ('.fastq', '.fastq.gz', '.fq', '.fq.gz'), ignore_prefixes: tuple[str] | None = None, debug: bool = False) list[tuple[str, str, str]]

Interpret a list of filenames and/or foldernames.

Returns a list of tuples (stem, left filename, right filename) where stem is intended for use in logging and output naming, and may include a directory name.

The filenames will be normalised relative to the current directory (so that we can directly compare file lists which could have been defined inconsistently by the user).

Will ignore “-” if present in the inputs.

thapbi_pict.prepare.load_marker_defs(session, spike_genus: str = '') dict[str, dict[str, int | str | list[tuple[str, str, set[str]]]]]

Load marker definitions and any spike-in sequences from the DB.

thapbi_pict.prepare.main(fastq: list[str], out_dir: str, session, flip: bool = False, min_abundance: int = 2, min_abundance_fraction: float = 0.0, ignore_prefixes: tuple[str] | None = None, merged_cache: str | None = None, tmp_dir: str | None = None, debug: bool = False, cpu: int = 0) list[str]

Implement the thapbi_pict prepare-reads command.

For use in the pipeline command, returns a filename listing of the FASTA files created.

thapbi_pict.prepare.make_nr_fasta(input_fasta_or_fastq: str, output_fasta: str, min_abundance: int = 0, min_len: int = 0, max_len: int = 9223372036854775807, weighted_input: bool = False, fastq: bool = False, gzipped: bool = False, header_dict: dict[str, str | int | None] | None = None, debug: bool = False) tuple[int, int, int, int]

Trim and make non-redundant FASTA/Q file from FASTA input.

Makes a non-redundant FASTA file with the sequences named >MD5_abundance\n.

For FASTQ files all input reads are treated as abundance one (using weighted_input=True gives an error).

If FASTA input and weighted_input=True, reads must follow >identifier_abundance\n naming and the abundance is used. Otherwise all treated as abundance one.

Makes a non-redundant FASTA file with the sequences named >MD5_abundance\n.

Returns the total number of accepted reads before de-duplication (integer), number of those unique (integer), and the total number of those which passed the minimum abundance threshold (integer), and number of those which are unique (integer).

thapbi_pict.prepare.marker_cut(marker_definitions, file_pairs: list[tuple[str, str, str]], out_dir: str, merged_cache: str, tmp: str, flip: bool, min_abundance: int, min_abundance_fraction: float, debug: bool = False, cpu: int = 0) list[str]

Apply primer-trimming for given markers.

thapbi_pict.prepare.merge_paired_reads(raw_R1: str, raw_R2: str, merged_fasta_gz: str, tmp: str, debug: bool = False, cpu: int = 0) tuple[int, int]

Create NR FASTA file by overlap merging the paired FASTQ files.

thapbi_pict.prepare.parse_cutadapt_stdout(stdout: str) tuple[int, int]

Extract FASTA count before and after cutadapt.

>>> parse_cutadapt_stdout(
...     "...\n"
...     "Total reads processed: 5,869\n"
...     "...\n"
...     "Reads written (passing filters): 5,861 (99.9%)\n"
...     "..."
... )
(5869, 5861)
thapbi_pict.prepare.parse_flash_stdout(stdout: str) tuple[int, int]

Extract FASTQ pair count before/after running flash.

>>> parse_flash_stdout(
...     "...\n"
...     "[FLASH] Read combination statistics:[FLASH]     Total pairs:      6105\n"
...     "[FLASH]     Combined pairs:   5869\n"
...     "..."
... )
(6105, 5869)
thapbi_pict.prepare.prepare_sample(fasta_name: str, trimmed_fasta: str, headers: dict[str, int | str | None], min_len: int, max_len: int, min_abundance: int, min_abundance_fraction: float, tmp: str, debug: bool = False, cpu: int = 0) tuple[int | None, int | None, int | None, int]

Create marker-specific FASTA file for sample from paired FASTQ.

Applies abundance threshold, and min/max length.

Returns pre-threshold total read count, accepted unique sequence count, accepted total read count, and the absolute abundance threshold used (higher of the given absolute threshold or the given fractional threshold).

thapbi_pict.prepare.run_cutadapt(long_in: str, out_template: str, marker_definitions: dict[str, Any], flip: bool = False, debug: bool = False, cpu: int = 0) tuple[int, int]

Run cutadapt on a single file (i.e. after merging paired FASTQ).

The input and/or output files may be compressed as long as they have an appropriate suffix (e.g. gzipped with .gz suffix).

Returns FASTA count before and after cutadapt.

thapbi_pict.prepare.run_flash(trimmed_R1: str, trimmed_R2: str, output_dir: str, output_prefix: str, debug: bool = False, cpu: int = 0) tuple[int, int]

Run FLASH on a pair of trimmed FASTQ files to merge overlapping pairs.

Returns two integers, FASTQ pair count for input and output files.

thapbi_pict.prepare.save_nr_fasta(counts: dict[str, int], output_fasta: str, min_abundance: int = 0, gzipped: bool = False, header_dict: dict[str, str | int | None] | None = None) tuple[int, int]

Save a dictionary of sequences and counts as a FASTA file.

Writes a FASTA file with header lines starting # (which not all tools will accept as valid FASTA format).

The output FASTA records are named >MD5_abundance\n, which is the default style used in SWARM. This could in future be generalised, for example >MD5;size=abundance;\n for the VSEARCH default.

Results are sorted by decreasing abundance then alphabetically by sequence.

Returns the total and number of unique sequences accepted (above any minimum abundance specified).

Use output_fasta=’-’ for standard out.