thapbi_pict.utils module

Helper functions for THAPB-PICT code.

thapbi_pict.utils.abundance_filter_fasta(input_fasta: str, output_fasta: str, min_abundance: int) → None: Apply a minimum abundance filter to a FASTA file.

thapbi_pict.utils.abundance_from_read_name(text: str, debug: bool = False) → int

Extract abundance from SWARM style read name.

>>> abundance_from_read_name("9e8f051c64c2b9cc3b6fcb27559418ca_988")
988

If fails, will return one.

thapbi_pict.utils.abundance_values_in_fasta(fasta_file: str, gzipped: bool = False) → tuple[int, int, dict[str, int]]: Return unique count, total abundance, and maximum abundances by spike-in.

thapbi_pict.utils.cmd_as_string(cmd)

Express a list command as a suitably quoted string.

Intended for using in debugging or error messages.

thapbi_pict.utils.expand_IUPAC_ambiguity_codes(seq: str)

Convert to upper case and iterate over possible unabmigous interpretations.

This is a crude recursive implementation, intended for use on sequences with just a few ambiguity codes in them - it may not scale very well!

thapbi_pict.utils.export_sample_biom(output_file: str, seqs: dict[tuple[str, str], str], seq_meta: dict[tuple[str, str], dict], sample_meta: dict[str, dict[str, str | int | None]], counts: dict[tuple[str, str, str], int], gzipped: bool = True) → bool

Export a sequence vs samples counts BIOM table, with metadata.

Similar to the export_sample_tsv file (our TSV output), expects same arguments as loaded from one of our TSV files via the parse_sample_tsv function.

Will save a BIOM v2 HDF5 file if possible and return True. If output fails (e.g. cannot import the biom Python library), returns False.

thapbi_pict.utils.export_sample_tsv(output_file: str, seqs: dict[tuple[str, str], str], seq_meta: dict[tuple[str, str], dict], sample_meta: dict[str, dict[str, str]], counts: dict[tuple[str, str, str], int], gzipped: bool = False) → None

Export a sequence vs sample counts TSV table, with metadata.

The TSV file ought to be readable by the parse_sample_tsv function, and is first generated in our pipeline by the sample-tally command, and then extended by the classify command to add taxonomic sequence metadata.

If the output tabular file argument is “-”, it writes to stdout (not supported with gzipped mode).

With no sequence metadata this should be accepted as a TSV BIOM file.

thapbi_pict.utils.file_to_sample_name(filename: str) → str

Given filename (with or without a directory name), return sample name only.

i.e. XXX.fasta, XXX.fastq.gz, XXX.method.tsv –> XXX

thapbi_pict.utils.find_paired_files(filenames_or_folders, ext1, ext2, ignore_prefixes=None, debug=False, strict=False)

Interpret a list of filenames and/or foldernames to find pairs.

Looks for paired files named XXX.ext1 and XXX.ext2 which can be in different directories - duplicated filenames (in different directories) are considered to be an error.

Having XXX.ext1 without XXX.ext2 is an error in strict mode, or a warning in debug mode, otherwise silently ignored.

Having XXX.ext2 without XXX.ext1 is silently ignored.

The arguments ext1 and ext2 should include the leading dot.

thapbi_pict.utils.find_requested_files(filenames_or_folders: list[str], ext: str | tuple[str, ...] = '.fasta', ignore_prefixes: tuple[str] | None = None, debug: bool = False) → list[str]

Interpret a list of filenames and/or foldernames.

The extensions argument can be a tuple.

thapbi_pict.utils.genus_species_name(genus: str, species: str) → str

Return name, genus with species if present.

Copes with species being None (or empty string).

thapbi_pict.utils.genus_species_split(name: str) → tuple[str, str]

Return (genus, species) splitting on first space.

If there are no spaces, returns (name, ‘’) instead.

thapbi_pict.utils.is_spike_in(sequence: str, spikes: list[tuple[str, str, set[str]]]) → str: Return spike-in name if sequence matches, else empty string.

thapbi_pict.utils.iskeyword(): x.__contains__(y) <==> y in x.

thapbi_pict.utils.kmers(sequence: str, k: int = 31) → set[str]: Make set of all kmers in the given sequence.

thapbi_pict.utils.load_fasta_header(fasta_file, gzipped=False) → dict: Parse our FASTA hash-comment line header as a dict.

thapbi_pict.utils.load_metadata(metadata_file, metadata_encoding, metadata_cols, metadata_groups=None, metadata_name_row=1, metadata_index=0, metadata_index_sep=';', ignore_prefixes=('Undetermined',), debug=False)

Load specified metadata as several lists.

The encoding argument can be None or “”, meaning use the default.

The columns argument should be a string like “1,3,5” - a comma separated list of columns to output. The column numbers are assumed to be one-based as provided by the command line user.

The name row indicates which row in the table contains the names or descriptions of the metadata columns (one-based).

The index column is assumed to contain one or more sequenced sample names separated by the character specified (default is semi-colon). This one-to-many mapping reflecting that a single field sample could be sequenced more than once (e.g. technical replicates). These sample names are matched against the file name stems, see function find_metadata.

The metadata table rows are sorted based on the requested columns.

Return values:

Dict mapping FASTQ stems to metadata tuples
Ordered dict mapping metadata tuples to lists of FASTQ stems
list of the N field names
Color grouping offset into the N values

thapbi_pict.utils.md5_hexdigest(filename: str, chunk_size: int = 1024) → str: Return the MD5 hex-digest of the given file.

thapbi_pict.utils.md5seq(seq: str) → str

Return MD5 32-letter hex digest of the (upper case) sequence.

>>> md5seq("ACGT")
'f1f8f4bf413b16ad135722aa4591043e'

thapbi_pict.utils.onebp_deletions(seq: str) → set[str]

Generate all variants of sequence with 1bp deletion.

Assumes unambiguous IUPAC codes A, C, G, T only.

thapbi_pict.utils.onebp_inserts(seq: str) → set[str]

Generate all variants of sequence with 1bp insert.

Assumes unambiguous IUPAC codes A, C, G, T only.

thapbi_pict.utils.onebp_substitutions(seq: str)

Generate all 1bp substitutions of the sequence.

Assumes unambiguous IUPAC codes A, C, G, T only.

thapbi_pict.utils.parse_sample_tsv(tabular_file: str, min_abundance: int = 0, debug: bool = False, force_upper: bool = True) → tuple[dict[tuple[str, str], str], dict[tuple[str, str], dict[str, str]], dict[str, dict[str, str]], dict[tuple[str, str, str], int]]

Parse file of sample abundances and sequence (etc).

Optional argument min_abundance is applied to the per sequence per sample values (i.e. the matrix elements, not the row/column totals).

Columns are: * Sequence label, <marker>/<identifier>_<abundance> * Column per sample giving the sequence count * Sequence itself * Optional additional columns for sequence metadata (e.g. chimera flags)

Supports optional sample metadata header too as # prefixed header lines.

Returns dictionaries of: * Sequence keyed on [<marker>, <identitifer>], string * Sequence metadata keyed [<marker>, <identitifer>], dict of key:value pairs * Sample metadata keyed on [<sample>], dict of key:value pairs * Counts keyed on 3-tuple [<marker>, <identifier>, <sample>], integer

thapbi_pict.utils.parse_species_tsv(tabular_file, min_abundance=0, req_species_level=False, allow_wildcard=False) → Iterator[tuple[str | None, str, str, str]]

Parse file of species assignments/predictions by sequence.

Yields tuples of marker name (from the file header line), sequence name, taxid, and genus_species.

thapbi_pict.utils.primer_clean(primer: str) → str

Handle non-IUPAC entries in primers, maps I for inosine to N.

>>> primer_clean("I")
'N'

Inosine is found naturally at the wobble position of tRNA, and can match any base. Structurally similar to guanine (G), it preferentially binds cytosine (C). It sometimes used in primer design (Ben-Dov et al, 2006), where degeneracy N would give similar results.

thapbi_pict.utils.reject_species_name(species: str) → bool

Reject species names like ‘environmental samples’ or ‘uncultured …’.

Will also reject names with “;” in them as used in the classifier and reports to combine multiple species entries.

thapbi_pict.utils.run(cmd, debug: bool = False, attempts: int = 1) → CompletedProcess

Run a command via subprocess, abort if fails.

Returns a subprocess.CompletedProcess object, or None if all attempts fail.

thapbi_pict.utils.species_level(prediction: str) → bool

Is this prediction at species level.

Returns True for a binomial name (at least one space), False for genus only or no prediction.

thapbi_pict.utils.split_read_name_abundance(text: str, debug: bool = False) → tuple[str, int]

Split SWARM style read name into prefix and abundance.

>>> split_read_name_abundance("9e8f051c64c2b9cc3b6fcb27559418ca_988")
('9e8f051c64c2b9cc3b6fcb27559418ca', 988)

If fails to detect the abundance, will return the original text as the prefix with an abundance of 1.

thapbi_pict.utils.valid_marker_name(text: str) → bool

Check the proposed string valid for use as a marker name.

Want to be able to use the string for file or directory names, and also column names etc in reports. At very least should reject whitespace, line breaks, and slashes.

Also rejecting all digits, as might want to accept integers as argument (e.g. cluster array job mapping task numbers to marker numbers).

Also rejecting the underscore, as may want to use it as a field separator in sequence names (e.g. marker_md5_abundance), and full stop as may use it as a field separator in filenames.

May want to relax this later, thus defining this central function.