Marker data

Either clone the THAPBI PICT source code repository, or decompress the latest source code release (.tar.gz file). You should find it contains a directory examples/oomycete_rps10/ which is for this example.

Shell scripts setup.sh and run.sh should reproduce the analysis discussed.

The documentation goes through running each step of the analysis gradually, including building a combined database, before calling pipeline command. We provide script run.sh to do the final run-though automatically, but encourage you to follow along the individual steps first.

FASTQ data

File PRJNA699663.tsv was download from the ENA and includes the FASTQ checksums, URLs, and sample metadata. With a little scripting to extract the relevant sample metadata for use with THAPBI PICT this was reformatted as metadata.tsv (see below).

Script setup.sh will download the raw FASTQ files for Foster et al. (2021) from https://www.ebi.ac.uk/ena/data/view/PRJNA699663 - you could also use https://www.ncbi.nlm.nih.gov/bioproject/PRJNA699663/ to get this.

It will download 64 raw FASTQ files (32 pairs), taking about 2.3GB on disk

If you have the md5sum tool installed (standard on Linux; we suggest conda install coreutils to install this on macOS), verify the FASTQ files downloaded correctly:

$ cd raw_data/
$ md5sum -c MD5SUM.txt
...
$ cd ..

There is no need to decompress the files.

Amplicon primers & reference sequences

This example looks at samples amplified with the same ITS1 primers as the THAPBI PICT default database, but also rps10 primers. The rps10 assay is a multiplex PCR reaction comprising two rps10 forward primers and seven rps10 reverse primers that differ slightly in sequence but anneal to the same position. We follow the authors in using IUPAC codes to approximate the rps10 marker primers, with GTTGGTTAGAGYARAAGACT for the left and ATRYYTAGAAAGAYTYGAACT for the right (reverse complement AGTTCRARTCTTTCTARRYAT).

In order to classify the rps10 sequences, we need to build a THABPI PICT database of full-length primer-trimmed references. Happily we can use the authors’ own reference sequences from OomyceteDB https://oomycetedb.cgrb.oregonstate.edu (after reformatting the FASTA headers to a pattern THAPBI PICT recognises).

Metadata

The provided file metadata.tsv has seven columns:

run_accession, eg “SRR13658667”
tax_id, eg “410658”
scientific_name, eg “soil metagenome”
library_name, eg “C5”
experiment_title, eg “Illumina MiSeq sequencing; ITS1 sequences of agricultural soil”
Marker, “ITS1” or “rps10”
Source, eg “Agricultural soil”

When calling THAPBI PICT, the meta data commands are given as follows:

$ thapbi_pict ... -t metadata.tsv -x 1 -c 3,7,4,6

Argument -t metadata.tsv says to use this file for the metadata.

The -x 1 argument indicates the filename stem can be found in column 1, Accession.

Argument -c 3,7,4,6 says which columns to display and sort by (do not include the indexed column again).

We have not given a -g argument to assign colour bands in the Excel reports, so it will default to the first column in -c, “scientific name” meaning we get four coloured bands for “aquatic metagenome”, “plant metagenome”, “soil metagenome”, and “synthetic metagenome”.

Other files

The setup script will create symlinks using the sample names under sub-folder expected/ pointing at the relevant mock-community known file. This is for automatically assessing the classifier performance.

Sub-folders under intermediate/ are used for intermediate files, a folder for each primer-pair. Subfolder summary/ is used for the generated reports.