Marker data

Either clone the THAPBI PICT source code repository, or decompress the latest source code release (.tar.gz file). You should find it contains a directory examples/pest_insects/ which is for this example.

Shell scripts setup.sh and run.sh should reproduce the analysis discussed.

FASTQ data

File PRJNA716058.tsv was download from the ENA and includes the FASTQ checksums, URLs, and sample metadata.

Script setup.sh will download the raw FASTQ files for Batovska et al. (2021) from https://www.ebi.ac.uk/ena/data/view/PRJNA716058

It will download 60 raw FASTQ files (30 pairs), taking 7.9 GB on disk.

If you have the md5sum tool installed (standard on Linux), verify the FASTQ files downloaded correctly:

$ cd raw_data/
$ md5sum -c MD5SUM.txt
...
$ cd ../

There is no need to decompress the files.

Amplicon primers & reference sequences

Three separate markers used here, as shown in the paper’s Supplementary Table S2, together with the shared Illumina adaptors used.

The authors provide their reference species level sequences as a compressed FASTA file merged_arthropoda_rdp_species.fa.gz on the GitHub repository for the paper: https://github.com/alexpiper/HemipteraMetabarcodingMS

The worked example applies the three primer-pairs to this FASTA file to make an amplicon specific FASTA file for each marker.

Metadata

File metadata.tsv is based on the ENA metadata and the paper text. It has four columns:

run_accession, assigned by the public archive, e.g. “SRR14022295”
sample_alias, e.g. “100-Pool-1” or “Trap-1”
source, e.g. one of the mock communities like “Pool 1”, or “Trap”
individuals, e.g. “0100” (with leading zero for sorting) or “-” for traps.

When calling THAPBI PICT, the meta data commands are given as follows:

$ thapbi_pict ... -t metadata.tsv -x 1 -c 3,4,2

Argument -t metadata.tsv says to use this file for the metadata.

Argument -c 3,4,2 says which columns to display and sort by. This means by source (i.e. which mock community, or environmental traps), then number of individuals in the mock, and finally the human readable sample alias. The purpose here is to group the samples logically (sorting on sample_alias would not work), and suitable for group colouring.

Argument -x 1 (default, so not needed) indicates the filename stem can be found in column 1, run accession.

Other files

Files mock_community_1.known.tsv, …, mock_community_5.known.tsv list the expected species in the five different mock community pools. The setup script will create symlinks using the sample names under sub-folder expected/ pointing at the relevant community known file. This is for automatically assessing the classifier performance.

Sub-folders under intermediate/ are used for intermediate files, a folder for each primer-pair.