Marker data

Either clone the THAPBI PICT source code repository, or decompress the latest source code release (.tar.gz file). You should find it contains a directory examples/soil_nematodes/ which is for this example.

Shell scripts setup.sh and run.sh should reproduce the analysis discussed.

FASTQ data

File PRJEB27581.tsv was download from the ENA and includes the FASTQ checksums, URLs, and sample metadata.

Script setup.sh will download the raw FASTQ files for Ahmed et al. (2019) from https://www.ebi.ac.uk/ena/data/view/PRJEB27581

It will download 32 raw FASTQ files (16 pairs), taking 12GB on disk.

If you have the md5sum tool installed (standard on Linux), verify the FASTQ files downloaded correctly:

$ cd raw_data/
$ md5sum -c MD5SUM.txt
...
$ cd ../

There is no need to decompress the files.

Amplicon primers & reference sequences

There were four separate markers used here, as shown in the paper’s Table 2 together with the shared Illumina adaptors used.

The authors do not provide copies of their reference sequence databases with the paper. Instead, files NF1-18Sr2b.fasta, SSUF04-SSUR22.fasta, D3Af-D3Br.fasta and JB3-JB5GED.fasta were based on the accessions listed in the paper and close matches in the NCBI found with BLASTN against the NT database. Note many of the species names have been reduced to just “Genus sp.” in line with the mock community entries, and all the fungal entries are listed as just “Fungi”.

Metadata

File metadata.tsv is based on the ENA metadata and the paper text. It has four columns:

run_accession, assigned by the public archive, e.g. “ERR2678656”
read_count, the number of paired reads in the raw FASTQ files.
sample, one of “MC1”, “MC2”, “MC3” for the mock communities, or “Blank”
marker, one of “NF1-18Sr2b”, “SSUF04-SSUR22”, “D3Af-D3Br” or “JB3-JB5GED”

When calling THAPBI PICT, the meta data commands are given as follows:

$ thapbi_pict ... -t metadata.tsv -x 1 -c 4,3

Argument -t metadata.tsv says to use this file for the metadata.

Argument -c 4,3 says which columns to display and sort by. This means sample and then marker. The purpose here is to group the samples logically (sorting on accession would not work), and suitable for group colouring.

Argument -x 1 (default, so not needed) indicates the filename stem can be found in column 1, run accession.

Other files

The provided negative_control.known.tsv and mock_community.known.tsv files lists the expected species in the negative controls (none) and the mock community samples (the same 23 species). Sub-folders under expected/ are created for each primer-pair, linking each accession name to either file as appropriate for assessing the classifier performance.

Sub-folders under intermediate/ are used for intermediate files, a folder for each primer-pair.