Marker data
Either clone the THAPBI PICT source code repository, or decompress the
latest source code release (.tar.gz
file). You should find it contains
a directory examples/great_lakes/
which is for this example.
Shell scripts setup.sh
and run.sh
should reproduce the analysis
discussed.
Subdirectories MOL16S/
and SPH16S/
are used for the different
amplicons (with different primer settings and reference databases).
FASTQ data
File PRJNA379165.tsv
was download from the ENA and includes the FASTQ
checksums, URLs, and sample metadata. Derived file metadata.tsv
contains
report-ready metadata about the samples (see below).
Script setup.sh
will download the raw FASTQ files for Klymus et al.
(2017) from https://www.ebi.ac.uk/ena/data/view/PRJNA379165
It will download 36 raw FASTQ files (18 pairs), taking 1.8GB on disk.
If you have the md5sum
tool installed (standard on Linux), verify the FASTQ
files downloaded correctly:
$ cd raw_data/
$ md5sum -c MD5SUM.txt
...
$ cd ../
There is no need to decompress the files.
Amplicon primers & reference sequences
The MOL16S
amplicon targeted a short fragment of the mtDNA 16S RNA gene
using degenerate primer pair MOL16S_F/MOL16S_R (RRWRGACRAGAAGACCCT
and
ARTCCAACATCGAGGT
).
The SPH16S
amplicon targeted sphaeriid mussel species where it amplified
an overlapping slightly downstream region of the mtDNA 16S RNA gene using
non-degenerate primers SPH16S_F/SPH16S_R (TAGGGGAAGGTATGAATGGTTTG
and
ACATCGAGGTCGCAACC
).
This means we need to run THAPBI PICT twice (once for each primer pair, against a different marker database each time).
Metadata
The provided file metadata.tsv
is based on metadata in the ENA, split into
separate columns for reporting. It has five columns:
Run accession, e.g. “SRR5534972”
Library name, e.g. “SC3PRO2”
Sample title, e.g. “Mock Community 2 MOL16S with Fish Block Primer”
Marker, “MOL16S” or “SPH16S”
Group, “Mock Community”, “Aquarium”, “River” or “Control”
When calling THAPBI PICT, the meta data commands are given as follows:
$ thapbi_pict ... -t metadata.tsv -x 1 -c 4,5,3,2
Argument -t metadata.tsv
says to use this file for the metadata.
Argument -c 4,5,3,2
says which columns to display and sort by. This means
Marker, Group, Sample Title, Library name. This splits up the samples first by
the expected marker, and then the group.
Argument -x 1
the filename stems can be found in that column one.
Other files
Files MOL16S.fasta
and SPH16S.fasta
are for building reference
databases. These were constructed from the accessions in the paper listed in
Table 1, Table 8, Supplementary Table 1, Supplementary Table 3, and some
additional accessions for the mock community. The sequences were primer
trimmed using cutadapt
(requiring both the left and right primer to be
present), and the description given cut to just species level (discarding
strain or isolate information).