Marker data
Either clone the THAPBI PICT source code repository, or decompress the
latest source code release (.tar.gz
file). You should find it contains
a directory examples/fecal_sequel/
which is for this example.
Shell scripts setup.sh
and run.sh
should reproduce the analysis
discussed.
FASTQ data
File PRJNA574765
was download from the ENA and includes the FASTQ
checksums, URLs, and the key metadata. Related file metadata.tsv
contains report-ready metadata about the samples (see below).
Script setup.sh
will download the raw FASTQ files for Walker et al.
(2019) from https://www.ebi.ac.uk/ena/data/view/PRJNA574765
It will download 120 raw FASTQ files (60 pairs), taking about 641MB on disk
If you have the md5sum
tool installed (standard on Linux), verify the FASTQ
files downloaded correctly:
$ cd raw_data/
$ md5sum -c MD5SUM.txt
...
$ cd ..
There is no need to decompress the files.
We focus on bioproject PRJNA574765 which has 60 samples and covers the mock communities. Additionally the paper describes PRJNA525109 (41 samples comparing genetic efficacy vs traditional survey techniques), and PRJNA525407 (9 samples looking at bat species assemblages in archaeological sites in Belize, with an expanded reference set).
Amplicon primers & reference sequences
The primer pair is SFF_145f (GTHACHGCYCAYGCHTTYGTAATAAT
) and SFF_351r
(CTCCWGCRTGDGCWAGRTTTCC
).
The reference set of COI sequences is taken from Supplementary S2 in the preceding paper (which also included bioproject PRJNA325503 with 9 samples):
Walker et al. (2016) Species From Feces: Order-Wide Identification of Chiroptera From Guano and Other Non-Invasive Genetic Samples. https://doi.org/10.1371/journal.pone.0162342
File COI_430_bats.fasta
of pre-trimmed bat COI markers is generated by
setup.sh
by downloading the FASTA file from Walker et al. (2016)
Supplementary S2, with underscores replaced with spaces in the record names.
Provided file observed_3_bats.fasta
contains alternative COI markers
observed in at least 10 samples, and their assumed species source. This is
for discussing the effect of the database.
Metadata
The provided file metadata.tsv
is based on PRJNA574765
but breaks up
the sample name into separate columns:
Accession, assigned by the public archive, e.g. “SRR10198789”
Rare, which of the 3 species is at low abundance, “COTO”, “EPFU” or “TABR”.
Ratio, either “1:64” (rare) or “1:192” (very rare)
Replicate, “01” to “10” (leading zero for alphabetical sorting)
The four letter appreviations are Corynorhinus townsendii (COTO), Eptesicus fuscus (EPFU) and Tadarida brasiliensis (TABR).
When calling THAPBI PICT, the meta data commands are given as follows:
$ thapbi_pict ... -t metadata.tsv -x 1 -c 2,3,4
Argument -t metadata.tsv
says to use this file for the metadata.
The -x 1
argument indicates the filename stem can be found in column 1,
Accession.
Argument -c 2,3,4
says which columns to display and sort by (do not
include the indexed column again). i.e. Rare species, ratio, replicate.
We have not given a -g
argument to assign colour bands in the Excel
reports, so it will default to the first column in -c
, meaning we get
three coloured bands for “COTO”, “EPFU” and “TABR”.
Other files
File mock_community.known.tsv
describes the three species of bats expected
in the mock communities (which use different ratios).