Marker data
Either clone the THAPBI PICT source code repository, or decompress the
latest source code release (.tar.gz
file). You should find it contains
a directory examples/fungal_mock/
which is for this example.
Shell scripts setup.sh
and run.sh
should reproduce the analysis
discussed.
FASTQ data
File PRJNA377530.tsv
was download from the ENA and includes the FASTQ
checksums, URLs, and sample metadata.
Script setup.sh
will download the raw FASTQ files for Bakker (2018) from
https://www.ebi.ac.uk/ena/data/view/PRJNA377530
It will download 122 raw FASTQ files (61 pairs), taking 346MB on disk.
If you have the md5sum
tool installed (standard on Linux), verify the FASTQ
files downloaded correctly:
$ cd raw_data/AL1/
$ md5sum -c MD5SUM.txt
...
$ cd ../../
$ cd raw_data/AL1/
$ md5sum -c MD5SUM.txt
...
$ cd ../../
There is no need to decompress the files.
Amplicon primers & reference sequences
Amplicon library one (AL1) amplified a small region of ITS1 using primer pair
BITS/B58S3 (ACCTGCGGARGGATC
and GAGATCCRTTGYTRAAAGTT
), as shown in the
paper’s supplementary Table S4.
Amplicon library two (AL2) amplified a larger region of ITS1 using primer pair
ITS1f/ITS2 (CTTGGTCATTTAGAGGAAGTAA
and GCTGCGTTCTTCATCGATGC
), which
includes the first library’s target region entirely. Similar yields as per
supplementary Table S4 vs S5.
Additionally, amplicon library two (AL2) amplified ITS2 using primer pair
ITS3‐KYO2 with ITS4‐KYO3 (GATGAAGAACGYAGYRAA
and CTBTTVCCKCTTCACTCG
),
with lower yields as per supplementary Table S5 vs S6.
The example must run THAPBI PICT twice. First using a single-marker database for AL1 using the BITS/B58S3 primers, and then with a dual-marker database for AL2 using the ITS1f/ITS2 and ITS3‐KYO2/ITS4‐KYO3 primers. In fact the example runs it third time, as we can also try the BITS/B58S3 primers on the second amplicon library, because they amplify a subregion of what the ITS1f/ITS2 pair amplify. See the primer discussion on the similar Redekar et al. (2019) worked example.
Files ITS1.fasta
and ITS2.fasta
were extracted from supplementary
materials appendix S2, with the species name alone added to the FASTA titles
(for input to thapbi_pict import
with primer trimming).
Metadata
The amplicon specific files metadata_AL1.tsv
and metadata_AL2.tsv
are
based on the metadata downloaded from the ENA, with some reformatting. The
split into amplicon one and two was based on supplementary Tables S4, S5 and
S6 (for the mock community samples) and reading the paper (for placing the
negative controls).
They have seven columns:
Accession, assigned by the public archive, e.g. “SRR5314337”
MiSeq-name, author’s filename stem, e.g. “FMockE.HC_S190”
Condition, based on original name without replicate suffix, e.g. “MockE_HC”
Replicate, numeric, e.g. “1”
Sample-type, either “fungal mock community” or “negative control”
Group, e.g. “even” or “staggered A”
Protocol, e.g. “high PCR cycle number” or “standard workflow”
When calling THAPBI PICT, the meta data commands are given as follows:
$ thapbi_pict ... -t metadata_AL1.tsv -c 5,6,7,3,4,2 -x 1 -g 6
$ thapbi_pict ... -t metadata_AL2.tsv -c 5,6,7,3,4,2 -x 1 -g 6
Argument -t metadata.tsv
says to use this file for the metadata.
Argument -c 5,6,7,3,4,2
says which columns to display and sort by. This
means Sample-type, Group, Protocol, Condition, Replicate, MiSeq Name. The
purpose here is to group the samples logically (sorting on accession or MiSeq
Name would not work), and suitable for group colouring.
Argument -x 1
(default, so not needed) indicates the filename stem can be
found in column 1, Accession. We might have downloaded the files and used the
author original names, in which case -x 2
ought to work.
Argument -g 6
means assign colour bands using column 6, Group. This is
used in the Excel reports.
Other files
The provided negative_control.known.tsv
and mock_community.known.tsv
files lists the expected species in the negative controls (none) and the mock
community samples (the same 19 species, although not always in equal ratios).
Sub-folders under intermediate/
are used for intermediate files, a folder
for each amplicon library (AL1 and AL2) and primer-pair combination.