Preparing the sequence data
Running thapbi-pict prepare-reads
Calling thapbi-pict prepare-reads
is the first action done by the top
level thapbi_pict pipeline
command.
$ thapbi_pict prepare-reads -h
...
Assuming you have the FASTQ files in raw_data/
as described above:
$ thapbi_pict prepare-reads -i raw_data/ -o intermediate/
...
For each input FASTQ file pair raw_data/<sample_name>_R1.fastq.gz
and
raw_data/<sample_name>_R2.fastq.gz
you should get a small FASTA file
intermediate/<marker_name>/<sample_name>.fasta
. In this case, there are
multiple replicates from each of 14 sample sites where the file name stem is
Site_<N>_sample_<X>
, plus the controls.
$ ls -1 intermediate/ITS1/*.fasta | wc -l
122
Note this is robust to being interrupted and restarted (e.g. a job might time out on a cluster).
You should find 122 small FASTA files in the intermediate/ITS1/
folder
Note that four of these FASTA files are empty, Site_13_sample_7.fasta
and
Site_9_sample_4-3.fasta
(nothing above the minimum threshold), and both
negative controls (good).
Warning
So far this example omits a key consideration - telling the tool which samples are negative controls, and/or manually setting the minimum read abundance. See below.
Intermediate FASTA files
What the prepare command does can be briefly summarised as follows:
Merge the overlapping paired FASTQ reads into single sequences (pairs which do not overlap are discarded, for example from unexpectedly long fragments, or not enough left after quality trimming).
Primer trim (reads without both primers are discarded).
Convert into a non-redundant FASTA file, with the sequence name recording the abundance (discarding sequences of low abundance).
If synthetic controls are defined in the DB, look for matches using k-mers. These will be discounted when using negative control samples to raise the minimum abundance threshold for the plate.
For each input <sample_name>_R1.fastq.gz
and <sample_name>_R2.fastq.gz
FASTQ pair we get a single much smaller FASTA file <sample_name>.fasta
.
Warning
The intermediate FASTA files can legitimately have no sequences which passed the thresholds. This can happen when a PCR failed, and is expected to happen on blank negative controls.
Warning
The intermediate FASTA files start with an atypical header made up of
lines starting #
. Some tools need this to be removed, but others will
accept this as valid FASTA format.
For example, here the header tells us this sample started with 6136 reads in the paired FASTQ files, down to just 4180 after processing (with the final step being the abundance threshold).
$ head -n 12 intermediate/ITS1/Site_1_sample_1.fasta
#marker:ITS1
#left_primer:GAAGGTGAAGTCGTAACAAGG
#right_primer:GCARRGACTTTCGTCCCYRC
#threshold_pool:raw_data
#raw_fastq:6136
#flash:5900
#cutadapt:5886
#abundance:5194
#threshold:2
#singletons:692
>2e4f0ed53888ed39a2aee6d6d8e02206_2269
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACTGTATCGAACAACTAGTTGG
GGGTCTTGTTTGGCGTGCGGCTGCTTCGGTAGCTGCTGCTAGGCGAGCCCTATCACGGCGAGCGTTTGGACTTCGGTCTG
AGCTAGTAGCTATTTTTTAAACCCATTCTTTAATACTGATTATACT
The sequence entries in the FASTA file are named <checksum>_<abundance>
.
Here <checksum>
is the MD5 checksum
of the sequence, and this is used as a unique shorthand. It is a 32 character
string of the digits 0
to 9
and lower cases letters a
to f
inclusive, like a559aa4d00a28f11b83012e762391259
. These MD5 checksums are
used later in the pipeline, including in reports. The <abundance>
is just
an integer, the number of paired reads which after processing had this unique
sequence.
Any description entry in the FASTA records after the identifier is the name of
the synthetic spike-in sequence in the database that was matched to using
k-mer counting (so 2e4f0ed53888ed39a2aee6d6d8e02206_2269
was not a
spike-in sequence).
The order of the FASTA sequences is in decreasing abundance, so the first
sequence shown 2e4f0ed53888ed39a2aee6d6d8e02206_2269
is the most common,
and so that read count 2269 also appears in the headers as the maximum
non-spike-in abundance (with no spike-in reads in this sample).
Note the sequence in the FASTA file is written as a single line in upper
case. With standard FASTA line wrapping at 60 or 80 characters, the ITS1
sequences would need a few lines each. However, they are still short enough
that having them on one line without line breaks is no hardship - and it is
extremely helpful for simple tasks like using grep
to look for a
particular sequence fragment at the command line.
Note that for this documentation, the FASTA output has had the sequences line wrapped at 80 characters.
$ grep "^>" intermediate/ITS1/Site_1_sample_1.fasta | head -n 8
>2e4f0ed53888ed39a2aee6d6d8e02206_2269
>c1a720b2005f101a9858107545726123_715
>96e0e2f0475bd1617a4b05e778bb04c9_330
>fb30156d7f66c8abf91f9da230f4d19e_212
>dcd6316eb77be50ee344fbeca6e005c7_194
>972db44c016a166de86a2bacab3f4226_182
>d9bc3879fdab3b4184c04bfbb5cf6afb_165
>ed15fefb7a3655147115fc28a8d6d671_113
The final output has just eight unique sequences accepted, happily none of
which match the synthetic controls. The most common is listed first, and had
MD5 checksum 2e4f0ed53888ed39a2aee6d6d8e02206
and was seen in 2269 reads.
You could easily find out which other samples had this unique sequence using
the command line search tool grep
as follows:
$ grep 2e4f0ed53888ed39a2aee6d6d8e02206 intermediate/*.fasta
...
Or, since we deliberately record the sequences without line wrapping, you
could use grep
with the actual sequence instead (which might spot some
slightly longer entries as well).
You can also answer this example question from the read report produced later.
Abundance thresholds
As you might gather from reading the command line help, there are two settings
to do with the minimum read absolute abundance threshold, -a
or
--abundance
(default 100), and -n
or --negctrls
for specifying
negative controls (default none).
(See also Abundance & Negative Controls which discusses the use of the fractional
abundance threshold -f
or --abundance-fraction
and how to set this
dynamically with synthetic control samples with -y
or --synthetic
.)
If any negative controls are specified, those paired FASTQ files are processed first. If any of these contained ITS1 sequences above the specified minimum absolute abundance threshold (default 100), that higher number is used as the minimum abundance threshold for the non-control samples. For example, say one control had several ITS1 sequences with a maximum abundance of 124, and another control had a maximum ITS1 abundance of 217, while the remaining controls had no ITS1 sequence above the default level. In that case, the tool would take maximum 217 as the abundance threshold for the non-control samples.
If you wished to lower the threshold from the default to 50, you could use:
$ rm -rf intermediate/ITS1/*.fasta # Are you sure?
$ thapbi_pict prepare-reads -i raw_data/ -o intermediate/ -a 50
...
Warning
By default thapbi_pict prepare-reads
and thapbi_pict pipeline
will
reuse existing intermediate FASTA files, so you must explicitly delete any
old FASTA files before the new abundance threshold will have any effect.
Warning
Setting the abundance threshold low (say under 50) risks background contamination coming through into the results. Do not do this without strong justification (e.g. look at suitable controls over multiple plates from your own laboratory procedure).
Warning
Setting the abundance threshold very low (under 10) has the additional problem that the number of unique sequences accepted will increase many times over. This will dramatically slow down the rest of the analysis. This is only advised for investigating single samples.
For the woody host data, each plate had a negative control sample which should
contain no ITS1 sequences. We can specify the negative controls with -n
or
--negctrls
by entering the four FASTQ filenames in full, but since they
have a common prefix we can use a simple wildcard:
$ thapbi_pict prepare-reads -i raw_data/ -o intermediate/ -n raw_data/NEGATIVE*.fastq.gz
...
For this sample data, happily neither of the negative controls have any ITS1 present above the default threshold, so this would have no effect.
For the THAPBI Phyto-Threats project we now run each 96-well PCR plate with multiple negative controls. Rather than a simple blank, these include a known mixture of synthetic sequences of the same length, same nucelotide composition, and also same di-nucleotide composition as real Phytophthora ITS1. This means we might have say 90 biological samples which should contain ITS1 but not the synthetics controls, and 6 negative controls which should contain synthetic controls but not ITS1.
We therefore run thapbi_pict prepare-reads
separately for each plate,
where any ITS1 contamination in the synthetic controls is used to set a plate
specific minimum abundance. This means we cannot run thapbi_pict pipeline
on multiple plates at once (although we could run it on each plate, we
generally want to produce reports over multiple plates).