Preparing the sequence data

Running thapbi-pict prepare-reads

Calling thapbi-pict prepare-reads is the first action done by the top level thapbi_pict pipeline command.

$ thapbi_pict prepare-reads -h

Assuming you have the FASTQ files in raw_data/ as described above:

$ thapbi_pict prepare-reads -i raw_data/ -o intermediate/

For each input FASTQ file pair raw_data/<sample_name>_R1.fastq.gz and raw_data/<sample_name>_R2.fastq.gz you should get a small FASTA file intermediate/<marker_name>/<sample_name>.fasta. In this case, there are multiple replicates from each of 14 sample sites where the file name stem is Site_<N>_sample_<X>, plus the controls.

$ ls -1 intermediate/ITS1/*.fasta | wc -l

Note this is robust to being interrupted and restarted (e.g. a job might time out on a cluster).

You should find 122 small FASTA files in the intermediate/ITS1/ folder

Note that four of these FASTA files are empty, Site_13_sample_7.fasta and Site_9_sample_4-3.fasta (nothing above the minimum threshold), and both negative controls (good).


So far this example omits a key consideration - telling the tool which samples are negative controls, and/or manually setting the minimum read abundance. See below.

Intermediate FASTA files

What the prepare command does can be briefly summarised as follows:

  • Merge the overlapping paired FASTQ reads into single sequences (pairs which do not overlap are discarded, for example from unexpectedly long fragments, or not enough left after quality trimming).

  • Primer trim (reads without both primers are discarded).

  • Convert into a non-redundant FASTA file, with the sequence name recording the abundance (discarding sequences of low abundance).

  • If synthetic controls are defined in the DB, look for matches using k-mers. These will be discounted when using negative control samples to raise the minimum abundance threshold for the plate.

For each input <sample_name>_R1.fastq.gz and <sample_name>_R2.fastq.gz FASTQ pair we get a single much smaller FASTA file <sample_name>.fasta.


The intermediate FASTA files can legitimately have no sequences which passed the thresholds. This can happen when a PCR failed, and is expected to happen on blank negative controls.


The intermediate FASTA files start with an atypical header made up of lines starting #. Some tools need this to be removed, but others will accept this as valid FASTA format.

For example, here the header tells us this sample started with 6136 reads in the paired FASTQ files, down to just 4180 after processing (with the final step being the abundance threshold).

$ head -n 12 intermediate/ITS1/Site_1_sample_1.fasta

The sequence entries in the FASTA file are named <checksum>_<abundance>. Here <checksum> is the MD5 checksum of the sequence, and this is used as a unique shorthand. It is a 32 character string of the digits 0 to 9 and lower cases letters a to f inclusive, like a559aa4d00a28f11b83012e762391259. These MD5 checksums are used later in the pipeline, including in reports. The <abundance> is just an integer, the number of paired reads which after processing had this unique sequence.

Any description entry in the FASTA records after the identifier is the name of the synthetic spike-in sequence in the database that was matched to using k-mer counting (so 2e4f0ed53888ed39a2aee6d6d8e02206_2269 was not a spike-in sequence).

The order of the FASTA sequences is in decreasing abundance, so the first sequence shown 2e4f0ed53888ed39a2aee6d6d8e02206_2269 is the most common, and so that read count 2269 also appears in the headers as the maximum non-spike-in abundance (with no spike-in reads in this sample).

Note the sequence in the FASTA file is written as a single line in upper case. With standard FASTA line wrapping at 60 or 80 characters, the ITS1 sequences would need a few lines each. However, they are still short enough that having them on one line without line breaks is no hardship - and it is extremely helpful for simple tasks like using grep to look for a particular sequence fragment at the command line.

Note that for this documentation, the FASTA output has had the sequences line wrapped at 80 characters.

$ grep "^>" intermediate/ITS1/Site_1_sample_1.fasta | head -n 8

The final output has just eight unique sequences accepted, happily none of which match the synthetic controls. The most common is listed first, and had MD5 checksum 2e4f0ed53888ed39a2aee6d6d8e02206 and was seen in 2269 reads.

You could easily find out which other samples had this unique sequence using the command line search tool grep as follows:

$ grep 2e4f0ed53888ed39a2aee6d6d8e02206 intermediate/*.fasta

Or, since we deliberately record the sequences without line wrapping, you could use grep with the actual sequence instead (which might spot some slightly longer entries as well).

You can also answer this example question from the read report produced later.

Abundance thresholds

As you might gather from reading the command line help, there are two settings to do with the minimum read absolute abundance threshold, -a or --abundance (default 100), and -n or --negctrls for specifying negative controls (default none).

(See also Abundance & Negative Controls which discusses the use of the fractional abundance threshold -f or --abundance-fraction and how to set this dynamically with synthetic control samples with -y or --synthetic.)

If any negative controls are specified, those paired FASTQ files are processed first. If any of these contained ITS1 sequences above the specified minimum absolute abundance threshold (default 100), that higher number is used as the minimum abundance threshold for the non-control samples. For example, say one control had several ITS1 sequences with a maximum abundance of 124, and another control had a maximum ITS1 abundance of 217, while the remaining controls had no ITS1 sequence above the default level. In that case, the tool would take maximum 217 as the abundance threshold for the non-control samples.

If you wished to lower the threshold from the default to 50, you could use:

$ rm -rf intermediate/ITS1/*.fasta  # Are you sure?
$ thapbi_pict prepare-reads -i raw_data/ -o intermediate/ -a 50


By default thapbi_pict prepare-reads and thapbi_pict pipeline will reuse existing intermediate FASTA files, so you must explicitly delete any old FASTA files before the new abundance threshold will have any effect.


Setting the abundance threshold low (say under 50) risks background contamination coming through into the results. Do not do this without strong justification (e.g. look at suitable controls over multiple plates from your own laboratory procedure).


Setting the abundance threshold very low (under 10) has the additional problem that the number of unique sequences accepted will increase many times over. This will dramatically slow down the rest of the analysis. This is only advised for investigating single samples.

For the woody host data, each plate had a negative control sample which should contain no ITS1 sequences. We can specify the negative controls with -n or --negctrls by entering the four FASTQ filenames in full, but since they have a common prefix we can use a simple wildcard:

$ thapbi_pict prepare-reads -i raw_data/ -o intermediate/ -n raw_data/NEGATIVE*.fastq.gz

For this sample data, happily neither of the negative controls have any ITS1 present above the default threshold, so this would have no effect.

For the THAPBI Phyto-Threats project we now run each 96-well PCR plate with multiple negative controls. Rather than a simple blank, these include a known mixture of synthetic sequences of the same length, same nucelotide composition, and also same di-nucleotide composition as real Phytophthora ITS1. This means we might have say 90 biological samples which should contain ITS1 but not the synthetics controls, and 6 negative controls which should contain synthetic controls but not ITS1.

We therefore run thapbi_pict prepare-reads separately for each plate, where any ITS1 contamination in the synthetic controls is used to set a plate specific minimum abundance. This means we cannot run thapbi_pict pipeline on multiple plates at once (although we could run it on each plate, we generally want to produce reports over multiple plates).