Abundance & Negative Controls
Any negative control sample is not expected to contain any of the target sequences (although it may contain spike-in synthetic control sequences).
On a typical 96-well plate of PCR products which will go on to be multiplexed for Illumina MiSeq sequencing, most of the samples are biological - but some should be negative controls (e.g. PCR blanks, or synthetic sequences). The presence of biological sequence reads in the negative control samples is indicative of some kind of cross contamination. Likewise, synthetic sequences in the biological samples are a warning sign.
The tool implements both absolute and fractional abundance thresholds, which can be specified at the command line. Moreover, control samples can be used to automatically raise the threshold for batches of samples. Simple negative controls can be used to set an absolute abundance threshold, but to set the fractional abundance threshold we need to be able to distinguish expected sequences from unwanted ones. For this we require known spike-in control sequences, which are clearly distinct from the biological markers.
Four synthetic sequences were designed for Phyto-Threats project which funded THAPBI PICT. These were of the typical expected ITS1 fragment length and base content, had the typical fixed 32bp header, but were otherwise shuffled with no biological meaning (avoiding any secondary structure forming). They were synthesised using Integrated DNA Technologies gBlocks Gene Fragments.
Our 96-well PCR plates included multiple control samples which were known ratios of these synthetic sequences, rather than environmental DNA.
The tool needs a way to distinguish biological marker sequences (for which we wanted to make as few assumptions as possible) from the synthetic ones (where the template sequences were known, subject only to PCR noise).
The spike-in controls are assumed to be in the database, by default under the synthetic “genus” but that is configurable. Similar sequences in the samples are considered to be spike-ins. While the PCR noise is typically just a few base pair changes, we also found large deletions relatively common. The matching is therefore relaxed, currently based on k-mer content.
Conversely, the presence of the synthetic controls in any of the biological samples is also problematic. Since our synthetic control sequences are in the default database, they can be matched by the chosen classifier, and appear in the reports.
Minimum Absolute Abundance Threshold
The initial absolute abundance threshold is set at the command line with
--abundance giving an integer value. If your samples have
dramatically different read coverage, then the fractional abundance threshold
may be more appropriate (see below).
During the sample tally step, the
--negctrls argument gives
the sample filenames of any negative controls to use to potentially increase
the absolute abundance threshold (see below). If you have no spike-in
controls, then any sequences in these negative controls can raise the
threshold - regardless of what they may or may not match in the reference
Minimum Fractional Abundance Threshold
The initial fractional abundance threshold is set at the command line with
--abundance-fraction to an floating point number between zero
and one, thus
-f 0.001 means 0.1%. This is a percentage of the reads
identified for each marker after merging the overlapping pairs and primer
During the read preparation step, the
--synctrls argument gives
the sample filenames of any synthetic controls to use to potentially increase
the absolute abundance threshold. This setting works in conjunction with the
database which must include the spike-in sequences under the genera specified
at the command lines with
--synthetic (by default “synthetic”).
Any control samples are processed first, before the biological samples, and high read counts can raise the threshold to that level for the other samples in that folder. This assumes if you have multiple 96-well plates, or other logical groups, their raw FASTQ files are separated into a sub-folder per plate.
Control samples given via
-n can raise the absolute abundance threshold
(any synthetic spike-in reads are ignored for this), while controls given via
-y can raise the fractional abundance threshold (but must have synthetic
spike-in reads in order to give a meaningful fraction).
For example, if running with the default minimum abundance threshold of 100
-a 100), and a negative control (set via
-n raw_data/CTRL*.fastq.gz) contains a non-spike-in (and thus presumably
biological) sequence at abundance 136, then the threshold for the non-control
samples in that folder is raised to 136.
Alternatively, you might have synthetic spike-in controls listed with
-y raw_data/SPIKES*.fastq.gz and use
-f 0.001 to set a default
fractional abundance threshold of 0.1%. Suppose a control had 100,000 reads
for a marker passing the overlap merging and primer matching, of which 99,800
matched the spike-ins leaving 200 unwanted presumably biological reads, of
which the most abundant was at 176 copies. Then the fractional abundance
threshold would be raised slightly to 176 / 100000 = 0.00176 or 0.176%.
Note that a control sample can be used with both
-y, so in
this second example that would also raise the absolute abundance threshold
to 176 reads.
Potentially a synthetic control sample can have unusually low read coverage, meaning even a low absolute number of non-spike-in reads (at noise level) would give a spuriously high inferred fractional abundance threshold. To guard against this corner case, as a heuristic half the absolute abundance threshold is applied to the synthetic control samples. Likewise, half of any fractional abundance threshold is applied to the negative control samples, which guards against spurious raising of the absolute threshold.
A similar problem would occur if you accidentally use
-y on a sample
without any expected spike-in controls. This would suggest result in an overly
high fractional threshold, treated as an error.