High level overview
The high level summary is that all the samples have high coverage, much higher than most of the examples we have used. The coverage also varies between samples - making a fractional minimum abundance threshold attractive here. There is minimal off target signal (from the other primer sets), and the no template blanks have lower yields. The read counts in the blanks are high, but happily do not appear to contain nematode sequence.
Per-marker yield
We’ll start by looking at the number of read-pairs found for each marker.
After calling ./run.sh
you should be able to inspect these report files
at the command line or in Excel.
$ cut -f 1,2,5-7,9,12 summary/NF1-18Sr2b.samples.onebp.tsv
<SEE TABLE BELOW>
Or open the Excel version summary/NF1-18Sr2b.samples.onebp.xlsx
, and focus
on those early columns:
#marker |
sample |
Raw FASTQ |
Flash |
Cutadapt |
Threshold |
Accepted |
---|---|---|---|---|---|---|
D3Af-D3Br |
Blank |
1193593 |
1039205 |
0 |
25 |
0 |
D3Af-D3Br |
MC1 |
3897994 |
3317661 |
0 |
25 |
0 |
D3Af-D3Br |
MC2 |
4228233 |
3685150 |
0 |
25 |
0 |
D3Af-D3Br |
MC3 |
4309817 |
3864130 |
0 |
25 |
0 |
JB3-JB5GED |
Blank |
69641 |
62060 |
0 |
25 |
0 |
JB3-JB5GED |
MC1 |
1236201 |
1157824 |
0 |
25 |
0 |
JB3-JB5GED |
MC2 |
2160885 |
2058441 |
1 |
25 |
0 |
JB3-JB5GED |
MC3 |
1204900 |
1139777 |
0 |
25 |
0 |
NF1-18Sr2b |
Blank |
260778 |
218813 |
187776 |
25 |
140063 |
NF1-18Sr2b |
MC1 |
2483453 |
2126062 |
2109488 |
25 |
1394883 |
NF1-18Sr2b |
MC2 |
2349364 |
1985981 |
1972923 |
25 |
1359884 |
NF1-18Sr2b |
MC3 |
2435278 |
2088185 |
2070379 |
25 |
1409844 |
SSUF04-SSUR22 |
Blank |
57199 |
46879 |
0 |
25 |
0 |
SSUF04-SSUR22 |
MC1 |
3162379 |
2633321 |
77 |
25 |
0 |
SSUF04-SSUR22 |
MC2 |
2790363 |
2370732 |
280 |
25 |
0 |
SSUF04-SSUR22 |
MC3 |
1953138 |
1640045 |
52 |
25 |
0 |
You should find the raw FASTQ numbers match the author’s Table 5, although that omits the blanks - which happily are all much lower.
The “Flash” column reports how many of those raw FASTQ read pairs could be
overlap merged into a single sequence - and our numbers range from 82% to 95%
(it is easy to add this calculation in Excel). This is very different from the
author’s results in Table 6, although we agree that the best yield was with
the JB3-JB5GED markers. Exploring the flash settings here, using -O
or
--allow-outies
was important here to maximize yield, but that alone does
not explain this discrepancy.
The “Cutadapt” column reports how many of those merged reads could be primer trimmed with the NF1-18Sr2b primers, and happily we get high numbers only for the NF1-18Sr2b samples, but low levels from the other samples. That could be barcode leakage in the demultiplexing, or actual unwanted DNA in the samples.
Then we have our “Threshold” and the final column highlighted here is the
“Read count” after applying our minimum abundance threshold - and now we only
get reads from the NF1-18Sr2b samples. These are all 25 specified at the
command line with -a 25
in the script, and -f 0
to disable the
fractional abundance threshold. This was done to reduce the false negatives
in the mock communities to be more in line with the original analysis.
We can repeat this for the other three primer sets, and the same pattern is observed - strong signal only for the matching samples (with the blanks giving strong but lower counts), and all non-matching samples zero after the minimum abundance threshold is applied.
Blank controls
The excellent news is even at this (much lower than default) minimum abundance threshold there are no recognisable nematode sequences in any of the blanks.
Looking at the same sample reports (or the more detailed read reports), we see that while the blank samples with no PCR template control give lots of reads, where they can be identified the organisms are not seen in the mock communities. Quoting the paper:
Blank samples only yielded sequences of fungi and streptophyta.
In our case, we found lots of fungi and also the genus Urtica (which is a green plant under streptophyta), but also some Blastocystis (Stramenopiles), Cercomonas (Rhizaria) and Sphaerularioidea (Opisthokonta).
$ for MARKER in NF1-18Sr2b SSUF04-SSUR22 D3Af-D3Br JB3-JB5GED; do \
grep $MARKER.Blank summary/$MARKER.samples.onebp.tsv | cut -f 1,2,4; \
done
<SEE TABLE EXCERPT BELOW>
Or manually looking at the four separate files - where column 4 is a text summary of the classifier output:
NF1-18Sr2b |
Blank |
Fungi (unknown species), Urtica sp., Unknown |
SSUF04-SSUR22 |
Blank |
Blastocystis sp., Fungi (unknown species), Unknown |
D3Af-D3Br |
Blank |
Cercomonas sp., Fungi (unknown species), Sphaerularioidea gen. sp. EM-2016, Unknown |
JB3-JB5GED |
Blank |
Unknown |
It should stressed that all the blank samples have unknown sequences (indeed the JB3-JB5GED blank sequences are all reported as unknown).