High level overview

The high level summary is that all the samples have high coverage, much higher than most of the examples we have used. The coverage also varies between samples - making a fractional minimum abundance threshold attractive here. There is minimal off target signal (from the other primer sets), and the no template blanks have lower yields. The read counts in the blanks are high, but happily do not appear to contain nematode sequence.

Per-marker yield

We’ll start by looking at the number of read-pairs found for each marker. After calling ./run.sh you should be able to inspect these report files at the command line or in Excel.

$ cut -f 1,2,5-7,9,12 summary/NF1-18Sr2b.samples.onebp.tsv
<SEE TABLE BELOW>

Or open the Excel version summary/NF1-18Sr2b.samples.onebp.xlsx, and focus on those early columns:

#marker

sample

Raw FASTQ

Flash

Cutadapt

Threshold

Accepted

D3Af-D3Br

Blank

1193593

1039205

0

25

0

D3Af-D3Br

MC1

3897994

3317661

0

25

0

D3Af-D3Br

MC2

4228233

3685150

0

25

0

D3Af-D3Br

MC3

4309817

3864130

0

25

0

JB3-JB5GED

Blank

69641

62060

0

25

0

JB3-JB5GED

MC1

1236201

1157824

0

25

0

JB3-JB5GED

MC2

2160885

2058441

1

25

0

JB3-JB5GED

MC3

1204900

1139777

0

25

0

NF1-18Sr2b

Blank

260778

218813

187776

25

140063

NF1-18Sr2b

MC1

2483453

2126062

2109488

25

1394883

NF1-18Sr2b

MC2

2349364

1985981

1972923

25

1359884

NF1-18Sr2b

MC3

2435278

2088185

2070379

25

1409844

SSUF04-SSUR22

Blank

57199

46879

0

25

0

SSUF04-SSUR22

MC1

3162379

2633321

77

25

0

SSUF04-SSUR22

MC2

2790363

2370732

280

25

0

SSUF04-SSUR22

MC3

1953138

1640045

52

25

0

You should find the raw FASTQ numbers match the author’s Table 5, although that omits the blanks - which happily are all much lower.

The “Flash” column reports how many of those raw FASTQ read pairs could be overlap merged into a single sequence - and our numbers range from 82% to 95% (it is easy to add this calculation in Excel). This is very different from the author’s results in Table 6, although we agree that the best yield was with the JB3-JB5GED markers. Exploring the flash settings here, using -O or --allow-outies was important here to maximize yield, but that alone does not explain this discrepancy.

The “Cutadapt” column reports how many of those merged reads could be primer trimmed with the NF1-18Sr2b primers, and happily we get high numbers only for the NF1-18Sr2b samples, but low levels from the other samples. That could be barcode leakage in the demultiplexing, or actual unwanted DNA in the samples.

Then we have our “Threshold” and the final column highlighted here is the “Read count” after applying our minimum abundance threshold - and now we only get reads from the NF1-18Sr2b samples. These are all 25 specified at the command line with -a 25 in the script, and -f 0 to disable the fractional abundance threshold. This was done to reduce the false negatives in the mock communities to be more in line with the original analysis.

We can repeat this for the other three primer sets, and the same pattern is observed - strong signal only for the matching samples (with the blanks giving strong but lower counts), and all non-matching samples zero after the minimum abundance threshold is applied.

Blank controls

The excellent news is even at this (much lower than default) minimum abundance threshold there are no recognisable nematode sequences in any of the blanks.

Looking at the same sample reports (or the more detailed read reports), we see that while the blank samples with no PCR template control give lots of reads, where they can be identified the organisms are not seen in the mock communities. Quoting the paper:

Blank samples only yielded sequences of fungi and streptophyta.

In our case, we found lots of fungi and also the genus Urtica (which is a green plant under streptophyta), but also some Blastocystis (Stramenopiles), Cercomonas (Rhizaria) and Sphaerularioidea (Opisthokonta).

$ for MARKER in NF1-18Sr2b SSUF04-SSUR22 D3Af-D3Br JB3-JB5GED; do \
  grep $MARKER.Blank summary/$MARKER.samples.onebp.tsv | cut -f 1,2,4; \
  done
<SEE TABLE EXCERPT BELOW>

Or manually looking at the four separate files - where column 4 is a text summary of the classifier output:

NF1-18Sr2b

Blank

Fungi (unknown species), Urtica sp., Unknown

SSUF04-SSUR22

Blank

Blastocystis sp., Fungi (unknown species), Unknown

D3Af-D3Br

Blank

Cercomonas sp., Fungi (unknown species), Sphaerularioidea gen. sp. EM-2016, Unknown

JB3-JB5GED

Blank

Unknown

It should stressed that all the blank samples have unknown sequences (indeed the JB3-JB5GED blank sequences are all reported as unknown).