Minimum Abundance Threshold

With less samples multiplexed per sample than our own work (which guided the default settings), these samples were sequenced at much higher depth:

$ cut -f 1,2,5 metadata.tsv
<SEE TABLE BELOW>

As a table:

run_accession

library_name

read_count

SRR7109326

m6-stds

817764

SRR7109327

m6-301-1

890561

SRR7109328

m6-mock3-32000b

943839

SRR7109329

m6-766-1

840068

SRR7109330

m6-744-2

704173

SRR7109331

m6-500-1

911793

SRR7109341

m6-712-2

872265

SRR7109342

m6-500-2

879762

SRR7109343

m6-757-1

903886

SRR7109344

m6-757-2

1210627

SRR7109345

m6-mock3-16000

922440

SRR7109406

m6-712-1

897159

SRR7109408

m6-744-1

778090

SRR7109409

m6-mock3-32000a

1125275

SRR7109411

m6-766-2

785776

SRR7109412

m6-755-1

957067

SRR7109414

m6-736-1

998817

SRR7109415

m6-301-2

1181567

SRR7109417

m6-755-2

1071829

SRR7109418

m6-736-2

919363

SRR7109420

m6-SynMock

1299238

The defaults are an absolute abundance threshold of 100, and a fractional threshold of 0.1% (i.e. -a 100 -f 0.001). After merging overlapping reads and primer matching we could expect over 650,000 reads per m6 sample, giving a threshold over 650 reads.

So, with this coverage the default fractional abundance threshold of 0.1% (i.e. -f 0.001) makes the default absolute abundance threshold of 100 (i.e. -a 100) redundant. However, on this dataset our defaults are quite cautious, and control samples can help set thresholds objectively.

In this dataset there is a single synthetic control for m6 sequencing run, library SynMock aka SRR7109420. We can tell THAPBI PICT at the command line to use this to set the fractional abundance threshold via -y or --synctrls, and/or set the absolute abundance threshold via -n or --negctrls (with a list of control file names). It turns out however that with the default thresholds the control is clean (no unwanted non-synthetic ITS2 reads).

Using the defaults

The first step in run.sh is to run the pipeline with the default abundance thresholds (stricter than the alternatives analyses below), giving just a few hundred unique ITS2 sequences:

$ grep -c "^ITS2" summary/defaults.ITS2.tally.tsv
360
$ grep -c "^ITS2" summary/defaults.ITS2.reads.1s5g.tsv
360

Look at summary/defaults.ITS2.samples.1s5g.xlsx or working at the command line with the TSV file:

$ cut -f 1,7,9,11-12,14-15 summary/defaults.ITS2.samples.1s5g.tsv
<SEE TABLE BELOW>

As a table:

#sample_alias

Cutadapt

Threshold

Max non-spike

Max spike-in

Accepted

Unique

301-1

807956

808

348111

0

528957

38

301-2

1108129

1109

457440

0

778850

31

500-1

819468

820

289229

0

516474

30

500-2

813470

814

214155

0

529967

34

712-1

820146

821

131937

0

533310

56

712-2

796363

797

299240

0

520290

34

736-1

943427

944

349965

0

669563

36

736-2

854919

855

282132

0

609025

25

744-1

706659

707

358089

0

493209

20

744-2

651528

652

136471

0

452421

36

755-1

887650

888

462493

0

616322

27

755-2

982087

983

589120

0

669602

17

757-1

835431

836

281533

0

578198

35

757-2

1099959

1100

224635

0

742540

28

766-1

792260

793

526535

0

583643

16

766-2

711176

712

251097

0

469397

26

BioMock

866253

867

56120

0

591947

23

BioMock

846519

847

65686

0

585715

23

BioMock

1023231

1024

84748

0

698170

22

BioMockStds

736334

737

35300

0

521693

26

SynMock

1199806

1200

0

103014

862950

18

The SynMock control is clean, no non-spike-in reads passed the default abundance thresholds.

So, there is scope to lower the default thresholds - but how low? We will start by reproducing the Illumina part of Figure 6, which was based on the m6 MiSeq sequencing run. This figure explores tag-switching in the demultiplexing, and in the authors’ analysis goes as low as 5 reads.

Excluding only singletons

The run.sh example continues by running the pipeline on the m6 dataset with -f 0 -a 2 to accept everything except singletons (sequences which are only seen once in a sample; including them gives about ten times as many unique sequences which slows everything down). Also, this analysis does not use the synthetic control to raise the threshold on the rest of the samples - we want to see any low level mixing. We then can compare our sample report against Figure 6.

Looking at the unique reads in the FASTA file, tally table, or in the reads report with metadata, we have nearly 200 thousand ITS2 sequences:

$ grep -c "^ITS2" summary/a2.ITS2.tally.tsv
196480
$ grep -c "^ITS2" summary/a2.ITS2.reads.onebp.tsv
196480

Look at summary/a2.ITS2.samples.onebp.xlsx or working at the command line with the TSV file:

$ cut -f 1,5-7,11-12,14-15 summary/a2.ITS2.samples.onebp.tsv
<SEE TABLE BELOW>

As a table:

#sample_alias

Raw FASTQ

Flash

Cutadapt

Max non-spike

Max spike-in

Accepted

Unique

301-1

890561

812674

807956

348111

0

687950

12638

301-2

1181567

1113606

1108129

457440

0

977003

13319

500-1

911793

823392

819468

289229

0

689174

14249

500-2

879762

817277

813470

214155

0

699634

12851

712-1

897159

823034

820146

131937

0

703189

17574

712-2

872265

800475

796363

299240

0

683057

13937

736-1

998817

948348

943427

349965

15

834461

12993

736-2

919363

858915

854919

282132

0

757097

9625

744-1

778090

710762

706659

358089

0

614988

7936

744-2

704173

654661

651528

136471

0

564238

8650

755-1

957067

891942

887650

462493

15

782052

12142

755-2

1071829

987280

982087

589120

0

848793

10587

757-1

903886

839105

835431

281533

0

725057

12729

757-2

1210627

1105530

1099959

224635

0

950457

15819

766-1

840068

794475

792260

526535

0

712126

7519

766-2

785776

714894

711176

251097

0

606887

11189

BioMock

943839

872263

866253

56120

0

744007

17274

BioMock

922440

859262

846519

65686

0

733784

16676

BioMock

1125275

1047383

1023231

84748

3

884514

18416

BioMockStds

817764

740627

736334

35300

0

628576

17202

SynMock

1299238

1204532

1199806

187

103014

1043525

14234

Here SynMock (SRR7109420) is the synthetic control, and it has some non-spike-in reads present, the most abundant at 187 copies. Conversely, samples 755-1 (SRR7109412), 736-1 (SRR7109414), and one of the BioMock samples (SRR7109409) have trace levels of unwanted synthetic spike-in reads, the most abundant at 15, 15 and 3 copies respectively. The counts differ, but these are all samples highlighted in Figure 6 (sharing the same Illumina i7 or i5 index for multiplexing). We don’t see this in the other BioMock samples, but our pipeline appears slightly more stringent.

As percentages, 187/1199806 gives 0.0156% which is nearly ten times lower than our default of 0.1%. The numbers the other way round are all even lower, 15/462496 gives 0.003%, 15/349965 gives 0.004%, and 3/1023234 gives 0.003%.

Using the synthetic control

Next the run.sh example uses the SynMock synthetic control to automatically raise the fractional abundance threshold from zero to 0.015% by including -a 100 -f 0 -y raw_data/SRR7109420_*.fastq.gz in the command line. This brings down the unique sequence count enough to just over three thousand, allowing use of a slower but more lenient classifier as well:

$ grep -c "^ITS2" summary/ctrl.ITS2.tally.tsv
3097
$ grep -c "^ITS2" summary/ctrl.ITS2.reads.1s5g.tsv
3097

Look at summary/ctrl.ITS2.samples.1s5g.xlsx or working at the command line with the TSV file:

$ cut -f 1,7,9,11-12,14-15 summary/ctrl.ITS2.samples.1s5g.tsv
<SEE TABLE BELOW>

Note we now get a threshold column showing the absolute threshold applied to each sample (using the inferred percentage), all above the absolute default of 100. You can see the total accepted read count has dropped, and the number of unique sequences accepted has dropped even more dramatically:

#sample_alias

Cutadapt

Threshold

Max non-spike

Max spike-in

Accepted

Unique

301-1

807956

126

348111

0

579502

262

301-2

1108129

173

457440

0

829870

189

500-1

819468

128

289229

0

568336

228

500-2

813470

127

214155

0

578432

215

712-1

820146

128

131937

0

569100

181

712-2

796363

125

299240

0

570488

243

736-1

943427

148

349965

0

708900

183

736-2

854919

134

282132

0

653753

220

744-1

706659

111

358089

0

540597

273

744-2

651528

102

136471

0

472785

129

755-1

887650

139

462493

0

694273

340

755-2

982087

154

589120

0

754928

338

757-1

835431

131

281533

0

610579

171

757-2

1099959

172

224635

0

781212

142

766-1

792260

124

526535

0

648524

301

766-2

711176

111

251097

0

508838

205

BioMock

866253

136

56120

0

607401

77

BioMock

846519

132

65686

0

603186

82

BioMock

1023231

160

84748

0

718660

85

BioMockStds

736334

115

35300

0

526317

48

SynMock

1199806

100

187

103014

885051

113

Note that Palmer et al. (2018) apply a threshold to individual sequences, but the thresholding strategy in THAPBI PICT applies the fractional threshold to all the samples (given in the same sub-folder as input, so you can separate your MiSeq runs, or your PCR plates, or just apply a global threshold).

In fact, looking at the read report summary/ctrl.ITS2.reads.1s5g.tsv it is clear that while this threshold may have excluded Illumina tag-switching, it has not excluded PCR noise - there are hundreds of low abundance sequences unique to a single sample. To address that we would have to use a considerably higher threshold, and the default 0.1% is a reasonable choice here, or apply a denoising algorithm like UNOISE.

Threshold selection

Excluding only singletons is too lenient, but how does the the synthetic control inferred threshold (0.0156%) compare to the default (0.1%)?

Here are the classifier assessment values using the lower inferred threshold which allows a lot of PCR noise:

$ head -n 2 summary/ctrl.ITS2.assess.1s5g.tsv
<SEE TABLE BELOW>

As a table:

#Species

TP

FP

FN

TN

sensitivity

specificity

precision

F1

Hamming-loss

Ad-hoc-loss

OVERALL

102

11

1

186

0.99

0.94

0.90

0.94

0.0400

0.105

Versus the stricter higher default abundance fraction which excludes most of the PCR noise:

$ head -n 2 summary/defaults.ITS2.assess.1s5g.tsv
<SEE TABLE BELOW>

As a table:

#Species

TP

FP

FN

TN

sensitivity

specificity

precision

F1

Hamming-loss

Ad-hoc-loss

OVERALL

92

8

11

189

0.89

0.96

0.92

0.91

0.0633

0.171

You could use the assessment metrics to help decide on your preferred threshold, depending on the best tradeoff for your use-case.

Personally, of the these two I would pick the higher default threshold since it appears to exclude lots of PCR noise as seen in the edit graphs. With the default 0.1% threshold:

Sequence edit-graph, with default $0.1\%$ abundance threshold, showing 360 ASVs.

Using the lower threshold there are roughly ten times as many ASVs. The more common ASV nodes become the centre of a halo of 1bp variants, typically each seen in a single sample, which we attribute to PCR noise:

Sequence edit-graph, with synthetic control inferred $0.0156\%$ abundance threshold, showing 3097 ASVs.

The best choice of threshold may lie somewhere in between?

Read-correction for denoising

Read-correction is an alternative or supplement to a stringent abundance filter for removing the noise of sequence variants presumed to be PCR artefacts. Use --denoise as part of the pipeline or sample-tally commands to enable our implementation of the UNOISE algorithm (Edgar 2016).

Adding this to the control-driven abundance threshold example drops the total unique read count from over 3 thousand to just over 700:

$ grep -c "^ITS2" summary/ctrl_denoise.ITS2.tally.tsv
704
$ grep -c "^ITS2" summary/ctrl_denoise.ITS2.reads.1s5g.tsv
704

This gives an edit graph visually somewhere in between the examples above, with the obvious variant halos collapsed, but some of the more complex chains of variants still present.

In terms of classifier assessment on the mock community, there is no change:

$ head -n 2 summary/ctrl_denoise.ITS2.assess.1s5g.tsv
<SEE TABLE BELOW>

As a table:

#Species

TP

FP

FN

TN

sensitivity

specificity

precision

F1

Hamming-loss

Ad-hoc-loss

OVERALL

102

11

1

186

0.99

0.94

0.90

0.94

0.0400

0.105

Looking at the reports, the read counts are of course different, but also some of the reads assigned a genus-only classification have been removed via the read-correction, so the taxonomy output does not directly match up either.