High level overview

The high level summary is that all the samples have high coverage, much higher than most of the examples we have used. Some of the samples yield over a million reads for the COI and 12S amplicons, which with the default fractional minimum abundance threshold of 0.1% (-f 0.001) would mean using over 1000 reads as the threshold. This was too stringent, so the worked example reduces this to 0.01% (with -f 0.0001) matching the author’s analysis, and dropped the default absolute abundance threshold of 100 to 50 (with -a 50).

Note that the rarest members of the mock communities are expected from 1 in 500 individuals (0.02%) or 1 in 1000 individuals (0.01%), which is ten times higher than the fractional abundance threshold.

Sequence yield

We’ll start by looking at the number of read-pairs found for each marker. After calling ./run.sh you should be able to inspect these report files at the command line or in Excel.

$ cut -f 3,6-8,10,12-14 summary/COI.samples.onebp.tsv
<SEE TABLE BELOW>

Or open the Excel version summary/COI.samples.onebp.xlsx, and focus on those early columns:

sample_alias

Raw FASTQ

Flash

Cutadapt

Threshold

Singletons

Accepted

Unique

100-Pool-1

478705

474621

109233

50

11074

86402

178

250-Pool-1

1845819

1829913

157310

50

23119

118383

251

500-Pool-1

647776

643030

51092

50

6446

36718

127

1000-Pool-1

855997

848914

66002

50

7967

49058

149

100-Pool-2

737998

732014

432826

50

29168

368249

418

250-Pool-2

2037475

2022814

1250718

126

85718

1042562

482

500-Pool-2

1908370

1895715

1231908

124

59702

1042441

442

1000-Pool-2

1068715

1060596

584017

59

33060

498955

445

100-Pool-3

950692

940342

249422

50

24964

189156

371

250-Pool-3

1631700

1615113

274422

50

39974

192944

562

500-Pool-3

923807

916621

358429

50

32221

284819

567

1000-Pool-3

1773647

1758637

468361

50

42263

374487

733

100-Pool-4

634017

628523

117499

50

14596

74799

175

250-Pool-4

2501145

2480381

441558

50

61904

324512

707

500-Pool-4

572779

568565

144488

50

18279

96537

306

1000-Pool-4

1198812

1189853

294607

50

30130

220678

470

100-Pool-5

1817929

1800594

434739

50

45224

329015

660

250-Pool-5

1632786

1617219

440995

50

58159

328842

729

500-Pool-5

807060

801471

321428

50

30944

247519

484

1000-Pool-5

1423279

1411512

332286

50

32751

255309

584

Trap-1

1759819

1719671

110882

50

19740

73024

251

Trap-10

2445993

2420303

308371

50

58670

204842

480

Trap-2

1127739

1107970

110856

50

24385

55757

92

Trap-3

2422054

2366037

161686

50

30631

110043

268

Trap-4

742893

732907

63107

50

11933

35225

77

Trap-5

3437292

3346620

346696

50

71464

208989

542

Trap-6

697389

689125

91284

50

17153

57037

149

Trap-7

2853448

2820200

223330

50

31011

169121

319

Trap-8

2196646

2161966

146646

50

28814

92632

220

Trap-9

2065455

2049024

70591

50

14636

40131

109

The marker specific tables show the threshold applied was usually 50, the default absolute value set via -a 50 at the command line. Occasionally this has been increased to 0.1% of the sequences matching the primers for this marker, set via -f 0.0001 at the command line.

The numbers are similar for the 12S and 18S markers, or pooling them all:

$ cut -f 3,6,7,13,14 summary/pooled.samples.onebp.tsv
<SEE TABLE BELOW>

Again, alternatively open Excel file summary/pooled.samples.onebp.xlsx, and focus on those early columns:

sample_alias

Raw FASTQ

Flash

Accepted

Unique

100-Pool-1

478705

474621

371045

703

250-Pool-1

1845819

1829913

1508292

689

500-Pool-1

647776

643030

522396

800

1000-Pool-1

855997

848914

692639

950

100-Pool-2

737998

732014

587902

886

250-Pool-2

2037475

2022814

1243165

837

500-Pool-2

1908370

1895715

1551757

1142

1000-Pool-2

1068715

1060596

863574

1024

100-Pool-3

950692

940342

684297

1479

250-Pool-3

1631700

1615113

1158575

1241

500-Pool-3

923807

916621

697552

1457

1000-Pool-3

1773647

1758637

1366298

1993

100-Pool-4

634017

628523

451801

879

250-Pool-4

2501145

2480381

1867605

1171

500-Pool-4

572779

568565

416456

925

1000-Pool-4

1198812

1189853

918004

1660

100-Pool-5

1817929

1800594

1369274

1918

250-Pool-5

1632786

1617219

1128901

1475

500-Pool-5

807060

801471

603390

1276

1000-Pool-5

1423279

1411512

1104412

1716

Trap-1

1759819

1719671

392775

919

Trap-10

2445993

2420303

492325

1079

Trap-2

1127739

1107970

129956

273

Trap-3

2422054

2366037

427533

953

Trap-4

742893

732907

232800

403

Trap-5

3437292

3346620

486282

1177

Trap-6

697389

689125

80003

170

Trap-7

2853448

2820200

1158684

842

Trap-8

2196646

2161966

683669

1024

Trap-9

2065455

2049024

1352408

689

The “Accepted” column is the number of reads matching the primer pairs and passing our abundance thresholds. The fraction accepted varies from 61% to 82% for the mock community samples, but is considerably lower for the environmental traps, varying from 11% to 65%. Much of that would be noise and trace level environmental DNA.

The “Unique” column is the number of accepted unique sequences. For the mock communities this should be up to 18 with at most six species each, and three markers. The observed counts are much higher, so we might want to denoise, or and/or raise the abundance threshold higher. Dropping it further does raise the false positive rate inferred from the mock communities.