Unexpected sequences

In the previous section, we highlighted several unexpected contaminants in the negative controls which could not be explained as cross-contamination from the mock community. Likewise the read reports show plenty of unassigned sequences, things which did not match the very narrow databases built from ITS1.fasta or ITS2.fasta containing markers expected from the mock community only.

Some unexpected sequences might reflect additional alternative copies of ITS1 or ITS2 in the genomes. Others are likely external contamination - after all there are fungi practically everywhere. This seems to have happened on amplicon library one in the high PCR cycle negative control at least. Meanwhile, amplicon library two does not have any obvious external contamination.

Amplicon library one - ITS1 (BITS/B58S3)

From the first amplicon library for ITS1 we saw the following sequences in the negative controls (and by chance, not in any mock community samples) - shown here with their highest single sample abundance, which supports using a minimum abundance threshold higher than 10:

MD5 checksum	Max	Species
`daadc4126b5747c43511bd3be0ea2438`	32	Wallemia muriae
`e5b7a8b5dc0da33108cc8a881eb409f5`	10	Wallemia muriae; Wallemia sebi
`5194a4ae3a27d987892a8fee7b1669b9`	17	Trichosporon asahii
`702929cef71042156acb3a28270d8831`	14	Candida tropicalis

Here are the reads from entries with a maximum sample abundance over 75 which the onebp and in some cases blast based classifier failed to match, along with the most likely match from reviewing an online NCBI BLAST search. You can easily extract these entries (and their sequences) from the bottom of the summary/AL1_BITS_B58S3.reads.*.tsv files:

MD5 checksum	Max	Species
`5ca0acd7dd9d76fdd32c61c13ca5c881`	4562	Epicoccum nigrum; Epicoccum layuense
`ee5382b80607f0f052a3ad3c4e87d0ce`	575	glomeromycetes, perhaps Rhizophagus
`880007c5a18be69c3f444efd144fc450`	236	Ascochyta or Neoascochyta?
`8e74f38b058222c58943fc6211d277fe`	149	Fusarium
`cae29429b90fc6539c440a140494aa25`	114	glomeromycetes, perhaps Rhizophagus
`85775735614d45d056ce5f1b67f8d2b2`	109	Fusarium

The sequence with the top abundance, 5ca0acd7dd9d76fdd32c61c13ca5c881, perfectly matches fungus Epicoccum nigrum and Epicoccum layuense. Present at low levels in multiple samples, this was the dominant sequence in SRR5314339 aka FMockE.HC1_S178, which was a high PCR cycle number replicate of the even mixture. Perhaps this was a stray fragment of Epicoccum which by chance was amplified early in the PCR? This example was not highlighted in the original paper, but is exactly the kind of thing you should worry about with a high PCR cycle number.

Next ee5382b80607f0f052a3ad3c4e87d0ce and the less abundant sequence cae29429b90fc6539c440a140494aa25 looks like glomeromycetes, perhaps Rhizophagus (from the mock community), but could be from a Glomus species. Using the blast classifier and the minimal curated reference set matches this to Rhizophagus irregularis, but the situation would be ambiguous in a more complete database.

Sequence 880007c5a18be69c3f444efd144fc450 has perfect matches to lots of unclassified fungi, and conflicting perfect matches including Ascochyta or Neoascochyta. This was seen only in the high PCR cycle number sample SRR5314339 as above.

Next 8e74f38b058222c58943fc6211d277fe and 85775735614d45d056ce5f1b67f8d2b2 have good BLAST matches to several different Fusarium species, so could also be from the mock community.

You can find all six of these sequence on the edit-graph, most as isolated grey nodes along the bottom except cae29429b90fc6539c440a140494aa25 which is 3bp away from Rhizophagus irregularis and linked to it with a dashed line.

So some of the ITS1 sequences in amplicon library one are likely external contamination - particularly with the high PCR cycle negative control (which was likely included exactly because of this risk).

Amplicon library two - ITS1 (ITS1f/ITS2)

Using our blast classifier with the 19 species database, everything was assigned a match. The default onebp classifier was stricter. For example while the very common f1b689ef7d0db7b0d303e9c9206ee5ad (which with the BITS/B58S3 primers gave bb28f2b57f8fddefe6e7b5d01eca8aea) was matched to Fusarium oxysporum, all the variations of this were too far away from the database entries for a match.

These primers amplified a larger fragment to that in amplicon library one. Focusing on those with a sample-abundance over 75 (as in the edit-graphs) which the onebp classifier did not match to the curated reference set:

Long sequence MD5 (ITS1f/ITS2).	Max	Species
`57b06dff740b38bd6a0375abd9db3972`	640	glomeromycetes, perhaps Rhizophagus
`eed6e5c3881a233cca219f7ffd886bbe`	315	glomeromycetes, perhaps Rhizophagus
`05007e829ab71427b49743994a14105f`	154	glomeromycetes, perhaps Rhizophagus
`93b2d56429637947243e1b5d54a065cf`	132	Fusarium
`610caedb1a5699836310fce9dbb9c5fa`	96	Fusarium
`54aecb27334809f56b7f940b9ca060a3`	93	Fusarium
`bd30cf52b7031ddd96e3d7588c1f0e1c`	90	Fusarium
`c40cad2530d633430c3805be3740c9a4`	88	Fusarium
`d44cd471b11f15e2e42070806737e5d1`	86	Fusarium
`831acf596cca4ef840c5543d82e23d16`	82	Fusarium
`d4145ba9e3ed6c8c2138ed15b147152d`	81	Fusarium

You can find all of these sequence on the edit-graph, most of those labelled as likely Fusarium are a 1bp edit away from large grey node f1b689 top left (except 610caedb1a5699836310fce9dbb9c5fa which is an isolated node placed bottom middle). Those labelled glomeromycetes are in the middle near, and in once case connected to, a dark red Rhizophagus irregularis node.

i.e. None of the ITS1 sequences in amplicon library two are clear cut external contamination.

Amplicon library two - ITS2

Finally, amplicon library two using the ITS3-KYO and ITS4-KYO3 primers for ITS2. Again, the blast based classifier matched everything to an entry in the mock community database. The stricter onebp classifier assigned most reads. Here are those few it failed to match with a maximum read abundance over 75:

MD5 checksum	Max	Species
`d1bb95fff4a7e9958fa3c7f13cc51343`	211	Fusarium
`2ef33e6acd8079d729b81d24b91fcf88`	133	Fusarium
`8edbf2c168b11f910458b0e567ae5fc6`	78	Aspergillus

These three all appears on the edit-graph separated from a red node (database entry) by a dashed or dotted line indicating a 2bp or 3bp edit away.

Using an online NCBI BLAST search didn’t pin any of these down to species level, but they do all seem to be fungi. Again, quite a few Fusarium matches which could be alternative ITS2 sequences in the genomes but not in the curated reference set. Likewise the Aspergillus like sequence could be from the Aspergillus flavus in the mock community.

i.e. None of the ITS2 sequences in amplicon library two are clear cut external contamination.