Universal plant DNA barcodes and mini-barcodes
As in the animal primers, for rbcL the paper used two targets, a long barcode
and a shorter mini-barcode. The same names have been used in the run.sh
script provided, the output of which is referred to below.
matK
The paper described two sets of primers for matK, although only one was used
for the MiSeq sequencing. This gave no sequences at the default abundance
threshold, dropping to 50 showed three uniques sequences in three files, and
even dropping to 10 only gave results from EM_2
, EM_14
and S8
.
NCBI BLAST of these sequence gave no perfect matches, but suggested
Sanguisorba sp. was present, noted in the original paper for S8
which
is one of the traditional medicine samples.
rbcL - long target
Using our default abundance threshold and the author’s minimum length of 140bp,
we got no sequences at all. Allowing a minimum length of 100 (our default)
gave the following sequence and a one SNP variant, all from S3
:
>3ec67342f519461a0ad40fef436b1b1d
GACTGCGGGGTTCAAAGCTGGTGTTAAAGATTATAGATTGACGTATTATACTCCTGAATTGGGGTTATCCGCTAAGAATT
ACGGTAGAGCAGTTTATGAATGTCTT
The best NCBI BLAST matches are Astragalus, but with a break point. The
authors of the original paper report finding Astragalus danicus in S3
.
Mini-rbcL - short target
This was by far and above the most diverse in terms of unique sequences recovered:
$ grep -c -v "^#" summary/Mini-rbcL.tally.tsv
278
We see expected plant species like Lactuca sativa, Brassica oleracea, Aloe variegata and Dendrobium sp. - exactly how they are classified depends critically on how the database is built.
The traditional medicine samples have multiple unknown sequences likely of plant origin.
The edit-graph is the most complicated of those in this dataset - not
simply in terms of the number of nodes. This marker needs more careful
review before using THAPBI PICT’s default onebp
classifier.
trnL-UAA
Not very diverse, only eight unique sequences recovered:
$ grep -c -v "^#" summary/trnL-UAA.tally.tsv
8
We see lots of Brassica, the difficulties with Brassica oleracea vs Brassica napus (and the genus in general) are discussed in the paper too.
trnL-P6-loop
Initially I saw no sequences with this marker, even disabling the abundance threshold. This was strange, however easily explained - quoting the paper:
We implemented a minimum DNA barcode length of 200 nt, except for DNA barcodes with a basic length shorter than 200 nt, in which case the minimum expected DNA barcode length is set to 100 nt for ITS2, 140 nt for mini-rbcL, and 10 nt for the trnL (P6 loop) marker.
Therefore in run.sh
we have changed the THAPBI PICT minimum length from
100 (our default) to 10 for this marker - and now get lots, over a hundred
unique sequences:
$ grep -c -v "^#" summary/trnL-P6-loop.tally.tsv
134
We find this dominated by Brassica oleracea in most samples. However, at our default abundance threshold we do not find Cycas revoluta which is consistent with the original analysis reporting this at very low abundance.
Our reference set here has Aloe reynoldsii sequences, but none for the expected entry Aloe variegata.
An obvious false positive here is Cullen sp. which like the authors we found
in the S3
traditional medicine, but also unexpectedly in all the S1
samples.
ITS2
Quite diverse, with over fifty unique sequences recovered:
$ grep -c -v "^#" summary/ITS2.tally.tsv
59
Finds all the Brassica and Echinocactus sp., most of the Euphorbia sp.
We do see unexpected matches to Lactuca sp. where Lactuca sativa was in
the experimental mixture. The dominant sequence present is just one base pair
away from a published sequence from that species (KM210323.1), but perfectly
matches published sequences from Lactuca altaica, L. serriola and
L. virosa - and that is what was in the sample database. If you open the
associated edit-graph file (ITS2.edit-graph.onebp.xgmml
) in Cytoscape,
you can see this quite clearly.