Pipeline with metadata
Running thapbi-pict pipeline
Having run all the steps of the typical pipeline individually, we now return
to the top level thapbi_pict pipeline
command:
$ thapbi_pict pipeline -h
...
Assuming you have the FASTQ files in raw_data/
, we can run the pipeline
command as follows, and should get multiple output report files:
$ thapbi_pict pipeline -i raw_data/ -s intermediate/ \
-o summary/thapbi-pict
...
$ ls -1 summary/thapbi-pict.*
summary/thapbi-pict.ITS1.onebp.tsv
summary/thapbi-pict.ITS1.reads.onebp.tsv
summary/thapbi-pict.ITS1.reads.onebp.xlsx
summary/thapbi-pict.ITS1.samples.onebp.tsv
summary/thapbi-pict.ITS1.samples.onebp.xlsx
summary/thapbi-pict.ITS1.tally.tsv
summary/thapbi-pict.edit-graph.onebp.pdf
summary/thapbi-pict.edit-graph.onebp.xgmml
As described for the prepare-reads step we should also specify which of the samples are negative controls, which may be used to increase the plate level minimum abundance threshold:
$ thapbi_pict pipeline -i raw_data/ -s intermediate/ \
-o summary/thapbi-pict -n raw_data/NEGATIVE*.fastq.gz
...
And, as described for the summary reports, we can provide metadata:
$ thapbi_pict pipeline -i raw_data/ -s intermediate/ \
-o summary/with-metadata -n raw_data/NEGATIVE*.fastq.gz \
-t metadata.tsv -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -x 16
...
Finally, as we will review next, we can ask the pipeline to assess the results against any expected sample species classifications:
$ thapbi_pict pipeline -i raw_data/ expected/ -s intermediate/ \
-o summary/with-metadata -n raw_data/NEGATIVE*.fastq.gz \
-t metadata.tsv -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -x 16
...
$ ls -1 summary/with-metadata.*
summary/with-metadata.ITS1.onebp.tsv
summary/with-metadata.ITS1.assess.confusion.onebp.tsv
summary/with-metadata.ITS1.assess.onebp.tsv
summary/with-metadata.ITS1.assess.tally.onebp.tsv
summary/with-metadata.ITS1.reads.onebp.tsv
summary/with-metadata.ITS1.reads.onebp.xlsx
summary/with-metadata.ITS1.samples.onebp.tsv
summary/with-metadata.ITS1.samples.onebp.xlsx
summary/with-metadata.ITS1.tally.tsv
Here we also used -o
(or --output
) to specify a different stem for the
report filenames.
Conclusions
For the THAPBI Phyto-Threats project our datasets span multiple plates, but we
want to set plate-specific minimum abundance thresholds. That is taken care of
as long as each plate is in its own directory. For example, you might have
raw_data/plate_NNN/*.fastq.gz
and run the pipeline with -i raw_data/
).
However, while you could run the pipeline command on all the data in one go,
with access to a computer cluster it will likely be faster to run at least the
(slowest) prepare-reads
stage on separate cluster nodes (e.g. one cluster
job for each plate).