Pipeline with metadata

Running thapbi-pict pipeline

Having run all the steps of the typical pipeline individually, we now return to the top level thapbi_pict pipeline command:

$ thapbi_pict pipeline -h
...

Assuming you have the FASTQ files in raw_data/, we can run the pipeline command as follows, and should get multiple output report files:

$ thapbi_pict pipeline -i raw_data/ -s intermediate/ \
  -o summary/thapbi-pict
...
$ ls -1 summary/thapbi-pict.*
summary/thapbi-pict.ITS1.onebp.tsv
summary/thapbi-pict.ITS1.reads.onebp.tsv
summary/thapbi-pict.ITS1.reads.onebp.xlsx
summary/thapbi-pict.ITS1.samples.onebp.tsv
summary/thapbi-pict.ITS1.samples.onebp.xlsx
summary/thapbi-pict.ITS1.tally.tsv
summary/thapbi-pict.edit-graph.onebp.pdf
summary/thapbi-pict.edit-graph.onebp.xgmml

As described for the prepare-reads step we should also specify which of the samples are negative controls, which may be used to increase the plate level minimum abundance threshold:

$ thapbi_pict pipeline -i raw_data/ -s intermediate/ \
  -o summary/thapbi-pict -n raw_data/NEGATIVE*.fastq.gz
...

And, as described for the summary reports, we can provide metadata:

$ thapbi_pict pipeline -i raw_data/ -s intermediate/ \
  -o summary/with-metadata -n raw_data/NEGATIVE*.fastq.gz \
  -t metadata.tsv -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -x 16
...

Finally, as we will review next, we can ask the pipeline to assess the results against any expected sample species classifications:

$ thapbi_pict pipeline -i raw_data/ expected/ -s intermediate/ \
  -o summary/with-metadata -n raw_data/NEGATIVE*.fastq.gz \
  -t metadata.tsv -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -x 16
...
$ ls -1 summary/with-metadata.*
summary/with-metadata.ITS1.onebp.tsv
summary/with-metadata.ITS1.assess.confusion.onebp.tsv
summary/with-metadata.ITS1.assess.onebp.tsv
summary/with-metadata.ITS1.assess.tally.onebp.tsv
summary/with-metadata.ITS1.reads.onebp.tsv
summary/with-metadata.ITS1.reads.onebp.xlsx
summary/with-metadata.ITS1.samples.onebp.tsv
summary/with-metadata.ITS1.samples.onebp.xlsx
summary/with-metadata.ITS1.tally.tsv

Here we also used -o (or --output) to specify a different stem for the report filenames.

Conclusions

For the THAPBI Phyto-Threats project our datasets span multiple plates, but we want to set plate-specific minimum abundance thresholds. That is taken care of as long as each plate is in its own directory. For example, you might have raw_data/plate_NNN/*.fastq.gz and run the pipeline with -i raw_data/).

However, while you could run the pipeline command on all the data in one go, with access to a computer cluster it will likely be faster to run at least the (slowest) prepare-reads stage on separate cluster nodes (e.g. one cluster job for each plate).