Overview¶
titan-nf
(titan-nextflow) is a Nextflow wrapper around the Titan Genomic Characterization (Titan-GC) is a workflow. If you're asking yourself Why the workflow inception?, I encourage you to read the next section!
Motivation¶
Titan-GC is a workflow developed by Kevin Libuit for the analysis of viral pathogens, specifically SARS-CoV-2. Titan-GC is written in Workflow Description Language (WDL) with the Terra cloud platform being its target environment (and it does the quite well!).
A command line interface (CLI) version of Titan-GC is available, which allows the user to run all Titan-GC workflows (clearlabs, illumina_pe, illumina_se, and ont) through a single command. This is done using Cromwell, via cromwell run
, which does not support call-caching (e.g. jobs cannot be resumed, please let me know if I'm wrong!). There is cromwell server
which supports call-caching, but I think its to much to process a single genome. I also explored miniwdl, which allows resuming jobs, but it currently does not support Singularity images.
Due to this I decided to make a Nextflow wrapper, titan-nf
, around the Titan-GC CLI. This wrapper runs one genome per-process trough Titan-GC. In other words, if one sample fails, the whole run doesn't fail. Instead if one sample fails, Nextflow will just retry it and keep everything moving.
titan-nf
might not be for you
titan-nf
is highly configured for our setup, so it there's a good chance it will not work for you! If you would still like to use titan-nf
, don't hesistate to reach out, I bet with a few tweaks we can make it work.
titan-nf
Overview¶
Now that you've made it this far, it's time to learn what titan-nf
is doing. There are four steps to titan-nf
:
- Update Pangolin Container
- Run
titan-gc-cli
- Merge Results and Make a Backup
- Clean Up the
work
Directory
TLDR for next sections
titan-nf
automatically updates Pangolin, runs titan-gc-cli
per sample, merges the results, creates a tarball for backups, and finally cleans up the work
directory.
That's pretty much it, no need to read the rest now! Head on over to Installation! 🎉
Update Pangolin¶
I think we all know by now SARS-CoV-2 is constantly evolving, which means Pangolin is constantly being updated. I help maintain the Bioconda recipe for Pangolin, and for each Pangolin release I make sure to update the versions for pangoLEARN, pango-designation, scorpio and constellations. Once the latest version of Pangolin is merged into Bioconda, a container is automatically created on Biocontainers.
So, everytime you run titan-nf
a quick check will happen to see if you already have the latest Pangolin container.
Disable this by providing a Docker image
If you want to disable this feature you can provide your own Docker image tag with --pangolin_docker
(e.g. --pangolin_docker quay.io/biocontainers/pangolin:3.1.8--pyhdfd78af_0
).
Run titan-gc-cli
¶
I don't know why, and I don't plan to figure it out, but for some reason (resources maybe?) some steps in titan-gc-cli
fail. If you're running titan-gc-cli
on multiple samples this means the whole run would crash, and because it uses cromwell run
the analysis would have to start all the way from the beginning. Unfortunately, the reason for failures is not always reproducible, and the solution is just to retry running it.
This is the main reason behind using Nextflow. At this stage we are telling Nextflow to run titan-gc-cli
on only one sample at a time. Keep in mind though Nextflow is still queuing up multiple samples. If one sample fails it doesn't kill all the other runs, Nextflow will simple retry it. In the event a sample failed more times then Nextflow would allow, then yes Nextflow would crash and burn! BUT since we are using Nextflow we can simply use the -resume
option, to pick up where things left off.
Merge Results and Make a Backup¶
Since we used Nextflow to run all the samples one at a time, we have to take the results for all the samples and merge them. By doing so, we have replicated how the outputs would have looked had you run titan-gc-cli
with multiple samples and it completed sucessfully.
Additionally, a tarball is created of the finished run and moved to a folder that is automatically synced to a Google Storage Bucket. In other words, we are creating a backup of the FASTQs and FASTAs! This also takes a burden off everyone, because now we'll know where the results for each run are located.
Clean Up the work
Directory¶
So we started using Nextflow, and one of the features of Nextflow is it runs all processes in and isolated location within the work
directory. Often times this work
directory can get very large. In our case, it gets even bigger because cromwell
required files to be copied between directories. Nextflow's default is to create a symbolic link to save space, but here we are forced to copy the data. This easily doubles the size of the work
directory.
Because of this, titan-nf
is setup to delete the work
directory when it successfully finishes. By doing so we lose the ability to resume, but if it finished fine you probably weren't going to resume anyways!
If you made it this far, I commend you! That was probably unnecessarily verbose, but there was a TLDR! Anyways head on over to Installation to get started!