Preparing Inputs¶
File Of File Names¶
So step 1, in order to use titan-nf
we need to make a file of file names (FOFN). The FOFN is really just a tab-delimited file that we can feed to the Nextflow to process all our
genomes.
The FOFN for titan-nf
should have five columns:
- sample: A unique sample name that will be used for naming output files
- theiacov_wf: Informs
titan-nf
which workflow to use (clearlabs, illumina_pe, illumina_se, ont) - r1: If paired-end, the first pair of reads, else the single-end reads
- r2: If paired-end, the second pair of reads, else an empty FASTQ.
- primers: A BED formatted file of the primers used during sequencing
Here's an example of one:
sample theiacov_wf r1 r2 primers
sample01 clearlabs /home/robert_petit/test/fastqs/sample01.fastq.gz /home/robert_petit/.titan/EMPTY.fastq.gz /home/robert_petit/test/artic-v1.bed
sample02 illumina_pe /home/robert_petit/test/fastqs/sample02_R1.fastq.gz /home/robert_petit/test/fastqs/sample02_R1.fastq.gz /home/robert_petit/test/artic-v2.bed
sample03 illumina_se /home/robert_petit/test/fastqs/sample03.fastq.gz /home/robert_petit/.titan/EMPTY.fastq.gz /home/robert_petit/test/artic-v3.bed
sample04 ont /home/robert_petit/test/fastqs/sample04.fastq.gz /home/robert_petit/.titan/EMPTY.fastq.gz /home/robert_petit/test/midnight.bed
In the example above we have four samples. titan-nf
would read in this FOFN and process these samples like so:
sample01
uses the ClearLabs workflow (clearlabs) with Artic V1 primers (artic-v1.bed)sample02
uses the Illumina PE workflow (illumina_pe) with Artic V2 primers (artic-v2.bed)sample03
uses the Illumina SE workflow (illumina_se) with Artic V3 primers (artic-v3.bed)sample04
uses the ONT workflow (ont) with Midnight primers (midnight.bed)
At this point you might be thinking, "I seriously have to create this file everytime I want to run titan-nf
?, and the answer is Yes! But, you don't need to create these FOFNs manually everytime! There's a command theiacov-gc-prepare.py
that we can use to create the FOFNs.
Generating a FOFN¶
theiacov-gc-prepare.py
has been included to help (hopefully!) you generate your FOFNs in quickly and easily.
Examples¶
theiacov-gc-prepare.py
by default expects nicely named FASTQs (e.g. sample01_R1.fastq.gz
, sample01_R2.fastq.gz
, sample02.fastq.gz
). But as we all know, typically the FASTQs we receive are never named so nicely.
If you're like me, and also have some oddly named FASTQs, then you too are going to have to play around with some of the theiacov-gc-prepare.py
parameters. In the next few sections I'll go through some examples, and the changes necessary to get our square block to fit in the triangle hole.
Cheat Sheet¶
If you have deafult names from these technologies, these parameters are likely to work for you.
Technology | Parameters to change |
---|---|
Clear Labs | --fastq_pattern *.fastq --fastq_ext .fastq |
Illumina Paired-End | --fastq_ext "_001.fastq.gz" |
ONT | --fastq_separator . |
You made to tweak these a little
Clear Labs¶
Alright, I'm assuming Clear Labs follows a standard naming schema, so your's should look like mine (if not, let me know!). For now, let's assume your's also look like this:
ls clearlabs-fastqs/ | head -n 5
21052354.BB6L11.2021-08-05.01.barcode26.fastq
21052630.BB6L11.2021-08-05.01.barcode25.fastq
21052685.BB6L11.2021-08-05.01.barcode27.fastq
21052694.BB6L11.2021-08-05.01.barcode31.fastq
21052712.BB6L11.2021-08-05.01.barcode30.fastq
These break down to ${SAMPLE_NAME}.${RUN_ID}.${BARCODE}.fastq
, which is not ${SAMPLE_NAME}.fastq
, so we'll need to play around with the theiacov-gc-prepare.py
parameters.
First thing we need to fix is the extension, by default theiacov-gc-prepare.py
expects the FASTQs to look like this *.fastq.gz
, but here we have *.fastq
. This can be changed with the --fastq_pattern
parameter.
theiacov-gc-prepare.py clearlabs-fastqs/ clearlabs /opt/titan/data/artic-v3.primers.bed --tsv --fastq_pattern *.fastq | head -n 5
sample theiacov_wf r1 r2 primers
21052354.BB6L11.2021-08-05.01.barcode26.fastq clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052354.BB6L11.2021-08-05.01.barcode26.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052630.BB6L11.2021-08-05.01.barcode25.fastq clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052630.BB6L11.2021-08-05.01.barcode25.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052685.BB6L11.2021-08-05.01.barcode27.fastq clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052685.BB6L11.2021-08-05.01.barcode27.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052694.BB6L11.2021-08-05.01.barcode31.fastq clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052694.BB6L11.2021-08-05.01.barcode31.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
Hey!!!! Progress! But the sample names (e.g. 21052354.BB6L11.2021-08-05.01.barcode26.fastq
) have .fastq
at the end. Let's get rid of it! To do this, we can use the --fastq_ext
, which also defaults to .fastq.gz
and is used to remove the extension from the file. Here we want to remove .fastq
so we'll go with --fastq_ext .fastq
.
theiacov-gc-prepare.py clearlabs-fastqs/ clearlabs /opt/titan/data/artic-v3.primers.bed --tsv --fastq_pattern *.fastq --fastq_ext .fastq | head -n 5
sample theiacov_wf r1 r2 primers
21052354.BB6L11.2021-08-05.01.barcode26 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052354.BB6L11.2021-08-05.01.barcode26.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052630.BB6L11.2021-08-05.01.barcode25 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052630.BB6L11.2021-08-05.01.barcode25.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052685.BB6L11.2021-08-05.01.barcode27 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052685.BB6L11.2021-08-05.01.barcode27.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052694.BB6L11.2021-08-05.01.barcode31 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052694.BB6L11.2021-08-05.01.barcode31.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
Looking better! Sample names (21052354.BB6L11.2021-08-05.01.barcode26
) no longer have the .fastq
extension. At this point, you could take this feed it to titan-nf
.
theiacov-gc-prepare.py
with Clear Labs data
If you have standard Clear Labs FASTQs as demonstrated above, then this command should work for you.
theiacov-gc-prepare.py fastqs/ clearlabs primers.bed --tsv --fastq_pattern *.fastq --fastq_ext .fastq > samples.txt
Clean Sample Names¶
The following might not work for you
I still don't like the sample names, but fixing them is outside the scope of theiacov-gc-prepare.py
. In the next section, I'll demonstrate how we can use sed
to get some proper sample names!
Again, you've been warned... Let's go back to the Clear Labs observed naming schema:
${SAMPLE_NAME}.${RUN_ID}.${BARCODE}.fastq
We removed the .fastq
so we have ${SAMPLE_NAME}.${RUN_ID}.${BARCODE}
. We can use sed
to get rid of .${RUN_ID}.${BARCODE}
part!
In our example our ${RUN_ID}
is BB6L11.2021-08-05.01
, and our ${BARCODE}
is barcode00
. Let's generalize this a bit...
`BB6L11.2021-08-05.01` is the same as `[A-Za-z0-9]+.[0-9]+-[0-9]+-[0-9]+.[0-9]+`
haha thats hard to remember! So we can just substitute the whole run ID
barcode00
is an easy one though, it's just barcode[0-9]+
which will match barcode00
, barcode01
, ... barcode99
.
Alright enough, you probably just want the sed
part, here it is!
sed -E 's=.BB6L11.2021-08-05.01.barcode[0-9]+\t=\t='
What does this do?
The sed
statement above looks for .BB6L11.2021-08-05.01.barcode[0-9]+\t
and if found replaces it with a \t
(tab). The tabs are there to be explicit, because the FASTQ names also have ${RUN_ID}.${BARCODE}
in them, but they end with the .fastq
, not a \t
. By default sed
would only replace the first occurence of .BB6L11.2021-08-05.01.barcode[0-9]+
which is the sample
column, but I'd rather play it safe and use the \t
at the end.
Now, when we pipe the ouput of theiacov-gc-prepare.py
to this sed
statement, we get:
theiacov-gc-prepare.py clearlabs-fastqs/ clearlabs /opt/titan/data/artic-v3.primers.bed --tsv --fastq_pattern *.fastq --fastq_ext .fastq | \
sed -E 's=.BB6L11.2021-08-05.01.barcode[0-9]+\t=\t=' | \
head -n 5
sample theiacov_wf r1 r2 primers
21052354 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052354.BB6L11.2021-08-05.01.barcode26.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052630 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052630.BB6L11.2021-08-05.01.barcode25.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052685 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052685.BB6L11.2021-08-05.01.barcode27.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
21052694 clearlabs /home/robert_petit/test/tmp/clearlabs-fastqs/21052694.BB6L11.2021-08-05.01.barcode31.fastq /home/robert_petit/.titan/EMPTY.fastq.gz /opt/titan/data/artic-v3.primers.bed
Check out those pretty sample names!
In the future, you would just replace the ${RUN_ID}
with the current one, for example:
# Generalized
sed -E 's=.${RUN_ID}.barcode[0-9]+\t=\t='
# RUN_ID = BB6L11.2021-11-11-01, then use
sed -E 's=.BB6L11.2021-11-11-01.barcode[0-9]+\t=\t='
# RUN_ID = TokyoOlympics2021, then use
sed -E 's=.TokyoOlympics2021.barcode[0-9]+\t=\t='
Nicely named samples for Clear Labs
If you made it this far, you are probably ok playing with pipes and sed
. Otherwise stick with the method above.
theiacov-gc-prepare.py FASTQ_DIR clearlabs PRIMER_BED --tsv --fastq_pattern *.fastq --fastq_ext .fastq | \
sed -E 's=.${RUN_ID}.barcode[0-9]+\t=\t=' > samples.txt
Illumina Paired-End¶
Ok! You got yourself some Illumina paired-end sequences. Illumina has a naming convention, so hopefully your's look something like this:
ls pe-fastqs/ | head -n6
20117579-COV-210722-1_S27_L001_R1_001.fastq.gz
20117579-COV-210722-1_S27_L001_R2_001.fastq.gz
20118596-COV-210722-1_S26_L001_R1_001.fastq.gz
20118596-COV-210722-1_S26_L001_R2_001.fastq.gz
20119899-COV-210722-1_S12_L001_R1_001.fastq.gz
20119899-COV-210722-1_S12_L001_R2_001.fastq.gz
We can break these down to ${SAMPLE_NMAE}-${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}_R{1|2}_001.fastq.gz
, which since they don't look like ${SAMPLE_NAME}_R1.fastq.gz
we are going to have to play with the parameters of theiacov-gc-prepare.py
.
Good news though, this is an easy fix! All our FASTQs end in _001.fastq.gz
, so we can tell theiacov-gc-prepare.py
to use --fastq_ext "_001.fastq.gz"
. Let's see what happens:
theiacov-gc-prepare.py pe-fastqs/ illumina_pe /opt/titan/data/artic-v3.primers.bed --fastq_ext "_001.fastq.gz" --tsv
sample theiacov_wf r1 r2 primers
20117579-COV-210722-1_S27_L001 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20118596-COV-210722-1_S26_L001 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20119899-COV-210722-1_S12_L001 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20122466-COV-210722-1_S48_L001 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20123480-COV-210722-1_S40_L001 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
Very nice! Everything looks good, and if you wanted you could feed this to titan-nf
and get started processing.
theiacov-gc-prepare.py
with Illumina Paired-End data
If you have standard Illumina Paired-End FASTQs as demonstrated above, then this command should work for you.
theiacov-gc-prepare.py fastqs/ illumina_pe primers.bed --tsv --fastq_ext "_001.fastq.gz" > samples.txt
Clean Sample Names¶
The following might not work for you
I still don't like the sample names (20123480-COV-210722-1_S40_L001
), but fixing them is outside the scope of theiacov-gc-prepare.py
. In the next section, I'll demonstrate how we can use sed
to get some proper sample names!
Again, you've been warned... Let's go back to the Illumina Pair-End observed naming schema:
${SAMPLE_NMAE}-${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}_R{1|2}_001.fastq.gz
We removed the _R{1|2}_001.fastq.gz
so we have ${SAMPLE_NMAE}-${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}
. We can use sed
to get rid of -${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}
part!
In our example the ${RUN_ID}
is COV-210722-1
, the ${SAMPLE_NUMBER}
is S00
, and the ${LANE_ID}
is L000
.
For the ${RUN_ID}
, its a bit difficult to account for the infinite number of possible run names, since its specified by the user. But for the others, we can generalize them to:
${SAMPLE_NUMBER}
becomes S[0-9]+
which will match S00
, S09
, S42
, etc...
${LANE_ID}
becomes L[0-9]+
which will match L001
, L004
, L1
, etc...
So, if we put this all together in sed
we get:
sed -E 's=-COV-210722-1_S[0-9]+_L[0-9]+\t=\t='
What does this do?
The sed
statement above looks for -COV-210722-1_S[0-9]+_L[0-9]+\t
and if found replaces it with a \t
(tab). The tabs are there to be explicit, because the FASTQ names also have -${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}
in them, but they end with the .fastq.gz
, not a \t
. By default sed
would only replace the first occurence of -COV-210722-1_S[0-9]+_L[0-9]+
which is the sample
column, but I'd rather play it safe and use the \t
at the end.
Now, when we pipe the ouput of theiacov-gc-prepare.py
to this sed
statement, we get:
theiacov-gc-prepare.py pe-fastqs/ illumina_pe /opt/titan/data/artic-v3.primers.bed --fastq_ext "_001.fastq.gz" --tsv | \
sed -E 's=-COV-210722-1_S[0-9]+_L[0-9]+\t=\t='
sample theiacov_wf r1 r2 primers
20117579 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20118596 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20119899 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20122466 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
20123480 illumina_pe /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R1_001.fastq.gz /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R2_001.fastq.gz /opt/titan/data/artic-v3.primers.bed
Some very nice looking sample names!
In the future, you would just replace the ${RUN_ID}
with the current one, for example:
# Generalized
sed -E 's=-${RUN_ID}_S[0-9]+_L[0-9]+\t='
# RUN_ID = COV-002200-4, then use
sed -E 's=-COV-002200-4_S[0-9]+_L[0-9]+\t=\t='
# RUN_ID = TokyoOlympics2021, then use
sed -E 's=-TokyoOlympics2021_L[0-9]+\t=\t='
Nicely named samples for Clear Labs
If you made it this far, you are probably ok playing with pipes and sed
. Otherwise stick with the method above.
theiacov-gc-prepare.py pe-fastqs/ illumina_pe /opt/titan/data/artic-v3.primers.bed --fastq_ext "_001.fastq.gz" --tsv | \
sed -E 's=-${RUN_ID}_S[0-9]+_L[0-9]+\t=\t=' > samples.txt
Oxford Nanopore¶
Alright ONT naming, this is a fun one because when we merge the multiple FASTQs associated with a barcode we probably name it different then you. But for example purposes, let's assume your ONT FASTQs look like this (they will not though!):
ls ont-fastqs/ | head -n5
A01_20119534.fastq.gz
A02_20119750.fastq.gz
A03_20121547.fastq.gz
A04_20140893.fastq.gz
A05_20137749.fastq.gz
Here, there is not real naming convension, so we can really break it down. But we do need to play with the parameters.
theiacov-gc-prepare.py ont-fastqs/ ont /opt/titan/data/artic-v3.primers.bed --tsv
ERROR: "A01" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A02" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A03" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A04" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A05" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
This is an easy fix tough. Since we have single-end reads, theiacov-gc-prepare.py
is expecting the FASTQs to look like this ${SAMPLE_NAME}.fastq.gz
. By default theiacov-gc-prepare.py
splits readsets by the _
character, so it's thinking we have pair-end reads due to the _
in our FASTQ names.
We can use --fastq_separator
to tell theiacov-gc-prepare
to split on the .
(character.