Skip to content

Preparing Inputs

File Of File Names

So step 1, in order to use titan-nf we need to make a file of file names (FOFN). The FOFN is really just a tab-delimited file that we can feed to the Nextflow to process all our genomes.

The FOFN for titan-nf should have five columns:

  1. sample: A unique sample name that will be used for naming output files
  2. theiacov_wf: Informs titan-nf which workflow to use (clearlabs, illumina_pe, illumina_se, ont)
  3. r1: If paired-end, the first pair of reads, else the single-end reads
  4. r2: If paired-end, the second pair of reads, else an empty FASTQ.
  5. primers: A BED formatted file of the primers used during sequencing

Here's an example of one:

sample  theiacov_wf r1  r2  primers
sample01    clearlabs   /home/robert_petit/test/fastqs/sample01.fastq.gz    /home/robert_petit/.titan/EMPTY.fastq.gz    /home/robert_petit/test/artic-v1.bed
sample02    illumina_pe /home/robert_petit/test/fastqs/sample02_R1.fastq.gz /home/robert_petit/test/fastqs/sample02_R1.fastq.gz /home/robert_petit/test/artic-v2.bed
sample03    illumina_se /home/robert_petit/test/fastqs/sample03.fastq.gz    /home/robert_petit/.titan/EMPTY.fastq.gz    /home/robert_petit/test/artic-v3.bed
sample04    ont /home/robert_petit/test/fastqs/sample04.fastq.gz    /home/robert_petit/.titan/EMPTY.fastq.gz    /home/robert_petit/test/midnight.bed

In the example above we have four samples. titan-nf would read in this FOFN and process these samples like so:

  1. sample01 uses the ClearLabs workflow (clearlabs) with Artic V1 primers (artic-v1.bed)
  2. sample02 uses the Illumina PE workflow (illumina_pe) with Artic V2 primers (artic-v2.bed)
  3. sample03 uses the Illumina SE workflow (illumina_se) with Artic V3 primers (artic-v3.bed)
  4. sample04 uses the ONT workflow (ont) with Midnight primers (midnight.bed)

At this point you might be thinking, "I seriously have to create this file everytime I want to run titan-nf?, and the answer is Yes! But, you don't need to create these FOFNs manually everytime! There's a command theiacov-gc-prepare.py that we can use to create the FOFNs.

Generating a FOFN

theiacov-gc-prepare.py has been included to help (hopefully!) you generate your FOFNs in quickly and easily.

Examples

theiacov-gc-prepare.py by default expects nicely named FASTQs (e.g. sample01_R1.fastq.gz, sample01_R2.fastq.gz, sample02.fastq.gz). But as we all know, typically the FASTQs we receive are never named so nicely.

If you're like me, and also have some oddly named FASTQs, then you too are going to have to play around with some of the theiacov-gc-prepare.py parameters. In the next few sections I'll go through some examples, and the changes necessary to get our square block to fit in the triangle hole.

Cheat Sheet

If you have deafult names from these technologies, these parameters are likely to work for you.

Technology Parameters to change
Clear Labs --fastq_pattern *.fastq --fastq_ext .fastq
Illumina Paired-End --fastq_ext "_001.fastq.gz"
ONT --fastq_separator .

You made to tweak these a little

Clear Labs

Alright, I'm assuming Clear Labs follows a standard naming schema, so your's should look like mine (if not, let me know!). For now, let's assume your's also look like this:

ls clearlabs-fastqs/ | head -n 5
21052354.BB6L11.2021-08-05.01.barcode26.fastq
21052630.BB6L11.2021-08-05.01.barcode25.fastq
21052685.BB6L11.2021-08-05.01.barcode27.fastq
21052694.BB6L11.2021-08-05.01.barcode31.fastq
21052712.BB6L11.2021-08-05.01.barcode30.fastq

These break down to ${SAMPLE_NAME}.${RUN_ID}.${BARCODE}.fastq, which is not ${SAMPLE_NAME}.fastq, so we'll need to play around with the theiacov-gc-prepare.py parameters.

First thing we need to fix is the extension, by default theiacov-gc-prepare.py expects the FASTQs to look like this *.fastq.gz, but here we have *.fastq. This can be changed with the --fastq_pattern parameter.

theiacov-gc-prepare.py clearlabs-fastqs/ clearlabs /opt/titan/data/artic-v3.primers.bed --tsv --fastq_pattern *.fastq | head -n 5
sample  theiacov_wf        r1      r2      primers
21052354.BB6L11.2021-08-05.01.barcode26.fastq   clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052354.BB6L11.2021-08-05.01.barcode26.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed
21052630.BB6L11.2021-08-05.01.barcode25.fastq   clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052630.BB6L11.2021-08-05.01.barcode25.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed
21052685.BB6L11.2021-08-05.01.barcode27.fastq   clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052685.BB6L11.2021-08-05.01.barcode27.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed
21052694.BB6L11.2021-08-05.01.barcode31.fastq   clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052694.BB6L11.2021-08-05.01.barcode31.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed

Hey!!!! Progress! But the sample names (e.g. 21052354.BB6L11.2021-08-05.01.barcode26.fastq) have .fastq at the end. Let's get rid of it! To do this, we can use the --fastq_ext, which also defaults to .fastq.gz and is used to remove the extension from the file. Here we want to remove .fastq so we'll go with --fastq_ext .fastq.

theiacov-gc-prepare.py clearlabs-fastqs/ clearlabs /opt/titan/data/artic-v3.primers.bed --tsv --fastq_pattern *.fastq --fastq_ext .fastq | head -n 5
sample  theiacov_wf        r1      r2      primers
21052354.BB6L11.2021-08-05.01.barcode26 clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052354.BB6L11.2021-08-05.01.barcode26.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed
21052630.BB6L11.2021-08-05.01.barcode25 clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052630.BB6L11.2021-08-05.01.barcode25.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed
21052685.BB6L11.2021-08-05.01.barcode27 clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052685.BB6L11.2021-08-05.01.barcode27.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed
21052694.BB6L11.2021-08-05.01.barcode31 clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052694.BB6L11.2021-08-05.01.barcode31.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz      /opt/titan/data/artic-v3.primers.bed

Looking better! Sample names (21052354.BB6L11.2021-08-05.01.barcode26) no longer have the .fastq extension. At this point, you could take this feed it to titan-nf.

theiacov-gc-prepare.py with Clear Labs data

If you have standard Clear Labs FASTQs as demonstrated above, then this command should work for you. theiacov-gc-prepare.py fastqs/ clearlabs primers.bed --tsv --fastq_pattern *.fastq --fastq_ext .fastq > samples.txt

Clean Sample Names

The following might not work for you

I still don't like the sample names, but fixing them is outside the scope of theiacov-gc-prepare.py. In the next section, I'll demonstrate how we can use sed to get some proper sample names!

Again, you've been warned... Let's go back to the Clear Labs observed naming schema:

${SAMPLE_NAME}.${RUN_ID}.${BARCODE}.fastq

We removed the .fastq so we have ${SAMPLE_NAME}.${RUN_ID}.${BARCODE}. We can use sed to get rid of .${RUN_ID}.${BARCODE} part!

In our example our ${RUN_ID} is BB6L11.2021-08-05.01, and our ${BARCODE} is barcode00. Let's generalize this a bit...

`BB6L11.2021-08-05.01` is the same as `[A-Za-z0-9]+.[0-9]+-[0-9]+-[0-9]+.[0-9]+`

haha thats hard to remember! So we can just substitute the whole run ID

barcode00 is an easy one though, it's just barcode[0-9]+ which will match barcode00, barcode01, ... barcode99.

Alright enough, you probably just want the sed part, here it is!

sed -E 's=.BB6L11.2021-08-05.01.barcode[0-9]+\t=\t='

What does this do?

The sed statement above looks for .BB6L11.2021-08-05.01.barcode[0-9]+\t and if found replaces it with a \t (tab). The tabs are there to be explicit, because the FASTQ names also have ${RUN_ID}.${BARCODE} in them, but they end with the .fastq, not a \t. By default sed would only replace the first occurence of .BB6L11.2021-08-05.01.barcode[0-9]+ which is the sample column, but I'd rather play it safe and use the \t at the end.

Now, when we pipe the ouput of theiacov-gc-prepare.py to this sed statement, we get:

theiacov-gc-prepare.py clearlabs-fastqs/ clearlabs /opt/titan/data/artic-v3.primers.bed --tsv --fastq_pattern *.fastq --fastq_ext .fastq | \
    sed -E 's=.BB6L11.2021-08-05.01.barcode[0-9]+\t=\t=' | \
    head -n 5
sample  theiacov_wf        r1      r2      primers
21052354        clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052354.BB6L11.2021-08-05.01.barcode26.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz        /opt/titan/data/artic-v3.primers.bed
21052630        clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052630.BB6L11.2021-08-05.01.barcode25.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz        /opt/titan/data/artic-v3.primers.bed
21052685        clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052685.BB6L11.2021-08-05.01.barcode27.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz        /opt/titan/data/artic-v3.primers.bed
21052694        clearlabs       /home/robert_petit/test/tmp/clearlabs-fastqs/21052694.BB6L11.2021-08-05.01.barcode31.fastq      /home/robert_petit/.titan/EMPTY.fastq.gz        /opt/titan/data/artic-v3.primers.bed

Check out those pretty sample names!

In the future, you would just replace the ${RUN_ID} with the current one, for example:

# Generalized
sed -E 's=.${RUN_ID}.barcode[0-9]+\t=\t='

# RUN_ID = BB6L11.2021-11-11-01, then use
sed -E 's=.BB6L11.2021-11-11-01.barcode[0-9]+\t=\t='

# RUN_ID = TokyoOlympics2021, then use
sed -E 's=.TokyoOlympics2021.barcode[0-9]+\t=\t='

Nicely named samples for Clear Labs

If you made it this far, you are probably ok playing with pipes and sed. Otherwise stick with the method above.

theiacov-gc-prepare.py FASTQ_DIR clearlabs PRIMER_BED --tsv --fastq_pattern *.fastq --fastq_ext .fastq | \
    sed -E 's=.${RUN_ID}.barcode[0-9]+\t=\t=' > samples.txt

Illumina Paired-End

Ok! You got yourself some Illumina paired-end sequences. Illumina has a naming convention, so hopefully your's look something like this:

ls pe-fastqs/ | head -n6
20117579-COV-210722-1_S27_L001_R1_001.fastq.gz
20117579-COV-210722-1_S27_L001_R2_001.fastq.gz
20118596-COV-210722-1_S26_L001_R1_001.fastq.gz
20118596-COV-210722-1_S26_L001_R2_001.fastq.gz
20119899-COV-210722-1_S12_L001_R1_001.fastq.gz
20119899-COV-210722-1_S12_L001_R2_001.fastq.gz

We can break these down to ${SAMPLE_NMAE}-${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}_R{1|2}_001.fastq.gz, which since they don't look like ${SAMPLE_NAME}_R1.fastq.gz we are going to have to play with the parameters of theiacov-gc-prepare.py.

Good news though, this is an easy fix! All our FASTQs end in _001.fastq.gz, so we can tell theiacov-gc-prepare.py to use --fastq_ext "_001.fastq.gz". Let's see what happens:

theiacov-gc-prepare.py pe-fastqs/ illumina_pe /opt/titan/data/artic-v3.primers.bed --fastq_ext "_001.fastq.gz" --tsv
sample  theiacov_wf        r1      r2      primers
20117579-COV-210722-1_S27_L001  illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R2_001.fastq.gz     /opt/titan/data/artic-v3.primers.bed
20118596-COV-210722-1_S26_L001  illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R2_001.fastq.gz     /opt/titan/data/artic-v3.primers.bed
20119899-COV-210722-1_S12_L001  illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R2_001.fastq.gz     /opt/titan/data/artic-v3.primers.bed
20122466-COV-210722-1_S48_L001  illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R2_001.fastq.gz     /opt/titan/data/artic-v3.primers.bed
20123480-COV-210722-1_S40_L001  illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R2_001.fastq.gz     /opt/titan/data/artic-v3.primers.bed

Very nice! Everything looks good, and if you wanted you could feed this to titan-nf and get started processing.

theiacov-gc-prepare.py with Illumina Paired-End data

If you have standard Illumina Paired-End FASTQs as demonstrated above, then this command should work for you. theiacov-gc-prepare.py fastqs/ illumina_pe primers.bed --tsv --fastq_ext "_001.fastq.gz" > samples.txt

Clean Sample Names

The following might not work for you

I still don't like the sample names (20123480-COV-210722-1_S40_L001), but fixing them is outside the scope of theiacov-gc-prepare.py. In the next section, I'll demonstrate how we can use sed to get some proper sample names!

Again, you've been warned... Let's go back to the Illumina Pair-End observed naming schema:

${SAMPLE_NMAE}-${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}_R{1|2}_001.fastq.gz

We removed the _R{1|2}_001.fastq.gz so we have ${SAMPLE_NMAE}-${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID}. We can use sed to get rid of -${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID} part!

In our example the ${RUN_ID} is COV-210722-1, the ${SAMPLE_NUMBER} is S00, and the ${LANE_ID} is L000.

For the ${RUN_ID}, its a bit difficult to account for the infinite number of possible run names, since its specified by the user. But for the others, we can generalize them to:

${SAMPLE_NUMBER} becomes S[0-9]+ which will match S00, S09, S42, etc...

${LANE_ID} becomes L[0-9]+ which will match L001, L004, L1, etc...

So, if we put this all together in sed we get:

sed -E 's=-COV-210722-1_S[0-9]+_L[0-9]+\t=\t='

What does this do?

The sed statement above looks for -COV-210722-1_S[0-9]+_L[0-9]+\t and if found replaces it with a \t (tab). The tabs are there to be explicit, because the FASTQ names also have -${RUN_ID}_${SAMPLE_NUMBER}-${LANE_ID} in them, but they end with the .fastq.gz, not a \t. By default sed would only replace the first occurence of -COV-210722-1_S[0-9]+_L[0-9]+ which is the sample column, but I'd rather play it safe and use the \t at the end.

Now, when we pipe the ouput of theiacov-gc-prepare.py to this sed statement, we get:

theiacov-gc-prepare.py pe-fastqs/ illumina_pe /opt/titan/data/artic-v3.primers.bed --fastq_ext "_001.fastq.gz" --tsv | \
    sed -E 's=-COV-210722-1_S[0-9]+_L[0-9]+\t=\t='
sample  theiacov_wf        r1      r2      primers
20117579        illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20117579-COV-210722-1_S27_L001_R2_001.fastq.gz    /opt/titan/data/artic-v3.primers.bed
20118596        illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20118596-COV-210722-1_S26_L001_R2_001.fastq.gz    /opt/titan/data/artic-v3.primers.bed
20119899        illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20119899-COV-210722-1_S12_L001_R2_001.fastq.gz    /opt/titan/data/artic-v3.primers.bed
20122466        illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20122466-COV-210722-1_S48_L001_R2_001.fastq.gz    /opt/titan/data/artic-v3.primers.bed
20123480        illumina_pe     /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R1_001.fastq.gz    /home/robert_petit/test/tmp/pe-fastqs/20123480-COV-210722-1_S40_L001_R2_001.fastq.gz    /opt/titan/data/artic-v3.primers.bed

Some very nice looking sample names!

In the future, you would just replace the ${RUN_ID} with the current one, for example:

# Generalized
sed -E 's=-${RUN_ID}_S[0-9]+_L[0-9]+\t='

# RUN_ID = COV-002200-4, then use
sed -E 's=-COV-002200-4_S[0-9]+_L[0-9]+\t=\t='

# RUN_ID = TokyoOlympics2021, then use
sed -E 's=-TokyoOlympics2021_L[0-9]+\t=\t='

Nicely named samples for Clear Labs

If you made it this far, you are probably ok playing with pipes and sed. Otherwise stick with the method above.

theiacov-gc-prepare.py pe-fastqs/ illumina_pe /opt/titan/data/artic-v3.primers.bed --fastq_ext "_001.fastq.gz" --tsv | \
    sed -E 's=-${RUN_ID}_S[0-9]+_L[0-9]+\t=\t=' > samples.txt

Oxford Nanopore

Alright ONT naming, this is a fun one because when we merge the multiple FASTQs associated with a barcode we probably name it different then you. But for example purposes, let's assume your ONT FASTQs look like this (they will not though!):

ls ont-fastqs/ | head -n5
A01_20119534.fastq.gz
A02_20119750.fastq.gz
A03_20121547.fastq.gz
A04_20140893.fastq.gz
A05_20137749.fastq.gz

Here, there is not real naming convension, so we can really break it down. But we do need to play with the parameters.

theiacov-gc-prepare.py ont-fastqs/ ont /opt/titan/data/artic-v3.primers.bed --tsv
ERROR: "A01" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A02" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A03" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A04" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.
ERROR: "A05" must have equal paired-end read sets (R1 has 0 and R2 has 1), please check.

This is an easy fix tough. Since we have single-end reads, theiacov-gc-prepare.py is expecting the FASTQs to look like this ${SAMPLE_NAME}.fastq.gz. By default theiacov-gc-prepare.py splits readsets by the _ character, so it's thinking we have pair-end reads due to the _ in our FASTQ names.

We can use --fastq_separator to tell theiacov-gc-prepare to split on the . (character.