Nextflow DSL2
First, the Wikipedia definition of DSL:
A domain-specific language (DSL) is a computer language specialized to a particular application domain.
The DSL2 of nextflow was announced, the 24/07/2020 and is now well documented. It’s defined as:
a major evolution of the Nextflow language and makes it possible to scale and modularise your data analysis pipeline while continuing to use the Dataflow programming paradigm that characterises the Nextflow processing model.
This means that we can now split our pipeline between different files, instead of having one huge unreadable file.
Enabling DSL2
The DSL2 is supported by every version of nextflow >= 20.**.**
, you can update your version of nextflow with the following command:
nextflow self-update
The DSL2 is not enabled by default, for now, you need to add the following line into your main .nf
script:
.enable.dsl=2 nextflow
Nextflow modules
Nextflow module are merely generic process
definition without the input
from
nor output
into
channel names specified.
samtool sort
process definition
Channel
.fromPath( params.bam )
.map { it -> [it.simpleName, it]}
.set { bam_files }
{
process sort_bam "$file_id"
tag
:
input, file(bam) from bam_files
set file_id
:
output, "*_sorted.bam" into sorted_bam_files
set file_id
:
script"""
samtools sort -@ ${task.cpus} -O BAM -o ${file_id}_sorted.bam ${bam}
"""
}
samtool sort
module definition
{
process sort_bam "$file_id"
tag
:
inputval(file_id), path(bam)
tuple
:
outputval(file_id), path("*.bam*")
tuple
:
script"""
samtools sort -@ ${task.cpus} -O BAM -o ${bam.simpleName}_sorted.bam ${bam}
"""
}
We save this module definition in src/nf_modules/samtools/main.nf
You can now include your module with the following code:
{ sort_bam } from './nf_module/samtools/main.nf' include
Mind the ./
at the start of the path.
Workflow
With modules you don’t have the channel information to chain one process after another. Nextflow DSL2 introduces the workflow.
A workflow is a new block. With a workflow you can write the RNA quantification pipeline from the nextflow practical for experimental biologists as the following:
.info "fastq files : ${params.fastq}"
log.info "fasta file : ${params.fasta}"
log.info "bed file : ${params.bed}"
log
// same as Channel
channel .fromPath( params.fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
.set { fasta_files }
channel.fromPath( params.bed )
.ifEmpty { error "Cannot find any bed files matching: ${params.bed}" }
.set { bed_files }
channel.fromFilePairs( params.fastq )
.ifEmpty { error "Cannot find any fastq files matching: ${params.fastq}" }
.set { fastq_files }
{ adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
include { trimming_pairedend } from './nf_modules/urqt/main'
include { fasta_from_bed } from './nf_modules/bedtools/main'
include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'
include
{
workflow adaptor_removal_pairedend(fastq_files)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
Modules outputs
By default module outputs are accessible by module_name.out
if you have different output module_name.out
will be a list.
You can also have named output with the emit
definition. For example, the RNA quantification pipeline, the adaptor_removal_pairedend
module is defined as follows:
{
process adaptor_removal_pairedend "$pair_id"
tag "results/fastq/adaptor_removal/", mode: 'copy'
publishDir
:
inputval(pair_id), path(reads)
tuple
:
outputval(pair_id), path("*_cut_R{1,2}.fastq.gz"), emit: fastq
tuple "*_report.txt", emit: report
path
:
script"""
cutadapt -a ${adapter_3_prim} -g ${adapter_5_prim} -A ${adapter_3_prim} -G ${adapter_5_prim} \
-o ${pair_id}_cut_R1.fastq.gz -p ${pair_id}_cut_R2.fastq.gz \
${reads[0]} ${reads[1]} > ${pair_id}_report.txt
"""
}
Here, the adaptor_removal_pairedend
emit two named item: fastq
and report
Modules variable scope
In the src/nf_modules/cutadapt/main.nf
we have the following variable definition:
= "AGATCGGAAGAG"
adapter_3_prim = "CTCTTCCGATCT"
adapter_5_prim = "20" trim_quality
Which are used in the adaptor_removal_pairedend
modules. When the module is included, those variables are initialized. However, we can overwrite their value by redefining them in the workflow file.
{ adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
include { trimming_pairedend } from './nf_modules/urqt/main'
include { fasta_from_bed } from './nf_modules/bedtools/main'
include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'
include
= "other_adaptor"
adapter_3_prim
{
workflow adaptor_removal_pairedend(fastq_files)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
Implicit channel forking
With the DSL2 the operator into
is no longer defined, because channels are duplicated automatically !
We can easily add FastQC steps to our pipline
{ fastqc_fastq_pairedend } from './nf_modules/fastqc/main'
include
{
workflow adaptor_removal_pairedend(fastq_files)
fastqc_fastq_pairedend(fastq_files) // don't cause an error !
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
If channels are implicitly forked, it’s not the case for the modules. We can use as
in the include
command to rename modules and use the same module at different points of the workflow :
{
include as fastqc_raw; // mind the ";" !
fastqc_fastq_pairedend as fastqc_clipped;
fastqc_fastq_pairedend as fastqc_trimmed;
fastqc_fastq_pairedend } from './nf_modules/fastqc/main'
{
workflow fastqc_raw(fastq_files)
adaptor_removal_pairedend(fastq_files)
fastqc_clipped(adaptor_removal_pairedend.out.fastq)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fastqc_trimmed(trimming_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
Sub-workflow
Sub-workflow can be seen as workflow declared as module module. Sub-workflows are workflow that take
inputs and emit
output. We can split our RNASeq quantification pipeline the following way.
{
workflow read_processing :
take
fastq_files:
mainfastqc_raw(fastq_files)
adaptor_removal_pairedend(fastq_files)
fastqc_clipped(adaptor_removal_pairedend.out.fastq)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fastqc_trimmed(trimming_pairedend.out.fastq)
:
emit= trimming_pairedend.out.fastq
fastq = fastqc_raw.out.report
report .mix(fastqc_clipped.out.report)
.mix(fastqc_trimmed.out.report)
}
{
workflow read_processing(fastq_files)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
Nested workflow execution determines an implicit scope. Therefore the same process can be invoked in two different workflow scopes.
DSL2 migration notes
Process inputs or outputs of type
set
have to be replaced with tuple.Process output option
mode flatten
is not available any more.Use
path
instead offile
(can interpret string as path)The use of unqualified value and file elements into input tuples is not allowed anymore
: input, 'some-file.bam' tuple X
: inputval(X), path('some-file.bam') tuple
Operator bind has been deprecated by DSL2 syntax
Operator operator << has been deprecated by DSL2 syntax.
Operator choice has been deprecated by DSL2 syntax. Use branch instead.
Operator close has been deprecated by DSL2 syntax.
Operator create has been deprecated by DSL2 syntax.
Operator
countBy
has been deprecated by DSL2 syntax.Operator into has been deprecated by DSL2 syntax since it’s not needed anymore.
Operator
fork
has been renamed to multiMap.Operator
groupBy
has been deprecated by DSL2 syntax. Replace it with groupTupleOperator
print
andprintln
have been deprecated by DSL2 syntax. Use view instead.Operator merge has been deprecated by DSL2 syntax. Use join instead.
Operator separate has been deprecated by DSL2 syntax.
Operator spread has been deprecated with DSL2 syntax. Replace it with combine.
Operator
route
has been deprecated by DSL2 syntax.
To see all the changes you can read the DSL2 section of the documentation and re-read the full nextflow documentation…