Nextflow DSL2

changes introduced by the DSL2 of Nextflow
Club Bioinfo
Author

Laurent Modolo

Published

February 4, 2021

Nextflow DSL2

cc-by-sa

First, the Wikipedia definition of DSL:

A domain-specific language (DSL) is a computer language specialized to a particular application domain.

The DSL2 of nextflow was announced, the 24/07/2020 and is now well documented. It’s defined as:

a major evolution of the Nextflow language and makes it possible to scale and modularise your data analysis pipeline while continuing to use the Dataflow programming paradigm that characterises the Nextflow processing model.

This means that we can now split our pipeline between different files, instead of having one huge unreadable file.

Enabling DSL2

The DSL2 is supported by every version of nextflow >= 20.**.**, you can update your version of nextflow with the following command:

nextflow self-update

The DSL2 is not enabled by default, for now, you need to add the following line into your main .nf script:

nextflow.enable.dsl=2

Nextflow modules

Nextflow module are merely generic process definition without the input from nor output into channel names specified.

samtool sort process definition

Channel
  .fromPath( params.bam )
  .map { it -> [it.simpleName, it]}
  .set { bam_files }

process sort_bam {
  tag "$file_id"

  input:
    set file_id, file(bam) from bam_files

  output:
    set file_id, "*_sorted.bam" into sorted_bam_files

  script:
"""
samtools sort -@ ${task.cpus} -O BAM -o ${file_id}_sorted.bam ${bam}
"""
}

samtool sort module definition

process sort_bam {
  tag "$file_id"

  input:
    tuple val(file_id), path(bam)

  output:
    tuple val(file_id), path("*.bam*")

  script:
"""
samtools sort -@ ${task.cpus} -O BAM -o ${bam.simpleName}_sorted.bam ${bam}
"""
}

We save this module definition in src/nf_modules/samtools/main.nf

You can now include your module with the following code:

include { sort_bam } from './nf_module/samtools/main.nf' 

Mind the ./ at the start of the path.

Workflow

With modules you don’t have the channel information to chain one process after another. Nextflow DSL2 introduces the workflow.

A workflow is a new block. With a workflow you can write the RNA quantification pipeline from the nextflow practical for experimental biologists as the following:

log.info "fastq files : ${params.fastq}"
log.info "fasta file : ${params.fasta}"
log.info "bed file : ${params.bed}"

channel // same as Channel
  .fromPath( params.fasta )
  .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
  .set { fasta_files }
channel
  .fromPath( params.bed )
  .ifEmpty { error "Cannot find any bed files matching: ${params.bed}" }
  .set { bed_files }
channel
  .fromFilePairs( params.fastq )
  .ifEmpty { error "Cannot find any fastq files matching: ${params.fastq}" }
  .set { fastq_files }

include { adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
include { trimming_pairedend } from './nf_modules/urqt/main'
include { fasta_from_bed } from './nf_modules/bedtools/main'
include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'

workflow {
    adaptor_removal_pairedend(fastq_files)
    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
    fasta_from_bed(fasta_files, bed_files)
    index_fasta(fasta_from_bed.out.fasta)
    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}

Modules outputs

By default module outputs are accessible by module_name.out if you have different output module_name.out will be a list.

You can also have named output with the emit definition. For example, the RNA quantification pipeline, the adaptor_removal_pairedend module is defined as follows:

process adaptor_removal_pairedend {
  tag "$pair_id"
  publishDir "results/fastq/adaptor_removal/", mode: 'copy'

  input:
  tuple val(pair_id), path(reads)

  output:
  tuple val(pair_id), path("*_cut_R{1,2}.fastq.gz"), emit: fastq
  path "*_report.txt", emit: report

  script:
  """
  cutadapt -a ${adapter_3_prim} -g ${adapter_5_prim} -A ${adapter_3_prim} -G ${adapter_5_prim} \
  -o ${pair_id}_cut_R1.fastq.gz -p ${pair_id}_cut_R2.fastq.gz \
  ${reads[0]} ${reads[1]} > ${pair_id}_report.txt
  """
}

Here, the adaptor_removal_pairedend emit two named item: fastq and report

Modules variable scope

In the src/nf_modules/cutadapt/main.nf we have the following variable definition:

adapter_3_prim = "AGATCGGAAGAG"
adapter_5_prim = "CTCTTCCGATCT"
trim_quality = "20"

Which are used in the adaptor_removal_pairedend modules. When the module is included, those variables are initialized. However, we can overwrite their value by redefining them in the workflow file.

include { adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
include { trimming_pairedend } from './nf_modules/urqt/main'
include { fasta_from_bed } from './nf_modules/bedtools/main'
include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'

adapter_3_prim = "other_adaptor"

workflow {
    adaptor_removal_pairedend(fastq_files)
    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
    fasta_from_bed(fasta_files, bed_files)
    index_fasta(fasta_from_bed.out.fasta)
    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}

Implicit channel forking

With the DSL2 the operator into is no longer defined, because channels are duplicated automatically !

We can easily add FastQC steps to our pipline

include { fastqc_fastq_pairedend } from './nf_modules/fastqc/main'

workflow {
    adaptor_removal_pairedend(fastq_files)
    fastqc_fastq_pairedend(fastq_files) // don't cause an error !
    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
    fasta_from_bed(fasta_files, bed_files)
    index_fasta(fasta_from_bed.out.fasta)
    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}

If channels are implicitly forked, it’s not the case for the modules. We can use as in the include command to rename modules and use the same module at different points of the workflow :

include { 
  fastqc_fastq_pairedend as fastqc_raw; // mind the ";" !
  fastqc_fastq_pairedend as fastqc_clipped;
  fastqc_fastq_pairedend as fastqc_trimmed;
} from './nf_modules/fastqc/main'

workflow {
    fastqc_raw(fastq_files)
    adaptor_removal_pairedend(fastq_files)
    fastqc_clipped(adaptor_removal_pairedend.out.fastq)
    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
    fastqc_trimmed(trimming_pairedend.out.fastq)
    fasta_from_bed(fasta_files, bed_files)
    index_fasta(fasta_from_bed.out.fasta)
    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}

Sub-workflow

Sub-workflow can be seen as workflow declared as module module. Sub-workflows are workflow that take inputs and emit output. We can split our RNASeq quantification pipeline the following way.

workflow read_processing {
    take:
      fastq_files
    main:
        fastqc_raw(fastq_files)
      adaptor_removal_pairedend(fastq_files)
      fastqc_clipped(adaptor_removal_pairedend.out.fastq)
      trimming_pairedend(adaptor_removal_pairedend.out.fastq)
      fastqc_trimmed(trimming_pairedend.out.fastq)
    emit:
      fastq = trimming_pairedend.out.fastq
      report = fastqc_raw.out.report
                   .mix(fastqc_clipped.out.report)
                   .mix(fastqc_trimmed.out.report)
}
workflow {
    read_processing(fastq_files)
    fasta_from_bed(fasta_files, bed_files)
    index_fasta(fasta_from_bed.out.fasta)
    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}

Nested workflow execution determines an implicit scope. Therefore the same process can be invoked in two different workflow scopes.

DSL2 migration notes

  • Process inputs or outputs of type set have to be replaced with tuple.

  • Process output option mode flatten is not available any more.

  • Use path instead of file (can interpret string as path)

  • The use of unqualified value and file elements into input tuples is not allowed anymore

    input:
      tuple X, 'some-file.bam'
    input:
      tuple val(X), path('some-file.bam')
  • Operator bind has been deprecated by DSL2 syntax

  • Operator operator << has been deprecated by DSL2 syntax.

  • Operator choice has been deprecated by DSL2 syntax. Use branch instead.

  • Operator close has been deprecated by DSL2 syntax.

  • Operator create has been deprecated by DSL2 syntax.

  • Operator countBy has been deprecated by DSL2 syntax.

  • Operator into has been deprecated by DSL2 syntax since it’s not needed anymore.

  • Operator fork has been renamed to multiMap.

  • Operator groupBy has been deprecated by DSL2 syntax. Replace it with groupTuple

  • Operator print and println have been deprecated by DSL2 syntax. Use view instead.

  • Operator merge has been deprecated by DSL2 syntax. Use join instead.

  • Operator separate has been deprecated by DSL2 syntax.

  • Operator spread has been deprecated with DSL2 syntax. Replace it with combine.

  • Operator route has been deprecated by DSL2 syntax.

To see all the changes you can read the DSL2 section of the documentation and re-read the full nextflow documentation…