Chapter 2 Concepts

2.1 Workflow

Nextflow is used to create workflows. A workflow consists of multiple processes. Each process has input and output.

A process will not start until it has all the input it requires. This is important as a process will therefore wait till other processes are finished if the output of other processes are the input for the process.

In the nextflow script each process is created in the main script body (or in a separate main.nf file).

These processes are then called in the workflow.

Example main.nf

#!/usr/bin/env nextflow
/*
  *Params
*/
params.dir = "."
params.input1 = "input.txt"
/*
  * Processes
*/
process STEP1 {
  input:
    path(input_data)
  output:
    path("output1.txt")
}
process STEP2 {
  input:
    path(input_data)
  output:
      path("output2.txt")
}
/*
  * workflow
*/
workflow {
  /// Set input param to channel
  input1_ch=file(params.input1)
  /// Process 1
  step1(input1_ch)
  /// Process 2
  step2(step1.out)
}

2.2 Variables

Nextflow Variables are used across a nextflow script. Within the script section of a process block they are denoted by $ (e.g. $sample_id).

Variable names cannot start with numbers.

Bash variables within a script section must be denoted by a \ (e.g. \$var).

2.3 Channels, tuples, & lists

Channels, tuples, and lists are objects that contain multiple objects but work in different ways.

2.3.1 Channels

Channels are specified and used in the workflow section. A Channel contains a number of values. Each value passes through a process separately, this is carried out via parallelisation.

Example:

  • A Channel called integers_ch contains the 10 values 1,2,3,4,5,6,7,8, & 9.
  • A process called multiple_by_10 multiples input by 10.
  • If the Channel integers_ch was used as the input for the process multiple_by_10 the output woudl be a Channel of 10 values containing the values 10,20,30,40,50,60,70,80, & 90.

2.3.2 Tuples & Lists

Tuples & Lists are used within process blocks. There are many ways to create and manipulate them within and without process blocks.

Confusingly Tuples & Lists are both structured as [value_1,value_2,value_3].

2.3.2.1 Tuples

Tuples contain multiple values with each value assigned as a separate variable in a process. This allows you to input/output data which should be grouped together. The below example shows how to group sample ids with their paired fastq reads.

process step1 {
  input:
    tuple val(sample_id), path(r1), path(r2)
  output:
    tuple val(sample_id), path("${sample_id}_r1_trimmed.fastq"), path("${sample_id}_r2_trimmed.fastq")
  script:
  """
  trim -i1 $r1 -i2 $r2 \
  -o1 ${sample_id}_r1_trimmed.fastq -o2 ${sample_id}_r2_trimmed.fastq
  """
}

This is important as multiple Channels with multiple values are not ordered relative to each other.

2.3.2.2 Lists

When a List is used within a script block all the values will be used together with a space () between each value.

Example:

The Channel r1_fastqs_ch contains the List ["S1_R1.fastq","S2_R1.fastq","S3_R1.fastq"]

The below truncated nextflow script to run fastqc....

process r1_fastqc {
  input:
    path(r1s)
  output:
    .....
  script:
  """
  fastqc -o fastqc_output \
  $r1s
  """
}
workflow {
  r1_fastqc(r1_fastqs_ch)
}

Would be run as:

fastqc -o fastqc_output \
S1_R1.fastq S2_R1.fastq S3_R1.fastq

2.3.2.3 Combinations

Of course you can have a Channel that can contain multiple Tuples and/or Lists. Additionally Tuples can contain Lists.