Chapter 2 Concepts
2.1 Workflow
Nextflow is used to create workflows. A workflow consists of multiple processes. Each process has input and output.
A process will not start until it has all the input it requires. This is important as a process will therefore wait till other processes are finished if the output of other processes are the input for the process.
In the nextflow script each process is created in the main script body (or in a separate main.nf
file).
These processes are then called in the workflow.
Example main.nf
#!/usr/bin/env nextflow
/*
*Params
*/
params.dir = "."
params.input1 = "input.txt"
/*
* Processes
*/
process STEP1 {
input:
path(input_data)
output:
path("output1.txt")
}
process STEP2 {
input:
path(input_data)
output:
path("output2.txt")
}
/*
* workflow
*/
workflow {
/// Set input param to channel
input1_ch=file(params.input1)
/// Process 1
step1(input1_ch)
/// Process 2
step2(step1.out)
}
2.2 Variables
Nextflow Variables are used across a nextflow script.
Within the script
section of a process
block they are denoted by $
(e.g. $sample_id
).
Variable names cannot start with numbers.
Bash variables within a script
section must be denoted by a \
(e.g. \$var
).
2.3 Channels, tuples, & lists
Channels, tuples, and lists are objects that contain multiple objects but work in different ways.
2.3.1 Channels
Channels are specified and used in the workflow section. A Channel contains a number of values. Each value passes through a process separately, this is carried out via parallelisation.
Example:
- A Channel called
integers_ch
contains the 10 values 1,2,3,4,5,6,7,8, & 9. - A process called
multiple_by_10
multiples input by 10. - If the Channel
integers_ch
was used as the input for the processmultiple_by_10
the output woudl be a Channel of 10 values containing the values 10,20,30,40,50,60,70,80, & 90.
2.3.2 Tuples & Lists
Tuples & Lists are used within process blocks. There are many ways to create and manipulate them within and without process blocks.
Confusingly Tuples & Lists are both structured as [value_1,value_2,value_3]
.
2.3.2.1 Tuples
Tuples contain multiple values with each value assigned as a separate variable in a process. This allows you to input/output data which should be grouped together. The below example shows how to group sample ids with their paired fastq reads.
process step1 {
input:
tuple val(sample_id), path(r1), path(r2)
output:
tuple val(sample_id), path("${sample_id}_r1_trimmed.fastq"), path("${sample_id}_r2_trimmed.fastq")
script:
"""
trim -i1 $r1 -i2 $r2 \
-o1 ${sample_id}_r1_trimmed.fastq -o2 ${sample_id}_r2_trimmed.fastq
"""
}
This is important as multiple Channels with multiple values are not ordered relative to each other.
2.3.2.2 Lists
When a List is used within a script block all the values will be used together with a space () between each value.
Example:
The Channel r1_fastqs_ch
contains the List ["S1_R1.fastq","S2_R1.fastq","S3_R1.fastq"]
The below truncated nextflow script to run fastqc....
process r1_fastqc {
input:
path(r1s)
output:
.....
script:
"""
fastqc -o fastqc_output \
$r1s
"""
}
workflow {
r1_fastqc(r1_fastqs_ch)
}
Would be run as: