General Notes

  • Use at minimum 3 samples when running a pipeline (even for minimal testing purposes). Some pipelines will not be able to complete successfully when running only a single sample.
  • Although not a requirement, we advise to not use the iGenomes option suggested in a number of pipelines due to the lack of information on versioning linked to each assembly which may compromise reproducibility to a small degree. Instead pass on a reference (genome, and annotation file(s)) from a source and version that is suitable to your needs.
  • Some tips that we have found require the use of a custom config. Custom configs allow for modifying the values the pipeline uses, including passing in new command flags to the underlying software the pipeline uses. Custom configs are passed in via the command line using -c ${custom_config}, where -c is the command flag and ${custom_config} is the path to the custom config.
  • Results from the work directory are copied over to the resulting directory (typically called results) - deleting the work directory may be desired to avoid duplication of files, however deleting this directory also removes the log files (the .command.* files) and prevents you from resuming a pipeline execution (with -resume). If you do not need to resume a pipeline run, and would like to remove large files from the work directory while keeping the logs, you may clean this directory with the command nextflow clean -k -f (and we recommend performing a dry-run prior to this with nextflow clean -k -n).

nf-core - asking for help

We recommend users to join the nf-core Slack group to search for questions across pipelines specific channels (each pipeline has their own channel [e.g.: #rnaseq, atacseq]). You may find the link to join their Slack at https://nf-co.re/join.

In addition, nf-core has other beginner friendly channels such as the #nostupidquestions which may be useful for more general questions across multiple pipelines. Please follow their guidelines when asking a question along with the following best practices nicely summarized in stack overflow: https://stackoverflow.com/help/how-to-ask (e.g.: use Slack’s search function to search for your problem; provide a synopsis of the issue along with a reproducible example etc. Please refrain from DM pipeline developers).

Executor mode for Cheaha profile

We recommend to use this profile when processing multiple samples in the pipeline as it will aid in parallel processing. If your run only contains a small amount of samples, standard local mode (to a compute node) is usually sufficient. Feel free to let us know if additional customization would benefit the community.

Pipeline-specific Notes

nf-core contains a library of pipelines which can be found at https://nf-co.re/pipelines. However, below are notes and tips for pipelines which U-BDS have tested in-depth:

nf-core/rnaseq

Although a user choice, we generally recommend to enable Salmon’s flags --seqBias --gcBias. For versions >= 3.10, users can enable these flags by using the following pipeline parameter: extra_salmon_quant_args: "--seqBias --gcBias"

Note for versions < 3.10: a custom custom.config is required to enable these flags. An example is provided below for version 3.6 (this configuration provides the flags in the salmon run with star inputs, and with the salmon run where quasi-mapping was performed by salmon):

// tool params not linked to direct pipeline params. 
process {
    withName: '.*:QUANTIFY_STAR_SALMON:SALMON_QUANT' {
        ext.args = "--seqBias --gcBias"
    }

    withName: '.*:QUANTIFY_SALMON:SALMON_QUANT' {
        ext.args = "--seqBias --gcBias"
    }
}

Note that the syntax linked to custom parameters in the custom.config may differ from pipeline to pipeline or even versions. We recommend to following similar syntax to what is present in the modules.config of the pipeline of interest such as this one for v.3.6. If you have questions about this, please feel free to ask a question linked to it during our data science office hours.

Tips linked to reference

  • The transcript_fasta is not a required parameter by the pipeline. If this is not provided by the user, it will be generated by the process called RSEM_PREPAREREFERENCE_TRANSCRIPTS (which may be renamed in the near-future as this process is executed whenever the user does not provide a transcriptome, regardless of the choice of --aligner). For more information, reference here
  • If the user provides a file to transcript_fasta which is used at the Salmon steps, it should be a transcriptome file generated from gffread as opposed to one downloaded from GENCODE or Ensembl. Issues linked to Ensembl transcriptomes are derieved from the fact that the transcriptome files (even after coding and non-coding RNA concatenation) have less transcripts than the GTF file. The issues linked to GENCODE files are linked to an issue in the pipeline which prevents it from correctly processing the off-the-shelf transcript fasta due to the sequence name format (despite enabling the --gencode parameter). For more information, reference here and here
    • It should be noted the current recommended way to use reference files generally is to provide the --fasta and --gtf along with --save_reference the first time you run the pipeline and then let it generate all of the downstream artifacts like the transcriptome and indices to be recycled thereafter via including them in them as params in the yml. This is the way recommended by the authors per the nf-core/rnaseq slack channel. For more information, reference here

gffread example:

samtools faidx Mus_musculus.GRCm38.dna.primary_assembly.fa

gffread -w Mus_musculus.GRCm38_GTF_matched_transcripts.fa -g ./Mus_musculus.GRCm38.dna.primary_assembly.fa Mus_musculus.GRCm38.102.gtf

From the example above, the newly generated transcriptome Mus_musculus.GRCm38_GTF_matched_transcripts.fa would be passed on to the pipeline.

We recommend to creata a conda environment with both dependencies and record the versions.

gffread documentation can be found from: http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread

nf-core/nanoseq

  • The version of nanoplot that is used by this pipeline (1.38) causes the pipeline to crash. Per the author, the issue is with a package that is used to convert html files into png files. However, nanoseq expects there to be png files which causes crash. In order to get around this issue, you can create a custom config and override the container that nanoseq would pull down for Nanoplot. Values specified in configs passed via the commandline take priority over the defaults that are specified in the pipeline proper.

container override example:

process {
    withName: NANOPLOT {
        container = 'https://depot.galaxyproject.org/singularity/nanoplot:1.32.1--py_0'
    }
}

Place the above lines in a file called custom.config. When the pipleine is run, specify a new command-line parameter -c path/to/custom.config to the nextflow command, where path/to/custom.config would be the path to the custom config file.

nf-core/fetchngs

  • The input file passed in via the --input must end in .txt
  • It is possible that curl may exceed the default timeout that is set in the pipeline. In order to increase or turn this off you will need a to add the below code to a custom config. In order to turn off curl timelimit, the custom config will need to contain the following lines:
process {
    withName: SRA_FASTQ_FTP {
        ext.args = '--retry 5  --continue-at -'
    }
}
  • If curl does not work, you can set the pipeline to use sra-toolkit instead. In order to set the pipeline to use sra-toolkit to download data, you will need to place the following lines into a params.yml file, and pass that to the pipeline by adding -params-file params.yml to the nextflow run command.
enable_conda: true
force_sra_downloads: true
  • Prefetch has a default filesize (20GB), however files you download may be larger than this default. In order to increase the limit, you will want to place the following lines into a custom.conf file and pass it to the pipeline by adding -c custom.conf to the nextflow run command. Note that this command sets the max size to 50GB but just tweak this number to accommodate your downloads
process {
    withName: SRATOOLS_PREFETCH {
        ext.args = '--max-size 50G'
    }
}

nf-core/atacseq

  • Extra precaution should be taken when running versions 1.2.1 and below in a cluster environment. Macs2 uses the /tmp dir on the cluster, which is apt to run out of space while the process is running and cause the macs2 output files to be empty (/tmp is by design a small file system and it would be preferable to use user scratch or user data). Since this version of the pipeline is DSL1, there is not a way to modify the parameters of this process to store temporary files in a different directory to properly resolve the issue, without modifying the source code and executing it appropriately (this goes beyond the scope of this guide. If more information is needed, we recommend to stop by our data science office hours). As a result we recommend running this pipeline on a local/personal computer rather than on the cluster for the aforementioned versions.
    • For reference the pipeline will fail with the below error when this occurs. It should be noted that this error can have other causes, however, if you are running on a cluster it is worth trying to run it in a non-cluster environment as well to confirm the cause.
    Error in read.table(PeakFiles[idx], sep = "\t", header = FALSE) : 
      no lines available in input
    Execution halted

nf-core/scrnaseq

  • We recommend only running versions post-v2.0.0 of this pipeline. v2.0.0 is when the pipeline converted to DSL2.0 and resolved a number of issues we had noted occurring in previous versions of the pipeline.
  • A more general note: this pipeline does not produce any QC results, such as fastqc results, mapping stats, single-cell stats, nor do the multiqc reports that are generated have a much useful information. Any qc information the user is wishing to see will need to be executed manually separately from the pipeline.
    • An exception to this note is alevin_qc, which is a collection of various stats and graphs about your dataset. These qc results are only produced when using alevin as an aligner.
  • Sequencing platform: Illumina

nf-core/scnanoseq

  • scnanoseq is an nf-core pipeline for single-cell/nuclei analysis for data derived from Oxford Nanopore (long-read sequencing). This pipeline is currently in development nearing a first release. We note that the original authors and maintainers of this pipeline are the same authors of this guide (Austyn Trull and Lara Ianov). Given we are part of the UAB community, we are happy to discuss the pipeline during office hours or through a formal meeting. However, for consistency with other nf-core pipelines, please send questions about scnanoseq in the nf-core pipeline channel dedicated to it, which can be found here. Keeping all questions in a central location will ensure that there is minimal redundancy of effort in our part in aiding the community. We appreciate your understanding and feedback.