It's Monday morning and you just arrived to work and ready to crush the work day. You log onto your computer and navigate to your team's "central hub" where all code is stored and shared (lets use GitHub as an example) for collaboration on various projects.
The project you're currently working on involves analyzing survey data collected from your company to identify how to increase work productivity for data analysis based projects. The goal is to get projects completed sooner and while there are developed and maintained pipelines/workflows to help, your company still has not seen a significant improvement.
To asses why work performance is low despite having standardized pipelines, you and your team create a ranked question based survey (e.g., very good, okay, not good) that addresses any knowledge gaps of the employee's in regard to using the pipelines.
Your team and you are hoping to identify trainings the employee's will need to undergo to improve their work productivity when using these pipelines. All data has been collected and you and your team are processing/cleaning the data before analyzing it. You notice some of the data has been deleted, but you're not sure who made that change and specifically what part of the data was deleted. Without version control, it will be a tedius, long, and frustrating process to fix everything, this is why Git is important.
Git is a version control software that allows you to track and manage changes to files, so let's go over how to get started using Git!
Go to https://git-scm.com/download/win
git --version
git config --global user.name "Your name"
git congif --global user.email "you@example.com"
cd "C:/Program Files/User"
mkdir my-project && cd my-project
git init
git add RData_file.R
git commit -m "Initial commit"
Docker is a platform that enables the creation of containers, which are lightweight, isolated environments bundling applications, their dependencies, and custom scripts. This isolation ensures consistent deployment and operation across various systems, including vritual machines, local servers, and cloud infrastructures. Docker Desktop is a GUI application that can be installed to manage and interact with Docker containers and images.
# Use an Ubuntu base image
FROM Ubuntu:20.04
# Install bioinformatics tools
Run apt -get update && apt-get install -y\
bwa\
samtools\
fastqc\
&& apt-cget clean
# Set working directory inside the container
Run mkdir /home/scripts
# Set default command to run when the container starts
CMD["R"]
# Copy custom scripts into container
COPY ./scripts /home/scripts
#Install Bioconductor and other packages
Run R -e "install.pacakges('BiocManager')"\
&& Run R -e "BiocManager::install('DESEq2')"\
&& Run R -e "BiocManager::install([other_packages])"\
# Build Docker image
docker build -t my-r-container .
# Run Docker image
docker run -it my-r-container
Variants (also known as mutations) are changes in a DNA sequence when compared to a reference sequence. Mutations can be pathogenic and lead to health complications ranging from mild issues to severe genetic disorders. Variant calling is essential in identifying pathogenic variants by using computational methods to detect differences in a sample DNA sequence compared to a reference genome.
After identifying the variants that are present, the next step is to interpret these variants, also known as variant classification. Variant classification allows for the biological or clinical interpretation on identified variants to determine if a variant is benign, pathogenic, or of unknown significance.
There are a wide variety tools that can be used for variant calling and classification. For this tutorial, I will be explaining how to perform variant calling and variant classification using Nextflow. Nextflow is a workflow management system designed for reproducible and scalable data analysis pipelines. Using Nextflow, a collection of bioinformatics workflows have been built that include analysis for:
# Nextflow Installation:
curl -s https://getnextflow.io | bash
# Create a folder named 'bin' inside your home directory
mkdir ~/bin/
# Move Nextflow file into folder named 'bin'
mv nextflow ~/bin/
# General structure of a nextflow.config file
//. 1 Global workflow parameters
params {
input = 'home/data/*.fastq'
outdir = 'home/results'
genome = 'hg38'
threads = 4
email = 'your.email@domain.com'
}
// 2. Executer settings
process.executor = 'local'
// 3. Resource defaults for all processes
process {
cpus = 2
memory = '4GB'
time = '2h'
withLabel:big_mem {
memory = '32GB'
cpus = 8
}
}
// or for variant calling workflows use:
// 3.1 Processes for variant calling
process {
container = 'broadinstitute/gatk:4.2.0.0'
}
// 4. Docker or Singularity configuration
docker {
enabled = true
runOptions = '-u \$(id -u)\$(id -g)'
}
//5 Custom config includes (optional)
profiles {
profile_name_1 {
# configuration settings
}
profile_name_2 {
# configuration settings
}
...
}
// Each profile is like a named preset for running the pipeline in different enviornments or configurations
// 6. Defining enivornment(s) for pipeline to run in
env {
PYTHONNOUSERSITE = 1
R_PROFILE_USER = "/.Rprofile"
R_ENVIRON_USER = "/.Renviron"
JULIA_DEPOT_PATH = "/usr/local/share/julia"
}
// 7. Define default shell and shell options used by all proces blocks
process.shell = [
"bash",
"-C",
"-e",
"-u",
"-o",
"pipefail"
]