Using Git - version control

Introduction to Git

Last updated on Tue, June 10, 2025
Source: git-scm.com

It's Monday morning and you just arrived to work and ready to crush the work day. You log onto your computer and navigate to your team's "central hub" where all code is stored and shared (lets use GitHub as an example) for collaboration on various projects.

The project you're currently working on involves analyzing survey data collected from your company to identify how to increase work productivity for data analysis based projects. The goal is to get projects completed sooner and while there are developed and maintained pipelines/workflows to help, your company still has not seen a significant improvement.

To asses why work performance is low despite having standardized pipelines, you and your team create a ranked question based survey (e.g., very good, okay, not good) that addresses any knowledge gaps of the employee's in regard to using the pipelines.

Your team and you are hoping to identify trainings the employee's will need to undergo to improve their work productivity when using these pipelines. All data has been collected and you and your team are processing/cleaning the data before analyzing it. You notice some of the data has been deleted, but you're not sure who made that change and specifically what part of the data was deleted. Without version control, it will be a tedius, long, and frustrating process to fix everything, this is why Git is important.

Git is a version control software that allows you to track and manage changes to files, so let's go over how to get started using Git!

Download the Installer

Go to https://git-scm.com/download/win

Run setup wizard (double-click the .exe).
Add Git to environment PATH.
Select "Environment variable".
Under "System variables" select "Path" and click "Edit".
Click "New" and add path, example:
C:\Program Files\Git\cmd
Verify Install:
```
git --version
```

Getting started with Git

First-time configuration:

git config --global user.name "Your name"
git congif --global user.email "you@example.com"

Use cd to navigate to desired directory:
```
cd "C:/Program Files/User"
```
Create folder in Git:
```
mkdir my-project && cd my-project
```
Set up repo:
```
git init
```
Adding files:
```
git add RData_file.R
```
Add commit message:
```
git commit -m "Initial commit"
```

Docker - Container Orchestration

Introduction to Docker

Last updated on Mon, June 16, 2025
Source: Docker Docs.com

What is Docker?

Docker is a platform that enables the creation of containers, which are lightweight, isolated environments bundling applications, their dependencies, and custom scripts. This isolation ensures consistent deployment and operation across various systems, including vritual machines, local servers, and cloud infrastructures. Docker Desktop is a GUI application that can be installed to manage and interact with Docker containers and images.

Installation

Follow this link to install Docker Desktop: https://www.docker.com/products/docker-desktop/

Get Started with Docker

How to create a dockerfile:

# Use an Ubuntu base image
FROM Ubuntu:20.04

# Install bioinformatics tools
Run apt -get update && apt-get install -y\
bwa\
samtools\
fastqc\
&& apt-cget clean

# Set working directory inside the container
Run mkdir /home/scripts

# Set default command to run when the container starts
CMD["R"]

# Copy custom scripts into container
COPY ./scripts /home/scripts

#Install Bioconductor and other packages
Run R -e "install.pacakges('BiocManager')"\
&& Run R -e "BiocManager::install('DESEq2')"\
&& Run R -e "BiocManager::install([other_packages])"\

Build Docker image (type in command-line):

# Build Docker image
docker build -t my-r-container .

Run Docker image (type in command-line):

# Run Docker image
docker run -it my-r-container

Variant Calling and Classification using Nextflow

Last updated on Tue, July 1, 2025
Citation: ACMG Standards and Guidelines

Variant Calling and Variant Classification

Variant Calling

Variants (also known as mutations) are changes in a DNA sequence when compared to a reference sequence. Mutations can be pathogenic and lead to health complications ranging from mild issues to severe genetic disorders. Variant calling is essential in identifying pathogenic variants by using computational methods to detect differences in a sample DNA sequence compared to a reference genome.

Variant Classification

After identifying the variants that are present, the next step is to interpret these variants, also known as variant classification. Variant classification allows for the biological or clinical interpretation on identified variants to determine if a variant is benign, pathogenic, or of unknown significance.

Nextflow

There are a wide variety tools that can be used for variant calling and classification. For this tutorial, I will be explaining how to perform variant calling and variant classification using Nextflow. Nextflow is a workflow management system designed for reproducible and scalable data analysis pipelines. Using Nextflow, a collection of bioinformatics workflows have been built that include analysis for:

Variant Calling and Classification
NGS Data Analysis
Metagenomics

Getting Started with Nextflow

Run this in terminal, Windows PowerShell, Windows Subsystem for Linux (WSL) or Bash:

 # Nextflow Installation:
curl -s https://getnextflow.io | bash 

# Create a folder named 'bin' inside your home directory
mkdir ~/bin/

# Move Nextflow file into folder named 'bin'
mv nextflow ~/bin/

Create a nextflow.config file (this can be done in a notepad app):

 # General structure of a nextflow.config file
//. 1 Global workflow parameters
params {
  input = 'home/data/*.fastq'
  outdir = 'home/results'
  genome = 'hg38'
  threads = 4
  email = 'your.email@domain.com'
}

// 2. Executer settings
process.executor = 'local'

// 3. Resource defaults for all processes
process {
  cpus = 2
  memory = '4GB'
  time = '2h'
  withLabel:big_mem {
    memory = '32GB'
    cpus = 8
  }
}

// or for variant calling workflows use:

// 3.1 Processes for variant calling
process {
  container = 'broadinstitute/gatk:4.2.0.0'
}

// 4. Docker or Singularity configuration
docker {
  enabled = true
  runOptions = '-u \$(id -u)\$(id -g)'
}

//5 Custom config includes (optional)
profiles {
  profile_name_1 {
  # configuration settings
  }
  profile_name_2 {
  # configuration settings
  }
  ...
}

// Each profile is like a named preset for running the pipeline in different enviornments or configurations

// 6. Defining enivornment(s) for pipeline to run in
env {
  PYTHONNOUSERSITE = 1
  R_PROFILE_USER = "/.Rprofile"
  R_ENVIRON_USER = "/.Renviron"
  JULIA_DEPOT_PATH = "/usr/local/share/julia"
}

// 7. Define default shell and shell options used by all proces blocks
process.shell = [
  "bash",
  "-C",
  "-e",
  "-u",
  "-o",
  "pipefail"
]

Table of Contents