Table of Contents

Using Git - version control

Introduction to Git

Last updated on Tue, June 10, 2025
Source: git-scm.com

It's Monday morning and you just arrived to work and ready to crush the work day. You log onto your computer and navigate to your team's "central hub" where all code is stored and shared (lets use GitHub as an example) for collaboration on various projects.

The project you're currently working on involves analyzing survey data collected from your company to identify how to increase work productivity for data analysis based projects. The goal is to get projects completed sooner and while there are developed and maintained pipelines/workflows to help, your company still has not seen a significant improvement.

To asses why work performance is low despite having standardized pipelines, you and your team create a ranked question based survey (e.g., very good, okay, not good) that addresses any knowledge gaps of the employee's in regard to using the pipelines.

Your team and you are hoping to identify trainings the employee's will need to undergo to improve their work productivity when using these pipelines. All data has been collected and you and your team are processing/cleaning the data before analyzing it. You notice some of the data has been deleted, but you're not sure who made that change and specifically what part of the data was deleted. Without version control, it will be a tedius, long, and frustrating process to fix everything, this is why Git is important.

Git is a version control software that allows you to track and manage changes to files, so let's go over how to get started using Git!

Download the Installer

Go to https://git-scm.com/download/win

  • Run setup wizard (double-click the .exe).
  • Add Git to environment PATH.
  • Select "Environment variable".
  • Under "System variables" select "Path" and click "Edit".
  • Click "New" and add path, example:
    C:\Program Files\Git\cmd
  • Verify Install:
    git --version

Getting started with Git

  • First-time configuration:
    git config --global user.name "Your name"
    git congif --global user.email "you@example.com"
    
  • Use cd to navigate to desired directory:
    cd "C:/Program Files/User"
  • Create folder in Git:
    mkdir my-project && cd my-project
  • Set up repo:
    git init
  • Adding files:
    git add RData_file.R
  • Add commit message:
    git commit -m "Initial commit"

Docker - Container Orchestration

Introduction to Docker

Last updated on Mon, June 16, 2025
Source: Docker Docs.com

What is Docker?

Docker is a platform that enables the creation of containers, which are lightweight, isolated environments bundling applications, their dependencies, and custom scripts. This isolation ensures consistent deployment and operation across various systems, including vritual machines, local servers, and cloud infrastructures. Docker Desktop is a GUI application that can be installed to manage and interact with Docker containers and images.

Installation

  • Follow this link to install Docker Desktop: https://www.docker.com/products/docker-desktop/
  • Get Started with Docker

  • How to create a dockerfile:
    # Use an Ubuntu base image
    FROM Ubuntu:20.04
    
    # Install bioinformatics tools
    Run apt -get update && apt-get install -y\
    bwa\
    samtools\
    fastqc\
    && apt-cget clean
    
    # Set working directory inside the container
    Run mkdir /home/scripts
    
    # Set default command to run when the container starts
    CMD["R"]
    
    # Copy custom scripts into container
    COPY ./scripts /home/scripts
    
    #Install Bioconductor and other packages
    Run R -e "install.pacakges('BiocManager')"\
    && Run R -e "BiocManager::install('DESEq2')"\
    && Run R -e "BiocManager::install([other_packages])"\
    
  • Build Docker image (type in command-line):
    # Build Docker image
    docker build -t my-r-container .
    
  • Run Docker image (type in command-line):
    # Run Docker image
    docker run -it my-r-container
    
  • Variant Calling and Classification using Nextflow

    Last updated on Tue, July 1, 2025
    Citation: ACMG Standards and Guidelines

    Variant Calling and Variant Classification

    Variant Calling

    Variants (also known as mutations) are changes in a DNA sequence when compared to a reference sequence. Mutations can be pathogenic and lead to health complications ranging from mild issues to severe genetic disorders. Variant calling is essential in identifying pathogenic variants by using computational methods to detect differences in a sample DNA sequence compared to a reference genome.

    Variant Classification

    After identifying the variants that are present, the next step is to interpret these variants, also known as variant classification. Variant classification allows for the biological or clinical interpretation on identified variants to determine if a variant is benign, pathogenic, or of unknown significance.

    Nextflow

    There are a wide variety tools that can be used for variant calling and classification. For this tutorial, I will be explaining how to perform variant calling and variant classification using Nextflow. Nextflow is a workflow management system designed for reproducible and scalable data analysis pipelines. Using Nextflow, a collection of bioinformatics workflows have been built that include analysis for:

    • Variant Calling and Classification
    • NGS Data Analysis
    • Metagenomics

    Getting Started with Nextflow

  • Run this in terminal, Windows PowerShell, Windows Subsystem for Linux (WSL) or Bash:
     # Nextflow Installation:
    curl -s https://getnextflow.io | bash 
    
    # Create a folder named 'bin' inside your home directory
    mkdir ~/bin/
    
    # Move Nextflow file into folder named 'bin'
    mv nextflow ~/bin/
  • Create a nextflow.config file (this can be done in a notepad app):
     # General structure of a nextflow.config file
    //. 1 Global workflow parameters
    params {
      input = 'home/data/*.fastq'
      outdir = 'home/results'
      genome = 'hg38'
      threads = 4
      email = 'your.email@domain.com'
    }
    
    // 2. Executer settings
    process.executor = 'local'
    
    // 3. Resource defaults for all processes
    process {
      cpus = 2
      memory = '4GB'
      time = '2h'
      withLabel:big_mem {
        memory = '32GB'
        cpus = 8
      }
    }
    
    // or for variant calling workflows use:
    
    // 3.1 Processes for variant calling
    process {
      container = 'broadinstitute/gatk:4.2.0.0'
    }
    
    // 4. Docker or Singularity configuration
    docker {
      enabled = true
      runOptions = '-u \$(id -u)\$(id -g)'
    }
    
    //5 Custom config includes (optional)
    profiles {
      profile_name_1 {
      # configuration settings
      }
      profile_name_2 {
      # configuration settings
      }
      ...
    }
    
    // Each profile is like a named preset for running the pipeline in different enviornments or configurations
    
    // 6. Defining enivornment(s) for pipeline to run in
    env {
      PYTHONNOUSERSITE = 1
      R_PROFILE_USER = "/.Rprofile"
      R_ENVIRON_USER = "/.Renviron"
      JULIA_DEPOT_PATH = "/usr/local/share/julia"
    }
    
    // 7. Define default shell and shell options used by all proces blocks
    process.shell = [
      "bash",
      "-C",
      "-e",
      "-u",
      "-o",
      "pipefail"
    ]