Bisulfite Treated Sequence Read Alignment

Disclaimer: This tutorial was originally written on May 30, 2019.

Introduction

In this tutorial, we show you how to download raw Bisulfite-seq DNA methylation sequence data from the European instance of the SRA, which can be accessed via https://www.ebi.ac.uk/ena. On the European Nucleotide Archive (ENA) website, the sequencing reads are directly available in FASTQ or SRA formats, which will be explained below.

For this tutorial, we need FastQC, multiQC, the SRA toolkit, a powerful suite of tools designed to interact with SAM and BAM files called samtools, and the Bismark aligner to align the Bisulfite-seq reads to the reference genome.

All the above mentioned tools need to be installed and referenced in the environment variable PATH.

We set our working directory to the tuto folder created in our first tutorial.

setwd('./tuto')

Let’s first check if these requirements are met:

fastqc --version

multiqc --version

fastq-dump --version

samtools --version

bismark --version

If at least one of the above commands produces an error, please, check your installation of the tool and try again.

Now let’s create a working directory for our bisulfite-seq DNA methylation project (let’s call it tuto).

mkdir -p tuto && cd tuto

1. Data Download

To download a set of SRA files:

Go to https://www.ebi.ac.uk/ena.
Search for the accession number of the project, e.g., SRP041828 (should be indicated in the published paper).
There are several ways to start the download, here we show you how to do it through the command line interface on GNU/Linux.
- copy the link’s address of the “SRA files” column (right mouse click), go to the command line, move to the target directory, type: wget < link copied from the ENA website >
- If there are many samples as it is the case for the project referenced here (accession number: SRP041828), you can download the summary of the sample information from ENA by right-clicking on “TEXT” or “TSV” and copying the link location.

Now let’s download the file from the link copied earlier.

wget -O all_samples.txt "https://www.ebi.ac.uk/ena/portal/api/filereport?\
accession=PRJNA274258&result=read_run&fields=study_accession,sample_accession,\
experiment_accession,run_accession,tax_id,scientific_name,fastq_ftp,\
submitted_ftp,sra_ftp&format=tsv&download=true&limit=0"

You may try to open the all_samples.txt file with LibreOffice or Excel to view it. For this project, we are only interested in the paired-end first 4 Bisulfite-seq samples (2 normal cells samples vs 2 breast cancer cells samples). Since the first line in all_samples.txt contains the header, we will generate another file containing only the first 4 lines of all_samples.txt with the following command:

sed '1d' all_samples.txt > all_samples2.txt
head -4 all_samples2.txt > samples.txt
rm all_samples2.txt

Now, let’s create a new folder for our SRA files.

mkdir -p sra_files

Since the decommission of the SRA Fuse/FTP site on December 1, 2019, the prefetch utility included in the SRA toolkit is recommended.

Notice that the accession number for the SRA files are located in the fourth column run_accession in all_samples.txt. We proceed to the download of the SRA files of the samples listed in samples.txt with the following code: (Attention: The download may take a long time!)

# The command below may take too long to download.
cut -f4 samples.txt | xargs -i sh -c \
        'run_accession={}; \
         prefetch $run_accession --output-directory sra_files'

2. Converting SRA files to FASTQ files

Now that the download is complete, let’s convert the SRA files into FASTQ files with the following command: (Attention: This may take a long time!)

cut -f4 samples.txt | xargs -i sh -c \
    'run_accession={}; \
     fastq-dump --outdir fastq/${run_accession} \
                --gzip \
                --skip-technical \
                --split-3 sra_files/${run_accession}/${run_accession}.sra'

3. Quality Control of the FASTQ files

Up to this point, we have all our Bisulfite-seq DNA methylation FASTQ files ready for Quality Control (QC) check. This is done with the fastqc tools developed by the Babraham Institute. Run the following command to perform QC check for all the samples: (This may take some time!)

cut -f4 samples.txt | xargs -i sh -c 
    'run_accession={}; \
     mkdir -p fastqc_reports/${run_accession}; \
     fastqc fastq/${run_accession}/*fastq.gz -o fastqc_reports/${run_accession}'

Next, let’s summarize the QC reports (for all the samples) into one unique report using multiqc:

multiqc fastqc_reports --dirs -o multiQC_report/

Let’s examine the summary multiqc report either by double-clicking on multiQC_report/multiqc_report.html or by executing the following code:

xdg-open multiQC_report/multiqc_report.html

4. Read Alignment

The assignment of sequencing reads to the most likely locus of origin is called read alignment or mapping and it is a crucial step in most types of high-throughput sequencing experiments.

The general challenge of short read alignment is to map millions of reads accurately and in a reasonable time, despite the presence of sequencing errors, genomic variation and repetitive elements. The different alignment programs employ various strategies that are meant to speed up the process (e.g., by indexing the reference genome) and find a balance between mapping fidelity and error tolerance.

4.1. Reference genomes and annotation

Genome sequences and annotation are often generated by consortia such as (mod)ENCODE, The Mouse Genome Project, The Berkeley Drosophila Genome Project, and many more. The results of these efforts can either be downloaded from individual websites set up by the respective consortia or from more comprehensive data bases such as the one hosted by the University of California, Santa Cruz (UCSC) or the European genome resource (Ensembl).

Reference sequences are usually stored in plain text FASTA files that can either be compressed with the generic gzip command.

The reference sequences file can be obtained from NCBI, ENSEMBL or UCSC Genome Browser.

For this DNA methylation (Bisulfite-seq) tutorial, we will align the reads against the genome (DNA) reference sequences. We obtain both the genome refernce sequences and our gene annotation files from UCSC. This is very important as we intend to perform all downstream DNA methylation analysis using the methylKit Bioconductor R package which works nominally with UCSC genome references.

# Download the latest human genome
wget -P reference http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

4.2. Aligning reads using `Bismark` aligner

4.2.1. Generate genome index

This step has to be performed only once per genome type (and alignment program).

bismark_genome_preparation --verbose ./reference

4.2.2. Alignment

This step has to be performed for each individual FASTQ file.

This step may take a long time! (may take several days to complete).

# execute Bismark aligner
cut -f4 samples.txt | xargs -i bash -c \
  'run_accession={}; 
   mkdir -p alignment_Bismark/${run_accession}; \
   bismark --parallel 8 
           --gzip \
           --fastq \
           --output_dir alignment_Bismark/${run_accession} \
           --genome ./reference -1 fastq/${run_accession}/${run_accession}_1.fastq.gz \
                                -2 fastq/${run_accession}/${run_accession}_2.fastq.gz'

4.2.3. Sorting BAM files and converting to SAM files

We sort sort the BAM files using the samtools sort command:

# Sorting the bam files and converting them to sam files
for i in alignment_Bismark/*/*; do
    if [ "${i}" != "${i%pe.bam}" ];then
        samtools sort -l 0 \
                      -T $(dirname ${i})/$(basename ${i} \
                                  _1_bismark_bt2_pe.bam)_temp \
                      -O sam -@ 8 \
                      -o $(dirname ${i})/$(basename ${i} .bam).sort.sam ${i}
    fi
done

Either SeqMonk or the the Integrative Genomics Viewer (IGV) can be used to visualize the resulting sorted SAM files.

We will later use the methylKit Bioconductor R package to import the methylation data into R from the sorted SAM files.

5. Methylation extraction using Bismark methylation extractor

With the bismark_methylation_extractor command, we extract the methylation call for every single Cytosine analyzed. This process takes as input the resulting BAM file from Bismark aligner. The bismark_methylation_extractor command writes the position of every single Cytosine to a new output file, depending on its context (CpG, CHG or CHH), whereby methylated Cytosines are labelled as forward reads (+), non-methylated Cytosines as reverse reads (-).

SeqMonk, a genome viewer, can be used to visualize the output files.

We store the output of the Bismark methylation extractor in the methylation_data folder.

# Extract methylation data
cut -f4 samples.txt | xargs -i bash -c \
        'run_accession={}; \
         mkdir -p bismark_methCalls/${run_accession}; \
         bismark_methylation_extractor \
              --parallel 8 \
              --gzip \
              --bedGraph \
              --buffer_size 40G \
              --merge_non_CpG \
              --comprehensive \
              --output bismark_methCalls/${run_accession} \
                       alignment_Bismark/${run_accession}/*_pe.bam'

In another tutorial, we will analyze DNA methylation data from the generated sorted SAM files from this tutorial using the MethylKit Bioconductor package in R.

Find the Original source code for this tutorial here.