Disclaimer: This tutorial was originally written on June 06, 2019.

Introduction

In a previous tutorial, we showed you how to download and process Bisulfite-treated sequence DNA methylation FASTQ files for read alignment on a reference sequence. In this tutorial, we show you how to run DNA methylation analysis using the Bioconductor bsseq and dmrseq R packages.

We set our working directory to the tuto folder created in our previous tutorial.

setwd('./tuto')

Now, let’s install all the required packages for this tutorial.

# Indicate package repositories to R...
repositories <- c("https://cloud.r-project.org", 
                   "https://bioconductor.org/packages/3.7/bioc",
                   "https://bioconductor.org/packages/3.7/data/annotation", 
                   "https://bioconductor.org/packages/3.7/data/experiment",
                   "https://www.stats.ox.ac.uk/pub/RWin", 
                   "http://www.omegahat.net/R", 
                   "https://R-Forge.R-project.org",
                   "https://www.rforge.net", 
                   "https://cloud.r-project.org", 
                   "http://www.bioconductor.org",
                   "http://www.stats.ox.ac.uk/pub/RWin")

# Package list to download
packages <- c("bsseq", "bsseqdata", "dmrseq", "DelayedMatrixStats")

# Install and load missing packages
new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]

if(length(new.packages)){
    install.packages(new.packages, repos = repositories)
}

lapply(packages, library, character.only = TRUE)

1. Obtaining methylation data from Bismark extraction methylation calls

We read in the methylation calls directly from the Bismark methylation extractor files obtained from the previous tutoral. The files are located within the bismark_methCalls folder (see previous tutorial). For that purpose, we use the read.bismark() function from the bsseq Bioconductor R package as described below:

# Get sample data from files
files_loc <- file.path(getwd(), 'bismark_methCalls')
samples <- list.dirs(files_loc, full.names = FALSE, recursive = FALSE)
samples
# Define conditions
conditions <- c(rep(c("normal", "cancer"), each = 2))
sampleData <- data.frame(condition = conditions)
rownames(sampleData) <- samples
sampleData
methyl_files <- list.files(files_loc, "\\cov.gz$", 
                           full.names = TRUE, recursive = TRUE)
methyl_files
# Will generate specifically for this set of data, 4 variables: 
# methyl_data1, methyl_data2, methyl_data3, methyl_data4
for(i in 1:length(methyl_files)){
    sampleTable <- data.frame(condition = conditions[i])
    rownames(sampleTable) <- samples[i]
    assign(
        paste0("methyl_data", i),
        read.bismark(methyl_files[i],
                     loci = NULL,
                     colData = sampleTable,
                     rmZeroCov = FALSE,
                     strandCollapse = TRUE,
                     BPPARAM = bpparam(),
                     BACKEND = "HDF5Array",
                     dir = tempfile("BSseq"),
                     replace = FALSE,
                     chunkdim = NULL,
                     level = NULL,
                     nThread = 8,
                     verbose = getOption("verbose"))
    )
}
methyl_data1
methyl_data2
methyl_data3
methyl_data4
pData(methyl_data1)
pData(methyl_data2)
pData(methyl_data3)
pData(methyl_data4)

We now combine all methylation data for all 4 samples:

combined_data <- combine(methyl_data1, methyl_data2, methyl_data3, methyl_data4)
combined_data
pData(combined_data)

2. Smoothing

The first step of the analysis is to smooth the data

# Smoothing the data
combined_data.fit <- BSmooth(
    BSseq = combined_data, 
    BPPARAM = MulticoreParam(workers = 7), 
    verbose = TRUE
)

Since the previous step is time consuming and computationally expensive, let’s save the smoothed data:

# Saving the smoothed data
save(combined_data.fit, file = "combined_data_fit.rda")

You may later load the combined_data.fit object by running the following code:

load("combined_data_fit.rda")
# Inspecting the combined smoothed data
combined_data.fit
# The average coverage of CpGs on the two chromosomes
round(colMeans(getCoverage(combined_data)), 1)
# Number of CpGs in two chromosomes
length(combined_data)
# Number of CpGs which are covered by at least 1 read in all 4 samples
sum(rowSums(getCoverage(combined_data) >= 1) == 4)
# Number of CpGs with 0 coverage in all samples
sum(rowSums(getCoverage(combined_data)) == 0)

3. Computing t-statistics

To avoid too many differentially methilated regions (DRMs), we remove CpGs with little or no coverage (which are likely false positives). We keep CpGs where at least 1 cancer samples and at least 1 normal samples have at least 2x in coverage.

# which loci and sample indices to keep
keep.index <- which(
                DelayedMatrixStats::rowSums2(
                  getCoverage(combined_data, type = "Cov") == 0) == 0
              )

sample.index <- which(pData(combined_data)$condition %in% c("normal", "cancer"))

combined_data.filtered <- combined_data[keep.index, sample.index]
combined_data.filtered

For t-statistics, we will only keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least 2x in coverage.

combined_data.cov <- getCoverage(combined_data.fit)

keep.index2 <- which(rowSums(combined_data.cov[,
                             combined_data$condition == "cancer"] >= 2) >= 2 &
                     rowSums(combined_data.cov[, 
                             combined_data$condition == "normal"] >= 2) >= 2
              )

length(keep.index2)
combined_data.fit2 <- combined_data.fit[keep.index2]

Let’s first arrange the two groups for the t-test:

# In grp1, we keep all the normal sample names, and
# in grp2, all the cancer sample names
grp1 <- rownames(sampleData)[sampleData$condition == 'normal']
grp2 <- rownames(sampleData)[sampleData$condition == 'cancer']
grp1
grp2

We now compute t-statistics with the BSmooth.tstat function provided by the BSseq Bioconductor R package.

combined_data.tstat <- BSmooth.tstat(
    combined_data.fit2,
    group1 = grp2,
    group2 = grp1,
    estimate.var = "group2",
    local.correct = TRUE,
    mc.cores = 8,
    verbose = TRUE
)
combined_data.tstat
stats <- as.data.frame(combined_data.tstat@stats)
head(stats)

Let’s check the marginal distribution of the t-statistic:

plot(density(as.numeric(as.vector(stats$tstat.corrected)), 
             na.rm = TRUE), xlim = c(-15, 15), col = "red", main = "")
             
lines(density(as.numeric(as.vector(stats$tstat)), 
              na.rm = TRUE), col = "blue")
              
legend("topright", legend = c("corrected", "uncorrected"), 
       col = c("red", "blue"), lty = 1)

The “blocks” of hypomethylation are clearly visible in the marginal distribution of the uncorrected t-statistics.

4. Finding Differentially Methylated Regions (DMRs)

We use the dmrseq() function of the dmrseq Bioconductor R package to compute the DMRs.

# run the results for a subset of 60,000 CpGs in the interest
# of computation time
# Run with a single core if it fails on multiple cores (workers = 1)
dmrs <- dmrseq(bs = combined_data.filtered[240001:300000, ],
               cutoff = 0.05,
               BPPARAM = MulticoreParam(workers = 1),
               testCovariate = "condition")
# Displaying DMRs
show(dmrs)

4.1. Exploring Significantly Methylated Regions

In this tutorial, the number of significantly (and Differentially) Methylated Regions is assessed at the FDR (q-value) cutoff of 0.05.

sum(dmrs$qval < 0.05)
# select just the regions below FDR 0.05 and place in a new data.frame
sigRegions <- dmrs[dmrs$qval < 0.05, ]

4.2. Proportion of regions with hyper-methylation

sum(sigRegions$stat > 0) / length(sigRegions)

To interpret the direction of effect, since dmrseq uses alphabetical order of the covariate of interest, the condition cancer would be the reference category.

4.3. Visualizing Differentially Methylated Regions (DMRs)

We first get the annotation for hg18, then use the plotDMRs() function provided by the Bioconductor dmrseq R package.

# get annotations for hg18
annotation <- getAnnot("hg18")
# Plot DMRs
plotDMRs(combined_data.filtered,
         regions = dmrs[1, ], 
         testCovariate = "condition",
         annoTrack = annotation)

5. Detecting large-scale methylation blocks

In some applications, such as cancer, it is of interest to effectively ‘zoom out’ in order to detect larger (lower-resolution) methylation blocks on the order of hundreds of thousands to millions of bases.

# Run the results for a subset of 300,000 CpGs in the interest
# of computation time
# Run with a single core if it fails on multiple cores
blocks <- dmrseq(bs = combined_data.filtered[120001:420000, ],
                 cutoff = 0.05,
                 testCovariate ='condition',
                 block = TRUE,
                 BPPARAM = MulticoreParam(workers = 1),
                 minInSpan = 500,
                 bpSpan = 5e4,
                 maxGapSmooth = 1e6,
                 maxGap = 5e3)
show(blocks)

We may visualize the top methylation blocks from the block analysis as shown below:

# Plot top methylation blocks 
plotDMRs(combined_data.filtered, 
         regions = blocks[1, ], 
         testCovariate = "condition",
         annoTrack = annotation)

This last DMR plot concludes this tutorial on DNA methylation analysis with the BSseq and dmrseq Bioconductor R packages. For more information, feel free to check the official BSseq and dmrseq tutorials.

Find the Original source code for this tutorial here.