SummarizedExperiments are R objects meant for organizing and manipulating rectangular matrices that are typically produced by arrays or high-throughput sequencing. If you are doing any kind of analysis that requires associating feature-level data (RNA-seq gene counts, methylation array loci, ATAC-seq regions, etc.) with the genomic coordinates of those features and the sample-level metadata with which those features were measured, then you should be using a SummarizedExperiment to organize, manipulate, and store your results.
Please take a moment to read through the first 2 sections (at least) of the SummarizedExperiment vignette in order to familiarize yourself with what SummarizedExperiments are and their structure. I will demonstrate how you can use SummarizedExperiments below.
The SummarizedExperiment object coordinates four main parts:
assay(), assays(): A matrix-like or list of matrix-like objects of identical dimension
matrix-like: implements dim(), dimnames(), and 2-dimensional [, [<- methods.
rows: genes, genomic coordinates, etc.
columns: samples, cells, etc.
colData(): Annotations on each column, as a DataFrame.
E.g., description of each sample
rowData() and / or rowRanges(): Annotations on each row.
rowRanges(): coordinates of gene / exons in transcripts / etc.
rowData(): P-values and log-fold change of each gene after differential expression analysis.
metadata(): List of unstructured metadata describing the overall content of the object.
In order to better understand how they work, let’s construct a SummarizedExperiment from scratch.
Constructing a SummarizedExperiment
Hopefully you’ll already be working with data that is in a SummarizedExperiment or some other class that derives from one. But just in case you don’t have data structured as a SummarizedExperiment it’s useful and instructive to understand how to create one from scratch.
To be most useful, a SummarizedExperiment should have at least:
A matrix of data with features in rows and samples in columns
A metadata data.frame with samples as rownames and columns describing their properties
Another really useful object to add to SummarizedExperiments is a GRanges object describing the genomic locations of each feature in the matrix. Adding this to the SummarizedExperiment creates what is called a RangedSummarizedExperiment that acts just like a regular SummarizedExperiment with some extra features.
To construct our basic SummarizedExperiment:
We’ll create a ‘counts’ matrix with gene IDs as rows and Samples in columns
We’ll add some metadata describing the Samples
We’ll add on GRanges() describing the genomic location of the genes
Construct the counts matrix
suppressPackageStartupMessages(library(SummarizedExperiment))counts <-matrix(data =rnbinom(n =200*6, mu =100, size =1/0.5),nrow =200,dimnames =list(paste0("Gene", 1:200), paste0("Sample", 1:6)))# Take a peek at what this looks likecounts[1:5, 1:5]
It is important that the sample metadata be either a data.frame or a DataFrame object because SummarizedExperiment requires the colData() to have rownames that match the colnames of the count matrix.
coldata <-data.frame(SampleName =colnames(counts),Treatment =gl(2, 3, labels =c("Control", "Treatment")),Age =sample.int(100, 6),row.names =colnames(counts))# Take a peek at what this looks likecoldata
SampleName Treatment Age
Sample1 Sample1 Control 42
Sample2 Sample2 Control 83
Sample3 Sample3 Control 6
Sample4 Sample4 Treatment 70
Sample5 Sample5 Treatment 95
Sample6 Sample6 Treatment 26
Notice that all of the rownames of the metadata are in the same order as the colnames of the counts matrix. This is necessary.
Construct gene range annotations
You will usually have gene annotations or GRanges objects loaded from a GTF file or you may even create GRanges yourself by specifying the chromosome, start, end, and strand, information manually.
rowranges <-GRanges(rep(c("chr1", "chr2"), c(50, 150)),IRanges(floor(runif(200, 1e5, 1e6)), width =100),strand =sample(c("+", "-"), 200, TRUE),feature_id =sprintf("ID%03d", 1:200),gene_type =sample(c("protein_coding", "lncRNA", "repeat_element"), 200, replace =TRUE))names(rowranges) <-rownames(counts)# Take a peek at what this looks likerowranges
With these pieces of information we’re ready to create a SummarizedExperiment object.
se <-SummarizedExperiment(assays =list(counts = counts),colData = coldata,rowRanges = rowranges)# Printing the object gives a summary of what's insidese
Accessing parts of the SummarizedExperiment object
Every part of the SummarizedExperiment object can be extracted with its accessor function. To extract a particular assay you can use the assay() function. To extract the column metadata you can use the colData() function. To extract the GRanges for the rows of the matrix you can use the rowRanges() function. The rowData() function also allows you to access row-level annotation information from data added to the rowData slot or by the mcols() of the rowRanges. This will be made more clear below.
Once you create a SummarizedExperiment you are not stuck with the information in that object. SummarizedExperiments allow you to add and modify the data within the object.
Adding assays
For example, we may wish to calculate counts per million values for our counts matrix and add a new assay back into our SummarizedExperiment object.
# Calculate counts per millioncounts <-assay(se, "counts")cpm <- counts /colSums(counts) *1e6# Add the CPM data as a new assay to our existing se objectassay(se, "cpm") <- cpm# And if we wish to log-scale these valuesassay(se, "logcounts") <-log2(cpm)# Now there are three assays availableassays(se)
List of length 3
names(3): counts cpm logcounts
Note: Instead of creating intermediate variables we could also directly use the assays like so:
SummarizedExperiment objects use the $ to get and set columns of the metadata contained in the colData slot. For example, to get all of the Ages we can use:
se$Age
[1] 42 83 6 70 95 26
If we want to add a new column we simply create the new column in the same way
se$Batch <-factor(rep(c("A", "B", "C"), 2))# Now you can se that a new 'Batch` column has been added to the colDatacolData(se)
DataFrame with 6 rows and 4 columns
SampleName Treatment Age Batch
<character> <factor> <integer> <factor>
Sample1 Sample1 Control 42 A
Sample2 Sample2 Control 83 B
Sample3 Sample3 Control 6 C
Sample4 Sample4 Treatment 70 A
Sample5 Sample5 Treatment 95 B
Sample6 Sample6 Treatment 26 C
Adding rowData
We can also modify the data which describes each feature in the matrix by adding columns to the rowData. For example, let’s create a new column called Keep if the gene is a protein_coding gene.
With a SummarizedExperiment “the rows you want” corresponds to the features in the rows of the matrix/rowData and “the columns you want” corresponds to the metadata in colData
Subsetting based on sample metadata
For example, if we want to select all of the data belonging only to samples in the Treatment group we can use the following:
Notice how the dim of the object changed from 6 to 3. This is because we have selected only the Samples from the original SummarizedExperiment object from the treatment group. The cool thing about SummarizedExperiments is that all of the assays have also been subsetted to reflect this selection!
Take a look at the “logcounts” assay. It only contains Samples 4, 5, and 6.
Notice how the dim changed from 200 to 50 reflecting the fact that we have only selected the first 50 rows.
This subsetting is very useful when combined with logical operators. Above we created a vector in rowData called “Keep” that contained TRUE if the corresponding row of the se object was a coding gene and FALSE otherwise. Let’s use this vector to subset our se object.
A closely related row-wise subsetting operation can be used if you have a RangedSummarizedExperiment (a SummarizedExperiment with a rowRanges slot) that allows you to perform operations on a SummarizedExperiment object like you would a GRanges object.
For example, let’s say we only wanted to extract the features on Chromosome 2 only. Then we can use the GenomicRanges function subsetByOverlaps directly on our SummarizedExperiment object like so:
# Region of interestroi <-GRanges(seqnames ="chr2", ranges =1:1e7)# Subset the SE object for only features on chr2(chr2 <-subsetByOverlaps(se, roi))
You can see again that the dim changed reflecting our selection. Again, all of the associated assays and rowData have also been subsetted reflecting this change as well.
There’s also a few shortcuts on range operations using GRanges/SummarizedExperiments. See the help pages for %over, %within%, and %outside%. For example:
Of course you don’t have to perform one subsetting operation at a time. Like base R you can combine multiple expressions to subset a SummarizedExperiment object.
For example, to select only features labeled as repeat_elements and the Sample from ‘Batch’ A in the ‘Control’ group
Since SummarizedExperiments keep basically all information about an experiment in one place, it is convenient to save the entire SummarizedExperiment object so that you can pick up an analysis where you left off or even to facilitate better sharing of data between collaborators.
You can save the entire SummarizedExperiment object with:
saveRDS(se, "/path/to/se.rds")
And when you want to read the same object back into R for your next analysis you can do so with:
se <- readRDS("/path/to/se.rds")
SummarizedExperiments in the real world
If you’re working with any Bioconductor packages it’s likely that the object you’re working with either is a SummarizedExperiment or is inherited from one. For example, the DESeqDataSet from the DESeq2 package and BSseq objects from the bsseq package both inherit from a SummarizedExperiment and thus retain all of the same functionality as above. If you go to the SummarizedExperiment landing page and click “See More” under details you can see all of the packages that depend on SummarizedExperiment.
Also, many common methods are also implemented for SummarizedExperiment objects. For example, to simplify calculating counts-per-million above I could have simply used the edgeR::cpm() directly on the SummarizedExperiment object. Many functions in bioconductor packages know how to deal directly with SummarizedExperiments so you don’t ever have to take the trouble extracting components and performing tedious calculations yourself.
assay(se, "cpm") <- edgeR::cpm(se)
I also left out any discussion of the metadata() slot of the SummarizedExperiment. The metadata slot is simply a list of any R object that contains information about the experiment. The metadata in the metadata slots are not subjected to the same subsetting rules as the other slots. In practice this assay contains additional information about the experiment as a whole. For example, I typically store bootstrap alignments for each sample here.
To add something to the SummarizedExperiment metadata slot you can do:
metadata(se)$additional_info <-"Experiment performed on 6 samples with three replicates each"
And to retrieve this:
metadata(se)$additional_info
[1] "Experiment performed on 6 samples with three replicates each"
Closing thoughts
Hopefully this was enough information to get you started using SummarizedExperiments. There’s many things I left out such as different backings for storing out of memory data, a tidyverse interface to SummarizedExperiment objects, TreeSummarizedExperiments for microbiome data, MultiAssayExperiments for dealing with experiments containing multiomics data, and much more.
Please let me know your thoughts and if anything needs more clarification.