Practical introduction to SummarizedExperiments

What are SummarizedExperiments

SummarizedExperiments are R objects meant for organizing and manipulating rectangular matrices that are typically produced by arrays or high-throughput sequencing. If you are doing any kind of analysis that requires associating feature-level data (RNA-seq gene counts, methylation array loci, ATAC-seq regions, etc.) with the genomic coordinates of those features and the sample-level metadata with which those features were measured, then you should be using a SummarizedExperiment to organize, manipulate, and store your results.

Please take a moment to read through the first 2 sections (at least) of the SummarizedExperiment vignette in order to familiarize yourself with what SummarizedExperiments are and their structure. I will demonstrate how you can use SummarizedExperiments below.

From the SummarizedExperiment vignette:

The SummarizedExperiment object coordinates four main parts:

  • assay(), assays(): A matrix-like or list of matrix-like objects of identical dimension
    • matrix-like: implements dim(), dimnames(), and 2-dimensional [, [<- methods.
    • rows: genes, genomic coordinates, etc.
    • columns: samples, cells, etc.
  • colData(): Annotations on each column, as a DataFrame.
    • E.g., description of each sample
  • rowData() and / or rowRanges(): Annotations on each row.
    • rowRanges(): coordinates of gene / exons in transcripts / etc.
    • rowData(): P-values and log-fold change of each gene after differential expression analysis.
  • metadata(): List of unstructured metadata describing the overall content of the object.

In order to better understand how they work, let’s construct a SummarizedExperiment from scratch.

Constructing a SummarizedExperiment

Hopefully you’ll already be working with data that is in a SummarizedExperiment or some other class that derives from one. But just in case you don’t have data structured as a SummarizedExperiment it’s useful and instructive to understand how to create one from scratch.

To be most useful, a SummarizedExperiment should have at least:

  • A matrix of data with features in rows and samples in columns
  • A metadata data.frame with samples as rownames and columns describing their properties

Another really useful object to add to SummarizedExperiments is a GRanges object describing the genomic locations of each feature in the matrix. Adding this to the SummarizedExperiment creates what is called a RangedSummarizedExperiment that acts just like a regular SummarizedExperiment with some extra features.

To construct our basic SummarizedExperiment:

  • We’ll create a ‘counts’ matrix with gene IDs as rows and Samples in columns
  • We’ll add some metadata describing the Samples
  • We’ll add on GRanges() describing the genomic location of the genes

Construct the counts matrix

suppressPackageStartupMessages(library(SummarizedExperiment))


counts <- matrix(
  data = rnbinom(n = 200 * 6, mu = 100, size = 1 / 0.5),
  nrow = 200,
  dimnames = list(paste0("Gene", 1:200), paste0("Sample", 1:6))
)

# Take a peek at what this looks like
counts[1:5, 1:5]
      Sample1 Sample2 Sample3 Sample4 Sample5
Gene1     191     162      55     102      22
Gene2      11       1     216      37      95
Gene3     126      60      91      63     132
Gene4      72     129      71     211     236
Gene5      36      42      31      55     202

Construct the sample metadata

It is important that the sample metadata be either a data.frame or a DataFrame object because SummarizedExperiment requires the colData() to have rownames that match the colnames of the count matrix.

coldata <- data.frame(
  SampleName = colnames(counts),
  Treatment = gl(2, 3, labels = c("Control", "Treatment")),
  Age = sample.int(100, 6),
  row.names = colnames(counts)
)

# Take a peek at what this looks like
coldata
        SampleName Treatment Age
Sample1    Sample1   Control  80
Sample2    Sample2   Control  37
Sample3    Sample3   Control  91
Sample4    Sample4 Treatment   5
Sample5    Sample5 Treatment  82
Sample6    Sample6 Treatment  43

Notice that all of the rownames of the metadata are in the same order as the colnames of the counts matrix. This is necessary.

Construct gene range annotations

You will usually have gene annotations or GRanges objects loaded from a GTF file or you may even create GRanges yourself by specifying the chromosome, start, end, and strand, information manually.

rowranges <- GRanges(
  rep(c("chr1", "chr2"), c(50, 150)),
  IRanges(floor(runif(200, 1e5, 1e6)), width = 100),
  strand = sample(c("+", "-"), 200, TRUE),
  feature_id = sprintf("ID%03d", 1:200),
  gene_type = sample(c("protein_coding", "lncRNA", "repeat_element"), 200, replace = TRUE)
)
Warning in S4Vectors:::anyMissing(runValue(x_seqnames)): 'S4Vectors:::anyMissing()' is deprecated.
Use 'anyNA()' instead.
See help("Deprecated")
Warning in S4Vectors:::anyMissing(runValue(strand(x))): 'S4Vectors:::anyMissing()' is deprecated.
Use 'anyNA()' instead.
See help("Deprecated")
names(rowranges) <- rownames(counts)

# Take a peek at what this looks like
rowranges
GRanges object with 200 ranges and 2 metadata columns:
          seqnames        ranges strand |  feature_id      gene_type
             <Rle>     <IRanges>  <Rle> | <character>    <character>
    Gene1     chr1 606835-606934      + |       ID001 repeat_element
    Gene2     chr1 451467-451566      - |       ID002         lncRNA
    Gene3     chr1 424905-425004      + |       ID003 protein_coding
    Gene4     chr1 805272-805371      - |       ID004 repeat_element
    Gene5     chr1 204209-204308      + |       ID005 protein_coding
      ...      ...           ...    ... .         ...            ...
  Gene196     chr2 105197-105296      - |       ID196 protein_coding
  Gene197     chr2 864490-864589      + |       ID197 protein_coding
  Gene198     chr2 482738-482837      + |       ID198 protein_coding
  Gene199     chr2 209038-209137      + |       ID199 protein_coding
  Gene200     chr2 285516-285615      + |       ID200 repeat_element
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Construct the SummarizedExperiment object

With these pieces of information we’re ready to create a SummarizedExperiment object.

se <- SummarizedExperiment(
  assays = list(counts = counts),
  colData = coldata,
  rowRanges = rowranges
)

# Printing the object gives a summary of what's inside
se
class: RangedSummarizedExperiment 
dim: 200 6 
metadata(0):
assays(1): counts
rownames(200): Gene1 Gene2 ... Gene199 Gene200
rowData names(2): feature_id gene_type
colnames(6): Sample1 Sample2 ... Sample5 Sample6
colData names(3): SampleName Treatment Age

Accessing parts of the SummarizedExperiment object

Every part of the SummarizedExperiment object can be extracted with its accessor function. To extract a particular assay you can use the assay() function. To extract the column metadata you can use the colData() function. To extract the GRanges for the rows of the matrix you can use the rowRanges() function. The rowData() function also allows you to access row-level annotation information from data added to the rowData slot or by the mcols() of the rowRanges. This will be made more clear below.

Getting the count matrix

assay(se, "counts") |> head()
      Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
Gene1     191     162      55     102      22     119
Gene2      11       1     216      37      95      75
Gene3     126      60      91      63     132      48
Gene4      72     129      71     211     236      71
Gene5      36      42      31      55     202      66
Gene6      20      95      62      90     115     105

To see what assays are available you can use the assays() function

assays(se)
List of length 1
names(1): counts

Getting the column metadata

colData(se)
DataFrame with 6 rows and 3 columns
         SampleName Treatment       Age
        <character>  <factor> <integer>
Sample1     Sample1 Control          80
Sample2     Sample2 Control          37
Sample3     Sample3 Control          91
Sample4     Sample4 Treatment         5
Sample5     Sample5 Treatment        82
Sample6     Sample6 Treatment        43

Getting the rowRanges

rowRanges(se)
GRanges object with 200 ranges and 2 metadata columns:
          seqnames        ranges strand |  feature_id      gene_type
             <Rle>     <IRanges>  <Rle> | <character>    <character>
    Gene1     chr1 606835-606934      + |       ID001 repeat_element
    Gene2     chr1 451467-451566      - |       ID002         lncRNA
    Gene3     chr1 424905-425004      + |       ID003 protein_coding
    Gene4     chr1 805272-805371      - |       ID004 repeat_element
    Gene5     chr1 204209-204308      + |       ID005 protein_coding
      ...      ...           ...    ... .         ...            ...
  Gene196     chr2 105197-105296      - |       ID196 protein_coding
  Gene197     chr2 864490-864589      + |       ID197 protein_coding
  Gene198     chr2 482738-482837      + |       ID198 protein_coding
  Gene199     chr2 209038-209137      + |       ID199 protein_coding
  Gene200     chr2 285516-285615      + |       ID200 repeat_element
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Getting the rowData

Note that rowData in this case is the same as mcols() of the rowRanges

rowData(se)
DataFrame with 200 rows and 2 columns
         feature_id      gene_type
        <character>    <character>
Gene1         ID001 repeat_element
Gene2         ID002         lncRNA
Gene3         ID003 protein_coding
Gene4         ID004 repeat_element
Gene5         ID005 protein_coding
...             ...            ...
Gene196       ID196 protein_coding
Gene197       ID197 protein_coding
Gene198       ID198 protein_coding
Gene199       ID199 protein_coding
Gene200       ID200 repeat_element

Modifying a SummarizedExperiment

Once you create a SummarizedExperiment you are not stuck with the information in that object. SummarizedExperiments allow you to add and modify the data within the object.

Adding assays

For example, we may wish to calculate counts per million values for our counts matrix and add a new assay back into our SummarizedExperiment object.

# Calculate counts per million
counts <- assay(se, "counts")
cpm <- counts / colSums(counts) * 1e6

# Add the CPM data as a new assay to our existing se object
assay(se, "cpm") <- cpm

# And if we wish to log-scale these values
assay(se, "logcounts") <- log2(cpm)

# Now there are three assays available
assays(se)
List of length 3
names(3): counts cpm logcounts

Note: Instead of creating intermediate variables we could also directly use the assays like so:

assay(se, "cpm") <- assay(se, "counts") / colSums(assay(se, "counts")) * 1e6

Adding metadata

SummarizedExperiment objects use the $ to get and set columns of the metadata contained in the colData slot. For example, to get all of the Ages we can use:

se$Age
[1] 80 37 91  5 82 43

If we want to add a new column we simply create the new column in the same way

se$Batch <- factor(rep(c("A", "B", "C"), 2))

# Now you can se that a new 'Batch` column has been added to the colData
colData(se)
DataFrame with 6 rows and 4 columns
         SampleName Treatment       Age    Batch
        <character>  <factor> <integer> <factor>
Sample1     Sample1 Control          80        A
Sample2     Sample2 Control          37        B
Sample3     Sample3 Control          91        C
Sample4     Sample4 Treatment         5        A
Sample5     Sample5 Treatment        82        B
Sample6     Sample6 Treatment        43        C

Adding rowData

We can also modify the data which describes each feature in the matrix by adding columns to the rowData. For example, let’s create a new column called Keep if the gene is a protein_coding gene.

rowData(se)$Keep <- rowData(se)$gene_type == "protein_coding"

rowData(se)
DataFrame with 200 rows and 3 columns
         feature_id      gene_type      Keep
        <character>    <character> <logical>
Gene1         ID001 repeat_element     FALSE
Gene2         ID002         lncRNA     FALSE
Gene3         ID003 protein_coding      TRUE
Gene4         ID004 repeat_element     FALSE
Gene5         ID005 protein_coding      TRUE
...             ...            ...       ...
Gene196       ID196 protein_coding      TRUE
Gene197       ID197 protein_coding      TRUE
Gene198       ID198 protein_coding      TRUE
Gene199       ID199 protein_coding      TRUE
Gene200       ID200 repeat_element     FALSE

Subsetting SummarizedExperiment objects

SummarizedExperiments follow the basic idea of

se[the rows you want, the columns you want]

With a SummarizedExperiment “the rows you want” corresponds to the features in the rows of the matrix/rowData and “the columns you want” corresponds to the metadata in colData

Subsetting based on sample metadata

For example, if we want to select all of the data belonging only to samples in the Treatment group we can use the following:

(trt <- se[, se$Treatment == "Treatment"])
class: RangedSummarizedExperiment 
dim: 200 3 
metadata(0):
assays(3): counts cpm logcounts
rownames(200): Gene1 Gene2 ... Gene199 Gene200
rowData names(3): feature_id gene_type Keep
colnames(3): Sample4 Sample5 Sample6
colData names(4): SampleName Treatment Age Batch

Notice how the dim of the object changed from 6 to 3. This is because we have selected only the Samples from the original SummarizedExperiment object from the treatment group. The cool thing about SummarizedExperiments is that all of the assays have also been subsetted to reflect this selection!

Take a look at the “logcounts” assay. It only contains Samples 4, 5, and 6.

assay(trt, "logcounts") |> head()
       Sample4  Sample5  Sample6
Gene1 12.22644 10.21101 12.64345
Gene2 10.82906 12.35025 11.77628
Gene3 11.72886 12.79303 11.13898
Gene4 13.50149 13.43010 11.76935
Gene5 11.52999 13.21223 11.79598
Gene6 12.03931 12.46510 12.49464

Of course you can combine multiple conditions as well

se[, se$Batch %in% c("B", "C") & se$Age > 10]
class: RangedSummarizedExperiment 
dim: 200 4 
metadata(0):
assays(3): counts cpm logcounts
rownames(200): Gene1 Gene2 ... Gene199 Gene200
rowData names(3): feature_id gene_type Keep
colnames(4): Sample2 Sample3 Sample5 Sample6
colData names(4): SampleName Treatment Age Batch

Subsetting based on rows

We can also select certain features that we want to keep using row subsetting. For example to select only the first 50 rows:

se[1:50, ]
class: RangedSummarizedExperiment 
dim: 50 6 
metadata(0):
assays(3): counts cpm logcounts
rownames(50): Gene1 Gene2 ... Gene49 Gene50
rowData names(3): feature_id gene_type Keep
colnames(6): Sample1 Sample2 ... Sample5 Sample6
colData names(4): SampleName Treatment Age Batch

Notice how the dim changed from 200 to 50 reflecting the fact that we have only selected the first 50 rows.

This subsetting is very useful when combined with logical operators. Above we created a vector in rowData called “Keep” that contained TRUE if the corresponding row of the se object was a coding gene and FALSE otherwise. Let’s use this vector to subset our se object.

(coding <- se[rowData(se)$Keep, ])
class: RangedSummarizedExperiment 
dim: 69 6 
metadata(0):
assays(3): counts cpm logcounts
rownames(69): Gene3 Gene5 ... Gene198 Gene199
rowData names(3): feature_id gene_type Keep
colnames(6): Sample1 Sample2 ... Sample5 Sample6
colData names(4): SampleName Treatment Age Batch

And if we look at the resulting rowData we can see that it only contains the protein_coding features

rowData(coding)
DataFrame with 69 rows and 3 columns
         feature_id      gene_type      Keep
        <character>    <character> <logical>
Gene3         ID003 protein_coding      TRUE
Gene5         ID005 protein_coding      TRUE
Gene7         ID007 protein_coding      TRUE
Gene8         ID008 protein_coding      TRUE
Gene10        ID010 protein_coding      TRUE
...             ...            ...       ...
Gene194       ID194 protein_coding      TRUE
Gene196       ID196 protein_coding      TRUE
Gene197       ID197 protein_coding      TRUE
Gene198       ID198 protein_coding      TRUE
Gene199       ID199 protein_coding      TRUE

Each assay also reflects this operation

assay(coding, "cpm") |> head()
         Sample1   Sample2   Sample3   Sample4    Sample5    Sample6
Gene3  6788.4273 3225.9799 4275.3113  3394.214  7097.1558  2255.1092
Gene5  1935.5879 1973.2206 1670.1686  2957.148  9490.2514  3555.8429
Gene7   704.7216  700.3933 1720.5226 11134.602  4094.6070   537.6633
Gene8  1474.9988 1044.3003  701.5246  3048.331   439.7054 10195.4915
Gene10 7694.8445 4630.0627 3048.3308  1429.043   561.2197  5703.3286
Gene12 1964.2690 5752.4952 2363.4165  3414.087 11849.1568 19017.2584

Subsetting based on rowRanges

A closely related row-wise subsetting operation can be used if you have a RangedSummarizedExperiment (a SummarizedExperiment with a rowRanges slot) that allows you to perform operations on a SummarizedExperiment object like you would a GRanges object.

For example, let’s say we only wanted to extract the features on Chromosome 2 only. Then we can use the GenomicRanges function subsetByOverlaps directly on our SummarizedExperiment object like so:

# Region of interest
roi <- GRanges(seqnames = "chr2", ranges = 1:1e7)
Warning in S4Vectors:::anyMissing(runValue(x_seqnames)): 'S4Vectors:::anyMissing()' is deprecated.
Use 'anyNA()' instead.
See help("Deprecated")
Warning in S4Vectors:::anyMissing(runValue(strand(x))): 'S4Vectors:::anyMissing()' is deprecated.
Use 'anyNA()' instead.
See help("Deprecated")
# Subset the SE object for only features on chr2
(chr2 <- subsetByOverlaps(se, roi))
class: RangedSummarizedExperiment 
dim: 150 6 
metadata(0):
assays(3): counts cpm logcounts
rownames(150): Gene51 Gene52 ... Gene199 Gene200
rowData names(3): feature_id gene_type Keep
colnames(6): Sample1 Sample2 ... Sample5 Sample6
colData names(4): SampleName Treatment Age Batch

You can see again that the dim changed reflecting our selection. Again, all of the associated assays and rowData have also been subsetted reflecting this change as well.

rowData(chr2)
DataFrame with 150 rows and 3 columns
         feature_id      gene_type      Keep
        <character>    <character> <logical>
Gene51        ID051         lncRNA     FALSE
Gene52        ID052 protein_coding      TRUE
Gene53        ID053 protein_coding      TRUE
Gene54        ID054 repeat_element     FALSE
Gene55        ID055 repeat_element     FALSE
...             ...            ...       ...
Gene196       ID196 protein_coding      TRUE
Gene197       ID197 protein_coding      TRUE
Gene198       ID198 protein_coding      TRUE
Gene199       ID199 protein_coding      TRUE
Gene200       ID200 repeat_element     FALSE
assay(chr2, "counts") |> head()
       Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
Gene51      87     130      18      72     140      50
Gene52     117      82      15      60     103      45
Gene53     154      95     327      93      57      32
Gene54      67     151     237      96      48     165
Gene55      54      47      77      69      21     232
Gene56     142     159     118      90      66      69
rowRanges(chr2)
GRanges object with 150 ranges and 3 metadata columns:
          seqnames        ranges strand |  feature_id      gene_type      Keep
             <Rle>     <IRanges>  <Rle> | <character>    <character> <logical>
   Gene51     chr2 453695-453794      - |       ID051         lncRNA     FALSE
   Gene52     chr2 764042-764141      + |       ID052 protein_coding      TRUE
   Gene53     chr2 161517-161616      + |       ID053 protein_coding      TRUE
   Gene54     chr2 547970-548069      + |       ID054 repeat_element     FALSE
   Gene55     chr2 187530-187629      - |       ID055 repeat_element     FALSE
      ...      ...           ...    ... .         ...            ...       ...
  Gene196     chr2 105197-105296      - |       ID196 protein_coding      TRUE
  Gene197     chr2 864490-864589      + |       ID197 protein_coding      TRUE
  Gene198     chr2 482738-482837      + |       ID198 protein_coding      TRUE
  Gene199     chr2 209038-209137      + |       ID199 protein_coding      TRUE
  Gene200     chr2 285516-285615      + |       ID200 repeat_element     FALSE
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

There’s also a few shortcuts on range operations using GRanges/SummarizedExperiments. See the help pages for %over, %within%, and %outside%. For example:

all.equal(se[se %over% roi, ], subsetByOverlaps(se, roi))
[1] TRUE

Combining subsetting operations

Of course you don’t have to perform one subsetting operation at a time. Like base R you can combine multiple expressions to subset a SummarizedExperiment object.

For example, to select only features labeled as repeat_elements and the Sample from ‘Batch’ A in the ‘Control’ group

(selected <- se[
  rowData(se)$gene_type == "repeat_element",
  se$Treatment == "Control" &
    se$Batch == "A"
])
class: RangedSummarizedExperiment 
dim: 77 1 
metadata(0):
assays(3): counts cpm logcounts
rownames(77): Gene1 Gene4 ... Gene195 Gene200
rowData names(3): feature_id gene_type Keep
colnames(1): Sample1
colData names(4): SampleName Treatment Age Batch

Saving a SummarizedExperiment

Since SummarizedExperiments keep basically all information about an experiment in one place, it is convenient to save the entire SummarizedExperiment object so that you can pick up an analysis where you left off or even to facilitate better sharing of data between collaborators.

You can save the entire SummarizedExperiment object with:

saveRDS(se, "/path/to/se.rds")

And when you want to read the same object back into R for your next analysis you can do so with:

se <- readRDS("/path/to/se.rds")

SummarizedExperiments in the real world

If you’re working with any Bioconductor packages it’s likely that the object you’re working with either is a SummarizedExperiment or is inherited from one. For example, the DESeqDataSet from the DESeq2 package and BSseq objects from the bsseq package both inherit from a SummarizedExperiment and thus retain all of the same functionality as above. If you go to the SummarizedExperiment landing page and click “See More” under details you can see all of the packages that depend on SummarizedExperiment.

Also, many common methods are also implemented for SummarizedExperiment objects. For example, to simplify calculating counts-per-million above I could have simply used the edgeR::cpm() directly on the SummarizedExperiment object. Many functions in bioconductor packages know how to deal directly with SummarizedExperiments so you don’t ever have to take the trouble extracting components and performing tedious calculations yourself.

assay(se, "cpm") <- edgeR::cpm(se)

I also left out any discussion of the metadata() slot of the SummarizedExperiment. The metadata slot is simply a list of any R object that contains information about the experiment. The metadata in the metadata slots are not subjected to the same subsetting rules as the other slots. In practice this assay contains additional information about the experiment as a whole. For example, I typically store bootstrap alignments for each sample here.

To add something to the SummarizedExperiment metadata slot you can do:

metadata(se)$additional_info <- "Experiment performed on 6 samples with three replicates each"

And to retrieve this:

metadata(se)$additional_info
[1] "Experiment performed on 6 samples with three replicates each"

Closing thoughts

Hopefully this was enough information to get you started using SummarizedExperiments. There’s many things I left out such as different backings for storing out of memory data, a tidyverse interface to SummarizedExperiment objects, TreeSummarizedExperiments for microbiome data, MultiAssayExperiments for dealing with experiments containing multiomics data, and much more.

Please let me know your thoughts and if anything needs more clarification.