matthew.taliaferro@cuanschutz.edu
rMATS
to analyze alternative splicing using RNAseq datarMATS
to analyze alternative splicing using RNAseq datarMATS
output to derive meaningful insightsrMATS
to analyze alternative splicing using RNAseq datarMATS
output to derive meaningful insightsAs with all computational experiments (yes, they are experiments, don't let your pipette-toting friends tell you otherwise), keeping track of what you did is key. In the old days, I kept a written notebook of commands that I ran. Sounds silly, but there were many times that I went back to that notebook to see exactly what the parameters were for a given run using a piece of software.
Today, there are better options. You are using one of the better ones right now. Notebooks, including RMarkdown (mainly for R) and Jupyter (mainly for Python), are a great way to keep track of what you did as well as give justification or explanation for your analyses using plain 'ol English.
Trust me, one day you will be glad you used them. The Methods section of your paper is never fun to write without them.
Today's problem set is composed of 2 problems.
Today's problem set is composed of 2 problems.
rMATS
output file. You will be tasked with filtering the events by read coverage and then plotting the difference in PSI values and their statistical significance using a volcano plot. This problem is worth 50% of the total points.Today's problem set is composed of 2 problems.
rMATS
output file. You will be tasked with filtering the events by read coverage and then plotting the difference in PSI values and their statistical significance using a volcano plot. This problem is worth 50% of the total points.rMATS
output and plot the PSI values statistically significant events using a heatmap. This will allow you to easily judge the consistency of PSI values across replicates in these events. This problem is worth 50% of the total points.If you are interested, here is a little more information about today's topic that you can read later:
rMATS
rMATS
Last time we talked about using STAR
to align RNAseq reads to the transcriptome. We talked about why this was important for alternative splicing analysis given that we need to know exactly where in transcripts these reads are mapping.
Today we will talk about taking the alignments that we produced (OK not those exact alignments, but still) and using them to determine how the inclusion of alternative exons in transcripts varies between conditions.
Today we will continue our focus on the analysis of alternative splicing.
As is the case for many computational biology questions, there are many software choices when it comes to alternative splicing. The one that we will use is called rMATS
.
rMATS
will take data about where reads align and where alternative exons are to determine how often an exon is included in a sample. It will quantify this using a metric called PSI (Percent Spliced In).
rMATS
will then compare PSI values between replicates and across experimental conditions to identify alternative splicing events whoe PSI values are significantly different across conditions.
As you may know, the vast majority of mammalian protein-coding genes contain introns. In fact most contain many introns, meaning they have many exons. Not every one of these exons, or pieces of an exon, has to be included in every transcript.
The inclusion of these exons is often regulated through the binding of RNA binding proteins to specific sequences in the exon and in the flanking introns.
The data we will examine today came from an experiment probing the function of one of these RNA binding proteins, RBFOX2. More on that later.
wormbook
In order tell which exons were included in a transcript, we need to know more than just which transcript a read came from. We need to know where in the transcript it came from.
The most informative reads with regards to alternative splicing are those that cross a splice junction. Those reads tell us unambiguously that two exons were connected together in the transcript. If we are considering the inclusion of an alternative exon, a junction read could link the alternative exon to one of its neighbors, indicating that the alternative exon was included. Alternatively š (was this funny this time?), a junction read could link the two neighbor exons together, indiciating that the alternative exon was not included.
We can use the relative number of reads that support inclusion or exclusion to learn something about how often an exon was included in a sample.
Consider the plots on the right. These are called sashimi š£ plots.
We are quantifying the inclusion of the middle exon. Each row is an RNAseq sample where the height of the color indicates read depth at that location. The arcs that connect exons are junction reads. In the red samples, there are many more reads that support exon inclusion (i.e. go from the middle exon to a flanking exon) than support exclusion (i.e. go from one flanking exon to the other). In the orange samples, this trend is reversed.
We can therefore say that the exon is more often included in the red samples than the orange samples.
MISO documentation
Today, we will use data produced as part of a study into the function of the RNA binding protein RBFOX2.
A common strategy in the study of RNA processing is to take an RBP you are interested in, perturb it, and ask how the transcriptome looks before and after this perturbation. This is part of what the authors did in today's paper.
A common strategy in the study of RNA processing is to take an RBP you are interested in, perturb it, and ask how the transcriptome looks before and after this perturbation. This is part of what the authors did in today's paper.
Imagine you were studying an RBP that promoted inclusion of an alternative exon by binding the intron downstream of exons that it controls.
A common strategy in the study of RNA processing is to take an RBP you are interested in, perturb it, and ask how the transcriptome looks before and after this perturbation. This is part of what the authors did in today's paper.
Imagine you were studying an RBP that promoted inclusion of an alternative exon by binding the intron downstream of exons that it controls.
Now imagine that you perturbed that RBP, say, by knocking its expression down. What would happen?
Now imagine that you perturbed that RBP, say, by knocking its expression down. What would happen?
Exons that depended on that RBP for efficient inclusion would now be included in transcripts much less frequently. In this example, their PSI values would decrease.
However, other exons don't really care about whether this RBP is around or not. Their PSI values will be generally unaffected.
Now imagine that you perturbed that RBP, say, by knocking its expression down. What would happen?
Exons that depended on that RBP for efficient inclusion would now be included in transcripts much less frequently. In this example, their PSI values would decrease.
However, other exons don't really care about whether this RBP is around or not. Their PSI values will be generally unaffected.
Identifying RBP-sensitive exons in this manner therefore tells us two things:
More on this in the next class. For now, let's focus on running rMATS
and figuring out its output.
rMATS
is going to want two things:
gtf
genome annotation file)rMATS
is going to want two things:
gtf
genome annotation file)bam
formatrMATS
is going to want two things:
gtf
genome annotation file)bam
formatSee the documentation here. The relevant options we will need to pay attention to when doing this are shown below:
rMATS
is going to want two things:
gtf
genome annotation file)bam
formatSee the documentation here. The relevant options we will need to pay attention to when doing this are shown below:
rMATS
is going to want two things:
gtf
genome annotation file)bam
formatSee the documentation here. The relevant options we will need to pay attention to when doing this are shown below:
rMATS
is going to want two things:
gtf
genome annotation file)bam
formatSee the documentation here. The relevant options we will need to pay attention to when doing this are shown below:
rMATS
is going to want two things:
gtf
genome annotation file)bam
formatSee the documentation here. The relevant options we will need to pay attention to when doing this are shown below:
rMATS
is going to want two things:
gtf
genome annotation file)bam
formatSee the documentation here. The relevant options we will need to pay attention to when doing this are shown below:
rMATS
is going to want two things:
gtf
genome annotation file)bam
formatSee the documentation here. The relevant options we will need to pay attention to when doing this are shown below:
rMATS produces several output files. The files we are interested in end in .JC.txt
. There is one for every "type" of alternative splicing. Splicing quantification in these files was done using only reads that cross splice junctions. Although this reduces the statistical power of the tests that rMATS
uses because it reduces the number of usable reads, junction reads are always completely unambiguous in relating how exons were connected together.
Let's look at one of these output files.
rmatsout <- read.table('data/rMATS/RBFOX2kd/rMATSouts/SE.MATS.JC.txt', header = T)as_tibble(rmatsout) %>% head(n = 5)
#> # A tibble: 5 x 23#> ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES#> <int> <fct> <fct> <fct> <fct> <int> <int> <int>#> 1 5 ENSMUā¦ Neil1 chr9 - 57143757 5.71e7 57143450#> 2 6 ENSMUā¦ Wrap53 chr11 - 69562121 6.96e7 69561757#> 3 9 ENSMUā¦ Zfp28 chr7 + 6392181 6.39e6 6390399#> 4 11 ENSMUā¦ Pik3cd chr4 - 149656603 1.50e8 149656380#> 5 17 ENSMUā¦ Cd81 chr7 + 143065942 1.43e8 143064673#> # ā¦ with 15 more variables: upstreamEE <int>, downstreamES <int>,#> # downstreamEE <int>, ID.1 <int>, IJC_SAMPLE_1 <fct>, SJC_SAMPLE_1 <fct>,#> # IJC_SAMPLE_2 <fct>, SJC_SAMPLE_2 <fct>, IncFormLen <int>,#> # SkipFormLen <int>, PValue <dbl>, FDR <dbl>, IncLevel1 <fct>,#> # IncLevel2 <fct>, IncLevelDifference <dbl>
We will get more into the meaning of each of these columns in the exercises.
matthew.taliaferro@cuanschutz.edu
Keyboard shortcuts
ā, ā, Pg Up, k | Go to previous slide |
ā, ā, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |