Comments / complaints / bugs / and especially praise welcomed. -- mike a
CleaveLand 3 has now been completed and is ready for use. It is substantially different than both CleaveLand 1 (Described by Addo-Quaye et al. (2009) in bioinformatics, as well as CleaveLand2. It runs much faster, no longer uses extensive random queries, and has a sound statistical footing to assess the likelihood of a hit occurring by random chance, taking into account both the quality of the small RNA / mRNA alignment and the noise in the degradome library being analyzed.
Comments / complaints / bugs / and especially praise welcomed. -- mike a
Another view of CleaveLand statistics -- an even simpler way?
In my last post, I introduced the idea that one can estimate the chances of getting a spurious degradome 'hit' by examining the fraction of all positions within a transcriptome occupied by degradome data in a given library. After a little reflection, the next step seems even simpler than I first mentioned.
As in the last post, let's use an existing degradome library as a test case: GSM512878, from Arabidopsis thaliana, published in Ma et al. (2010).
After mapping this library to the sense strand of the Arabidopsis transcriptome, we find the following distribution of peaks:
Category Total Fraction (there are 53,562,709 possible positions in the txome)
0 18578 0.0003468458
1 9123 0.0001703237
2 364088 0.0067974157
3 97692 0.0018238809
4 182500 0.0034072212
So, let's say we find miRNA / mRNA alignments for a typical miRNA, using the Allen et al. rules and find the following:
score n targets cumulative targets
0 0 0
0.5 0 0
1 0 0
1.5 0 0
2 3 3
2.5 1 4
3 0 4
3.5 0 4
4 2 6
4.5 3 9
5 10 19
5.5 5 24
6 58 82
6.5 32 114
7 175 289
So, just as an example, let's think about alignments with a score of 5 or less .. there are 19 of these from the above table. So, by random chance, if we consider alignments scoreing 5 or less, we have 19 opportunities to get an alignment that corresponds with a degradome peak. As I stated in my last post, this can be considered a binomial process .. we have 19 'trials', each with a certain probability of 'success'. Success in this case is when position ten of an alignment corresponds with that of a degradome peak. So, for a category 0 target, we have 19 trials, each with prob of success=0.0003468458. Using the binomial distribution, we can compute that chance of getting 1 or more random hits in this case as 0.0065695387. Using this logic, we can fill in the table with the probabilities of random hits for each alignment score / category intersection. I have done so in the table below:
score n n_cum. cat0 cat1 cat2 cat3 cat4
0 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
0.5 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
1 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
1.5 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
2 3 3 1.04E-003 5.11E-004 2.03E-002 5.46E-003 1.02E-002
2.5 1 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
3 0 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
3.5 0 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
4 2 6 2.08E-003 1.02E-003 4.01E-002 1.09E-002 2.03E-002
4.5 3 9 3.12E-003 1.53E-003 5.95E-002 1.63E-002 3.03E-002
5 10 19 6.57E-003 3.23E-003 1.22E-001 3.41E-002 6.28E-002
5.5 5 24 8.29E-003 4.08E-003 1.51E-001 4.29E-002 7.86E-002
6 58 82 2.80E-002 1.39E-002 4.28E-001 1.39E-001 2.44E-001
6.5 32 114 3.88E-002 1.92E-002 5.40E-001 1.88E-001 3.22E-001
7 175 289 9.54E-002 4.80E-002 8.61E-001 4.10E-001 6.27E-001
The only remaining step is to decide on where a reasonable comfort level is. Reasonable cutoffs might be 0.001, 0.01, or 0.05. As an example, let's say we will reject anything with a p >= 0.01 (e.g. >= 1E-2). Looking at the table above, this means we will reject cat.0 hits with alignment scores of 6 and above, cat.1 hits with alignment score of 6 and above, ALL cat.2 hits, cat.3 hits of 4 and above, and ALL cat.4 hits.
Note, this should not be interpreted as accepting all alignments below the p-value cutoff as well-supported predictions. The point here is NOT to predict targets based on alignments. Rather, the goal is to identify alignment score cutoffs, for each peak category, below which we have a reasonable belief that any observed degradome peaks, should they actually exist, are unlikely to be due to random chance.
The approach I sketch above has the advantage of NOT relying upon randomized/shuffled miRNA queries to assess degradome data. Shuffling is problematic for two reasons: Computational expense, and the difficulty of achieving shuffled queries that resemble the dinucleotide content of real miRNAs, which is likely to be non-random. In this new approach, no shuffles are required.
Next step is to actually get this into a new CleaveLand version.
OK, so, the new CleaveLand version is not done yet. But I think I have a handle on a more robust and computationally easier method for degradome statistics.
Here is the reasoning: In any given degradome library, a certain fraction of all possible nts. in the transcriptome have one or more degradome-derived 5' ends. For example, consider one of our more recent Arabidopsis thaliana degradome libraries (GSM512878 from Ma et al. 2010). The A. thaliana transcriptome has 54,305,815 non-ambiguous nucleotides within 35,386 transcripts (in this example I've been lazy and not accounted for plastid or mitochondrial transcripts nor alternative splice forms). If we used 20nt reads to map to the transcriptome, only 54,305,815 - (20 * 35,386) = 53,562,709nts could theoretically have a degradome-derived 5' end mapped -- this is because the 20nts on the end of each transcript don't allow mapping of the most 3' of tags. In the published degradome density file for GSM512878, I find 18,578 category 0 degradome peaks, 9,123 category 1 degradome peaks, 364,088 category 2 degradome peaks, 97,692 category 3 degradome peaks, and 182,500 category 4 peaks. Put another way:
cat0 0.0342% of all possible transcript positions
cat1 0.0168% of all possible transcript positions
cat2 0.67% of all possible transcript positions
cat3 0.18% of all possible transcript positions
cat4 0.336% of all possible transcript positions
So, the critical question is how likely is a given degradome 'hit' to occur by chance? This depends upon both the number of miRNA / target alignments we expect by chance, and the probability that any given position will have a degradome peak. This could be modeled as a binomial experiment: The number of alignments expected by chance is the number of trials, while the probability of success of each 'trial' is the fraction of all possible positions having a degradome peak. For instance, for the sake of argument, suppose that you expect for any given random 21nt sequence that there are about 10 alignments in the transcriptome. What is the chance that exactly zero of these 10 random alignments intersects with a category 2 degradome peak (OK, more precisely, has the 10th nucleotide of the alignment)? This is a binomial experiment with 10 trials, each with a probability of 0.0067, so the answer is given by the binomial density function. In R, this would be
> dbinom(0, size=10, prob=0.0067)
So, about 93.5% of the time, 10 random alignments wouldn't give you any category 2 hits in this dataset. Put another way, about 6.5% of the time you would see a hit by chance. If you think a > 5% chance of having an error is too high, you shouldn't trust such a hit. For category 0, however, the chances are much lower:
> dbinom(0, size=10, prob=0.000342)
So, only (1 - 0.9965853) = 0.34% of the time do you expect one or more spurious category 0 hits.
If you accept the above logic, then there seems to be a rational way to assess the likelihood of observing any particular degradome peak / miRNA alignment as convincing evidence for slicing. For any given alignment with a degradome peak, simply use the expected random number of alignments and the probability of any given transcriptome position having a peak to estimate the chance that your hit is random.
The problem then becomes what are the expected numbers of alignments expected by chance? Obviously this depends upon the extent of complementarity -- perfectly complementary sites are much rarer than sites with seven mismatches. My next step is working out whether one can reliably use set values for this calculation.
- mike axtell
-- THIS WEBSITE HAS BEEN REPLACED ... please go to http://sites.psu.edu/axtell for the current Axtell Lab Website! --
Michael Axtell, lab PI