OK, so, the new CleaveLand version is not done yet. But I think I have a handle on a more robust and computationally easier method for degradome statistics.
Here is the reasoning: In any given degradome library, a certain fraction of all possible nts. in the transcriptome have one or more degradome-derived 5' ends. For example, consider one of our more recent Arabidopsis thaliana degradome libraries (GSM512878 from Ma et al. 2010). The A. thaliana transcriptome has 54,305,815 non-ambiguous nucleotides within 35,386 transcripts (in this example I've been lazy and not accounted for plastid or mitochondrial transcripts nor alternative splice forms). If we used 20nt reads to map to the transcriptome, only 54,305,815 - (20 * 35,386) = 53,562,709nts could theoretically have a degradome-derived 5' end mapped -- this is because the 20nts on the end of each transcript don't allow mapping of the most 3' of tags. In the published degradome density file for GSM512878, I find 18,578 category 0 degradome peaks, 9,123 category 1 degradome peaks, 364,088 category 2 degradome peaks, 97,692 category 3 degradome peaks, and 182,500 category 4 peaks. Put another way:
cat0 0.0342% of all possible transcript positions
cat1 0.0168% of all possible transcript positions
cat2 0.67% of all possible transcript positions
cat3 0.18% of all possible transcript positions
cat4 0.336% of all possible transcript positions
So, the critical question is how likely is a given degradome 'hit' to occur by chance? This depends upon both the number of miRNA / target alignments we expect by chance, and the probability that any given position will have a degradome peak. This could be modeled as a binomial experiment: The number of alignments expected by chance is the number of trials, while the probability of success of each 'trial' is the fraction of all possible positions having a degradome peak. For instance, for the sake of argument, suppose that you expect for any given random 21nt sequence that there are about 10 alignments in the transcriptome. What is the chance that exactly zero of these 10 random alignments intersects with a category 2 degradome peak (OK, more precisely, has the 10th nucleotide of the alignment)? This is a binomial experiment with 10 trials, each with a probability of 0.0067, so the answer is given by the binomial density function. In R, this would be
> dbinom(0, size=10, prob=0.0067)
[1] 0.9349844
So, about 93.5% of the time, 10 random alignments wouldn't give you any category 2 hits in this dataset. Put another way, about 6.5% of the time you would see a hit by chance. If you think a > 5% chance of having an error is too high, you shouldn't trust such a hit. For category 0, however, the chances are much lower:
> dbinom(0, size=10, prob=0.000342)
[1] 0.9965853
So, only (1 - 0.9965853) = 0.34% of the time do you expect one or more spurious category 0 hits.
If you accept the above logic, then there seems to be a rational way to assess the likelihood of observing any particular degradome peak / miRNA alignment as convincing evidence for slicing. For any given alignment with a degradome peak, simply use the expected random number of alignments and the probability of any given transcriptome position having a peak to estimate the chance that your hit is random.
The problem then becomes what are the expected numbers of alignments expected by chance? Obviously this depends upon the extent of complementarity -- perfectly complementary sites are much rarer than sites with seven mismatches. My next step is working out whether one can reliably use set values for this calculation.
- mike axtell
Here is the reasoning: In any given degradome library, a certain fraction of all possible nts. in the transcriptome have one or more degradome-derived 5' ends. For example, consider one of our more recent Arabidopsis thaliana degradome libraries (GSM512878 from Ma et al. 2010). The A. thaliana transcriptome has 54,305,815 non-ambiguous nucleotides within 35,386 transcripts (in this example I've been lazy and not accounted for plastid or mitochondrial transcripts nor alternative splice forms). If we used 20nt reads to map to the transcriptome, only 54,305,815 - (20 * 35,386) = 53,562,709nts could theoretically have a degradome-derived 5' end mapped -- this is because the 20nts on the end of each transcript don't allow mapping of the most 3' of tags. In the published degradome density file for GSM512878, I find 18,578 category 0 degradome peaks, 9,123 category 1 degradome peaks, 364,088 category 2 degradome peaks, 97,692 category 3 degradome peaks, and 182,500 category 4 peaks. Put another way:
cat0 0.0342% of all possible transcript positions
cat1 0.0168% of all possible transcript positions
cat2 0.67% of all possible transcript positions
cat3 0.18% of all possible transcript positions
cat4 0.336% of all possible transcript positions
So, the critical question is how likely is a given degradome 'hit' to occur by chance? This depends upon both the number of miRNA / target alignments we expect by chance, and the probability that any given position will have a degradome peak. This could be modeled as a binomial experiment: The number of alignments expected by chance is the number of trials, while the probability of success of each 'trial' is the fraction of all possible positions having a degradome peak. For instance, for the sake of argument, suppose that you expect for any given random 21nt sequence that there are about 10 alignments in the transcriptome. What is the chance that exactly zero of these 10 random alignments intersects with a category 2 degradome peak (OK, more precisely, has the 10th nucleotide of the alignment)? This is a binomial experiment with 10 trials, each with a probability of 0.0067, so the answer is given by the binomial density function. In R, this would be
> dbinom(0, size=10, prob=0.0067)
[1] 0.9349844
So, about 93.5% of the time, 10 random alignments wouldn't give you any category 2 hits in this dataset. Put another way, about 6.5% of the time you would see a hit by chance. If you think a > 5% chance of having an error is too high, you shouldn't trust such a hit. For category 0, however, the chances are much lower:
> dbinom(0, size=10, prob=0.000342)
[1] 0.9965853
So, only (1 - 0.9965853) = 0.34% of the time do you expect one or more spurious category 0 hits.
If you accept the above logic, then there seems to be a rational way to assess the likelihood of observing any particular degradome peak / miRNA alignment as convincing evidence for slicing. For any given alignment with a degradome peak, simply use the expected random number of alignments and the probability of any given transcriptome position having a peak to estimate the chance that your hit is random.
The problem then becomes what are the expected numbers of alignments expected by chance? Obviously this depends upon the extent of complementarity -- perfectly complementary sites are much rarer than sites with seven mismatches. My next step is working out whether one can reliably use set values for this calculation.
- mike axtell