Another view of CleaveLand statistics -- an even simpler way?
In my last post, I introduced the idea that one can estimate the chances of getting a spurious degradome 'hit' by examining the fraction of all positions within a transcriptome occupied by degradome data in a given library. After a little reflection, the next step seems even simpler than I first mentioned.
As in the last post, let's use an existing degradome library as a test case: GSM512878, from Arabidopsis thaliana, published in Ma et al. (2010).
After mapping this library to the sense strand of the Arabidopsis transcriptome, we find the following distribution of peaks:
Category Total Fraction (there are 53,562,709 possible positions in the txome)
0 18578 0.0003468458
1 9123 0.0001703237
2 364088 0.0067974157
3 97692 0.0018238809
4 182500 0.0034072212
So, let's say we find miRNA / mRNA alignments for a typical miRNA, using the Allen et al. rules and find the following:
score n targets cumulative targets
0 0 0
0.5 0 0
1 0 0
1.5 0 0
2 3 3
2.5 1 4
3 0 4
3.5 0 4
4 2 6
4.5 3 9
5 10 19
5.5 5 24
6 58 82
6.5 32 114
7 175 289
So, just as an example, let's think about alignments with a score of 5 or less .. there are 19 of these from the above table. So, by random chance, if we consider alignments scoreing 5 or less, we have 19 opportunities to get an alignment that corresponds with a degradome peak. As I stated in my last post, this can be considered a binomial process .. we have 19 'trials', each with a certain probability of 'success'. Success in this case is when position ten of an alignment corresponds with that of a degradome peak. So, for a category 0 target, we have 19 trials, each with prob of success=0.0003468458. Using the binomial distribution, we can compute that chance of getting 1 or more random hits in this case as 0.0065695387. Using this logic, we can fill in the table with the probabilities of random hits for each alignment score / category intersection. I have done so in the table below:
score n n_cum. cat0 cat1 cat2 cat3 cat4
0 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
0.5 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
1 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
1.5 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
2 3 3 1.04E-003 5.11E-004 2.03E-002 5.46E-003 1.02E-002
2.5 1 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
3 0 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
3.5 0 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
4 2 6 2.08E-003 1.02E-003 4.01E-002 1.09E-002 2.03E-002
4.5 3 9 3.12E-003 1.53E-003 5.95E-002 1.63E-002 3.03E-002
5 10 19 6.57E-003 3.23E-003 1.22E-001 3.41E-002 6.28E-002
5.5 5 24 8.29E-003 4.08E-003 1.51E-001 4.29E-002 7.86E-002
6 58 82 2.80E-002 1.39E-002 4.28E-001 1.39E-001 2.44E-001
6.5 32 114 3.88E-002 1.92E-002 5.40E-001 1.88E-001 3.22E-001
7 175 289 9.54E-002 4.80E-002 8.61E-001 4.10E-001 6.27E-001
The only remaining step is to decide on where a reasonable comfort level is. Reasonable cutoffs might be 0.001, 0.01, or 0.05. As an example, let's say we will reject anything with a p >= 0.01 (e.g. >= 1E-2). Looking at the table above, this means we will reject cat.0 hits with alignment scores of 6 and above, cat.1 hits with alignment score of 6 and above, ALL cat.2 hits, cat.3 hits of 4 and above, and ALL cat.4 hits.
Note, this should not be interpreted as accepting all alignments below the p-value cutoff as well-supported predictions. The point here is NOT to predict targets based on alignments. Rather, the goal is to identify alignment score cutoffs, for each peak category, below which we have a reasonable belief that any observed degradome peaks, should they actually exist, are unlikely to be due to random chance.
The approach I sketch above has the advantage of NOT relying upon randomized/shuffled miRNA queries to assess degradome data. Shuffling is problematic for two reasons: Computational expense, and the difficulty of achieving shuffled queries that resemble the dinucleotide content of real miRNAs, which is likely to be non-random. In this new approach, no shuffles are required.
Next step is to actually get this into a new CleaveLand version.
In my last post, I introduced the idea that one can estimate the chances of getting a spurious degradome 'hit' by examining the fraction of all positions within a transcriptome occupied by degradome data in a given library. After a little reflection, the next step seems even simpler than I first mentioned.
As in the last post, let's use an existing degradome library as a test case: GSM512878, from Arabidopsis thaliana, published in Ma et al. (2010).
After mapping this library to the sense strand of the Arabidopsis transcriptome, we find the following distribution of peaks:
Category Total Fraction (there are 53,562,709 possible positions in the txome)
0 18578 0.0003468458
1 9123 0.0001703237
2 364088 0.0067974157
3 97692 0.0018238809
4 182500 0.0034072212
So, let's say we find miRNA / mRNA alignments for a typical miRNA, using the Allen et al. rules and find the following:
score n targets cumulative targets
0 0 0
0.5 0 0
1 0 0
1.5 0 0
2 3 3
2.5 1 4
3 0 4
3.5 0 4
4 2 6
4.5 3 9
5 10 19
5.5 5 24
6 58 82
6.5 32 114
7 175 289
So, just as an example, let's think about alignments with a score of 5 or less .. there are 19 of these from the above table. So, by random chance, if we consider alignments scoreing 5 or less, we have 19 opportunities to get an alignment that corresponds with a degradome peak. As I stated in my last post, this can be considered a binomial process .. we have 19 'trials', each with a certain probability of 'success'. Success in this case is when position ten of an alignment corresponds with that of a degradome peak. So, for a category 0 target, we have 19 trials, each with prob of success=0.0003468458. Using the binomial distribution, we can compute that chance of getting 1 or more random hits in this case as 0.0065695387. Using this logic, we can fill in the table with the probabilities of random hits for each alignment score / category intersection. I have done so in the table below:
score n n_cum. cat0 cat1 cat2 cat3 cat4
0 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
0.5 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
1 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
1.5 0 0 0.00E+000 0.00E+000 0.00E+000 0.00E+000 0.00E+000
2 3 3 1.04E-003 5.11E-004 2.03E-002 5.46E-003 1.02E-002
2.5 1 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
3 0 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
3.5 0 4 1.39E-003 6.81E-004 2.69E-002 7.28E-003 1.36E-002
4 2 6 2.08E-003 1.02E-003 4.01E-002 1.09E-002 2.03E-002
4.5 3 9 3.12E-003 1.53E-003 5.95E-002 1.63E-002 3.03E-002
5 10 19 6.57E-003 3.23E-003 1.22E-001 3.41E-002 6.28E-002
5.5 5 24 8.29E-003 4.08E-003 1.51E-001 4.29E-002 7.86E-002
6 58 82 2.80E-002 1.39E-002 4.28E-001 1.39E-001 2.44E-001
6.5 32 114 3.88E-002 1.92E-002 5.40E-001 1.88E-001 3.22E-001
7 175 289 9.54E-002 4.80E-002 8.61E-001 4.10E-001 6.27E-001
The only remaining step is to decide on where a reasonable comfort level is. Reasonable cutoffs might be 0.001, 0.01, or 0.05. As an example, let's say we will reject anything with a p >= 0.01 (e.g. >= 1E-2). Looking at the table above, this means we will reject cat.0 hits with alignment scores of 6 and above, cat.1 hits with alignment score of 6 and above, ALL cat.2 hits, cat.3 hits of 4 and above, and ALL cat.4 hits.
Note, this should not be interpreted as accepting all alignments below the p-value cutoff as well-supported predictions. The point here is NOT to predict targets based on alignments. Rather, the goal is to identify alignment score cutoffs, for each peak category, below which we have a reasonable belief that any observed degradome peaks, should they actually exist, are unlikely to be due to random chance.
The approach I sketch above has the advantage of NOT relying upon randomized/shuffled miRNA queries to assess degradome data. Shuffling is problematic for two reasons: Computational expense, and the difficulty of achieving shuffled queries that resemble the dinucleotide content of real miRNAs, which is likely to be non-random. In this new approach, no shuffles are required.
Next step is to actually get this into a new CleaveLand version.