On colorspace to DNA-space translation Consider a typical plant small RNA (in DNA form), a 24nt RNA with a 5'-A : ATGCGATCATGGTAAATGGGGTCT Let's assume the 3' adapter for the SOLiD fragment libraries was: (btw this is the standard SOLiD small RNA 3' adapter to the best of my current knowledge) CGCCTTGGCCGTACAGCAG The "forward" read for SOLiD fragments begins with a T, which is part of the first di-base color that is read. So, the bases encoded in the read look as follows TATGCGATCATGGTAAATGGGGTCTCGCCTTGGCCGTACAGCAG . ................... Note that the dots above represent non-biological nucleotides that are encoded within the read The SOLid system reads di-bases, with one of four possible colors representing a 2 nt chunk, moving down the template in 1nt increments. Here is the table of color to di-base encodings: 0: AA, CC, GG, TT 1: AC, CA, GT, TG 2: AG, CT, GA, TC 3: AT, CG, GC, TA From the above table, we can see that our example will have the following colors: TA-AT-TG-GC-CG-GA-AT-TC-CA-AT-TG-GG-GT-TA-AA-AA-AT-TG-GG-GG-GG-GT-TC-CT-TC-CG-GC-CC-CT-TT-TG-GG-GC-CC-CG 3 3 1 3 3 2 3 2 1 3 1 0 1 3 0 0 3 1 0 0 0 1 2 2 2 3 3 0 2 0 1 0 3 0 3 . . . . . . . . . . . . h b b b b b b b b b b b b b b b b b b b b b b b h a a a a a a a a a a . : colors resulting from a dibase containing at least one non-biological nucleotide h : hybrid dibase, where one adapter/primer nucleotide and one biological nucleotide contribute to the dibase b : biological dibase, where both nucleotides were from the small RNA insert a : adapter dibase where both contributing nucleotides were from the adapter In this example, note that the first color is not entirely dictacted by the insert. Similarly, the '2' at the end of the insert represents a hybrid color. This leads to a key property of adapter trimming of SOLiD reads .. the last color left after proper trimming represents the hybrid di-base formed by the end of the insert small RNA and the beginning of the 3' adapter. In our example, the adapter-derived colors (including the first, hybrid color) are x3302010303, where 'x' represents 0,1,2, or 3. A reasonable adapter to search for is the first 8nts .. in our example, the user could input 'CGCCTTGG' as the adapter sequence. The first step would then be to translate the adapter into colorspace. From the table above, we find that there are only 7 dibases known from the 8nt input: CG GC CC CT TT TG GG 3 3 0 2 0 1 0 Once this translation is accomplished, we search the read for occurence of the translated color string, within the number of allowed substitutions specified by the user. Thus, a match looks like (quality values omitted for clarity): @[read_name] T33133232131013003100012223302010303 3302010 h<---------b----------->h<--a-----> # h, b, and a as above The trimming takes place in colorspace, AND RETAINS THE 3' HYBRID COLOR! So, our above example, after trimming, becomes: @[read_name] T3313323213101300310001222 h<---------b----------->h # h, b, and a as above If this read were to be mapped in colorspace, bowtie will (correctly) ignore the leading T3, and decode the colors "313323213101300310001222". Only 23 of the 24nts of the insert will be represented by two flanking di-bases: AT-TG-GC-CG-GA-AT-TC-CA-AT-TG-GG-GT-TA-AA-AA-AT-TG-GG-GG-GG-GT-TC-CT-TC 3 1 3 3 2 3 2 1 3 1 0 1 3 0 0 3 1 0 0 0 1 2 2 2 Note that the 5' A is only represented by one color, not two. Bowtie therefore cannot decode the first nucleotide correctly when mapping in color-space. Therefore, when mapped in colorspace, only the last 23nts will be deocded from a 24nt insert, and the 5' position and nucleotide of the read will not be correctly noted. The result of this is that the sizes present in the mapping file (.sam or others) will NOT represent the actual sizes of the small RNAs themselves .. they will be truncated 1nt at the 5' end. In principle, this is not a problem, since one could perform all downstream analyses with this knowledge by simply adding the one nt at the 5' end of each mapping. In practice, however, this has the potential to lead to serious confusion. The alternative to mapping in colorspace is the "translate" the colorspace read into DNA before mapping. Based on the dibase table shown above, and because we know the first di-base had a 'T' from the primer in the first position, we could translate a trimmed colorspace read to DNA as follows: Trimmed color-space read from above: T3313323213101300310001222 Relationship between first di-base nt, color, and second di-base nt: 'A0' => 'A', 'A1' => 'C', 'A2' => 'G', 'A3' => 'T', 'C0' => 'C', 'C1' => 'A', 'C2' => 'T', 'C3' => 'G', 'G0' => 'G', 'G1' => 'T', 'G2' => 'A', 'G3' => 'C', 'T0' => 'T', 'T1' => 'G', 'T2' => 'C', 'T3' => 'A', So, translating the read above, we begin with: T3 : A A3 : T T1 : G ... an so on, remembering to ignore the last color, to get: ATGCGATCATGGTAAATGGGGTCT This is fine except that if there is a single wrong color, the translation is "frameshifted", and all translated nts at and downstream of the color error are incorrect. This is the downside of translation to DNA before mapping. Mapping in colorspace Pros: single color errors ( or more, depending on mismatch threshold during mapping) are not catastrophic Cons : First nucleotides cannot be recovered directly from bowtie output. Interpretation of precise position of sequenced insert, and it's size, becomes tedious, and there is the potential for confusion. Mapping after translation of trimmed read to DNA Pros : Full-length alignments can be recovered. Cons: single color errors cause catstrophic, 'frameshift' error in translation and won't be mapped. There will be a reduction in mappable reads. In experiments were I have run the same mapping of trimmed small RNA reads in either colorspace or translated DNA-space, I have observed about an 8.5% loss of mappable reads when translating to DNA first. My opinion is that this is an acceptable price to pay for having map data that is easy to interpret, where the mapped information directly reflects the size of the small RNA itself.