Tuesday, November 21, 2006

Information in RNA codons

m-RNA codons (or triplets) consist of three nucleotide bases, usually identified by their first letter: U(racil), C(ytosine), A(denine), and G(uanine). In DNA, T(hymine) replaces U(racil).

The codons code for 20 amino acids, and three codons function as stop codes. The codon ('AUG') that codes for the amino acid methionine also functions as start code. In short, we can say that there are 21 instructions of the codons: 20 amino acids (including methionine/start code).

The total content of information measured in bits in the instructions is log2(21) = 4,3923. The maximum capacity of information in a single nucleotide is log2(4) = 2, since there are four different nucleotides, and the maximum capacity of information in a codon is log2(43) = 3*log2(4) = 6. That is, there is close to 1.6 bits of 'unused' information capacity in the codons, almost an entire nucleotide's worth.

If we look at the first nucleotide in codons, we get this table:

U72.8074
C52.3219
A72.8074
G52.3219
24

The first column indicates the nucleotide, the second column counts the number of instructions with a codon with the corresponding nucleotide as the first, and the third column shows the residual information needed; that is, the information needed in the remaining two nucleotides to resolve which instruction is coded for. Second column, fifth row counts the four preceding rows. Since some amino acids, such as leucine, have codons with different initial nucleotides, the number of instructions is 24 rather than 21.

Looking at the first two nucleotides, we get this table:

UU2 1
UC1 0
UA2 1
UG3 1.5850
CU1 0
CC1 0
CA2 1
CG1 0
AU2 1
AC1 0
AA2 1
AG2 1
GU1 0
GC1 0
GA2 1
GG1 0
25

Explanations are analogous to the above.

As can be seen, in only one case, 'UG', is the residual information greater than 1, and in eight cases, that is in half of all cases, the residual information is 0. This means that the third nucleotide actually carries very little information; none at all, where there's a '1' in the second column above (or a '0' in the third column).

To make this somewhat clearer, let's rearrange the table to have it ordered by the second nucleotide:

UU2
CU1
AU2
GU1
UC1
CC1
AC1
GC1
UA2
CA2
AA2
GA2
UG3
CG1
AG2
GG1

As can be clearly seen here, the codons whose instruction is fully determined by the two first nucleotides (those with a '1' in the second column above) are those codons, who have a 'C' as the second nucleotide or a C or G as the first nucleotide and a 'U' or 'G' as the second nucleotide. Put differently, if a codon has a 'C' as the second nucleotide, or it has a 'C' or 'G' as the first nucleotide and the second nucleotide is not an 'A', the third nucleotide does not add any information.

This suggests that the RNA/DNA codes have themself evolved, from an original two-nucleotide code based on 'C' (a pyrimidine) and 'G' (a purine), or simply only distinguishing between pyrimidines ('C' and 'U'/'T') and purines ('G' and 'A').


Acknowledgements: When I originally wrote this post, I had not noticed that there are two groups of serine, and I therefore counted 21 amino acids. I posted a part of this post at TheologyWeb, and user Roy made me aware of the error.

No comments:

About Me

A Christian in Satanist clothes