Hm. So it appears that, two years ago, I wrote a post on calculating the average gene length in prokaryotes. I found a half-draft of the second part and decided to finish it off.

In part 1 we defined **mGenes** (“maybe-genes”) as the pieces you get after breaking the genome at each stop codon, and predicted that the probability of finding an mGene of length *L* is given by the following equation:

**Equation 1:** The probability of finding an mGene of length L.

By plotting this function it is clear that the probability of a set of codons being an mGene plummets quickly, so that there is nearly a 0 probility of finding an mGene of 100 codons (300 bases) in a random sequence (black line in Fig. 1). I confirmed this with a 1,000,000 base synthetic genome (all code is at the end of this entry), resulting in the red circles in Fig. 1 that perfectly overlap with the prediction line.

**Fig. 1:** Predicted frequency of mGenes. Red circles, results from 1Megabase simulated genome. Black line, function shown in Equation 1.

So now we know what to expect from a completely random genome: Nearly all mGenes will be less than 100 codons (300 bases) in length. This is much shorter than your typical gene, and so I would expect there to be a large number of mGenes larger than this size in a real genome. So let’s check it out, using a fully sequenced prokaryotic genome!

Continue reading →