Average gene length in prokaryotes (part 2)

Hm. So it appears that, two years ago, I wrote a post on calculating the average gene length in prokaryotes. I found a half-draft of the second part and decided to finish it off.

In part 1 we defined mGenes (“maybe-genes”) as the pieces you get after breaking the genome at each stop codon, and predicted that the probability of finding an mGene of length L is given by the following equation:

Equation 1: The probability of finding an mGene of length L.

By plotting this function it is clear that the probability of a set of codons being an mGene plummets quickly, so that there is nearly a 0 probility of finding an mGene of 100 codons (300 bases) in a random sequence (black line in Fig. 1). I confirmed this with a 1,000,000 base synthetic genome (all code is at the end of this entry), resulting in the red circles in Fig. 1 that perfectly overlap with the prediction line.

Fig. 1: Predicted frequency of mGenes. Red circles, results from 1Megabase simulated genome. Black line, function shown in Equation 1.

So now we know what to expect from a completely random genome: Nearly all mGenes will be less than 100 codons (300 bases) in length. This is much shorter than your typical gene, and so I would expect there to be a large number of mGenes larger than this size in a real genome. So let’s check it out, using a fully sequenced prokaryotic genome!

NCBI maintains a wonderful FTP server full of bacterial/archaeal genomes, but I just want the genome from a good old E.coli strain. There are over 50 to choose from, so I picked one more or less at random: a strain of E.coli 0157H7. From the 0157H7 link , download the file called BA000007.fna, where fna means Fasta Nucleic Acid (in other words, it’s the genome as DNA in FASTA format). I used the same code as for the synthetic data above (code is at the end of this post) and found that the real genomic DNA has a distribution that is remarkably similar to the random synthetic genome (Fig. 2), though it does depart in an important way. See how the blue (real) curve drops faster at first but then changes its final slope to decrease more slowly? This suggests that the distribution of sizes is weighted towards the right. In other words, there is an overall increase in the size of mGenes.

Fig. 2: Probability of mGene lengths. The black “synthetic” line falls on top of the red “predicted” line.

This is hard to see from Fig. 2, so I made a cumulative distribution plot (Fig. 3) in hopes that it would clarify. It shows the same thing, but might be even more difficult to see…

Fig. 3: Cumulative distribution function of codon lengths. The black “synthetic” curve lies on top of the red “predicted” curve.

So what is going on here? I expected to see a much bigger shift towards longer mGenes. I propose the following:

To calculate these probabilities, I used every reading frame (there are six). So let’s say that there was a long mGene in frame 1 that corresponded to a real gene. In all 5 remaining frames, it is likely that that same length of DNA will be composed of multiple smaller mGenes. This is because a gene is only coded in one reading frame, and so in all other frames the sequence is essentially random. In effect, the increase in long mGenes is diluted by considering that same mGene in all other frames!

How do we get around this to answer the original question: What is the average length of a prokaryotic gene? To get rid of the dilution problem we can make synthetic genomes of the same length as the real genome, and then simply compare the number of mGenes in each that are larger than 100 codons. I will leave this for a future post.


To make synthetic genomes and break them up into mGenes in Python [EDIT: The code originally did not take the reverse complement of the sequence to get all frames. This has been fixed, and had no effect on the results shown above.]:

import random
import re

def random_genome( length , savename ):
    """Write a random string of AGCT of size 'length' to a file 'savename'."""
    genome_file = open( savename, 'w' )
    genome_file.write('>This is a randomly-generated sequence of length '+str(length) + '\n')
    for i in range( int(length) ): genome_file.write(random.choice( 'AGCT' ))

def mgenes_find( fasta, mgene_file ):
    """ Given a fasta file name, write a file containing all mGenes."""
    fasta  = open(      fasta, 'r' )
    mgenes = open( mgene_file, 'w' )
    mgenes.write( 'MGENE,LENGTH\n' )
    genome = fasta.read().strip()
    # 6 reading frames!
    for sense in [1,-1]:
        if sense == -1:
        for frame in [0,1,2]:
            codons = re.findall( '([AGCT]{3})', genome[::sense][frame:] )
            for c,codon in enumerate(codons):
                if codon in ['TAA','TGA','TAG']: codons[c] = 'STOP'
            genes  = ''.join(codons)
            genes  = re.split('STOP',genes)
            [mgenes.write(gene+','+str(len(gene))+'\n') for gene in genes]

def main():
    random_genome( 1000000, 'fake_genome.fasta' )
    mgenes_find( 'fake_genome.fasta', 'fake_mgenes.csv' )
    mgenes_find( 'BA000007.fna', 'real_mgenes.csv' )


To plot the resulting data in R:

D.fake   = read.table( 'fake_mgenes.csv', sep=',', header = TRUE )
D.real   = read.table( 'real_mgenes.csv', sep=',', header = TRUE )
len.fake = 1000000
len.real = 5498450
T.fake   = table( D.fake$LENGTH )
T.real   = table( D.real$LENGTH )
Codons  = 0:150
library( Cairo )

CairoPNG('mGene_predicted_lengths.png', width = 500, height=300 )
plot(  as.numeric(names(T.fake))/3, T.fake/2000000, type='l', col='red', bty='n', lwd=3,
	   ylab='P(mGene of length L)', xlab='L(codons)', yaxt='n', xlim=c(0,Codons[length(Codons)]))
axis(  2, at=0:6/1000 )
lines( Codons, (3/64)^2 * (61/64)^Codons,  type='l', lwd=1, col='black')
lines( as.numeric(names(T.real))/3,T.real/(2*5498450), lwd=2, col='blue' )
legend( 'right', c('real','synthetic','predicted'), text.col=c('blue','black','red'),bty='n' )

CairoPNG('mGene_CDFs.png', width = 500, height=300 )
plot(  Codons,ecdf(D.real$LENGTH)(Codons*3), type='l', col='blue', ylim=c(0,1),
	   bty='n', lwd=2, ylab='Cumulative Probability of mGene', xlab='Length(codons)')
lines( Codons, ecdf(D.fake$LENGTH)(Codons*3), lwd=3, col='red' )
lines( Codons, cumsum((3/64)^2 * (61/64)^(Codons))/sum((3/64)^2*(61/64)^(Codons)),
	   type='l', lwd=1, col='black')
legend( 'right', c('real','synthetic','predicted'), text.col=c('blue','black','red'),bty='n' )

2 thoughts on “Average gene length in prokaryotes (part 2)

  1. How did you get “6” reading frames? Codons are triplets of nucleotides, not sextuplets. My understanding is that there are a total of 3 reading frames for each section of DNA. Am I wrong?

    1. You’re thinking about it exactly right, but forgot the easy-to-miss point that the reading frame can either be in the 5′->3′ or the 3′->5′ direction. As you said, there are 3 reading frames, but there are 3 in both directions.

Comments are closed.