# Average gene length in prokaryotes (part 2)

Hm. So it appears that, two years ago, I wrote a post on calculating the average gene length in prokaryotes. I found a half-draft of the second part and decided to finish it off.

In part 1 we defined mGenes (“maybe-genes”) as the pieces you get after breaking the genome at each stop codon, and predicted that the probability of finding an mGene of length L is given by the following equation:

By plotting this function it is clear that the probability of a set of codons being an mGene plummets quickly, so that there is nearly a 0 probility of finding an mGene of 100 codons (300 bases) in a random sequence (black line in Fig. 1). I confirmed this with a 1,000,000 base synthetic genome (all code is at the end of this entry), resulting in the red circles in Fig. 1 that perfectly overlap with the prediction line.

So now we know what to expect from a completely random genome: Nearly all mGenes will be less than 100 codons (300 bases) in length. This is much shorter than your typical gene, and so I would expect there to be a large number of mGenes larger than this size in a real genome. So let’s check it out, using a fully sequenced prokaryotic genome!

# The Birthday “Problem” in Python

A few days ago I found myself having a vague recollection of an interesting statistics problem. All I could remember was that it had to do with having a room full of people and the probability that any two people in that room would have the same birthday. I remembered the point, which was that it is much more likely than you might think, but I was fuzzy on the details.

After trying to define the problem and find an answer mathematically, I remembered that I suck at statistical reasoning about as much as the average person. So I decided to model the problem with a short Python script and find the answer that way.

Sure, I could’ve looked it up, but where’s the fun in that?

The problem: There are n people (say, at a party) drawn randomly from a population in which the chances of having a birthday on any day is equal to having a birthday on any other (which is not true of real populations (probably)). What is the probability of there being at least two people with the same birthday in the sample?

To put this thing together, I figure we need three things:

1. The ability to generate random numbers (provided by Python’s random module);
2. An object representing each person;
3. A party object full of those people.

Then we can add things like the ability to choose how many people we want at the party and how many parties to have, as well as some output for making plots!

First, the Person object. All each person needs is a birthday:

```import random
random.seed()

class Person:
def __init__( self ):
self.birthday = random.randint( 1, 365 )
```

# average gene length in prokaryotes (part 1)

One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on hand. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?

So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?

For example, take the sequence:

(Sequence 1)  ATG AGT TGA ATG GTA TTG TAA TTT AGA TAA

This sequence has two potential start sites (in bold) and two stop codons (in bold italics). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.

To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.

# Python: GEB and Wondrousness

While perusing a bookstore a couple years ago, I stumbled upon a fascinating book by Douglas Hofstadter called Godel, Escher, Bach. If you like math, biology, music, art, computer science, and philosophy, this is really an amazing read (though, admittedly, I’ve only gotten halfway through since I bought the thing).

In one of the book’s entertaining conversations between Achilles and the Tortoise (this conversation regarding number theory), the Tortoise tells Achilles about a number property that he calls Wondrousness. A number is found to be Wondrous if, when following a specific algorithm (below), you can turn that number into 1. That number is Unwonderous if you can’t reach 1. The point of the characters’ ensuing discussion is that there is no “terminating test” for the property of Wondrousness; you could never know for sure that a number is Unwonderous because you have no idea how long it would take to reach 1 if it was, in fact, Wondrous instead.

The algorithm in question is this: Take a number N. If that number is odd, take it times 3 and add 1. If that number is instead even, divide it by 2. Continue this process until N=1.

The point, for this post, is that Achilles and the Tortoise demonstrate the above algorithm on the number 15, finding that it takes 17 steps to get to a value of 1. The Tortoise then warns that trying the same with the number 27 will require a large sheet of paper, but otherwise no more examples are given. So, I thought it would be an interesting exercise to write a short Python script to run the algorithm on a large set of numbers, and then plot the number of steps taken to get to 1 for each number.