# Average gene length in prokaryotes (part 2)

Hm. So it appears that, two years ago, I wrote a post on calculating the average gene length in prokaryotes. I found a half-draft of the second part and decided to finish it off.

In part 1 we defined mGenes (“maybe-genes”) as the pieces you get after breaking the genome at each stop codon, and predicted that the probability of finding an mGene of length L is given by the following equation:

By plotting this function it is clear that the probability of a set of codons being an mGene plummets quickly, so that there is nearly a 0 probility of finding an mGene of 100 codons (300 bases) in a random sequence (black line in Fig. 1). I confirmed this with a 1,000,000 base synthetic genome (all code is at the end of this entry), resulting in the red circles in Fig. 1 that perfectly overlap with the prediction line.

So now we know what to expect from a completely random genome: Nearly all mGenes will be less than 100 codons (300 bases) in length. This is much shorter than your typical gene, and so I would expect there to be a large number of mGenes larger than this size in a real genome. So let’s check it out, using a fully sequenced prokaryotic genome!

# Allele frequency problem in “Looper”

Time travel movies are always full of bad physics and and contradictory logic, though certainly some do it better than others. I usually just try not to think about them too hard so that I can take in the entertainment value. Looper (streaming|DVD) is no exception, but the most glaring error in the movie’s science was not in the physics; it was in the biology.

The beginning of the movie tells of a new mutation, the “TK mutation”, that has crept into the population to give people weak telekinetic powers. The idea of a gene, and more importantly a mutation in an existing gene, somehow allowing telekinesis is of course absurd, but that isn’t what I’m talking about.

I’m talking about the allele frequency. The movie takes place in the 2040’s. Only thirty years from now. And, at that time, the movie says that 10% of the human population has the TK mutation. This frequency is fantastically improbable.

Why? Well, right now 0% of the human population has this mutation. The thirty years between now and then have to bring that to 10%. That sounds impossible – let’s see if my suspicion is correct.

# The Birthday “Problem” in Python

A few days ago I found myself having a vague recollection of an interesting statistics problem. All I could remember was that it had to do with having a room full of people and the probability that any two people in that room would have the same birthday. I remembered the point, which was that it is much more likely than you might think, but I was fuzzy on the details.

After trying to define the problem and find an answer mathematically, I remembered that I suck at statistical reasoning about as much as the average person. So I decided to model the problem with a short Python script and find the answer that way.

Sure, I could’ve looked it up, but where’s the fun in that?

The problem: There are n people (say, at a party) drawn randomly from a population in which the chances of having a birthday on any day is equal to having a birthday on any other (which is not true of real populations (probably)). What is the probability of there being at least two people with the same birthday in the sample?

To put this thing together, I figure we need three things:

1. The ability to generate random numbers (provided by Python’s random module);
2. An object representing each person;
3. A party object full of those people.

Then we can add things like the ability to choose how many people we want at the party and how many parties to have, as well as some output for making plots!

First, the Person object. All each person needs is a birthday:

```import random
random.seed()

class Person:
def __init__( self ):
self.birthday = random.randint( 1, 365 )
```

# Cloning trick: ligation of multiple inserts

[2013.02.26 Edit: A number of people are finding this through Google searches. I don’t have an updated post on the topic, but if you’re trying to assemble multiple DNA fragments then I suggest looking into Gibson Assembly. NEB sells* a dead-simple mastermix, which is a bit pricey per reaction (I just make my reactions half the size) but comes out to cheap when you take into account the cost of labor (so long as your PI values your time…).]

I’ve spent the last couple months building a plasmid library, and in the process I thought of a trick. Ligations, perhaps the worst part of cloning, are notoriously finicky reactions. The goal is to take several pieces of linear DNA, where the ends of the pieces can only connect in a certain way, and then use an enzyme (T4 Ligase) to sew them all together into one piece (in my case, a circular plasmid).

I needed to insert three fragments at once into a single backbone. In my ignorance (from my lack of experience) I thought ligating four fragments should work just as well as two, so I just threw them all together and ran the reaction. The result was a mess, and when I tested 40 different clones afterwards not a single one was correct. So I started adding them one piece at a time which, obviously, was going to take three times as long.

# Python: Monty Hall modeling

You’ve all heard this classic statistics problem, based on an old game show:

A contestant is shown 3 doors. Only one of those three doors hides something of value to the contestant (perhaps a new car), while the other two contain nothing. The contestant chooses one door, but that door remains closed. The host then opens up a 2nd door, and this door is always a losing door. At this point, the contestant may choose to now open the originally-chosen door, or switch to and open the last remaining door.

So why is this interesting? It turns out that the way to maximize your chances of winning is to always switch, and this maximized chance is 67%. It also turns out that this is totally non-intuitive, and that most people think that, if the contestant always switches, the chances of winning are at best 50%. If you haven’t heard the solution to this problem before, you should think through it and see what you expect the chances of winning are under the two conditions: After the contestant chooses a door, and is subsequently shown that one of the other two is a losing door, [1] the contestant always switches to the remaining door, or [2] the contestant never switches. After the jump, I’ll explain this intuitively and then show a Python script to simulate this problem.