# Average gene length in prokaryotes (part 2)

Hm. So it appears that, two years ago, I wrote a post on calculating the average gene length in prokaryotes. I found a half-draft of the second part and decided to finish it off.

In part 1 we defined mGenes (“maybe-genes”) as the pieces you get after breaking the genome at each stop codon, and predicted that the probability of finding an mGene of length L is given by the following equation:

By plotting this function it is clear that the probability of a set of codons being an mGene plummets quickly, so that there is nearly a 0 probility of finding an mGene of 100 codons (300 bases) in a random sequence (black line in Fig. 1). I confirmed this with a 1,000,000 base synthetic genome (all code is at the end of this entry), resulting in the red circles in Fig. 1 that perfectly overlap with the prediction line.

So now we know what to expect from a completely random genome: Nearly all mGenes will be less than 100 codons (300 bases) in length. This is much shorter than your typical gene, and so I would expect there to be a large number of mGenes larger than this size in a real genome. So let’s check it out, using a fully sequenced prokaryotic genome!

# Allele frequency problem in “Looper”

Time travel movies are always full of bad physics and and contradictory logic, though certainly some do it better than others. I usually just try not to think about them too hard so that I can take in the entertainment value. Looper (streaming|DVD) is no exception, but the most glaring error in the movie’s science was not in the physics; it was in the biology.

The beginning of the movie tells of a new mutation, the “TK mutation”, that has crept into the population to give people weak telekinetic powers. The idea of a gene, and more importantly a mutation in an existing gene, somehow allowing telekinesis is of course absurd, but that isn’t what I’m talking about.

I’m talking about the allele frequency. The movie takes place in the 2040’s. Only thirty years from now. And, at that time, the movie says that 10% of the human population has the TK mutation. This frequency is fantastically improbable.

Why? Well, right now 0% of the human population has this mutation. The thirty years between now and then have to bring that to 10%. That sounds impossible – let’s see if my suspicion is correct.

# Cloning trick: ligation of multiple inserts

[2013.02.26 Edit: A number of people are finding this through Google searches. I don’t have an updated post on the topic, but if you’re trying to assemble multiple DNA fragments then I suggest looking into Gibson Assembly. NEB sells* a dead-simple mastermix, which is a bit pricey per reaction (I just make my reactions half the size) but comes out to cheap when you take into account the cost of labor (so long as your PI values your time…).]

I’ve spent the last couple months building a plasmid library, and in the process I thought of a trick. Ligations, perhaps the worst part of cloning, are notoriously finicky reactions. The goal is to take several pieces of linear DNA, where the ends of the pieces can only connect in a certain way, and then use an enzyme (T4 Ligase) to sew them all together into one piece (in my case, a circular plasmid).

I needed to insert three fragments at once into a single backbone. In my ignorance (from my lack of experience) I thought ligating four fragments should work just as well as two, so I just threw them all together and ran the reaction. The result was a mess, and when I tested 40 different clones afterwards not a single one was correct. So I started adding them one piece at a time which, obviously, was going to take three times as long.

# average gene length in prokaryotes (part 1)

One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on hand. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?

So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?

For example, take the sequence:

(Sequence 1)  ATG AGT TGA ATG GTA TTG TAA TTT AGA TAA

This sequence has two potential start sites (in bold) and two stop codons (in bold italics). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.

To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.

A few years ago I wrote a post on making Notepad++ your default text editor, wherein I expressed my extreme love for this little piece of open source software. To date, I use it for just about everything and, after extensive use in a computational biology lab rotation, I figured I should share one of its most amazing features: Regular Expressions.

I’m not going to take the time to define Regular expressions (“REs” from now on) and will instead send you to Wikipedia. Just know that they are totally badass. In a nutshell, REs let you do really fancy search-and-replace in a text document. Perhaps that doesn’t excite you, but let me give some hypothetical situations in which you may find yourself:

• You have a file of contact information for everyone you know (say, 1000 people) and want to get just the email addresses so that you can spam everyone.
• You have a large FASTA file and want to pull out all of the organism names.
• You want to convert a file from one format to another.
• You want to combine multiple lines into a single line.
• You want to separate a line into multiple lines.
• Other pain-in-the-ass sounding stuff.

Sure, you could manually copy-paste all of those email addresses or organism names, and you could go through and hit the ENTER key to put things on separate lines. OR, you could write a few characters into NP++’s Find & Replace box. I think an example is the best way to make this work.

As a biologist, I’m rather fond of FASTA files. FASTA is simply a way to format DNA or protein sequence data so that people and programs can easily do stuff with that data. The format is: