Average gene length in prokaryotes (part 2)

Hm. So it appears that, two years ago, I wrote a post on calculating the average gene length in prokaryotes. I found a half-draft of the second part and decided to finish it off.

In part 1 we defined mGenes (“maybe-genes”) as the pieces you get after breaking the genome at each stop codon, and predicted that the probability of finding an mGene of length L is given by the following equation:

Equation 1: The probability of finding an mGene of length L.

By plotting this function it is clear that the probability of a set of codons being an mGene plummets quickly, so that there is nearly a 0 probility of finding an mGene of 100 codons (300 bases) in a random sequence (black line in Fig. 1). I confirmed this with a 1,000,000 base synthetic genome (all code is at the end of this entry), resulting in the red circles in Fig. 1 that perfectly overlap with the prediction line.

Fig. 1: Predicted frequency of mGenes. Red circles, results from 1Megabase simulated genome. Black line, function shown in Equation 1.

So now we know what to expect from a completely random genome: Nearly all mGenes will be less than 100 codons (300 bases) in length. This is much shorter than your typical gene, and so I would expect there to be a large number of mGenes larger than this size in a real genome. So let’s check it out, using a fully sequenced prokaryotic genome!

Continue reading

average gene length in prokaryotes (part 1)

One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on hand. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?

So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?

For example, take the sequence:


This sequence has two potential start sites (in bold) and two stop codons (in bold italics). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.

To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.

Continue reading

Comp Bio is complicated

I finished up my first lab rotation two Fridays ago, here at UT Southwestern. It was a pleasant few months with an interesting project, consisting mostly of starting at a computer screen and writing Python scripts, running BLAST searches, and so on. To summarize, but leaving things vague (both for most-people-don’t-care reasons and the-data-is-unpublished reasons), the project was this:

There are currently a crap-ton (“crap-ton” is a standard scientific prefix) of bacterial and archaeal genomes published and available on NCBI‘s servers. Archaea, like bacteria, are single-celled prokaryotic organisms. However, they differ from bacteria genomically (and therefore metabolically) in many ways. Some archaeal properties are like those in eukaryotes (like us!), while others are like those in bacteria. So one of the huge unanswered questions in evolution is: how are bacteria, archaea, and eukaryotes related to each other? Or, how would we make a tree of life relating these three domains?

Continue reading

Dear Immune System,

Your time is up, and I am not pleased.

I first got this disease on Friday, five days ago today. It started with frequent trips to the bathroom and angry bowels. Then it got better the next morning of its own accord, and I thought that my brief excursion into sickness was done with.

So on Saturday I had a nice big meal. Too big, in fact, since I ended up with my belly so full that it hurt. But it still felt kinda good, after having been sick the day previously. But then the bathroom trips started again. And didn’t stop.

Sunday was spent entirely either in the bathroom or trying to sleep, in the vain hope that sleep would transport me into a disease-free future. I was rewarded in the evening with a fever.

Yes, body, I know that a fever is supposed to kill off whatever is attacking my insides (or whatever you have perceived to be attacking my insides), so why was it mild, and why was it gone the following morning, after accomplishing absolutely nothing? And it hasn’t come back, even last night when I was having even more frequent trips to el baño. What was the point?

And today I woke up and thought, “ah, it’s all over.” Since all I had was a slight belly ache. But as the morning went on it began to feel like someone had punched me oh-so-gently in the gut. Then I had a delicious meal and, before I could say “Not again!” it started again. The bathroom trips, I mean.

So to my innate immune system: I know you’re doing the best that you can with what you’ve got. You’re static, so really what more can I ask for?

But to my adaptive immune system: WTF? Now, I know you are one of the most amazing achievements of evolution, but seriously, can’t you just be a little more amazing? Perhaps you could work twice as fast, or fight twice as hard. I’d happily eat more to give you the spare parts and chemical energy. Or sacrifice some of my body fat for the task. Or muscle, even.

I know that’s too much to ask, since you’re fighting blind and quite randomly, which is why I have forsaken you and moved on to good old man-made drugs (all the cool kids are doing it). I’ve already lost a 6th of my trip to your taking of your sweet ass-time (yes, that is hyphenated correctly), and I am now cheating on you with generic, OTC, Mexican-made Cipro.

Now, adaptive immune system, this doesn’t mean I don’t love you and appreciate what you do, but when I’m on a deadline you just aren’t good enough. I’m sorry. I welcome you back as soon as the Cipro is done working (especially since I won’t have any gut flora left for defense).

Your Frustrated but Loving Counterparts,

The Higher Lobes

Or, the Ones That have to Perceive the suffering you’re supposed to be alleviating.