average gene length in prokaryotes (part 1)

One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on hand. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?

So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?

For example, take the sequence:


This sequence has two potential start sites (in bold) and two stop codons (in bold italics). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.

To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.

Continue reading

Python: Clean up and translate nucleotide sequences

[If you are more familiar with biology than with Python or computer programming, I highly recommend this book .]

[Note: A lot of you are finding this post through Google searches. Let me know in the comments if you found it helpful and, if not, what it was you were looking for!]

Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :

First, for cleaning up a sequence (preferably in FASTA format):

def clean_sequence( sequence ):
    """Given a sequence string, return a crap-free, standardized DNA version."""
    s = sequence.replace( '\r', '' ).split( '\n' )  # separate each line
    if s[0][0] == '>': s = s[ 1 :]                  # remove defline
    s = ''.join( s )                                # make one long string
    s = s.replace( ' ', '' ).replace( '\t', '' )    # remove spaces
    return s.upper().replace( 'U', 'T' )

Then, a function to let you know if there are characters in your sequence that shouldn’t be:

def report_bad_chars( sequence ):
    """Given a string 'sequence', return a dictionary of any non-AGCT characters."""
    bad_chars = {}
    for l in sequence:
        if l not in 'AGCT':
            if l in bad_chars: bad_chars[ l ] += 1
            else: bad_chars[ l ] = 1
    if bad_chars != {}: print( bad_chars )

After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.

Continue reading

Evolution: More Sensical than Gravity

I was thinking recently about the common attempts to explain what a scientific theory is by pointing out that gravity is “just” a theory. This is, of course, usually in defense of people who say that evolution is “not a fact, just a theory.” I was thinking about how the concept of evolution, in general and abstracted from a biological context, is a very logical idea that doesn’t actually need evidence in order to make sense. This differs greatly from the theory of gravity, which only makes sense in light of observations. Thus, evolution makes more sense than gravity. Let me explain.

Both concepts were developed in light of observation. For gravity, people recognized that things always fall when dropped. Eventually it was found that all objects fall at the same rate, and then realized that there was a force between all objects, dependent on mass, that pulls them toward one another from very great distances. The simple equation describing the gravitational force between two objects can then be used to predict what the interactions between those objects will be like. Gravity is a beautiful example of the scientific method, where we start from observation, come up with rules, and then test those rules against more observations. On top of that, the theory has been fine-tuned over a few centuries to be able to describe more and more observations. The important point I want to make here is that, in the absence of observations of falling objects, gravitational theory makes no sense at all. The equations are useless unless you observe and take measurements of moving objects. The theory only applies to observation.

So how does this contrast with evolution? Like gravity, evolution was a theory developed by many people over many years. Darwin was not the first to come up with it, he was just the first to pile up so much evidence for its occurrence in nature. My claim is, however, that the evidence isn’t necessary. Why is this? I’ll first state that evolution is not specifically a biological concept. It has to do with pieces of information that copy themselves and are prone to some amount of error. Any self-replicating information can and will evolve.

Continue reading