average gene length in prokaryotes (part 1)

One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on had. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?

So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?

For example, take the sequence:

(Sequence 1)  ATG AGT TGA ATG GTA TTG TAA TTT AGA TAA

This sequence has two potential start sites (in bold) and two stop codons (in bold italics). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.

To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.

Continue reading

Leave a Comment

Filed under computers/software, science

rights of CSS modification

For the low-low fee of $15/year, you can have your very own custom CSS document to make your WordPress blog exactly the way you want it!

A CSS file is just a text file, at MOST a few kilobytes.

In other news, you can get 20 gigabytes of extra space from Google for $5/year…

 

While a total ripoff, I was tired of WordPress’ themes never being quite right, and wanted to have the standard theme I use in presentations: maroon and shades of grey.

Still a better deal than text messaging…

Leave a Comment

Filed under life

Get Notepad++ NOW

What follows is from an old post I wrote on replacing Windows Notepad with Notepad++, but is just the part where I ramble about how amazing it is and why you should use it. To get to the post on Notepad replacement, go here.

If you ever use Notepad, Windows’ native text editor, you should upgrade to Notepad++. Immediately. Seriously, every moment that you don’t is one more moment of sheer misery you don’t even realize you’re going through! Even if you don’t normally use a text editor, you might as well get a nice one. Just in case.

So what’s the difference between a text editor and a word processor (like Microsoft Word)? Well, the name pretty much says it all. A text editor does nothing but manipulate text, just by adding or removing characters. A text editor can also be used to read any file-type, though if you open something like a jpeg you’ll have what looks like complete gibberish. Editors do absolutely no formatting. No font changes, size changes, italics, bold, etc etc. A word processor adds all of this fancy formatting, and usually uses all kinds of added code for both proprietary and useful reasons. For a good example, open up a Word document with Notepad. You’ll see the text of your document, but there is all kinds of crap between words, letters, paragraphs, and at the header and footer.

You can actually edit any kind of file you want with a text editor, but chances are you’ll just make the file stop working properly. It can be fun, however, to explore different files and see what kind of stuff is added by the software that runs them. You may even find that some files store and hide some information about you…

Continue reading

1 Comment

Filed under computers/software, HowTo

Python: GEB and Wondrousness

While perusing a bookstore a couple years ago, I stumbled upon a fascinating book by Douglas Hofstadter called Godel, Escher, Bach. If you like math, biology, music, art, computer science, and philosophy, this is really an amazing read (though, admittedly, I’ve only gotten halfway through since I bought the thing).

In one of the book’s entertaining conversations between Achilles and the Tortoise (this conversation regarding number theory), the Tortoise tells Achilles about a number property that he calls Wondrousness. A number is found to be Wondrous if, when following a specific algorithm (below), you can turn that number into 1. That number is Unwonderous if you can’t reach 1. The point of the characters’ ensuing discussion is that there is no “terminating test” for the property of Wondrousness; you could never know for sure that a number is Unwonderous because you have no idea how long it would take to reach 1 if it was, in fact, Wondrous instead.

The algorithm in question is this: Take a number N. If that number is odd, take it times 3 and add 1. If that number is instead even, divide it by 2. Continue this process until N=1.

The point, for this post, is that Achilles and the Tortoise demonstrate the above algorithm on the number 15, finding that it takes 17 steps to get to a value of 1. The Tortoise then warns that trying the same with the number 27 will require a large sheet of paper, but otherwise no more examples are given. So, I thought it would be an interesting exercise to write a short Python script to run the algorithm on a large set of numbers, and then plot the number of steps taken to get to 1 for each number.

Continue reading

Leave a Comment

Filed under computers/software, science

Python: Clean up and translate nucleotide sequences

Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :

First, for cleaning up a sequence (preferably in FASTA format):

def clean_sequence( sequence ):
    """Given a sequence string, return a crap-free, standardized DNA version."""
    s = sequence.replace( '\r', '' ).split( '\n' )  # separate each line
    if s[0][0] == '>': s = s[ 1 :]                  # remove defline
    s = ''.join( s )                                # make one long string
    s = s.replace( ' ', '' ).replace( '\t', '' )    # remove spaces
    return s.upper().replace( 'U', 'T' )

Then, a function to let you know if there are characters in your sequence that shouldn’t be:

def report_bad_chars( sequence ):
    """Given a string 'sequence', return a dictionary of any non-AGCT characters."""
    bad_chars = {}
    for l in sequence:
        if l not in 'AGCT':
            if l in bad_chars: bad_chars[ l ] += 1
            else: bad_chars[ l ] = 1
    if bad_chars != {}: print( bad_chars )

After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.

Continue reading

4 Comments

Filed under computers/software, science

Notepad++ and Regular Expressions

A few years ago I wrote a post on making Notepad++ your default text editor, wherein I expressed my extreme love for this little piece of open source software. To date, I use it for just about everything and, after extensive use in a computational biology lab rotation, I figured I should share one of its most amazing features: Regular Expressions.

I’m not going to take the time to define Regular expressions (“REs” from now on) and will instead send you to Wikipedia. Just know that they are totally badass. In a nutshell, REs let you do really fancy search-and-replace in a text document. Perhaps that doesn’t excite you, but let me give some hypothetical situations in which you may find yourself:

  • You have a file of contact information for everyone you know (say, 1000 people) and want to get just the email addresses so that you can spam everyone.
  • You have a large FASTA file and want to pull out all of the organism names.
  • You want to convert a file from one format to another.
  • You want to combine multiple lines into a single line.
  • You want to separate a line into multiple lines.
  • Other pain-in-the-ass sounding stuff.

Sure, you could manually copy-paste all of those email addresses or organism names, and you could go through and hit the ENTER key to put things on separate lines. OR, you could write a few characters into NP++’s Find & Replace box. I think an example is the best way to make this work.

As a biologist, I’m rather fond of FASTA files. FASTA is simply a way to format DNA or protein sequence data so that people and programs can easily do stuff with that data. The format is:

Continue reading

1 Comment

Filed under computers/software, HowTo, science

Connecting Python to the Web at UT Southwestern

While working on a little Python script to do  some datamining for lab, I stumbled upon a serious problem:

When connecting to the Internet via the school’s network, whether hard-wired, wifi, or through VPN, I could not get Python to fetch webpages using urllib.request. Here’s the quick solution to anyone else with that problem (for the general solution, just change the proxy address and port to your own).

import urllib.request as web
proxy_url, proxy_port = 'http://proxy.swmed.edu', '3128'
proxy_handler = urllib.request.ProxyHandler({'http':proxy_url+':'+proxy_port})
new_opener = web.build_opener(proxy_handler)
page = new_opener.open( url )

I actually just reassigned urllib.request.urlopen = opener.open so that the syntax would stay the same. I got the info from the Python3.1 documentation page for urllib.request, using minor changes to one of their examples.

Leave a Comment

Filed under computers/software, HowTo