Python: Clean up and translate nucleotide sequences

[If you are more familiar with biology than with Python or computer programming, I highly recommend this book .]

[Note: A lot of you are finding this post through Google searches. Let me know in the comments if you found it helpful and, if not, what it was you were looking for!]

Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :

First, for cleaning up a sequence (preferably in FASTA format):

def clean_sequence( sequence ):
    """Given a sequence string, return a crap-free, standardized DNA version."""
    s = sequence.replace( '\r', '' ).split( '\n' )  # separate each line
    if s[0][0] == '>': s = s[ 1 :]                  # remove defline
    s = ''.join( s )                                # make one long string
    s = s.replace( ' ', '' ).replace( '\t', '' )    # remove spaces
    return s.upper().replace( 'U', 'T' )

Then, a function to let you know if there are characters in your sequence that shouldn’t be:

def report_bad_chars( sequence ):
    """Given a string 'sequence', return a dictionary of any non-AGCT characters."""
    bad_chars = {}
    for l in sequence:
        if l not in 'AGCT':
            if l in bad_chars: bad_chars[ l ] += 1
            else: bad_chars[ l ] = 1
    if bad_chars != {}: print( bad_chars )

After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.

Continue reading

Comp Bio is complicated

I finished up my first lab rotation two Fridays ago, here at UT Southwestern. It was a pleasant few months with an interesting project, consisting mostly of starting at a computer screen and writing Python scripts, running BLAST searches, and so on. To summarize, but leaving things vague (both for most-people-don’t-care reasons and the-data-is-unpublished reasons), the project was this:

There are currently a crap-ton (“crap-ton” is a standard scientific prefix) of bacterial and archaeal genomes published and available on NCBI‘s servers. Archaea, like bacteria, are single-celled prokaryotic organisms. However, they differ from bacteria genomically (and therefore metabolically) in many ways. Some archaeal properties are like those in eukaryotes (like us!), while others are like those in bacteria. So one of the huge unanswered questions in evolution is: how are bacteria, archaea, and eukaryotes related to each other? Or, how would we make a tree of life relating these three domains?

Continue reading