Python: Clean up and translate nucleotide sequences

[If you are more familiar with biology than with Python or computer programming, I highly recommend this book .]

[Note: A lot of you are finding this post through Google searches. Let me know in the comments if you found it helpful and, if not, what it was you were looking for!]

Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :

First, for cleaning up a sequence (preferably in FASTA format):

def clean_sequence( sequence ):
    """Given a sequence string, return a crap-free, standardized DNA version."""
    s = sequence.replace( '\r', '' ).split( '\n' )  # separate each line
    if s[0][0] == '>': s = s[ 1 :]                  # remove defline
    s = ''.join( s )                                # make one long string
    s = s.replace( ' ', '' ).replace( '\t', '' )    # remove spaces
    return s.upper().replace( 'U', 'T' )

Then, a function to let you know if there are characters in your sequence that shouldn’t be:

def report_bad_chars( sequence ):
    """Given a string 'sequence', return a dictionary of any non-AGCT characters."""
    bad_chars = {}
    for l in sequence:
        if l not in 'AGCT':
            if l in bad_chars: bad_chars[ l ] += 1
            else: bad_chars[ l ] = 1
    if bad_chars != {}: print( bad_chars )

After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.

Continue reading