Python: Clean up and translate nucleotide sequences

[If you are more familiar with biology than with Python or computer programming, I highly recommend this book .]

Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :

First, for cleaning up a sequence (preferably in FASTA format):

def clean_sequence( sequence ):
    """Given a sequence string, return a crap-free, standardized DNA version."""
    s = sequence.replace( '\r', '' ).split( '\n' )  # separate each line
    if s[0][0] == '>': s = s[ 1 :]                  # remove defline
    s = ''.join( s )                                # make one long string
    s = s.replace( ' ', '' ).replace( '\t', '' )    # remove spaces
    return s.upper().replace( 'U', 'T' )

Then, a function to let you know if there are characters in your sequence that shouldn’t be:

def report_bad_chars( sequence ):
    """Given a string 'sequence', return a dictionary of any non-AGCT characters."""
    bad_chars = {}
    for l in sequence:
        if l not in 'AGCT':
            if l in bad_chars: bad_chars[ l ] += 1
            else: bad_chars[ l ] = 1
    if bad_chars != {}: print( bad_chars )

After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.

To make a random DNA sequence:

import random

def random_sequence( length ):
    """Return a random string of AGCT of size length."""
    return ''.join([ random.choice( 'AGCT' ) for i in range(int( length ))])

And to get the frequency of each nucleotide:

def nuc_frequencies( sequence ):
    """Return a dictionary of base:frequency pairs."""
    return { b: sequence.count(b)/len(sequence) for b in 'AGCT'}

A genetic code table (as a dictionary):

gencode = {
	'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
	'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
	'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
	'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
	'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
	'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
	'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
	'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
	'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
	'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
	'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
	'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
	'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
	'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
	'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
	'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

And the function for translation (using the above table):

def translate( sequence ):
	"""Return the translated protein from 'sequence' assuming +1 reading frame"""
	return ''.join([gencode.get(sequence[3*i:3*i+3],'X') for i in range(len(sequence)//3)])

Lastly, to get a codon distribution:

def codon_dict( filler=0 ):
    """Return a dictionary of codon:filler pairs for all 64 codons."""
    d = {}
    for one in 'AGCT':
        for two in 'AGCT':
            for three in 'AGCT':
                d.update({ one+two+three:filler })
    return d

def codon_frequencies( seq ):
    """Return a dictionary of codon:frequency pairs for all 64 codons."""
    table = codon_dict()
    for i in range(len(seq)):
        try: table[seq[i:i+3]] += 1
        except: pass
    seq = seq[::-1]
    for i in range(len(seq)):
        try: table[seq[i:i+3]] += 1
        except: pass
    return table

10 thoughts on “Python: Clean up and translate nucleotide sequences

  1. Hi Adam
    Can’t believe I just found your site (shot in the dark search for “codon pair python”), you have all the coding stuff I want to do but can’t yet, like R, Latex and Python. Ok, I can do a bit, but really not very much. I was hoping for a compass heading, does one just keep their heads down and code code code, did you go on a course (e.g. software-carpentry), and specific books or websites you found especially helpful (there are lots of unhelpful ones). You could also try for nice outputs of data, but thanks for the link you posted. I think nice looking data is an under appreciated skill, besides it hides the quality of the data somewhat as well.

    1. Mostly it’s just “code code code”! For R and TeX I just peruse tutorials online and troubleshoot with the trusty Google. Both are old and commonly-used enough that there is plenty available. I have not seen good books for either.

      I learned HTML/CSS and TeX first, which is a good way to start since they are markup instead of programming languages. At the same time I was playing around with Linux a lot on an old laptop, and had a friend in comp sci who would help me troubleshoot my way out of the various messes I got into. Having a computer with the sole purpose of trying out Linux operating systems was probably the best way to learn, since I could just reinstall when I broke it too much. Learning by failing and then debugging is the only way to do it!

      I got started with R in a statistics course in college, as it was part of the course. I didn’t use it as a programming language at all until I started picking up C++, but I had the basics I needed from that class.

      At the suggestion of that same comp sci friend, the first language I jumped into was C++. He suggested this because, “if I could figure it out, any other language would be easy.” Which turned out to be pretty accurate. Plus, it forces good coding style. I had no idea how to get started with C++, though, so I got a textbook through an online course at the Game Institute. I found their first textbook to be the most useful, and used there forums a bit at the beginning to get help. Otherwise I just used the book’s lessons as a starting point and tried to expand on their homework problems as much as possible, using Google to find my way out of corners.

      Then I finally picked up Python, starting mostly with tutorials online. Eventually I got a book for bioinformatics using Python, which was really an excellent guide. I would definitely suggest starting their if you are familiar with computational biology problems. It’s kept pretty simple throughout, and is a really good reference.

  2. Hi there,
    I am impressed with what you show here, but I don’t understand it completely (I am totally new to python & programming). First of all, did you forget a column of ‘ as the first one in your dictionary?
    And could you please explain the code that is supposed to translate the base sequence to me?
    I wrote a little script to calculate the tm of primers, now I want to write one to translate dna. But how do I print the function ‘translate’ which you defined. I always the output [1,2,3] ‘X’ when I try the script with three bases.

    Cheers, R

    1. Oooh, nice catch! I’ve added those quotes.

      Now, the translation code:

      return ''.join([gencode.get(sequence[3*i:3*i+3],'X') for i in range(len(sequence)//3)])

      Okay, so the sep.join(string_list) part just takes whatever strings are in the list string_list and sticks them together with the string sep in between each. That list, in this case, is of each of the translated codons.

      The codon list comes from the list comprehension, which loops through each of the calculated number of codons len(sequence)//3. At each iteration, it indexes sequence to pull out a 3-letter subsequence. That codon is then used as a key to the gencode dictionary to find the corresponding amino acid.

      I hope that clarified things a little. The trick is really the list comprehension, which is a pretty awesome thing that Python can do that isn’t available in other languages.

  3. I am very new to python, I am doing my honours in biochemistry with bioinformatics as a subject. We are doing the very basics in python.

    this is my dictionary that i will put in curl “{}” brackets



    How do I use this dictionary to translate my DNA sequence into an amino acid sequence?
    -A simple way, so that it gives me the following output:
    Sequence 1:

    Sequence 2:

    AND another question: how do I tell it to name each sequence consecutively “sequence 1, then sequence 2, then sequence three” as the count function doesn’t work for strings and lists.

    I would REALLY APPRECIATE a reply so much



    1. You can just use the translate() function defined in this post! Either change your dictionary name to “gencode” or change gencode to “d” in my translate() function, and you’re already there.

      I’m not totally sure what you are asking for in your second question. Are you referring to getting multiple reading frames? The easiest way to do that would be to sequentially pass in to translate() versions of your string that are shortened by 1 on the left side.

      For example:

      sequence = 'ATGGTCCTAGGA'
      three_frames = [ sequence[i:] for i in range(3) ]
      three_frames_translated = [translate(frame) for frame in three_frames]
      # and then to print your message:
      [print( 'Sequence ' + str(i+1) + ':\n' + s ) for i,s in enumerate(three_frames_translated)]

      I hope that helps! I made extensive use of “list comprehensions”, which are little confusing but are awesome and super-useful, so I recommend checking those out. They’re just a way of storing the output of a for loop in a list (so that last line will do all of the printing you want, but will return a list of Nones). The enumerate function should help you with getting numbers and strings at the same time. Check out the documentation for it.

  4. Thanks Adam for posting your dictionary of amino acids and walking through your code. The dictionary setup like that helped me avoid typing! I’m working through Rosalind’s exercises to improve my python skills and learn more/refresh about genetics and biology.


    ps: saw your Co’s games, and decided to get the latest: fun.

    1. Thanks for the link! I was unfamiliar with that site, and will be checking it out.

      My brothers have another game dropping on June 6. If you liked the others, you’ll like this one!

  5. Hey,
    I’m a chemist and I thought I was extremely proficient at analyzing data sets with spreadsheet programs (Excel, MeasureNet, etc.). I’m going into bio research and the data, nucleotide sequences are in a completely unrecognizable format. A programmer friend, who is an ‘expert’ in business intelligence(a senior IT auditor at a Fortune 100 company), recommended I learn Python for analyzing large amounts of data. So, Now I’m learning to code in it to help me break down the data into workable chunks. I know there are programs designed specifically to work with data from DNA but I figured learning to code in a new language shouldn’t hurt; I currently only know VBA for writing macros in Excel. This should help me out a lot in the future. I’ve ordered the book you recommended. And my friend also recommended the programming language “R” as something I might be interested in learning as well, any thoughts? Thanks a lot.


    1. I used R quite a bit while I was in the lab, and it has a ton of powerful Bioinformatics functions through the Bioconductor suite of packages. The language is super weird though, and significantly harder to learn than the other languages I’ve used. Totally worth it, if you can figure it out!

Comments are closed.