[If you are more familiar with biology than with Python or computer programming, I highly recommend this book .]
Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :
First, for cleaning up a sequence (preferably in FASTA format):
def clean_sequence( sequence ): """Given a sequence string, return a crap-free, standardized DNA version.""" s = sequence.replace( '\r', '' ).split( '\n' ) # separate each line if s[0][0] == '>': s = s[ 1 :] # remove defline s = ''.join( s ) # make one long string s = s.replace( ' ', '' ).replace( '\t', '' ) # remove spaces return s.upper().replace( 'U', 'T' ) |
Then, a function to let you know if there are characters in your sequence that shouldn’t be:
def report_bad_chars( sequence ): """Given a string 'sequence', return a dictionary of any non-AGCT characters.""" bad_chars = {} for l in sequence: if l not in 'AGCT': if l in bad_chars: bad_chars[ l ] += 1 else: bad_chars[ l ] = 1 if bad_chars != {}: print( bad_chars ) |
After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.