# Python: GEB and Wondrousness

While perusing a bookstore a couple years ago, I stumbled upon a fascinating book by Douglas Hofstadter called Godel, Escher, Bach. If you like math, biology, music, art, computer science, and philosophy, this is really an amazing read (though, admittedly, I’ve only gotten halfway through since I bought the thing).

In one of the book’s entertaining conversations between Achilles and the Tortoise (this conversation regarding number theory), the Tortoise tells Achilles about a number property that he calls Wondrousness. A number is found to be Wondrous if, when following a specific algorithm (below), you can turn that number into 1. That number is Unwonderous if you can’t reach 1. The point of the characters’ ensuing discussion is that there is no “terminating test” for the property of Wondrousness; you could never know for sure that a number is Unwonderous because you have no idea how long it would take to reach 1 if it was, in fact, Wondrous instead.

The algorithm in question is this: Take a number N. If that number is odd, take it times 3 and add 1. If that number is instead even, divide it by 2. Continue this process until N=1.

The point, for this post, is that Achilles and the Tortoise demonstrate the above algorithm on the number 15, finding that it takes 17 steps to get to a value of 1. The Tortoise then warns that trying the same with the number 27 will require a large sheet of paper, but otherwise no more examples are given. So, I thought it would be an interesting exercise to write a short Python script to run the algorithm on a large set of numbers, and then plot the number of steps taken to get to 1 for each number.

# Python: Clean up and translate nucleotide sequences

[If you are more familiar with biology than with Python or computer programming, I highly recommend this book .]

Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :

First, for cleaning up a sequence (preferably in FASTA format):

 ```def clean_sequence( sequence ): """Given a sequence string, return a crap-free, standardized DNA version.""" s = sequence.replace( '\r', '' ).split( '\n' ) # separate each line if s[0][0] == '>': s = s[ 1 :] # remove defline s = ''.join( s ) # make one long string s = s.replace( ' ', '' ).replace( '\t', '' ) # remove spaces return s.upper().replace( 'U', 'T' ) ```

Then, a function to let you know if there are characters in your sequence that shouldn’t be:

 ```def report_bad_chars( sequence ): """Given a string 'sequence', return a dictionary of any non-AGCT characters.""" bad_chars = {} for l in sequence: if l not in 'AGCT': if l in bad_chars: bad_chars[ l ] += 1 else: bad_chars[ l ] = 1 if bad_chars != {}: print( bad_chars ) ```

After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.

A few years ago I wrote a post on making Notepad++ your default text editor, wherein I expressed my extreme love for this little piece of open source software. To date, I use it for just about everything and, after extensive use in a computational biology lab rotation, I figured I should share one of its most amazing features: Regular Expressions.

I’m not going to take the time to define Regular expressions (“REs” from now on) and will instead send you to Wikipedia. Just know that they are totally badass. In a nutshell, REs let you do really fancy search-and-replace in a text document. Perhaps that doesn’t excite you, but let me give some hypothetical situations in which you may find yourself:

• You have a file of contact information for everyone you know (say, 1000 people) and want to get just the email addresses so that you can spam everyone.
• You have a large FASTA file and want to pull out all of the organism names.
• You want to convert a file from one format to another.
• You want to combine multiple lines into a single line.
• You want to separate a line into multiple lines.
• Other pain-in-the-ass sounding stuff.

Sure, you could manually copy-paste all of those email addresses or organism names, and you could go through and hit the ENTER key to put things on separate lines. OR, you could write a few characters into NP++’s Find & Replace box. I think an example is the best way to make this work.

As a biologist, I’m rather fond of FASTA files. FASTA is simply a way to format DNA or protein sequence data so that people and programs can easily do stuff with that data. The format is:

# More Puzzling

In a previous post, I discussed my attempt to write a program to solve a puzzle. I never updated that post because, well, I ran the program all night and it didn’t find the solution!

I had made up a fake puzzle that I knew had a solution for testing, and the program could solve it in 15 minutes. But it couldn’t solve the one I had recorded for the real puzzle. I figured (and hoped) that I had simply recorded it wrong and to check, I re-recorded the pieces and tried again. And it worked! Here’s how:

# Puzzling

My mother gave the fam a new game for Xmas called “The Impossible Puzzle.” Or maybe that was the company name. Either way, the label is certainly apt.

The puzzle is composed of nine, 4-sided pieces with interlocking parts in the shapes of the 4 card suits (hearts, diamonds, etc). There is only one way to lock the 9 pieces into a 3X3 square, and the game touts the fact that “there are over 300,000 combinations, only one of which is correct.”

My mother, her bf, and myself all fiddled with this puzzle for a while before becoming too frustrated. There seemed to be no way to logically sort through the options of all possible arrangements and rotations of the pieces and clearly a brute-force method of trying all combinations would take way too long by hand. But maybe not for a computer. So, I figured this would be an excellent opportunity to try out my programming skills.