I previously showed you (without a screencast) how to make NP++ default in XP. Of course, people have successfully done this for W7/Vista as well, but the various tutorials I saw were all a huge pain in the ass. Except for one: a comment on a more-complicated method. I made the following short screencast to demonstrate that method. This one has the advantages of being easy to reverse, easy to do, and not in any way intimidating!

I also have a post on using regular expressions in NP++, if you’re into that sort of thing.

Python: Monty Hall modeling

You’ve all heard this classic statistics problem, based on an old game show:

A contestant is shown 3 doors. Only one of those three doors hides something of value to the contestant (perhaps a new car), while the other two contain nothing. The contestant chooses one door, but that door remains closed. The host then opens up a 2nd door, and this door is always a losing door. At this point, the contestant may choose to now open the originally-chosen door, or switch to and open the last remaining door.

So why is this interesting? It turns out that the way to maximize your chances of winning is to always switch, and this maximized chance is 67%. It also turns out that this is totally non-intuitive, and that most people think that, if the contestant always switches, the chances of winning are at best 50%. If you haven’t heard the solution to this problem before, you should think through it and see what you expect the chances of winning are under the two conditions: After the contestant chooses a door, and is subsequently shown that one of the other two is a losing door, [1] the contestant always switches to the remaining door, or [2] the contestant never switches. After the jump, I’ll explain this intuitively and then show a Python script to simulate this problem.

average gene length in prokaryotes (part 1)

One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on hand. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?

So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?

For example, take the sequence:

(Sequence 1)  ATG AGT TGA ATG GTA TTG TAA TTT AGA TAA

This sequence has two potential start sites (in bold) and two stop codons (in bold italics). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.

To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.

Python: GEB and Wondrousness

While perusing a bookstore a couple years ago, I stumbled upon a fascinating book by Douglas Hofstadter called Godel, Escher, Bach. If you like math, biology, music, art, computer science, and philosophy, this is really an amazing read (though, admittedly, I’ve only gotten halfway through since I bought the thing).

In one of the book’s entertaining conversations between Achilles and the Tortoise (this conversation regarding number theory), the Tortoise tells Achilles about a number property that he calls Wondrousness. A number is found to be Wondrous if, when following a specific algorithm (below), you can turn that number into 1. That number is Unwonderous if you can’t reach 1. The point of the characters’ ensuing discussion is that there is no “terminating test” for the property of Wondrousness; you could never know for sure that a number is Unwonderous because you have no idea how long it would take to reach 1 if it was, in fact, Wondrous instead.

The algorithm in question is this: Take a number N. If that number is odd, take it times 3 and add 1. If that number is instead even, divide it by 2. Continue this process until N=1.

The point, for this post, is that Achilles and the Tortoise demonstrate the above algorithm on the number 15, finding that it takes 17 steps to get to a value of 1. The Tortoise then warns that trying the same with the number 27 will require a large sheet of paper, but otherwise no more examples are given. So, I thought it would be an interesting exercise to write a short Python script to run the algorithm on a large set of numbers, and then plot the number of steps taken to get to 1 for each number.

Python: Clean up and translate nucleotide sequences

[If you are more familiar with biology than with Python or computer programming, I highly recommend this book .]

Some simple, hopefully useful, and totally non-optimized functions for working with nucleotide sequence data (note that there are many more tools as part of the biopython distribution, if you’re interested in learning the library) :

First, for cleaning up a sequence (preferably in FASTA format):

 ```def clean_sequence( sequence ): """Given a sequence string, return a crap-free, standardized DNA version.""" s = sequence.replace( '\r', '' ).split( '\n' ) # separate each line if s[0][0] == '>': s = s[ 1 :] # remove defline s = ''.join( s ) # make one long string s = s.replace( ' ', '' ).replace( '\t', '' ) # remove spaces return s.upper().replace( 'U', 'T' ) ```

Then, a function to let you know if there are characters in your sequence that shouldn’t be:

 ```def report_bad_chars( sequence ): """Given a string 'sequence', return a dictionary of any non-AGCT characters.""" bad_chars = {} for l in sequence: if l not in 'AGCT': if l in bad_chars: bad_chars[ l ] += 1 else: bad_chars[ l ] = 1 if bad_chars != {}: print( bad_chars ) ```

After the jump, functions for translation, calculating amino acid and nucleotide frequencies, and making random DNA sequences.