A few days ago I found myself having a vague recollection of an interesting statistics problem. All I could remember was that it had to do with having a room full of people and the probability that any two people in that room would have the same birthday. I remembered the point, which was that it is much more likely than you might think, but I was fuzzy on the details.
After trying to define the problem and find an answer mathematically, I remembered that I suck at statistical reasoning about as much as the average person. So I decided to model the problem with a short Python script and find the answer that way.
Sure, I could’ve looked it up, but where’s the fun in that?
The problem: There are n people (say, at a party) drawn randomly from a population in which the chances of having a birthday on any day is equal to having a birthday on any other (which is not true of real populations (probably)). What is the probability of there being at least two people with the same birthday in the sample?
To put this thing together, I figure we need three things:
- The ability to generate random numbers (provided by Python’s random module);
- An object representing each person;
- A party object full of those people.
Then we can add things like the ability to choose how many people we want at the party and how many parties to have, as well as some output for making plots!
First, the Person object. All each person needs is a birthday:
def __init__( self ):
self.birthday = random.randint( 1, 365 )
You’ve all heard this classic statistics problem, based on an old game show:
A contestant is shown 3 doors. Only one of those three doors hides something of value to the contestant (perhaps a new car), while the other two contain nothing. The contestant chooses one door, but that door remains closed. The host then opens up a 2nd door, and this door is always a losing door. At this point, the contestant may choose to now open the originally-chosen door, or switch to and open the last remaining door.
So why is this interesting? It turns out that the way to maximize your chances of winning is to always switch, and this maximized chance is 67%. It also turns out that this is totally non-intuitive, and that most people think that, if the contestant always switches, the chances of winning are at best 50%. If you haven’t heard the solution to this problem before, you should think through it and see what you expect the chances of winning are under the two conditions: After the contestant chooses a door, and is subsequently shown that one of the other two is a losing door,  the contestant always switches to the remaining door, or  the contestant never switches. After the jump, I’ll explain this intuitively and then show a Python script to simulate this problem.
One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on hand. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?
So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?
For example, take the sequence:
(Sequence 1) ATG AGT TGA ATG GTA TTG TAA TTT AGA TAA
This sequence has two potential start sites (in bold) and two stop codons (in bold italics). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.
To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.
This post discusses a computer program that you can download to try yourself (and get the source code if you want to make your own version).
At a family reunion earlier this summer, we were handed a wordfind that someone had generated somewhere on the Internets that contained the names of the family founders. I was solving mine and noticed that, as anyone has frequently observed, in any given wordfind you will find words that are not in the list. Presumably, this is due to the randomly-assorted letters, by chance, spelling out an unplanned word. Of course, the wordfind makers might also stick those in on purpose (for example, the family wordfind contained the website name multiple times) or purposely prevent some random words (profanity). Regardless, I began to wonder how often a word might appear in a word find just by chance. So I used the margins to scratch out a formula for the chance of finding a word of a certain length within a matrix of random letters.