Comp Bio frustrations

I finished up my first lab rotation two Fridays ago, here at UT Southwestern. It was a pleasant few months with an interesting project, consisting mostly of starting at a computer screen and writing Python scripts, running BLAST searches, and so on. To summarize, but leaving things vague (both for most-people-don’t-care reasons and the-data-is-unpublished reasons), the project was this:

There are currently a crap-ton (“crap-ton” is a standard scientific prefix) of bacterial and archaeal genomes published and available on NCBI‘s servers. Archaea, like bacteria, are single-celled prokaryotic organisms. However, they differ from bacteria genomically (and therefore metabolically) in many ways. Some archaeal properties are like those in eukaryotes (like us!), while others are like those in bacteria. So one of the huge unanswered questions in evolution is: how are bacteria, archaea, and eukaryotes related to each other? Or, how would we make a tree of life relating these three domains?

Phylogenetics is the study of these kinds of relationships, and visualizing these relationships as trees is common and fairly easy to understand. One of the most confusing points is that a tree can either be rooted or unrooted.

 

Figure 1: rooted and unrooted trees
Figure 1: An unrooted tree (A), a rooted tree (B), and a rooted tree with an unkown root (C).

Figure 1 should help to explain this. In panel (A) you see an unrooted tree. A-D represent things that are related to each other, and the length of the lines represent divergence (a longer line between points indicates a more distant relationship). When we construct these things in biology, the letters are usually DNA or protein sequences, and the line length represents the number of sequence differences. So a short line indicates that the sequences are more similar, whereas a long line means they are less similar (more “diverged”).

In the unrooted case (Fig.1A) we know how diverged the sequences are from each other, meaning that we know how they are related to each other. However, we do not know anything about the relationship of A-D in terms of ancestry. What does I mean by this? See how the total line length between A and D in Fig.1A is really short, but the length between A and B is longer? This allows us to say, with confidence, that A is more closely related to D than it is to B. But is D the ancestor of A, or is B the ancestor of A, or A of D, etc? There is no way to know when looking only at the differences between closely related sequences! For all we know, the rooted tree could look like that in Fig.1B (where C is the ancestor of all the rest) or Fig.1C (where there is an unknown ancestor leading to the two related groups A&D and C&B). And it shouldn’t be difficult imagining any other arrangment that this tree could take.

(Two important points are that (1) no matter how you rearrange the tree, the distance between any two points cannot change, and (2) when looking at a rooted tree a 3rd dimension is implicitly added: time.)

Anyway, that little intro to phylogenetics was only to serve the purpose of illustrating the general idea of what I was trying to do in my lab rotation, which was this: determine the root of the tree between bacteria and archaea.

As it stands, we have various ways of measuring how diverged eukarya are from bacteria, bacteria are from archaea, and eukarya are from archaea, but we don’t know which came first (or if something else entirely did)! Obviously, we’re confident that eukarya didn’t come first, but are they the evolutionary product of bacteria, archaea, or both? And so on. The first and simpler question to address is where the root is beteen archaea and bacteria (did one lead to the other, or are they both diverged from something else?).

That was my task. It was expected that I would spend ~4 weeks doing all of the analysis, and then another 4 coming up with conclusions and, perhaps, even writing a paper. What actually happened was that I spent all 8 weeks just trying to collect data! I’m still not done, and I need to present what I’ve accomplished to the lab on Wednesday. Every time I thought I had it, it turned out that something needed to be done a little differently, or that one of the datasets was not behaving properly, or that one of the computer programs was returning junk, etc. You get the idea.

I’ll follow up this long introductory post others detailing what it was I actually did (with strategically removed details).