[Note: a lot of readers are finding this post through Google searches, but I have no idea if the contents of this post are fulfilling whatever need that caused those searches. Please leave comments to let me know if this was helpful or not! And, if not, what let me know what you were hoping to find!]
A few years ago I wrote a post on making Notepad++ your default text editor, wherein I expressed my extreme love for this little piece of open source software. To date, I use it for just about everything and, after extensive use in a computational biology lab rotation, I figured I should share one of its most amazing features: Regular Expressions.
I’m not going to take the time to define Regular expressions (“REs” from now on) and will instead send you to Wikipedia. Just know that they are totally badass. In a nutshell, REs let you do really fancy search-and-replace in a text document. Perhaps that doesn’t excite you, but let me give some hypothetical situations in which you may find yourself:
- You have a file of contact information for everyone you know (say, 1000 people) and want to get just the email addresses so that you can spam everyone.
- You have a large FASTA file and want to pull out all of the organism names.
- You want to convert a file from one format to another.
- You want to combine multiple lines into a single line.
- You want to separate a line into multiple lines.
- Other pain-in-the-ass sounding stuff.
Sure, you could manually copy-paste all of those email addresses or organism names, and you could go through and hit the ENTER key to put things on separate lines. OR, you could write a few characters into NP++’s Find & Replace box. I think an example is the best way to make this work.
As a biologist, I’m rather fond of FASTA files. FASTA is simply a way to format DNA or protein sequence data so that people and programs can easily do stuff with that data. The format is:
> Sequence one name goes here, after the greater-than sign ATGAACAGCCTGGTAGTACCACGTGACACGTATAA > Sequence two name goes here ATGTGGAGTAGACCAGTCATACGTACACGTAA
Simple. The sequence data for each FASTA entry can all be on one line, as above, or on multiple lines:
> Mulitline Sequence ATGAACAGCCTGGTAGTACCACGTGACACGTA ATGTGGAGTAGACCAGTCATACGTACACGTAA
Though some programs won’t let you use multi-line sequences (a problem addressable with REs!).
Let’s run through an example with using REs in NP++ to solve a biological problem. I’m currently rotating through a fly lab, so let’s head over to flybase to grab a fasta file. I’m going to ask you to download a text file that contains a set of 22 fasta entries in standard multi-line format. These entries are the transcripts of ~20 isoforms of the gene shot, and each is thousands of bases long. We’re using this giant file to make a point: it would be a huge pain in the ass to manually manipulate this file, and REs make it easy.
Anyway, you can find it yourself by going to flybase, searching for shot, expanding the “Genomic Location” bar, selecting “Transcripts” in the 2nd pull-down menu, and then clicking “Get FASTA”. Or, you can just click here. Now, fire up your trusty copy of Notepad++.
I’ll walk through even the tiniest of details of this process, so feel free to skip ahead when you get bored.
Now that you’ve opened NP++, you should have a new tab open by default with nothing in it. You may also have some old stuff opened instead, in which case hit Ctrl+N to make a new tab. In your browser, go to the link I gave you above where you should see a really long FASTA set. Hit Ctrl+A to select all, then Ctrl+C to copy it. In your new NP++ tab, Ctrl+V to paste.
And now we’re ready for action. Saving is probably a good idea, that way you can start over to try out each RE. Note that I wasn’t lying about this being a long file (almost 8000 lines!).
Quick point: the lines containing the sequence name and info (following the ‘>’ sign) are called “deflines” (for “definition lines“).
First problem: The deflines are horrendously long. Let’s say you don’t care about the genomic location of all the exons making up this transcript, but just want the name by itself (i.e. “FBtr0301964” in the screenshot above). Time for the first RE!
In NP++, hit Ctrl+H to get to the Find/Replace dialog, which looks like this:
This is where all the magic happens. For our purposes, there are 6 parts of this dialog that are super-important: (1) The “Search Mode” box; (2,3) “Find What:”, “Replace With”; (4-6) “Find Next”, “Replace”, “Replace All”. Hopefully the purpose of each of this is intuitive, given their names. For our purposes, make sure that the “Regular Expression” radio button is marked (section 1 in Fig.2).
Okay, so we need to build a regex that will search for all occurences of
>identifier blahblahblah...., and we want to replace this with
>identifier. Here we go.
For regex (as in any computer language) there are characters that have special meanings. For instance, the character
^ means “the start of the line”, and parentheses mark of groups of characters that you can call upon later. Special characters are import for two reasons: (1) they do all the magic, and (2) if you want their non-special meaning, you have to “escape” them. What I mean is, say you want to search for an actual left parenthesis, but the
( character DOES NOT MEAN PARENTHESIS! It means “hey, a group starts here”. To search for the actual
( character, you’ll have to precede it with a backslash. Backslashes are “escape characters”, meaning that you can put them in front of special characters to make them not so special anymore. So to signify “start of group” in regex you’ll use
(, and to signify “left parenthesis” you’ll use
And, in case the thought crossed your mind, you can also escape the escape character! So to signify “backslash” you’ll use
Let’s go over the really important special characters that I’ll be using here:
|\||escape; used to make specials non-special|
|()||group; you can retrieve its contents|
|||any character inside is considered a match|
|.||matches any character|
|*||match the previous character 0 or more times|
|+||match the previous character 1 or more times|
|^||if the first character in the regex, means “beginning of line”; inside  means “not”|
|\s||any space character (space, tab)|
Moving on. First, we need to be able to find this lines, and they are all different. So let’s take the one thing they all have in common: the starting ‘>’. We want to then keep every character until we hit the first space, following which is
blahblahblah that we don’t want. The
> is always the very first character of the line, when present.
Now we will build the first regex. I highly recommend doing it one step at a time, as I explain it, and then running the search in Notepad++ each time. This way you can actually see what is being matched. If you accidentally replace something in the fasta sequences, just undo (Ctrl+Z).
Translating into regex, we want to start looking at the beginning of the line (using
^), so long as the first character is
> (now our regex is
^>), and we want any character afterwards, so long as it isn’t a space (using
[^\s] ) (now our regex is
^>[^\s]). But we want all characters after the
> until we hit a space, so adding
* will match the previous character, signified by
[^\s] (which means “any character but a space”) as many times as it can.
This makes our final regex
^>[^\s]*, which says “starting at the beginning of the line, match if the first character is
> and keep matching until you hit a space.”
Note that the character
^ is used twice, but has a different meaning each time (in the first case it means “beginning of line” and in the second it means “not”), since its meanings are position-dependent.
Okay, and now the last thing. We want to actually replace those entire lines with what our current regex finds. In order to do this, we need to make a group. A group is simply a part of your regex, wrapped in parentheses, for which you can retrieve its contents after a search. We want to turn our entire search into a group:
Note that the left parenthesis is not before the
^ symbol. Why? Remember that
^ only means “beginning of line” when it is the first character of your regex. If I put a
( first, that will no longer be the case.
I’m going to trick you now (though I gave you fair warning) for educational purposes. We’ve now built this cute little regex, and you should try it out! Put that regex into your “Find what:” box in Notepad++’s Search and Replace dialog, and hit find.
It works! Here’s the neat thing: put
\1 into the “Replace with:” box. That
\number notation lets you refer to the contents of your groups, in order, starting with group 1. We only have one group, so that’s easy. What you are now set up to do is to replace the matches to your regex with the contents of the group in your regex. Make sense? Now hit “Replace All” to see what I mean…
And there was the trick (sorry!). Do you understand what happened?
Basically we found every instance that matched your regex, and replaced that with the contents of the group. BUT the content of the group is the entire match! You should have seen a messagebox saying “22 instances replaced” or something similar, meaning that NP++ really did replace those things. It just replaced them with themselves.
So back to actually solving this problem. Try to think of how you would do it before moving on.
Our regex matches
>identifier as we wanted, and we can recall that match using the \1 special character. But we still need to get rid of the rest of the lines. It’s simple, really. We want to replace the entire line with just the
>identifier part, which means that our regex has to match the entire line. All we have to do is add
.* to the end of our regex:
As shown in the table above, a period matches anything and a
* lets the previous character match as much as it can. So
.* matches everything from where it starts to the end of the line (regex in NP++ only works on one line at a time). I hope it is clear that this regex will match the entire line for each of the deflines, and that the group allows us to retrieve just the “>identifier” part.
Now if you update the regex in your NP++ searchbox and hit “Replace all” (so long as the “Replace with:” box still contains
\1) you’ll see the magic happen.
What would have taken you minutes to do manually (search through 8000 lines and delete the majority of just 22 different lines) took a mere tens of minutes! Of course, it took longer this time since I was explaining everything ad nauseum. But next time, it will be way faster than doing it manually.
That felt horribly anti-climactic, so let’s do one more.
Second problem: At an average length of 360 lines, each transcript here is awfully difficult to select for copying/pasting into other programs. What if we put each sequence on a single line, so that the file was only 44 lines in length (one line for each defline, another for its corresponding sequence)? Of course, those lines would be really long, but at least you could more easily manage them.
Note: if you have word-wrap enabled (the View menu in NP++), then the end result will be less impressive…
What would you do? Basically, you want to replace the newline characters (
\r\n in Windows,
\n for everybody else) with nothing for every line except the deflines. Unfortunately, regex in NP++ doesn’t let you search for newline characters, so we’ll have to use a combination of a regex and the “Extended” radio button in the “Search mode” box. This is a little less straight-forward than we’d like:
Since we want to replace newline characters (newlines, as distinct from “new lines”), we can’t use regex for that part. But, we only want to replace newlines for the non-defline parts, which seems to mean we’d need a regex. WHAT TO DO?
Well, we know that only AGCT show up in in sequence part of the FASTA data (note that this isn’t always true), and that the deflines now only contain
>FBtr and some numbers. So why don’t we add some unique word to the end of our deflines that will not be found anywhere else, and use that as a marker? If it’s not obvious why this would help now, hopefully it will be in a moment. And, we already have the regex written to do this! Our regex (before adding “.*”) will find the deflines, and I just told you that it does not find newlines. Therefore, we can just adjust the “Replace with” box contents:
^(>[^\s]*) into the “Find what” box, as before. Now in the “Replace with” box, put “\1BOOYAH!” and hit “Replace all.”
As you should expect, this will replace the “>identifier” line with the contents of the first groupe (
\1) and will then append to that the text “BOOYAH!”.
Now let’s just delete all of the newlines. Switch the radio button to “Extended” mode, and put
\r\n (or just
\n for non-Windows) in the search box and leave the replace box empty. Now when you hit “Replace all” NP++ will think for a while and then eventually delete ~8000 newlines. Careful here: you’ll end up with everything on one super-long line (400,000 characters!), and NP++ may get upset if you try to scroll!
Now, replace all “BOOYAH!” with newlines (you can use
WHAT?!?, you may ask yourself. Where’d the deflines go? Well, remember that we had previously removed all of the final newlines, and only added them back after the deflines. We still need them before the deflines. An easy fix: just repeat the previous step but replace “BOOYAH!” with
Now just delete that extra newline at the top and WHAM, we’ve finished this tutorial. Just in time, too, or I was going to need Supplemental figures…