Search Your Files With Grep and Regex
Search Your Files With Grep and Regex
Learn the ins and outs of searching through a file, including the most versatile method: using grep and regular expression, or regex.
Join the DZone community and get the full member experience.Join For Free
Learn more about how CareerBuilder was able to resolve customer issues 5x faster by using Scalyr, the fastest log management tool on the market.
How do you search through a file? On the surface, this might seem like sort of a silly question. But somewhere between the common-sense answer for many ("double click it and start reading!") and the heavily technical ("command line text grep regex") lies an interesting set of questions.
- Where does this file reside?
- What kind of file is it?
- How big is the file?
- What, exactly, are you looking for in the file?
Today, we're going to look at one of the most versatile ways to search a file: using grep and regex (short for regular expression). Using this combination of tools, you can search files of any sort and size. You can also search with extremely limited access to your environment, and if you get creative, you can find just about anything.
But with that versatility comes a bit of a learning curve. So let's look at how to take the edge off of that and get you familiar with this file search technique. To do that, I'll walk through a hypothetical example of trying to extract some information. First, though, let's cover a bit of background.
What Is Grep?
What does it do? Simple enough. Grep helps you search through files, looking for patterns.
Here's a template of what that looks like.
grep [-options] pattern [filename]
So basically, at a command line prompt, you would type "grep ford cars.txt" if you wanted to search for the text "ford" in the file "cars.txt." The grep utility would print any matching lines right there in the console for you to review.
The -options tag is just that: it lets you supply some options. For instance, you can tell grep to ignore the case of the characters or to put the results into a new file.
And that's really all there is to grep. Its beauty lies in its simplicity and the power it gives you to do things.
What Is a Regex?
Speaking of power, let's talk about regex. The term is actually, as I mentioned earlier, "regular expressions," but it's such a ubiquitous term in the programmer world that it's earned a nickname. I don't think regex has quite made the English dictionary yet, but programmers know what you mean by this.
You'll find that programmers have a love-hate relationship with regex - as in, some programmers love them and others hate them. People love regex for the power they confer on their users. Others hate them for their incomprehensibility and the confusion they create.
So what are they? Like grep, it's simple enough. Regular expressions are sequences of characters that represent patterns, and they instruct regex parsers on ways to search text and match patterns. Think of a much simpler version of this concept: the wildcard. The wildcard lets you enter a search like, say, "d*g" in the dictionary and receive results that include "dog," "dig," and "dug."
Regex takes this to a whole new level. You can do simple matches and wildcard searches with them. For instance, the expression "d.*g" says the same thing as my wildcard example: match words that start with d and end with g and have stuff between them.
But you can get more complicated, too. A lot more complicated.
^(19|20)\d\d[- /.](0[1-9]|1)[- /.](0[1-9]|[0-9]|3)$
Want to hazard a guess what that does? It matches a date in yyyy-mm-dd format from 1900 through 2099. I mean, obviously, right?
It's this complexity that drives the love-hate in the programming world. Expressing validation logic for a date in just 64 characters is powerful. But good luck understanding them without significant study and memorization. They're hard to read.
Grep: a Simple Example
By now, you've probably put together the grep-regex equation in your mind with its value proposition. Grep lets you search files from the command line, and regex lets you do some really formidable stuff. But let's walk before we run and take a look at an example using just grep.
A lot of hosted solutions using Apache feature something called the Webalizer. It offers a very specific sort of log aggregation, but that's not of interest for this example. Instead, we're interested in its configuration file, webalizer.conf. For this example, I want to log in to my hosted web server and figure some things out about my configuration.
A lot of people default to popping open a file in a text editor to take a look around and search. But recall the restrictions I mentioned at the beginning of the post.
- The file resides on a server where I have limited SSH access and can't use a graphical text editor.
- I don't know exactly what's in it, and it could be really big, for all I know.
- I might need to adjust my search as I go.
So because of these restrictions, I go with grep. I know that I can use it from the command line to search the file. Now, let's say that a Google search for some problem told me that I needed to check on a series of settings that start with "Allm" and I don't really know where in the file they are.
Grep to the rescue.
grep All webalizer.conf
That's pretty good, but I don't really care about those comments or "HideAllSites," so I revise it just a touch.
grep "#All" webalizer.conf
(The quotes are because I want to search for the special character "#" as well)
Grep Regex: a Simple Example
Alright, now we're making progress. I've got just the settings that I want. And I can also see that my problem may stem from the fact that the settings are not enabled, by virtue of the "#" commenting them out.
But let's say that for this hypothetical example, I wanted to drill in a little further. And let's also say that the file was a lot bigger with a lot more matches, so simply opening it and looking for these lines weren't feasible.
What if I wanted to narrow this down to just the "yes" entries? Huh. I've pretty much reached the limit of what I can do with simple search. Not only is there indeterminate text between the #All and the yes/no, but there's also a variable number of spaces. Let's further say that I'm only interested in the existence of a "yes," not a "no" or a hypothetical blank. How would I do that?
Well, I'd get ready to start using regex. (And I'd also test them with this tool because regex is hard.) Let's see what happens with this one.
grep "#All.*yes" webalizer.conf
Success! Now we've narrowed it specifically to the items that we want action with, but that are commented out.
As you can see, this is extremely powerful. I've just scratched the surface, and delving any more into grep (and especially into regex) would carry us well beyond the scope of this post. But even with access only to the command line, and without ever opening a file, you can perform remarkably sophisticated searches to zero in on issues.
Grep and Regex: Know When to Say When
I'll close with a bit of philosophical advice. As you learn your way around these tools, you'll find yourself able to do some truly cool stuff. They'll help you solve problems and be productive.
But don't let yourself lose sight of the forest for the trees. In a pinch (like my hypothetical one) where you need to log into a server, drill into some configuration file, and find stuff in it, this is great. If you find yourself doing similar things on a routine basis, you might start to ask yourself why. And if this is your life - grepping and regexing your way through countless, massive files (e.g., log files), then you probably have better options at your disposal.
Grep and regex are powerful, but so too are tools dedicated to automatic ingestion, parsing, and analysis of common types of files. And, since we're all in the business of automating, if you find yourself constantly slinging grep and regex at various files, you might ask yourself if there isn't a way to automate what you're doing instead.
Published at DZone with permission of Erik Dietrich , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.