../

Summary

I'm always eventually tempted to parse text the wrong way. Someone, somewhere, will have a text file, and I need a word, or a phrase, or a line somewhere in the middle. I start looking up the man pages to remind myself how to use strpbrk(), strtok(), strstr(), etc... And just as I'm typing "man" at the command-prompt, I remember: I know regular expressions!

Also see a follow-up post on std::regex and boost::regex on 2014-12-13.

Regex

Plenty of information is available on regular expressions. Personally, I keep Jeffrey Friedl's "Mastering Regular Expressions" close at hand on the bookcase next to my desk. The problem with using regular expressions is that while regex.h isn't overly complicated, it is more difficult to use than a simple API call. I will forever remember the order and format of parameters to fprintf(). But every time I need to call regcomp(), regexec() and regfree(), I need to look things up.

The solution I've used through the years is to make myself a helper function I can call. As soon as I rely on regular expressions once to do something, I know I'll end up needing to use it more often. The easiest way to deal is to have a function that will compile the expression, apply it, free resources, and return the results. The most complicated part in all this is extracting the results from pmatch[]. This is what it looks like:

#define MAX_REGEX_MATCHES 10 ... // compile the regular expression regex_t preg; int rc = regcomp( &preg, pattern.c_str(), REG_EXTENDED ); ... // apply the regular expression to the input text size_t nmatch = MAX_REGEX_MATCHES; regmatch_t pmatch[MAX_REGEX_MATCHES]; rc = regexec( &preg, text.c_str(), nmatch, pmatch, 0 ); ... // extract all the groupings for ( int i = 0; i < MAX_REGEX_MATCHES; i++ ) { // regex will store the complete string match in pmatch[0], but what // we want are the (...) groupings regoff_t so = pmatch[ i + 1 ].rm_so; // rm_so == start offset regoff_t eo = pmatch[ i + 1 ].rm_eo; // rm_eo == end offset // if matched, the offset will be >=0 if ( so >= 0 && eo > so ) { // the length of the matched string is eo - so result[i] = text.substr( so, eo - so ); } } // free the compiled regex regfree( &preg );

The previous code example does not include error checking. Also note that pattern, input, text, and result in this example are actually C++ std::string.

Error checking

Error checking is done using regerror(). Especially if you've hidden the complexity of regex.h in a helper function, you'll want to detect and log regex errors. Depending on how you use regular expressions, perhaps not finding a match is not actually an error. Thus, you'll need to mask that return value:

if ( rc && rc != REG_NOMATCH ) { char errbuf[1000]; regerror( rc, &preg, errbuf, sizeof(errbuf) ); printf( "Error: %s\n", errbuf ); ... }

Additional reading

Last modified: 2015-03-01
Stéphane Charette, stephanecharette@gmail.com
../