Using System.Text.RegularExpressions.Regex to do search/replace

By | 2009-04-14

Note: I have transferred this blog from my old blog site: http://dcarapic.blogspot.com/ so I am reposting most of the stuff from there

Google returns ~21.000.000 results (at the time of this writing) if you do a search for 'regular expressions' so you must be asking yourself why would somebody write something about it? Well sometimes you just need to use regular expressions to do some handy text processing, such as search/replace. I wrote this post just to give you some quick&dirty info on how to use the .NET Regex class to do substitutions.

First of, the simple search and replace. We search for a word and then replace it with another word:

public static void ReplaceSimple()
{
    string example = "I am a man.";
    string replaced = Regex.Replace(example, "man", "woman");
    Console.WriteLine(replaced);
    // Output: I am a woman.
    Console.ReadKey();
}

If you are using Regex for this kind of search/replace, then do yourself a favour and take a look at String.Replace method. Regular expressions get useful when you have some limitations on how you may do search and replace. Lets say that you wish to process some HTML (I am beating a dead cat, but who cares) and you would like to replace all <div> tags with <span> tags. One of the options is to use two replacements, one which replaces <div> with <span> and the other which replaces </div> with </span> (of course we could not just replace "div" with "span" because "div" might appear as a part of a HTML body text). But, doing it that way is not so interesting (and also makes this post pretty useless), so lets do something complex:

public static void ReplaceDiv()
{
    string example = "<div>Becomes span</div>";
    string replaced = Regex.Replace(example, @"<(/{0,1})div>", @"<$1span>");
    Console.WriteLine(replaced);
    // Output: <span>Becomes span</span>
    Console.ReadKey();
}

The secret to doing some complex search/replace in .NET (and I guess in many other regular expression implementations) is to define regular expression 'capture' groups inside the search pattern and then use them inside the replacement pattern. In the code above we are saying "give me all matches that start with <, have 0 or 1 / and end with div>; also group 0 or 1 /; then replace the found matches with <, followed by the first group and then followed by span>. The parenthesis (, ) characters serve as grouping constructs which we are using to 'capture' an appearance of 0 or 1 / character.

Any time that we use the parenthesis we are creating a group 'capture' which can be used in the replacement pattern by using the dollar ($) character. Using parenthesis for capture groups is simple but may lead to regular expression search patterns which are not immediately obvious (as if any useful regular expression patterns are obvious :)). To make it easier you may use named grouping constructs:

public static void ReplaceDivNamedGroup()
{
    string example = "<div>Becomes span</div>";
    string replaced = Regex.Replace(example, @"<(?<slash>/{0,1})div>", @"<${slash}span>");
    Console.WriteLine(replaced);
    // Output: <span>Becomes span</span>
    Console.ReadKey();
}

Naming the grouping is simple, just add ?<name> after the parenthesis start. You may then use the name of that group with syntax ${name} inside the replacement pattern.

As a bizarre fact, you may use the grouping constructs inside the search pattern. Example:

public static void EhThatRegex()
{
    string example = "We are searching for Yoda speak, searching are we";
    var match = Regex.Match(example, @"(\w+)\s+(\w+)\s+.*\2\s+\1");
    Console.WriteLine(match.Success );
    // Output: true
    Console.ReadKey();
}

Here we go, I've posted a non-obvious regular expression. But it can not be helped. The grouping 'capture' can be used inside the search pattern, we just may not use the $1 or ${xxx} but rather \1 and \k<xxx>. Of the top of my head I can not think where such a pattern would be useful but you might.

Here is the explanation of the search: (\w+)\s+(\w+)\s+.*\2\s+\1 search for a sequence of characters ending with space character (\w+)\s+ (basically we are searching for a complete word) and capture it in group 1; search again for another word and capture it in group 2; skip anything until you reach a point where the second word, followed by the first word appears. I am sure that some regular expression guru could use this in some way to check if the sentence is a palindrome, but that is beyond my abilities.

Leave a Reply

Your email address will not be published. Required fields are marked *