AddThis Social Bookmark Button

Print

C# Regular Expressions, Revisited

by Brad Merrill, coauthor of C# Essentials, 2nd Edition.
03/11/2002

It's been a year since my original article on the .NET Frameworks RegEx classes, and since O'Reilly is now releasing the 2nd edition of C# Essentials, it seems worthwhile to update this topic.

Updates since Beta1

The following is a brief list of the major changes to the RegEx class library since Beta1:

  • The RegularExpression Assembly is now merged with the main Frameworks class library. This means you will no longer need to reference the assembly seperately. You will still need to specify the namespace via the using statement, in order to use the classes by name.
  • When matching groups, the Group method was used to retrieve appropriately indexed group. Now we can retrieve the Groups property, which is a GroupCollection, which can be indexed directly.
  • In Beta1, the RegEx modifiers were specified as character code. Now, there is an Enum called RegexOptions which provides access to the modifier functionality.

Compiled RegEx

The biggest new feature that was added in Beta2 was the addition of compiled regular expressions. This allows assemblies to be seperately compiled, such that new assemblies can be built, referencing a seperately compiled RegEx. This is similar to how Lex programs are often used to create seperate parsers.

Let's look at a small (somewhat contrived) example. Let's say we have a pattern, which is fixed, for 90% of the time we use it. That matching would be a good candidate for prebuilding as a precompiled assembly.

Here's the first part of the solution, which expresses the matching pattern, and generates the assembly:


namespace MyApp
  {
  using System;
  using System.Reflection;
  using System.Text.RegularExpressions;
  class GenFishRegEx
    {
    static void Main()
      {
      // create the pattern
      string pat = @"(\w+)\s+(fish)";
      // create the compile info
      RegexCompilationInfo rci = new RegexCompilationInfo(
	pat, RegexOptions.IgnoreCase, "FishRegex", "MyApp", true);
      // setup to compile
      AssemblyName an = new AssemblyName();
      an.Name = "FishRegex";
      RegexCompilationInfo[] rciList = { rci };
      // compile the regular expression
      Regex.CompileToAssembly(rciList, an);
      }
    }
  }

In this sample, the compiled expression pat will match a word preceeding the word fish.

We now compile and run as:


	csc GenFishRegEx.cs
	GenFishRegEx

We have now created FishRegEx.dll, and we can now use this assembly in a new program. The specifics worth noting are the use of the RegexCompilationInfo which specifies the namespace to create the new type FishRegex within, and specifying the name of the newly created assembly.

You can now use this new assembly as:


// build as:
// csc /r:fishregex.dll UseFishRegEx.cs
namespace MyApp
  {
  using System;
  using System.Reflection;
  using System.Text.RegularExpressions;
  class UseFishRegEx
    {
    public static void Main()
      {
      string text = "One fish two fish red fish blue fish";
      int matchCount = 0;
      FishRegex f = new FishRegex();
      foreach (Match m in f.Matches(text))
	{
	Console.WriteLine("Match"+ (++matchCount));
	for (int i = 1; i <= 2; i++)
	  {
	  Group g = m.Groups[i];
	  Console.WriteLine("Group"+i+"='" + g + "'");
	  CaptureCollection cc = g.Captures;
	  for (int j = 0; j < cc.Count; j++)
	    {
	    Capture c = cc[j];
	    System.Console.WriteLine(
	      "Capture"+j+"='" + c + "', Position="+c.Index);
	    }
	  }
	}
      }
    }
  }

Related Reading

C# Essentials
By Ben Albahari, Peter Drayton, Brad Merrill

When we need a new instance of the RegEx we simply instantiate it as new FishRegEx, and then process it normally, since it inherits from RegEx.

It might be a useful test or verification of the above, to build and run both programs, and then examine their public methods using the ildasm tool.

Performance Considerations

If you do have a need for creating compiled regex assemblies, note that you will pay a cost for the initial assembly load. This cost can be minimized by utilizing the NGEN tool, to create a pre-JIT'ed assembly, which drastically reduces the intial load cost.

Updated Cookbook

I have updated the C# Cookbook samples for RTM. All of these consisted of just updating the code fragments as outlined in the changes from Beta1 to Beta2.


    // Roman Numbers
    string p1 = "^m*(d?c{0,3}|c[dm])"
      + "(l?x{0,3}|x[lc])(v?i{0,3}|i[vx])$";
    string t1 = "vii";
    Match m1 = Regex.Match(t1, p1);
    Console.WriteLine("Match=[" + m1 + "]");

    // Swap first two words
    string t2 = "the quick brown fox";
    string p2 = @"(\S+)(\s+)(\S+)";
    Regex x2 = new Regex(p2);
    string r2 = x2.Replace(t2, "$3$2$1", 1);
    Console.WriteLine("Result=[" + r2 + "]");

    // Keyword = Value
    string t3 = "myval = 3";
    string p3 = @"(\w+)\s*=\s*(.*)\s*$";
    Match m3 = Regex.Match(t3, p3);
    Console.WriteLine("Group1=[" + m1.Groups[1] + "]");
    Console.WriteLine("Group2=[" + m1.Groups[2] + "]");
    
    // Line of at least 80 chars
    string t4 = "********************"
      + "******************************"
      + "******************************";
    string p4 = ".{80,}";
    Match m4 = Regex.Match(t4, p4);
    Console.WriteLine("if line >= 80 is = " + m1.Success + "]");

    // MM/DD/YY HH:MM:SS
    string t5 = "01/01/01 16:10:01";
    string p5 = @"(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";
    Match m5 = Regex.Match(t5, p5);
    Console.WriteLine("M5=" + m5);
    for (int i = 1; i <= 6; i++)
      Console.WriteLine("Group" + i + "=[" + m5.Groups[i] + "]");

    // Changing directories (for Windows)
    string t6 = @"C:\Documents and Settings\user1\Desktop\";
    string r6 = Regex.Replace(t6, @"\\user1\\", @"\\user2\\"); // ";

    // expanding (%nn) hex escapes
    string t7 = "%41";
    string p7 = "%([0-9A-Fa-f][0-9A-Fa-f])";
    string r7 = Regex.Replace(t7, p7, HexConvert);
    Console.WriteLine("R7=" + r7);

    // deleting C comments (imperfectly)
    string t8 = @"
/*
 * this is an old cstyle comment block
 */
foo();
";
    string p8 = @"
  /\*  # match the opening delimiter
  .*?	 # match a minimal numer of chracters
  \*/	 # match the closing delimiter
";
    string r8 = Regex.Replace(t8, p8, "",
			      RegexOptions.IgnorePatternWhitespace
			      | RegexOptions.Singleline);
    Console.WriteLine("r8="+r8);

    // Removing leading and trailing whitespace
    string t9a = "     leading";
    string p9a = @"^\s+";
    string r9a = Regex.Replace(t9a, p9a, "");
    Console.WriteLine("r9b=" + r9a);

    string t9b = "trailing    ";
    string p9b = @"\s+$";
    string r9b = Regex.Replace(t9b, p9b, "");
    Console.WriteLine("r9b=" + r9b);

    // turning \ followed by n into a real newline
    string t10 = @"\ntest\n";
    string r10 = Regex.Replace(t10, @"\\n", "\n");
    Console.WriteLine("r10=" + r10);

    // IP address
    string t11 = "55.54.53.52";
    string p11 = "^" +
      @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
      @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
      @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
      @"([01]?\d\d|2[0-4]\d|25[0-5])" +
      "$";
    Match m11 = Regex.Match(t11, p11);
    Console.WriteLine("M11=" + m11);
    Console.WriteLine("Group1=" + m11.Groups[1]);
    Console.WriteLine("Group2=" + m11.Groups[2]);
    Console.WriteLine("Group3=" + m11.Groups[3]);
    Console.WriteLine("Group4=" + m11.Groups[4]);

    // removing leading path from filename

    string t12 = @"c:\file.txt";
    string p12 = @"^.*\\";
    string r12 = Regex.Replace(t12, p12, "");
    Console.WriteLine("r12=" + r12);

    // joining lines in multiline strings
    string t13 = @"this is 
a split line";
    string p13 = @"\s*\r?\n\s*";
    string r13 = Regex.Replace(t13, p13, " ");
    Console.WriteLine("r13=" + r13);

    // extracting all numbers from a string
    string t14 = @"
test 1
test 2.3
test 47
";
    string p14 = @"(\d+\.?\d*|\.\d+)";
    MatchCollection mc14 = Regex.Matches(t14, p14);
    foreach (Match m in mc14)
      Console.WriteLine("Match=" + m);

    // finding all caps words
    string t15 = "This IS a Test OF ALL Caps";
    string p15 = @"(\b[^\Wa-z0-9_]+\b)";
    MatchCollection mc15 = Regex.Matches(t15, p15);
    foreach (Match m in mc15)
      Console.WriteLine("Match=" + m);

    // find all lowercase words
    string t16 = "This is A Test of lowercase";
    string p16 = @"(\b[^\WA-Z0-9_]+\b)";
    MatchCollection mc16 = Regex.Matches(t16, p16);
    foreach (Match m in mc16)
      Console.WriteLine("Match=" + m);
    
    // find all initial caps
    string t17 = "This is A Test of Initial Caps";
    string p17 = @"(\b[^\Wa-z0-9_][^\WA-Z0-9_]*\b)";
    MatchCollection mc17 = Regex.Matches(t17, p17);
    foreach (Match m in mc17)
      Console.WriteLine("Match=" + m);
    
    // find links in simple html
    string t18 = @"
<html>
<a href=""first.htm"">first tag text</a>
<a href=""next.htm"">next tag text</a>
</html>
";
    string p18 = @"<A[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>";
    MatchCollection mc18 = Regex.Matches(t18, p18,
					 RegexOptions.Singleline
					 | RegexOptions.IgnoreCase);
    foreach (Match m in mc18)
      {
      Console.WriteLine("Match=" + m);
      Console.WriteLine("Group1=" + m.Groups[1]);
      }

    // finding middle initial
    string t19 = "Hanley A. Strappman";
    string p19 = @"^\S+\s+(\S)\S*\s+\S";
    Match m19 = Regex.Match(t19, p19);
    Console.WriteLine("Initial=" + m19.Groups[1]);

    // changing inch marks to quotes
    string t20 = @"2' 2"" ";
    string p20 = "\"([^\"]*)";
    string r20 = Regex.Replace(t20, p20, "``$1''");
    Console.WriteLine("Result=" + r20);

Interesting Patterns?

If you come across any frequently used RegEx patterns, I encourage you to share them with your fellow pattern builders. In the future, I hope to collect a repository for these patterns, which we will be able to share among all of the languages used in the .NET Framework. After all, C# is but one of the many languages you can use, and the RegEx classes can be used from them all.