Regular Expressions With C# and .NET 7
This article takes you step-by-step through creating a console app to explore regular expressions via some cool new .NET 7 features.
Join the DZone community and get the full member experience.
Join For FreeThis article is an (adapted) excerpt from the book C# 11 and .NET 7 – Modern Cross-Platform Development Fundamentals, and takes you step-by-step through creating a console app to explore regular expressions via some cool new .NET 7 features: the [StringSyntax]
attribute and source-generated regular expressions.
Pattern Matching With Regular Expressions
Regular expressions are useful for validating input from the user. They are very powerful and can get very complicated. Almost all programming languages have support for regular expressions and use a common set of special characters to define them. Let's try out some example regular expressions.
Use your preferred code editor to add a new Console App/console project named WorkingWithRegularExpressions
to your solution/workspace.
In Program.cs
, delete the existing statements, and then import the following namespace:
using System.Text.RegularExpressions; // Regex
Checking for Digits Entered as Text
We will start by implementing the common example of validating number input.
In Program.cs
, add statements to prompt the user to enter their age and then check that it is valid using a regular expression that looks for a digit character, as shown in the following code:
Write("Enter your age: ");
string input = ReadLine()!; // null-forgiving
Regex ageChecker = new(@"\d");
if (ageChecker.IsMatch(input))
{
WriteLine("Thank you!");
}
else
{
WriteLine($"This is not a valid age: {input}");
}
Note the following about the code:
- The
@
character switches off the ability to use escape characters in the string. Escape characters are prefixed with a backslash. For example,\t
means a tab and\n
means a new line. When writing regular expressions, we need to disable this feature. To paraphrase the television show The West Wing, "Let backslash be backslash." - Once escape characters are disabled with
@
, then they can be interpreted by a regular expression. For example,\d
means digit.
Run the code, enter a whole number such as 34
for the age, and view the result, as shown in the following output:
Enter your age: 34
Thank you!
Run the code again, enter carrots
, and view the result, as shown in the following output:
Enter your age: carrots
This is not a valid age: carrots
Run the code again, enter bob30smith
, and view the result, as shown in the following output:
Enter your age: bob30smith
Thank you!
The regular expression we used is \d
, which means one digit. However, it does not specify what can be entered before and after that one digit. This regular expression could be described in English as "Enter any characters you want as long as you enter at least one digit character."
In regular expressions, you indicate the start of some input with the caret ^
symbol and the end of some input with the dollar $
symbol. Let's use these symbols to indicate that we expect nothing else between the start and end of the input except for a digit.
Change the regular expression to ^\d$
, as shown in the following code:
Regex ageChecker = new(@"^\d$");
Run the code again and note that it rejects any input except a single digit. We want to allow one or more digits. To do this, we add a +
after the \d
expression to modify the meaning to one or more.
Change the regular expression, as shown in the following code:
Regex ageChecker = new(@"^\d+$");
Run the code again and note the regular expression only allows zero or positive whole numbers of any length.
Regular Expression Performance Improvements
The .NET types for working with regular expressions are used throughout the .NET platform and many of the apps built with it. As such, they have a significant impact on performance, but until now, they have not received much optimization attention from Microsoft.
With .NET 5 and later, the System.Text.RegularExpressions
namespace has rewritten internals to squeeze out maximum performance. Common regular expression benchmarks using methods like IsMatch
are now five times faster. And the best thing is, you do not have to change your code to get the benefits!
With .NET 7 and later, the IsMatch
method of the Regex
class now has an overload for a ReadOnlySpan<char>
as its input, which gives even better performance.
Splitting a Complex Comma-Separated String
Let's consider how we would split a complex string, like the following example of film titles:
"Monsters, Inc.","I, Tonya","Lock, Stock and Two Smoking Barrels"
The string value uses double quotes around each film title. We can use these to identify whether we need to split on a comma (or not). The Split
method is not powerful enough, so we can use a regular expression instead.
You can read a fuller explanation in this Stack Overflow article that inspired this task.
To include double quotes inside a string value, we prefix them with a backslash, or we could use the C# 11 raw string literal feature in C# 11 or later.
Add statements to store a complex comma-separated string variable, and then split it in a dumb way using the Split
method, as shown in the following code:
// C# 1 to 10: Use escaped double-quote characters \"
// string films = "\"Monsters, Inc.\",\"I, Tonya\",\"Lock, Stock and Two Smoking Barrels\"";
// C# 11 or later: Use """ to start and end a raw string literal
string films = """
"Monsters, Inc.","I, Tonya","Lock, Stock and Two Smoking Barrels"
""";
WriteLine($"Films to split: {films}");
string[] filmsDumb = films.Split(',');
WriteLine("Splitting with string.Split method:");
foreach (string film in filmsDumb)
{
WriteLine(film);
}
Add statements to define a regular expression to split and write the film titles in a smart way, as shown in the following code:
Regex csv = new(
"(?:^|,)(?=[^\"]|(\")?)\"?((?(1)[^\"]*|[^,\"]*))\"?(?=,|$)");
MatchCollection filmsSmart = csv.Matches(films);
WriteLine("Splitting with regular expression:");
foreach (Match film in filmsSmart)
{
WriteLine(film.Groups[2].Value);
}
In the last section, you will see how you can get a source generator to auto-generate XML comments for a regular expression to explain how it works. This is really useful for regular expressions that you might have copied from a website.
Run the code and view the result, as shown in the following output:
Splitting with string.Split method:
"Monsters
Inc."
"I
Tonya"
"Lock
Stock and Two Smoking Barrels"
Splitting with regular expression:
Monsters, Inc.
I, Tonya
Lock, Stock and Two Smoking Barrels
Activating Regular Expression Syntax Coloring
If you use Visual Studio 2022 as your code editor, then you probably noticed that when passing a string
value to the Regex
constructor, you see color syntax highlighting, as shown below:
Regular expression color syntax highlighting when using the Regex constructor
Why does this string get syntax coloring for regular expressions when most string values do not? Let's find out.
Right-click in the new
constructor, select Go To Implementation, and note the string
parameter named pattern
is decorated with an attribute named StringSyntax
that has the string
constant Regex
value passed to it, as shown in the following code:
public Regex([StringSyntax(StringSyntaxAttribute.Regex)] string pattern) :
this(pattern, culture: null)
{
}
Right-click in the StringSyntax
attribute, select Go To Implementation, and note there are 12 recognized string syntax formats that you can choose from as well as Regex
, as shown in the following partial code:
[AttributeUsage(AttributeTargets.Property | AttributeTargets.Field | AttributeTargets.Parameter, AllowMultiple = false, Inherited = false)]
public sealed class StringSyntaxAttribute : Attribute
{
public const string CompositeFormat = "CompositeFormat";
public const string DateOnlyFormat = "DateOnlyFormat";
public const string DateTimeFormat = "DateTimeFormat";
public const string EnumFormat = "EnumFormat";
public const string GuidFormat = "GuidFormat";
public const string Json = "Json";
public const string NumericFormat = "NumericFormat";
public const string Regex = "Regex";
public const string TimeOnlyFormat = "TimeOnlyFormat";
public const string TimeSpanFormat = "TimeSpanFormat";
public const string Uri = "Uri";
public const string Xml = "Xml";
…
}
In the WorkingWithRegularExpressions
project, add a new class file named Program.Strings.cs
, and modify its content to define some string constants, as shown in the following code:
partial class Program
{
const string digitsOnlyText = @"^\d+$";
const string commaSeparatorText =
"(?:^|,)(?=[^\"]|(\")?)\"?((?(1)[^\"]*|[^,\"]*))\"?(?=,|$)";
}
Note that the two
string
constants do not have any color syntax highlighting yet.
In Program.cs
, replace the literal string
with the string
constant for the digits-only regular expression, as shown in the following code:
Regex ageChecker = new(digitsOnlyText);
In Program.cs
, replace the literal string
with the string
constant for the comma separator regular expression, as shown in the following code:
Regex csv = new(commaSeparatorText);
Run the console app and confirm that the regular expression behavior is as before.
In Program.Strings.cs
, import the namespace for the [StringSyntax]
attribute and then decorate both string
constants with it, as shown in the following code:
using System.Diagnostics.CodeAnalysis; // [StringSyntax]
partial class Program
{
[StringSyntax(StringSyntaxAttribute.Regex)]
const string digitsOnlyText = @"^\d+$";
[StringSyntax(StringSyntaxAttribute.Regex)]
const string commaSeparatorText =
"(?:^|,)(?=[^\"]|(\")?)\"?((?(1)[^\"]*|[^,\"]*))\"?(?=,|$)";
}
In Program.Strings.cs
, add another string
constant for formatting a date, as shown in the following code:
[StringSyntax(StringSyntaxAttribute.DateTimeFormat)]
const string fullDateTime = "";
Click inside the empty string, type a letter d
, and note the IntelliSense, as shown below:
Finish entering the date format and as you type note the IntelliSense: dddd, d MMMM yyyy
.
Add at the end of the digitsOnlyText
string literal, add a \
, and note the IntelliSense to help you write a valid regular expression, as shown below:
IntelliSense for writing a regular expression
The
[StringSyntax]
attribute is a new feature introduced in .NET 7. It is up to your code editor to recognize it. .NET 7 libraries have more than 350 parameters, properties, and fields that are now decorated with this attribute.
Improving Regular Expression Performance With Source Generators
When you pass a string
literal or string
constant to the constructor of Regex
, the class parses the string and transforms it into an internal tree structure that represents the expression in an optimized way that can be executed efficiently by a regular expression interpreter.
You can also compile regular expressions by specifying a RegexOption
, as shown in the following code:
Regex ageChecker = new(digitsOnlyText, RegexOptions.Compiled);
Unfortunately, compiling has the negative effect of slowing down the initial creation of the regular expression. After creating the tree structure that would then be executed by the interpreter, the compiler then has to convert the tree into IL code, and then that IL code needs to be JIT compiled into native code. If you only run the regular expression a few times, it is not worth compiling it, which is why it is not the default behavior.
.NET 7 introduces a source generator for regular expressions which recognizes if you decorate a partial method that returns Regex
with the [GeneratedRegex]
attribute. It generates an implementation of that method which implements the logic for the regular expression.
Let's see it in action.
In the WorkingWithRegularExpressions
project, add a new class file named Program.Regexs.cs
, and modify its content to define some partial methods, as shown in the following code:
using System.Text.RegularExpressions; // [GeneratedRegex]
partial class Program
{
[GeneratedRegex(digitsOnlyText, RegexOptions.IgnoreCase)]
private static partial Regex DigitsOnly();
[GeneratedRegex(commaSeparatorText, RegexOptions.IgnoreCase)]
private static partial Regex CommaSeparator();
}
In Program.cs
, replace the new constructor with a call to the partial method that returns the digits-only regular expression, as shown in the following code:
Regex ageChecker = DigitsOnly();
In Program.cs
, replace the new constructor with a call to the partial method that returns the comma separator regular expression, as shown in the following code:
Regex csv = CommaSeparator();
Hover your mouse pointer over the partial methods and note that the tooltip describes the behavior of the regular expression, as shown below:
Tooltip for a partial method shows a description of the regular expression
Right-click the DigitsOnly
partial method, select Go To Definition, and note that you can review the implementation of the auto-generated partial methods, as shown below:
The auto-generated source code for the regular expression
Run the console app and confirm that the functionality is the same as before.
You can learn more about the improvements to regular expressions with .NET 7.
Published at DZone with permission of Mark Price. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments