I came across a particular problem that seem trivial at first, but ended up being trickier than I thought. What I wanted to do is split a string based on a delimiter value, but only when the delimiter value is not found inside a quoted string. In my particular case (simplified for discussion), I had two SQL queries concatenated by the ";" delimiter as follows:
SELECT "Jane;Sam;Julie;Homer's \"Salad\"" as f1;SELECT 'This is "a test". \\Part2\\' as f2;
I wanted to split the strings into:
SELECT "Jane;Sam;Julie;Homer's \"Salad\"" as f1
and
SELECT 'This is "a test". \\Part2\\' as f2
Note, that in each query, the quoted string is either escaped by single or double quotes, and that the string fields in the query themselves contain the delimiter within them. Using a simple String.Split() function will not yield the correct results. Writing a custom function to perform this split would probably work, but would probably be error prone, and ugly. So regular expressions seemed to be the answer. After spending some time doing some google searches, I couldn't find any example of a good way to perform my split using Regex, as much of the examples I found were about matching quoted strings (with escapes).
The trick to get this to work is to use a combination of Grouping Constructs, Back References, Zero Width Assertions, Conditionals, and Balanced Group Definitions in the regular expression pattern. Balanced Group definitions is a Microsoft innovation and is implemented in .NET Regex, so this is not a very common feature but is the key to get this to properly work.
Let's just layout the solution, then analyze it later. Here is the C# code that implements a SplitSql() function and splits the sample input above:
/////////////////////////////////////////////////////////////////////////////// // // Copyright 2008. Tom Kaminski. // // This program is free software; you can redistribute it and/or modify it under the terms // of the GNU General Public License as published by the Free Software Foundation; either // version 2 of the License, or (at your option) any later version. // These programs are distributed in the hope that it will be useful, but WITHOUT ANY // WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS // FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; namespace SplitSql { class Program { public static string[] SplitSql(string multi_sql) { Regex re = new Regex(@"(?>(?(IQ)(?(ESC).(?<-ESC>)| \\(?<ESC>))|(?!))|(?(IQ)\k<QUOTE>(?<-IQ>) |(?<QUOTE>[""'])(?<IQ>))|(?(IQ).|[^;]))+",RegexOptions.IgnorePatternWhitespace); List<string> sqls = new List<string>(); Match m = re.Match(multi_sql); while (m.Success) { string result = m.Groups[0].ToString().Trim(); if (result.Length > 0) sqls.Add(result); m = m.NextMatch(); } return sqls.ToArray(); } static void Main(string[] args) { string sql = @"SELECT ""Jane;Sam;Julie;Homer's \""Salad\""\"" as f1; SELECT 'This is ""a test"". \\Part2\\' as f2;"; string[] ss = SplitSql(sql); } } }
Most of the program is trivial except for the regular expression pattern, which is expanded, commented here:
Regex re = new Regex( @" (?> " + // Specify non-backtracking subexpression (only match sub-expression) // HANDLE ESCAPED CHARACTERS @" (?(IQ) " + // if (IQ {In Quote}) then @" (?(ESC) " + // if(ESCAPE MODE) then @" .(?<-ESC>) " + // Consume the next character(it is escaped!) @" | " + // else @" \\(?<ESC>))|(?!)) " + // Not in escape mode...but if \ is detected, enable ESCAPE MODE // otherwise, zero width assert: (?!) never matches - do nothing @" | " + // OR // DETECT IF IN QUOTE, OR NOT @" (?(IQ) " + // if (IQ {In Quote} ) then @" \k<QUOTE>(?<-IQ>) " + // If captured quote character matched, then Pop the IQ named group. // No longer in quote. We don't have to worry about preceeding \ as // escape characters are already handled in escape section!! @" | " + // else @" (?<QUOTE>[""'])(?<IQ>)) " + // Try to match the begin quote character. The begin quote character // is assigned to the named group <QUOTE> for referencing later. If // matched, then indicate that we in a quote. Push IQ. @" | " + // OR // CONSUME REGULAR CHARACTERS @" (?(IQ) " + // if (IQ {In Quote} ) then @" . " + // Any character is allowed when we are in the quote. @" | " + // else @" [^;]) " + // Any non-delimiting character is allowed. @" )+", // Allow the subexpression to be matched one more more times // Will terminate if end of string, or delimit character is found when // not in a quote. RegexOptions.IgnorePatternWhitespace);
The above regular expressions demonstrates how to change modes in the regular expression pattern and conditionally match based on what was detected before. This enables balanced groups (start tags/end tags etc), and escape handling for example. In theory, these concepts and techniques can be used to build a robust parser.
Leave me a comment if you find this code and explanation useful!!!
Comments
nice post
thanks for such an informative material and it is productive in nature
SIMRAN