NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)
###About the pattern
When one or more lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value
that contains the captured values in that line/record (and potentially an OpenValue
capture group if there was an open quote at the end of the line/record).
Here's the commented pattern (test it on Regexstorm.net):
(?<=
|
|^)(?!
|
|$) // Records start at the beginning of line (line must not be empty)
(?: // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
(?: // Group for matching one of the value formats before a comma or EOL
"(?<Value>(?:[^"]|"")*)"| // Quoted value -or-
(?<Value>(?!")[^,
]+)| // Unquoted value -or-
"(?<OpenValue>(?:[^"]|"")*)(?=
|
|$)| // Open ended quoted value -or-
(?<Value>) // Empty value before comma (before EOL is excluded by "+?" quantifier later)
)
(?:,|(?=
|
|$)) // The value format matched must be followed by a comma or EOL
)+? // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))? // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:
|
|
|$) // Records end at EOL
Here's the raw pattern without all the comments or whitespace.
(?<=
|
|^)(?!
|
|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,
]+)|"(?<OpenValue>(?:[^"]|"")*)(?=
|
|$)|(?<Value>))(?:,|(?=
|
|$)))+?(?:(?<=,)(?<Value>))?(?:
|
|
|$)
[Here is a visualization from Debuggex.com][3] (capture groups named for clarity):
![Debuggex.com visualization][4]
###Usage examples:
Simple example for reading an entire CSV file/stream at once (test it on C# Pad):
(For better performance and less impact on system resources you should use the second example)
using System.Text.RegularExpressions;
Regex CSVParser = new Regex(
@"(?<=
|
|^)(?!
|
|$)" +
@"(?:" +
@"(?:" +
@"""(?<Value>(?:[^""]|"""")*)""|" +
@"(?<Value>(?!"")[^,
]+)|" +
@"""(?<OpenValue>(?:[^""]|"""")*)(?=
|
|$)|" +
@"(?<Value>)" +
@")" +
@"(?:,|(?=
|
|$))" +
@")+?" +
@"(?:(?<=,)(?<Value>))?" +
@"(?:
|
|
|$)",
RegexOptions.Compiled);
String CSVSample =
",record1 value2,val3,"value 4","testing ""embedded double quotes"""," +
""testing quoted "","" character", value 7,,value 9," +
""testing empty """" embedded quotes"," +
""testing a quoted value" + Environment.NewLine +
Environment.NewLine +
"that includes CR/LF patterns" + Environment.NewLine +
Environment.NewLine +
"(which we wish would never happen - but it does)", after CR/LF" + Environment.NewLine +
Environment.NewLine +
""testing an open ended quoted value" + Environment.NewLine +
Environment.NewLine +
",value 2 ,value 3," + Environment.NewLine +
""test"";
MatchCollection CSVRecords = CSVParser.Matches(CSVSample);
for (Int32 recordIndex = 0; recordIndex < CSVRecords.Count; recordIndex++)
{
Match Record = CSVRecords[recordIndex];
for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
{
Capture c = Record.Groups["Value"].Captures[valueIndex];
Console.Write("R" + (recordIndex + 1) + ":V" + (valueIndex + 1) + " = ");
if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '"')
{
// No need to unescape/undouble quotes if the value is empty, the value starts
// at the beginning of the record, or the character before the value is not a
// quote (not a quoted value)
Console.WriteLine(c.Value);
}
else
{
// The character preceding this value is a quote
// so we need to unescape/undouble any embedded quotes
Console.WriteLine(c.Value.Replace("""", """));
}
}
foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
Console.WriteLine("ERROR - Open ended quoted value: " + OpenValue.Value);
}
Better example for reading a large CSV file/stream without reading the entire file/stream into a string (test it [on C# Pad][6]).
using System.IO;
using System.Text.RegularExpressions;
// Same regex from before shortened to one line for brevity
Regex CSVParser = new Regex(
@"(?<=
|
|^)(?!
|
|$)(?:(?:""(?<Value>(?:[^""]|"""")*)""|(?<Value>(?!"")[^,
]+)|""(?<OpenValue>(?:[^""]|"""")*)(?=
|
|$)|(?<Value>))(?:,|(?=
|
|$)))+?(?:(?<=,)(?<Value>))?(?:
|
|
|$)",
RegexOptions.Compiled);
String CSVSample = ",record1 value2,val3,"value 4","testing ""embedded double quotes""","testing quoted "","" character", value 7,,value 9,"testing empty """" embedded quotes","testing a quoted value," +
Environment.NewLine + Environment.NewLine + "that includes CR/LF patterns" + Environment.NewLine + Environment.NewLine + "(which we wish would never happen - but it does)", after CR/LF," + Environment.NewLine + Environment
.NewLine + ""testing an open ended quoted value" + Environment.NewLine + Environment.NewLine + ",value 2 ,value 3," + Environment.NewLine + ""test"";
using (StringReader CSVReader = new StringReader(CSVSample))
{
String CSVLine = CSVReader.ReadLine();
StringBuilder RecordText = new StringBuilder();
Int32 RecordNum = 0;
while (CSVLine != null)
{
RecordText.AppendLine(CSVLine);
MatchCollection RecordsRead = CSVParser.Matches(RecordText.ToString());
Match Record = null;
for (Int32 recordIndex = 0; recordIndex < RecordsRead.Count; recordIndex++)
{
Record = RecordsRead[recordIndex];
if (Record.Groups["OpenValue"].Success && recordIndex == RecordsRead.Count - 1)
{
// We're still trying to find the end of a muti-line value in this record
// and it's the last of the records from this segment of the CSV.
// If we're not still working with the initial record we started with then
// prep the record text for the next read and break out to the read loop.
if (recordIndex != 0)
RecordText.AppendLine(Record.Value);
break;
}
// Valid record found or new record started before the end could be found
RecordText.Clear();
RecordNum++;
for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
{
Capture c = Record.Groups["Value"].Captures[valueIndex];
Console.Write("R" + RecordNum + ":V" + (valueIndex + 1) + " = ");
if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '"')
Console.WriteLine(c.Value);
else
Console.WriteLine(c.Value.Replace("""", """));
}
foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
}
CSVLine = CSVReader.ReadLine();
if (CSVLine == null && Record != null)
{
RecordNum++;
//End of file - still working on an open value?
foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
}
}
}
Both examples return the same result of:
R1:V1 =
R1:V2 = record1 value2
R1:V3 = val3
R1:V4 = value 4
R1:V5 = testing "embedded double quotes"
R1:V6 = testing quoted "," character
R1:V7 = value 7
R1:V8 =
R1:V9 = value 9
R1:V10 = testing empty "" embedded quotes
R1:V11 = testing a quoted value
<b