A colleague and I have recently argued over whether a pure regex is capable of fully encapsulating the csv format, such that it is capable of parsing all files with any given escape char, quote char, and separator char.
The regex need not be capable of changing these chars after creation, but it must not fail on any other edge case.
I have argued that this is impossible for just a tokenizer. The only regex that might be able to do this is a very complex PCRE style that moves beyond just tokenizing.
I am looking for something along the lines of:
… the csv format is a context free grammar and as such, it is
impossible to parse with regex alone …
Or am I wrong? Is it possible to parse csv with just a POSIX regex?
For example, if both the escape char and the quote char are "
, then these two lines are valid csv:
"""this is a test.""",""
"and he said,""What will be, will be."", to which I replied, ""Surely not!""","moving on to the next field here..."
6
Nice in theory, terrible in practice
By CSV I’m going to assume you mean the convention as described in RFC 4180.
While matching basic CSV data is trivial:
"data", "more data"
Note: BTW, it’s a lot more efficient to use a .split(‘/n’).split(‘”‘) function for very simple and well-structured data like this. Regular Expressions work as a NDFSM (Non-Deterministic Finite State Machine) that wastes a lot of time backtracking once you start adding edge cases like escape chars.
For example here’s the most comprehensive regular expression matching string I’ve found:
re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^ # Anchor to start of string.
s* # Allow whitespace before value.
(?: # Group for value alternatives.
'[^'\]*(?:\[Ss][^'\]*)*' # Either Single quoted string,
| "[^"\]*(?:\[Ss][^"\]*)*" # or Double quoted string,
| [^,'"s\]*(?:s+[^,'"s\]+)* # or Non-comma, non-quote stuff.
) # End group of value alternatives.
s* # Allow whitespace after value.
(?: # Zero or more additional values
, # Values separated by a comma.
s* # Allow whitespace before value.
(?: # Group for value alternatives.
'[^'\]*(?:\[Ss][^'\]*)*' # Either Single quoted string,
| "[^"\]*(?:\[Ss][^"\]*)*" # or Double quoted string,
| [^,'"s\]*(?:s+[^,'"s\]+)* # or Non-comma, non-quote stuff.
) # End group of value alternatives.
s* # Allow whitespace after value.
)* # Zero or more additional values
$ # Anchor to end of string.
"""
It reasonably handles single and double quoted values, but not newlines in values, escaped quotes, etc.
Source: Stack Overflow – How can I parse a string with JavaScript
It’s becomes a nightmare once the common edge-cases are introduced like…
"such as ""escaped""","data"
"values that contain /n newline chars",""
"escaped, commas, like",",these"
"un-delimited data like", this
"","empty values"
"empty trailing values", // <- this is completely valid
// <- trailing newline, may or may not be included
The newline-as-value edge case alone is enough to break 99.9999% of the RegEx based parsers found in the wild. The only ‘reasonable’ alternative is to use RegEx matching for basic control/non-control character (ie terminal vs non-terminal) tokenization paired with a state machine used for higher level analysis.
Source: Experience otherwise known as extensive pain and suffering.
I am the author of jquery-CSV, the only javascript based, fully RFC-compliant, CSV parser in the world. I have spent months tackling this problem, speaking with many intelligent people, and trying a ton if different implementations including 3 full rewrites of the core parser engine.
tl;dr – Moral of the story, PCRE alone sucks for parsing anything but the most simple and strict regular (Ie Type-III) grammars. Albeit, it’s useful for tokenizing terminal and non-terminal strings.
2
Regex can parse any regular language, and cannot parse fancy things like recursive grammars. But CSV seems to be pretty regular, so parseable with a regex.
Let’s work from definition: allowed are sequence, choice form alternatives (|
), and repetition (Kleene star, the *
).
- An unquoted value is regular:
[^,]*
# any char but comma - A quoted value is regular:
"([^"]|\\|\")*"
# sequence of anything but quote"
or escaped quote"
or escaped escape\
- Some forms may include escaping quotes with quotes, which adds a variant
("")*"
to the expression above.
- Some forms may include escaping quotes with quotes, which adds a variant
- An allowed value is regular: <unquoted-value>
|
<quoted-value> - A single CSV line is regular: <value>
(,
<value>)*
- A sequence of lines separated by
n
is also obviously regular.
I did not meticulously test each of these expressions, and never defined catch groups. I also glossed over some technicalities, like the variants of characters which can be used instead of ,
, "
, or line separators: these do not break the regularity, you just get several slightly different languages.
If you can spot a problem in this proof, please comment! 🙂
But despite this, practical parsing of CSV files by pure regular expressions may be problematic. You need to know which of the variants is being fed to the parser, and there’s no standard for it. You can try several parsers against each line until one succeeds, or somehow divinate the format form comments. But this may require means other than regular expressions to do efficiently, or at all.
17
Simple answer – probably not.
The first problem is a lack of a standard. While one may describe their csv in a way that is strictly defined, one cannot expect to get strictly defined csv files. “Be conservative in what you do, be liberal in what you accept from others” -Jon Postal
Assuming that one does have a standardesque that is acceptable, there is the question of escape characters and if these need to be balanced.
A string in many csv formats is defined as string value 1,string value 2
. However, if that string contains a comma it is now "string, value 1",string value 2
. If it contains a quote it becomes "string, ""value 1""",string value 2
.
At this point I believe it is impossible. The problem being you need to determine how many quotes you have read and if a comma is inside or outside of the double quoted mode of the value. Balancing parentheses is an impossible regex problem. Some extended regular expression engines (PCRE) can deal with it, but it isn’t a regular expression then.
You might find https://stackoverflow.com/questions/8629763/csv-parsing-with-a-context-free-grammar useful.
Amended:
I have been looking at formats for escape characters and haven’t found any that need arbitrary counting – so that is probably not the issue.
However, there are issues of what is the escape character and record delimiter (to start with). http://www.csvreader.com/csv_format.php is a good read on the different formats in the wild.
- The rules for the quoted string (if it is a single quoted string or a double quoted string) differ.
'This, is a value'
vs"This, is a value"
- The rules for escape characters
"This ""is a value"""
vs"This "is a value""
- The handling of embedded record delimiter ({rd})
- (raw embeded)
"This {rd}is a value"
vs (escaped)"This {rd}is a value"
vs (translated)"This {0x1C}is a value"
- (raw embeded)
The key thing here is that it is possible to have a string that will always have multiple valid interpretations.
The related question (for edge cases) “is it possible to have a invalid string that is accepted?”
I still strongly doubt that there is a regular expression that can match every valid CSV that is created by some application and reject every csv that cannot be parsed.
2
This regexp can tokenize normal CSV, as described in the RFC:
/("(?:[^"]|"")*"|[^,"nr]*)(,|r?n|r)/
Explanation:
("(?:[^"]|"")*"|[^,"nr]*)
– a CSV field, quoted or not"(?:[^"]|"")*"
– a quoted field;[^"]|""
– each character is either not"
, or"
escaped as""
[^,"nr]*
– an unquoted field, which may not contain,
"
n
r
(,|r?n|r)
– the following separator, either,
or a newliner?n|r
– a newline, one ofrn
n
r
An entire CSV file can be matched and validated by using this regexp repeatedly. It is then necessary to fix the quoted fields, and split it into rows based on the separators.
Here is code for a CSV parser in Javascript, based on the regexp:
var csv_tokens_rx = /("(?:[^"]|"")*"|[^,"nr]*)(,|r?n|r)/y;
var csv_unescape_quote_rx = /""/g;
function csv_parse(s) {
if (s && s.slice(-1) != 'n')
s += 'n';
var ok;
var rows = [];
var row = [];
csv_tokens_rx.lastIndex = 0;
while (true) {
ok = csv_tokens_rx.lastIndex == s.length;
var m = s.match(csv_tokens_rx);
if (!m)
break;
var v = m[1], d = m[2];
if (v[0] == '"') {
v = v.slice(1, -1);
v = v.replace(csv_unescape_quote_rx, '"');
}
if (d == ',' || v)
row.push(v);
if (d != ',') {
rows.push(row)
row = [];
}
}
return ok ? rows : null;
}
Whether this answer helps to settle your argument is for you to decide; I am just happy to have a small, simple and correct CSV parser.
In my opinion a lex
program is more or a less a large regular expression, and those can tokenize much more complex formats, such as the C programming language.
With reference to the RFC 4180 definitions:
- line break (CRLF) – The regexp is more flexible, allowing CRLF, LF or CR.
- The last record in the file may or may not have an ending line break – The regexp as it is requires a final line break, but the parser adjusts for that.
- There maybe an optional header line – This does not impact the parser.
- Each line should contain the same number of fields throughout the file – not enforced
Spaces are considered part of a field and should not be ignored – okay
The last field in the record must not be followed by a comma – not enforced - Each field may or may not be enclosed in double quotes … – okay
- Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes – okay
- a double-quote appearing inside a field must be escaped by preceding it with another double quote – okay
The regexp itself satisfies most of the RFC 4180 requirements. I don’t agree with the others, but it is easy to adjust the parser to implement them.
3
First define the grammar for your CSV (are the field delimiters escaped or encoded somehow if they appear in text?) and then it can be determined if it is parsable with regex. Grammar first: parser second: http://www.boyet.com/articles/csvparser.html It should be noted that this method uses a tokenizer – but I can’t constuct a POSIX regex that would match all edge cases. If your usage of CSV formats is non-regular and context free … then your answer is in your question. Good overview here: http://nikic.github.com/2012/06/15/The-true-power-of-regular-expressions.html