Can the csv format be defined by a regex?

A colleague and I have recently argued over whether a pure regex is capable of fully encapsulating the csv format, such that it is capable of parsing all files with any given escape char, quote char, and separator char.

The regex need not be capable of changing these chars after creation, but it must not fail on any other edge case.

I have argued that this is impossible for just a tokenizer. The only regex that might be able to do this is a very complex PCRE style that moves beyond just tokenizing.

I am looking for something along the lines of:

… the csv format is a context free grammar and as such, it is
impossible to parse with regex alone …

Or am I wrong? Is it possible to parse csv with just a POSIX regex?

For example, if both the escape char and the quote char are ", then these two lines are valid csv:

"""this is a test.""",""
"and he said,""What will be, will be."", to which I replied, ""Surely not!""","moving on to the next field here..."

Nice in theory, terrible in practice

By CSV I’m going to assume you mean the convention as described in RFC 4180.

While matching basic CSV data is trivial:

"data", "more data"

Note: BTW, it’s a lot more efficient to use a .split(‘/n’).split(‘”‘) function for very simple and well-structured data like this. Regular Expressions work as a NDFSM (Non-Deterministic Finite State Machine) that wastes a lot of time backtracking once you start adding edge cases like escape chars.

For example here’s the most comprehensive regular expression matching string I’ve found:

re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^                                   # Anchor to start of string.
s*                                 # Allow whitespace before value.
(?:                                 # Group for value alternatives.
  '[^'\]*(?:\[Ss][^'\]*)*'     # Either Single quoted string,
| "[^"\]*(?:\[Ss][^"\]*)*"     # or Double quoted string,
| [^,'"s\]*(?:s+[^,'"s\]+)*    # or Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
s*                                 # Allow whitespace after value.
(?:                                 # Zero or more additional values
  ,                                 # Values separated by a comma.
  s*                               # Allow whitespace before value.
  (?:                               # Group for value alternatives.
    '[^'\]*(?:\[Ss][^'\]*)*'   # Either Single quoted string,
  | "[^"\]*(?:\[Ss][^"\]*)*"   # or Double quoted string,
  | [^,'"s\]*(?:s+[^,'"s\]+)*  # or Non-comma, non-quote stuff.
  )                                 # End group of value alternatives.
  s*                               # Allow whitespace after value.
)*                                  # Zero or more additional values
$                                   # Anchor to end of string.
"""

It reasonably handles single and double quoted values, but not newlines in values, escaped quotes, etc.

Source: Stack Overflow – How can I parse a string with JavaScript

It’s becomes a nightmare once the common edge-cases are introduced like…

"such as ""escaped""","data"
"values that contain /n newline chars",""
"escaped, commas, like",",these"
"un-delimited data like", this
"","empty values"
"empty trailing values",        // <- this is completely valid
                                // <- trailing newline, may or may not be included

The newline-as-value edge case alone is enough to break 99.9999% of the RegEx based parsers found in the wild. The only ‘reasonable’ alternative is to use RegEx matching for basic control/non-control character (ie terminal vs non-terminal) tokenization paired with a state machine used for higher level analysis.

Source: Experience otherwise known as extensive pain and suffering.

I am the author of jquery-CSV, the only javascript based, fully RFC-compliant, CSV parser in the world. I have spent months tackling this problem, speaking with many intelligent people, and trying a ton if different implementations including 3 full rewrites of the core parser engine.

tl;dr – Moral of the story, PCRE alone sucks for parsing anything but the most simple and strict regular (Ie Type-III) grammars. Albeit, it’s useful for tokenizing terminal and non-terminal strings.

Regex can parse any regular language, and cannot parse fancy things like recursive grammars. But CSV seems to be pretty regular, so parseable with a regex.

Let’s work from definition: allowed are sequence, choice form alternatives (|), and repetition (Kleene star, the *).

An unquoted value is regular: [^,]* # any char but comma
A quoted value is regular: "([^"]|\\|\")*" # sequence of anything but quote " or escaped quote " or escaped escape \
- Some forms may include escaping quotes with quotes, which adds a variant ("")*" to the expression above.
An allowed value is regular: <unquoted-value>|<quoted-value>
A single CSV line is regular: <value>(,<value>)*
A sequence of lines separated by n is also obviously regular.

I did not meticulously test each of these expressions, and never defined catch groups. I also glossed over some technicalities, like the variants of characters which can be used instead of ,, ", or line separators: these do not break the regularity, you just get several slightly different languages.

If you can spot a problem in this proof, please comment! 🙂

But despite this, practical parsing of CSV files by pure regular expressions may be problematic. You need to know which of the variants is being fed to the parser, and there’s no standard for it. You can try several parsers against each line until one succeeds, or somehow divinate the format form comments. But this may require means other than regular expressions to do efficiently, or at all.

Simple answer – probably not.

The first problem is a lack of a standard. While one may describe their csv in a way that is strictly defined, one cannot expect to get strictly defined csv files. “Be conservative in what you do, be liberal in what you accept from others” -Jon Postal

Assuming that one does have a standardesque that is acceptable, there is the question of escape characters and if these need to be balanced.

A string in many csv formats is defined as string value 1,string value 2. However, if that string contains a comma it is now "string, value 1",string value 2. If it contains a quote it becomes "string, ""value 1""",string value 2.

At this point I believe it is impossible. The problem being you need to determine how many quotes you have read and if a comma is inside or outside of the double quoted mode of the value. Balancing parentheses is an impossible regex problem. Some extended regular expression engines (PCRE) can deal with it, but it isn’t a regular expression then.

You might find https://stackoverflow.com/questions/8629763/csv-parsing-with-a-context-free-grammar useful.

Amended:

I have been looking at formats for escape characters and haven’t found any that need arbitrary counting – so that is probably not the issue.

However, there are issues of what is the escape character and record delimiter (to start with). http://www.csvreader.com/csv_format.php is a good read on the different formats in the wild.

The rules for the quoted string (if it is a single quoted string or a double quoted string) differ.
- 'This, is a value' vs "This, is a value"
The rules for escape characters
- "This ""is a value""" vs "This "is a value""
The handling of embedded record delimiter ({rd})
- (raw embeded) "This {rd}is a value" vs (escaped) "This {rd}is a value" vs (translated)"This {0x1C}is a value"

The key thing here is that it is possible to have a string that will always have multiple valid interpretations.

The related question (for edge cases) “is it possible to have a invalid string that is accepted?”

I still strongly doubt that there is a regular expression that can match every valid CSV that is created by some application and reject every csv that cannot be parsed.

This regexp can tokenize normal CSV, as described in the RFC:

/("(?:[^"]|"")*"|[^,"nr]*)(,|r?n|r)/

Explanation:

("(?:[^"]|"")*"|[^,"nr]*) – a CSV field, quoted or not
- "(?:[^"]|"")*" – a quoted field;
  - [^"]|"" – each character is either not ", or " escaped as ""
- [^,"nr]* – an unquoted field, which may not contain , " n r
(,|r?n|r) – the following separator, either , or a newline
- r?n|r – a newline, one of rn n r

An entire CSV file can be matched and validated by using this regexp repeatedly. It is then necessary to fix the quoted fields, and split it into rows based on the separators.

Here is code for a CSV parser in Javascript, based on the regexp:

var csv_tokens_rx = /("(?:[^"]|"")*"|[^,"nr]*)(,|r?n|r)/y;
var csv_unescape_quote_rx = /""/g;
function csv_parse(s) {
    if (s && s.slice(-1) != 'n')
        s += 'n';
    var ok;
    var rows = [];
    var row = [];
    csv_tokens_rx.lastIndex = 0;
    while (true) {
        ok = csv_tokens_rx.lastIndex == s.length;
        var m = s.match(csv_tokens_rx);
        if (!m)
            break;
        var v = m[1], d = m[2];
        if (v[0] == '"') {
            v = v.slice(1, -1);
            v = v.replace(csv_unescape_quote_rx, '"');
        }
        if (d == ',' || v)
            row.push(v);
        if (d != ',') {
            rows.push(row)
            row = [];
        }
    }
    return ok ? rows : null;
}

Whether this answer helps to settle your argument is for you to decide; I am just happy to have a small, simple and correct CSV parser.

In my opinion a lex program is more or a less a large regular expression, and those can tokenize much more complex formats, such as the C programming language.

With reference to the RFC 4180 definitions:

line break (CRLF) – The regexp is more flexible, allowing CRLF, LF or CR.
The last record in the file may or may not have an ending line break – The regexp as it is requires a final line break, but the parser adjusts for that.
There maybe an optional header line – This does not impact the parser.
Each line should contain the same number of fields throughout the file – not enforced
Spaces are considered part of a field and should not be ignored – okay
The last field in the record must not be followed by a comma – not enforced
Each field may or may not be enclosed in double quotes … – okay
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes – okay
a double-quote appearing inside a field must be escaped by preceding it with another double quote – okay

The regexp itself satisfies most of the RFC 4180 requirements. I don’t agree with the others, but it is easy to adjust the parser to implement them.

First define the grammar for your CSV (are the field delimiters escaped or encoded somehow if they appear in text?) and then it can be determined if it is parsable with regex. Grammar first: parser second: http://www.boyet.com/articles/csvparser.html It should be noted that this method uses a tokenizer – but I can’t constuct a POSIX regex that would match all edge cases. If your usage of CSV formats is non-regular and context free … then your answer is in your question. Good overview here: http://nikic.github.com/2012/06/15/The-true-power-of-regular-expressions.html

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 03:57

Thẻ: csv, parsing, regular-expressions

Thiết kế website giá rẻ

Danh mục

Can the csv format be defined by a regex?