High-level strategy for distinguishing a regular string from invalid JSON (ie. JSON-like string detection)

Disclaimer On Absence of Code:

I have no code to post because I haven’t started writing; was looking for more theoretical guidance as I doubt I’ll have trouble coding it but am pretty befuddled on what approach(es) would yield best results. I’m not seeking any code, either, though; just direction.

Dilemma

I’m toying with adding a “magic method”-style feature to a UI I’m building for a client, and it would require intelligently detecting whether or not a string was meant to be JSON as against a simple string.

I had considered these general ideas:

Look for a sort of arbitrarily-determined acceptable ratio of the frequency of JSON-like syntax (ie. regex to find strings separated by colons; look for colons between curly-braces, etc.) to the number of quote-encapsulated strings + nulls, bools and ints/floats. But the smaller the data set, the more fickle this would get
look for key identifiers like opening and closing curly braces… not sure if there even are more easy identifiers, and this doesn’t appeal anyway because it’s so prescriptive about the kinds of mistakes it could find
try incrementally parsing chunks, as those between curly braces, and seeing what proportion of these fractional statements turn out to be valid JSON; this seems like it would suffer less than (1) from smaller datasets, but would probably be much more processing-intensive, and very susceptible to a missing or inverted brace

Just curious if the computational folks or algorithm pros out there had any approaches in mind that my semantics-oriented brain might have missed.

PS: It occurs to me that natural language processing, about which I am totally ignorant, might be a cool approach; but, if NLP is a good strategy here, it sort of doesn’t matter because I have zero experience with it and don’t have time to learn & then implement/ this feature isn’t worth it to the client.

The real answer (“tell between invalid JSON and a string”)

For a lot of obvious reasons we’re not going to come up with a definite answer – we can at best hope to determine an index that will tell us whether a string might have been JSON.

The best approach I can think of is to use something like the Levenshtein algorithm, with this twist: we have no “B string”, but rather we start parsing the candidate string and modify it as soon as we hit an invalid character.

To do this, we can use a JSON streaming parser (state-based). Let’s say we parse

{ "key": value" }

then the parser will find the ‘v’ and choke. At that point we can try deleting the character, or adding one of several possible valid characters for that state and recurse.

Any branch can be abandoned as soon as the distance to valid JSON for that branch exceeds a minimum, if one has been found so far, or according to some heuristics (e.g. no more than four consecutive [‘s, and so on), or as soon as it exceeds a maximum allowed distance (say, 5). This last condition is probably essential to avoid snowballing.

A drawback of this approach is that some solutions would require backtracking:

{ "key": "value }

would be accepted up to the end, then a missing ” would be added, and finally a missing }, getting a valid JSON:

{ "key": "value }"}

with a distance of 2 (two additions). Perhaps some fast heuristic for backtracking can be found (e.g. when in “wait-for-end-quote”, backtrack until the first non-“special” character, and ignore whitespaces).

An important element to assess the viability of a solution would be how is this invalid JSON generated, i.e., is it a truncated JSON? Are some characters damaged? has it some sort of basic structure which is always the same?

User-generated JSON

As per comment, this JSON may be the result of an attempt of hand-building JSON. This on one hand is bad since human beings are very close to a random source when we’re talking about errors :-). On the other hand it may be good since the JSON is probably “almost good JSON” with possibly only a few errors: commas left out or added in,

{ "first_name": "John", "last_name": "Doe", "email": "[email protected]", }

Indeed, we might find it useful to at least preprocess the string to remove some common errors:

“,s*} –> “} # Extra comma due to copy-paste
(“)s*[:;.]s(“[^”]“s🙂 # Forgot or mistyped a comma between 1 and 2

and since (hopefully) square and curly braces won’t appear inside keys or values, we can try and balance them, and/or use them as guidance: e.g. if the curly braces are balanced, assume they’re okay and do not add or remove any of them. This has to be weighted against users’ error capacity – if they are liable to totally misplace a closing curly brace, then we can’t rely on balancing.

(Of course, we can’t do anything against a malicious user intentionally feeding malformed JSON designed to either trick the system, or even cause a denial of service through some expression requiring exponential time computation).

At that point we could instead, as you suggested, consider the frequency of JSON control characters, or try recognizing JSON subsequences. This has to be compared against whatever else might come in in that field.

The bad answer (“tell between valid JSON and a string”)

In the general case you cannot distinguish with 100% accuracy between strings and JSONs because, while not all JSONs are valid strings, all valid strings are valid JSONs. The encoding of a bare string is that string.

So upon receiving "Hello, world", any function would return that this is both a valid string and a valid JSON. Which should it choose?

Otherwise, distinguishing is pretty straightforward. Plug the JSON engine of your choice and try decoding the string as JSON. If you can, that is a JSON string. If you can’t, it’s a text string:

-- pseudo code
with ( JSON.decode(myString) ) {
    if (.isNull()) {
        -- Error. Not valid JSON, so it must be a String.
        return "String"
    }
    -- It is a valid JSON, but a String is that, too.
    if (.type().equals("String")) {
        -- Could return either, too, depending.
        return "Ambiguous"
    }
}
return "JSON"

This is surely not as efficient as a function that just checks the grammar constraints (you can find the schemas on the home page), but it’s a lot more maintainable. You can probably get a FSM by boiling dry the code of a streaming parser and removing all data emission, just returning “true” or “false” (a processing error meaning it’s not valid JSON).

So what you want is a JSON sniffer. Something to look at a string and make a heuristic guess. So what to look for and how to look?

First off, a JSON parser is pretty useless. You call that after you’ve guess it’s probably JSON.

I can think of three approaches.

Smoking gun. If the string has JSON or json-schema.org in it, that’s a match. Maybe there are some other common words you might know of.
Regex it. Think up some regexes that would be highly likely to be matched. If over half the lines match one or more suitable regexes it’s a match.
Tokenise it. JSON typically has lots of individual quoted strings, numbers, commas, brackets, braces and colons. Any string that is over 50% made up of those tokens is a match. (Make sure you don’t let an unmatched quote swallow the whole string.)

Your suggestions could work too, but I think mine are simpler.

I’ve used sniffing code in the past and it’s surprisingly easy to get it right most of the time. The mistakes are usually when you get (say) a text message with a piece of JSON quoted in it. You can’t win them all.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 03:50

Thẻ: algorithms, data-structures, json, parsing

Thiết kế website giá rẻ

Danh mục

High-level strategy for distinguishing a regular string from invalid JSON (ie. JSON-like string detection)

The real answer (“tell between invalid JSON and a string”)

User-generated JSON

The bad answer (“tell between valid JSON and a string”)