Disclaimer On Absence of Code:
I have no code to post because I haven’t started writing; was looking for more theoretical guidance as I doubt I’ll have trouble coding it but am pretty befuddled on what approach(es) would yield best results. I’m not seeking any code, either, though; just direction.
Dilemma
I’m toying with adding a “magic method”-style feature to a UI I’m building for a client, and it would require intelligently detecting whether or not a string was meant to be JSON as against a simple string.
I had considered these general ideas:
-
Look for a sort of arbitrarily-determined acceptable ratio of the frequency of JSON-like syntax (ie. regex to find strings separated by colons; look for colons between curly-braces, etc.) to the number of quote-encapsulated strings +
null
s,bool
s andint
s/float
s. But the smaller the data set, the more fickle this would get -
look for key identifiers like opening and closing curly braces… not sure if there even are more easy identifiers, and this doesn’t appeal anyway because it’s so prescriptive about the kinds of mistakes it could find
-
try incrementally parsing chunks, as those between curly braces, and seeing what proportion of these fractional statements turn out to be valid JSON; this seems like it would suffer less than (1) from smaller datasets, but would probably be much more processing-intensive, and very susceptible to a missing or inverted brace
Just curious if the computational folks or algorithm pros out there had any approaches in mind that my semantics-oriented brain might have missed.
PS: It occurs to me that natural language processing, about which I am totally ignorant, might be a cool approach; but, if NLP is a good strategy here, it sort of doesn’t matter because I have zero experience with it and don’t have time to learn & then implement/ this feature isn’t worth it to the client.
3
The real answer (“tell between invalid JSON and a string”)
For a lot of obvious reasons we’re not going to come up with a definite answer – we can at best hope to determine an index that will tell us whether a string might have been JSON.
The best approach I can think of is to use something like the Levenshtein algorithm, with this twist: we have no “B string”, but rather we start parsing the candidate string and modify it as soon as we hit an invalid character.
To do this, we can use a JSON streaming parser (state-based). Let’s say we parse
{ "key": value" }
then the parser will find the ‘v’ and choke. At that point we can try deleting the character, or adding one of several possible valid characters for that state and recurse.
Any branch can be abandoned as soon as the distance to valid JSON for that branch exceeds a minimum, if one has been found so far, or according to some heuristics (e.g. no more than four consecutive [‘s, and so on), or as soon as it exceeds a maximum allowed distance (say, 5). This last condition is probably essential to avoid snowballing.
A drawback of this approach is that some solutions would require backtracking:
{ "key": "value }
would be accepted up to the end, then a missing ” would be added, and finally a missing }, getting a valid JSON:
{ "key": "value }"}
with a distance of 2 (two additions). Perhaps some fast heuristic for backtracking can be found (e.g. when in “wait-for-end-quote”, backtrack until the first non-“special” character, and ignore whitespaces).
An important element to assess the viability of a solution would be how is this invalid JSON generated, i.e., is it a truncated JSON? Are some characters damaged? has it some sort of basic structure which is always the same?
User-generated JSON
As per comment, this JSON may be the result of an attempt of hand-building JSON. This on one hand is bad since human beings are very close to a random source when we’re talking about errors :-). On the other hand it may be good since the JSON is probably “almost good JSON” with possibly only a few errors: commas left out or added in,
{ "first_name": "John", "last_name": "Doe", "email": "[email protected]", }
Indeed, we might find it useful to at least preprocess the string to remove some common errors:
“,s*} –> “} # Extra comma due to copy-paste
(“)s*[:;.]s(“[^”]“s🙂 # Forgot or mistyped a comma between 1 and 2
and since (hopefully) square and curly braces won’t appear inside keys or values, we can try and balance them, and/or use them as guidance: e.g. if the curly braces are balanced, assume they’re okay and do not add or remove any of them. This has to be weighted against users’ error capacity – if they are liable to totally misplace a closing curly brace, then we can’t rely on balancing.
(Of course, we can’t do anything against a malicious user intentionally feeding malformed JSON designed to either trick the system, or even cause a denial of service through some expression requiring exponential time computation).
At that point we could instead, as you suggested, consider the frequency of JSON control characters, or try recognizing JSON subsequences. This has to be compared against whatever else might come in in that field.
The bad answer (“tell between valid JSON and a string”)
In the general case you cannot distinguish with 100% accuracy between strings and JSONs because, while not all JSONs are valid strings, all valid strings are valid JSONs. The encoding of a bare string is that string.
So upon receiving "Hello, world"
, any function would return that this is both a valid string and a valid JSON. Which should it choose?
Otherwise, distinguishing is pretty straightforward. Plug the JSON engine of your choice and try decoding the string as JSON. If you can, that is a JSON string. If you can’t, it’s a text string:
-- pseudo code
with ( JSON.decode(myString) ) {
if (.isNull()) {
-- Error. Not valid JSON, so it must be a String.
return "String"
}
-- It is a valid JSON, but a String is that, too.
if (.type().equals("String")) {
-- Could return either, too, depending.
return "Ambiguous"
}
}
return "JSON"
This is surely not as efficient as a function that just checks the grammar constraints (you can find the schemas on the home page), but it’s a lot more maintainable. You can probably get a FSM by boiling dry the code of a streaming parser and removing all data emission, just returning “true” or “false” (a processing error meaning it’s not valid JSON).
5
So what you want is a JSON sniffer. Something to look at a string and make a heuristic guess. So what to look for and how to look?
First off, a JSON parser is pretty useless. You call that after you’ve guess it’s probably JSON.
I can think of three approaches.
- Smoking gun. If the string has JSON or json-schema.org in it, that’s a match. Maybe there are some other common words you might know of.
- Regex it. Think up some regexes that would be highly likely to be matched. If over half the lines match one or more suitable regexes it’s a match.
- Tokenise it. JSON typically has lots of individual quoted strings, numbers, commas, brackets, braces and colons. Any string that is over 50% made up of those tokens is a match. (Make sure you don’t let an unmatched quote swallow the whole string.)
Your suggestions could work too, but I think mine are simpler.
I’ve used sniffing code in the past and it’s surprisingly easy to get it right most of the time. The mistakes are usually when you get (say) a text message with a piece of JSON quoted in it. You can’t win them all.
2