I have an app where I index data in OpenSearch. The data model is largely defined by the end users and not by myself. So, when a user says that they have a “string” field, I index it both as a text
and keyword
field in OpenSearch because I don’t know whether it’s a short, enum-style string or long-form text. So my field mappings look like:
"example_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
The problem arises when a user then supplies long-form text, and I get errors like:
Document contains at least one immense term in field=”example_field.keyword” (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.
I’ve tried setting ignore_above
like so:
"example_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 20000
}
}
}
But it looks like this doesn’t actually prevent immense terms in keyword fields.
Ideally, my app would distinguish between short and long text fields so I didn’t have to index everything as both text
and keyword
. But as that isn’t the case now, is there a way I can limit the max length of the keyword field, but not of the text field?
The text
and keyword
field types just works as you wanted.
In other words, there is no hard limit for
text
butkeyword
.
So when you index a long string (by default long mean is more than 256 characters), the keyword
field type will be dropped/ignored but you can continue to use text
field type for search.
If you would like to limit the length you can use ingest pipeline
. There is an example here.
1
The problem was that I had set ignore_above
too high:
"example_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 20000
}
}
}
That will work for English text, but not for e.g. Chinese text. ignore_above
will configure the character limit, but Lucene’s internal limit is 32766 bytes.
So a safe limit to set is ignore_above: 8000
. Unicode characters are at most 4 bytes, and 8000*4 = 32000, which is still below the limit.