I’m trying to tweak my Elasticsearch default tokeniser (for stemmed English, stemmed other languages and also for unstemmed analysis) since I noticed that dot (“.”) isn’t by default a token separator, i.e. with the standard analyser. Other questions have suggested answers to how to achieve that… but my problem is more fundamental: I don’t seem to be able to apply ANY change of tokeniser to my index, even to unstemmed fields.
NB I’m using Python but not the Elasticsearch “thin client”: I’m submitting requests using the Python requests
package and in fact have created a utility function for doing that.
So I create my index. Then I’ve understood that I have to close it in order to change the settings.
success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_close', command='post')
Then I try to apply the settings:
settings_obj =
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokeniser"
}
},
"tokenizer": {
"my_tokeniser": {
"type": "pattern",
"pattern": "," # comma tokeniser: a first test to see if I can get things working
# "pattern": r"W+^." # what I want to achieve
# "pattern": r"W+" # what the standard analyzer is said to do
}
}
}
}
}
headers = {CONTENT_TYPE_KEY: APPLICATION_JSON_VALUE}
success, deliverable = process_json_request(f'{ES_URL}/{new_index}/
_settings', command='put', data=json.dumps(settings_obj), headers=headers)
Then I open again:
success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_open', command='post')
… everything else occurs without errors. The index is populated. I run a search… and it is tokenised exactly as before! No “comma tokenising” has occurred.
I then go to Insomnia to examine my settings for this index:
"GET ... http://localhost:9200/dev_my_documents-744/_settings"
gives:
{
"dev_my_documents-744": {
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "dev_my_documents-744",
"creation_date": "1718297857215",
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokeniser"
}
},
"tokenizer": {
"my_tokeniser": {
"pattern": ",",
"type": "pattern"
}
}
},
"number_of_replicas": "1",
"uuid": "ZekazwMETLO5cDd9Oc_7Mg",
"version": {
"created": "8503000"
}
}
}
}
}
Obviously the stemmed fields have their own language-specific analyser which I assume overrides the “default” analyser from settings … but I also see no change at all with the 2 unstemmed fields (“normalised_content”, i.e. accents stripped out, and “unnormalised_content”) when I do searches using these.
I’ve also tried using the “_analyzer” end point … again, everything is tokenised on word boundaries… never on commas.
This “comma tokeniser” appears not to be being applied. What am I doing wrong?