Thiết kế website giá rẻ

Question

I’m trying to tweak my Elasticsearch default tokeniser (for stemmed English, stemmed other languages and also for unstemmed analysis) since I noticed that dot (“.”) isn’t by default a token separator, i.e. with the standard analyser. Other questions have suggested answers to how to achieve that… but my problem is more fundamental: I don’t seem to be able to apply ANY change of tokeniser to my index, even to unstemmed fields.

NB I’m using Python but not the Elasticsearch “thin client”: I’m submitting requests using the Python requests package and in fact have created a utility function for doing that.

So I create my index. Then I’ve understood that I have to close it in order to change the settings.

<code>success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_close', command='post')

</code>

<code>success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_close', command='post') </code>

success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_close', command='post')

Then I try to apply the settings:

<code>settings_obj =

{

"settings": {

"analysis": {

"analyzer": {

"my_analyzer": {

"tokenizer": "my_tokeniser"

}

},

"tokenizer": {

"my_tokeniser": {

"type": "pattern",

"pattern": "," # comma tokeniser: a first test to see if I can get things working

# "pattern": r"W+^." # what I want to achieve

# "pattern": r"W+" # what the standard analyzer is said to do

}

headers = {CONTENT_TYPE_KEY: APPLICATION_JSON_VALUE}

success, deliverable = process_json_request(f'{ES_URL}/{new_index}/

_settings', command='put', data=json.dumps(settings_obj), headers=headers)

</code>

<code>settings_obj = { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokeniser" } }, "tokenizer": { "my_tokeniser": { "type": "pattern", "pattern": "," # comma tokeniser: a first test to see if I can get things working # "pattern": r"W+^." # what I want to achieve # "pattern": r"W+" # what the standard analyzer is said to do } } } } } headers = {CONTENT_TYPE_KEY: APPLICATION_JSON_VALUE} success, deliverable = process_json_request(f'{ES_URL}/{new_index}/ _settings', command='put', data=json.dumps(settings_obj), headers=headers) </code>

settings_obj = 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokeniser"
                }
            },
            "tokenizer": {
                "my_tokeniser": {
                    "type": "pattern",
                    "pattern": "," # comma tokeniser: a first test to see if I can get things working
                    # "pattern": r"W+^." # what I want to achieve
                    # "pattern": r"W+" # what the standard analyzer is said to do
                }
            }
        }
    }
}
headers = {CONTENT_TYPE_KEY: APPLICATION_JSON_VALUE}
success, deliverable = process_json_request(f'{ES_URL}/{new_index}/
    _settings', command='put', data=json.dumps(settings_obj), headers=headers)

Then I open again:

<code>success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_open', command='post')

</code>

<code>success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_open', command='post') </code>

success, deliverable = process_json_request(f'{ES_URL}/{new_index}/_open', command='post')

… everything else occurs without errors. The index is populated. I run a search… and it is tokenised exactly as before! No “comma tokenising” has occurred.

I then go to Insomnia to examine my settings for this index:

<code>"GET ... http://localhost:9200/dev_my_documents-744/_settings"

</code>

<code>"GET ... http://localhost:9200/dev_my_documents-744/_settings" </code>

"GET ... http://localhost:9200/dev_my_documents-744/_settings"

gives:

<code>{

"dev_my_documents-744": {

"settings": {

"index": {

"routing": {

"allocation": {

"include": {

"_tier_preference": "data_content"

}

},

"number_of_shards": "1",

"provided_name": "dev_my_documents-744",

"creation_date": "1718297857215",

"analysis": {

"analyzer": {

"my_analyzer": {

"tokenizer": "my_tokeniser"

}

},

"tokenizer": {

"my_tokeniser": {

"pattern": ",",

"type": "pattern"

}

},

"number_of_replicas": "1",

"uuid": "ZekazwMETLO5cDd9Oc_7Mg",

"version": {

"created": "8503000"

}

</code>

<code>{ "dev_my_documents-744": { "settings": { "index": { "routing": { "allocation": { "include": { "_tier_preference": "data_content" } } }, "number_of_shards": "1", "provided_name": "dev_my_documents-744", "creation_date": "1718297857215", "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokeniser" } }, "tokenizer": { "my_tokeniser": { "pattern": ",", "type": "pattern" } } }, "number_of_replicas": "1", "uuid": "ZekazwMETLO5cDd9Oc_7Mg", "version": { "created": "8503000" } } } } } </code>

{
    "dev_my_documents-744": {
        "settings": {
            "index": {
                "routing": {
                    "allocation": {
                        "include": {
                            "_tier_preference": "data_content"
                        }
                    }
                },
                "number_of_shards": "1",
                "provided_name": "dev_my_documents-744",
                "creation_date": "1718297857215",
                "analysis": {
                    "analyzer": {
                        "my_analyzer": {
                            "tokenizer": "my_tokeniser"
                        }
                    },
                    "tokenizer": {
                        "my_tokeniser": {
                            "pattern": ",",
                            "type": "pattern"
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "ZekazwMETLO5cDd9Oc_7Mg",
                "version": {
                    "created": "8503000"
                }
            }
        }
    }
}

Obviously the stemmed fields have their own language-specific analyser which I assume overrides the “default” analyser from settings … but I also see no change at all with the 2 unstemmed fields (“normalised_content”, i.e. accents stripped out, and “unnormalised_content”) when I do searches using these.

I’ve also tried using the “_analyzer” end point … again, everything is tokenised on word boundaries… never on commas.

This “comma tokeniser” appears not to be being applied. What am I doing wrong?

Thiết kế website giá rẻ

Danh mục

How can I get my custom ES tokeniser to work?