Thiết kế website giá rẻ

Question

I am writing my own key-word-in-context function,
but i notice an annoying, yet expected behavior by stringi and stringr that quanteda is able to overcome.
Say I have the string "Ask a question" and my pattern exceeds the string, for example "\w+\s+\w+\s+\w+\s+\w+" – one or more letters followed by one or more spaces four times.

What I would like:
Output of stri_extract_all_regex (or str_extract_all) to be the whole string, as the pattern fully covers the string (and exceeds it).
Unfortunately, the output is NA.
I am not sure how to overcome this behavior.

I’ve added some examples to hopefully provide you with the context of my motivation to write my own function, instead of using quanteda, or to encourage developers of quanteda to add these functionalities:

Dummy example, car lease contract

part_of_contract <- "if you drive over 300000 miles a year you will pay a fine of 10000 usd."

I want to locate instances of numbers with 4 to 6 digits, with a window of 1 word before & after these numbers. I might expect typos, so i take one space or more between each word.

<code>search_term <- "\d{4,6}"

context_before <- rep("\w+\s+",1)

context_after <- rep("\s+\w+",1)

</code>

<code>search_term <- "\d{4,6}" context_before <- rep("\w+\s+",1) context_after <- rep("\s+\w+",1) </code>

search_term <- "\d{4,6}"
context_before <- rep("\w+\s+",1)
context_after <- rep("\s+\w+",1)

So far it’s simple, and `quanteda` provides great interface for such simple tasks

stri_extract_all_regex(part_of_contract, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]
yields the same-ish output as quanteda (a list-like result)
quanteda::tokens(part_of_contract) |> quanteda::kwic(search_term,valuetype = "regex",window = 1)

However, If i want to modify my window and search pattern, that’s where `quanteda` lacks the possibilities.

I dont necessarily want the same window before/after the keyword,
I dont necessarily want a word before the keyword, maybe a full sentence?

<code>context_before <- stri_c(rep("\w+\s+",2),collapse = "") # two words before

context_after <- stri_c(rep("\s+\w+",4),collapse = "") # four words after

stri_extract_all_regex(part_of_contract, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]

</code>

<code>context_before <- stri_c(rep("\w+\s+",2),collapse = "") # two words before context_after <- stri_c(rep("\s+\w+",4),collapse = "") # four words after stri_extract_all_regex(part_of_contract, stri_c(context_before,search_term,context_after,collapse = ""))[[1]] </code>

context_before <- stri_c(rep("\w+\s+",2),collapse = "") # two words before
context_after <- stri_c(rep("\s+\w+",4),collapse = "") # four words after

stri_extract_all_regex(part_of_contract, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]

<code>[1] "drive over 300000 miles a year you"

</code>

<code>[1] "drive over 300000 miles a year you" </code>

[1] "drive over 300000 miles a year you"

One result disappeared, because it didnt have 4 words (the `context_after` pattern) after it. `quanteda::kwic` is able to accommodate and will provide the 2nd match as well.

I might want a full sentence as context, another feature I can’t achieve with `quanteda`. If I want the prior sentence, defined by the words between two dots, my function will fail since there is no dot (maybe it was the first sentence)

<code>part_of_contract_2 <- paste("This is the sentence before.",part_of_contract)

# same search terms

context_before <- "\..*?\." # context before is the sentence before the keyword's sentence (before the dot)

context_after <- rep("\s+\w+",1) # just the word after

</code>

<code>part_of_contract_2 <- paste("This is the sentence before.",part_of_contract) # same search terms context_before <- "\..*?\." # context before is the sentence before the keyword's sentence (before the dot) context_after <- rep("\s+\w+",1) # just the word after </code>

part_of_contract_2 <- paste("This is the sentence before.",part_of_contract)
# same search terms
context_before <- "\..*?\." # context before is the sentence before the keyword's sentence (before the dot)
context_after <- rep("\s+\w+",1) # just the word after

<code>stri_extract_all_regex(part_of_contract_2, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]

[1] NA

</code>

<code>stri_extract_all_regex(part_of_contract_2, stri_c(context_before,search_term,context_after,collapse = ""))[[1]] [1] NA </code>

stri_extract_all_regex(part_of_contract_2, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]
[1] NA

It returns NA since there is no dot. but I dont always know if there is or isn’t a dot.

I found a workaround for the specific dummy example but it doesnt make since to capture everything from the begining of the text. Maybe I should start by `unnest_tokens()` to paragraphs? but then I lose the possibility to get paragraphs as context.

<code>context_before <- "(^.*?)\." # everything before a dot

stri_c(stri_extract_all_regex(part_of_contract_2,context_before)[[1]],

stri_extract_all_regex(part_of_contract_2, stri_c(search_term,context_after,collapse = ""))[[1]],sep = " "

)

# I am pleased with the output, but it's too specific to the dummy example

[1] "This is the sentence before. 300000 miles" "This is the sentence before. 10000 usd"

</code>

<code>context_before <- "(^.*?)\." # everything before a dot stri_c(stri_extract_all_regex(part_of_contract_2,context_before)[[1]], stri_extract_all_regex(part_of_contract_2, stri_c(search_term,context_after,collapse = ""))[[1]],sep = " " ) # I am pleased with the output, but it's too specific to the dummy example [1] "This is the sentence before. 300000 miles" "This is the sentence before. 10000 usd" </code>

context_before <- "(^.*?)\." # everything before a dot
stri_c(stri_extract_all_regex(part_of_contract_2,context_before)[[1]],
stri_extract_all_regex(part_of_contract_2, stri_c(search_term,context_after,collapse = ""))[[1]],sep = " "
)
# I am pleased with the output, but it's too specific to the dummy example

[1] "This is the sentence before. 300000 miles" "This is the sentence before. 10000 usd"

Thiết kế website giá rẻ

Danh mục

stringr and stringi ‘extract_all_*’ turns NA if pattern exceeds string (with detailed examples for desired use-cases)

Dummy example, car lease contract

I want to locate instances of numbers with 4 to 6 digits, with a window of 1 word before & after these numbers. I might expect typos, so i take one space or more between each word.

So far it’s simple, and `quanteda` provides great interface for such simple tasks

However, If i want to modify my window and search pattern, that’s where `quanteda` lacks the possibilities.

One result disappeared, because it didnt have 4 words (the `context_after` pattern) after it. `quanteda::kwic` is able to accommodate and will provide the 2nd match as well.

I might want a full sentence as context, another feature I can’t achieve with `quanteda`. If I want the prior sentence, defined by the words between two dots, my function will fail since there is no dot (maybe it was the first sentence)

It returns NA since there is no dot. but I dont always know if there is or isn’t a dot.

I found a workaround for the specific dummy example but it doesnt make since to capture everything from the begining of the text. Maybe I should start by `unnest_tokens()` to paragraphs? but then I lose the possibility to get paragraphs as context.

Danh mục

Dummy example, car lease contract

I want to locate instances of numbers with 4 to 6 digits, with a window of 1 word before & after these numbers. I might expect typos, so i take one space or more between each word.

So far it’s simple, and quanteda provides great interface for such simple tasks

However, If i want to modify my window and search pattern, that’s where quanteda lacks the possibilities.

One result disappeared, because it didnt have 4 words (the context_after pattern) after it. quanteda::kwic is able to accommodate and will provide the 2nd match as well.

I might want a full sentence as context, another feature I can’t achieve with quanteda. If I want the prior sentence, defined by the words between two dots, my function will fail since there is no dot (maybe it was the first sentence)

It returns NA since there is no dot. but I dont always know if there is or isn’t a dot.

I found a workaround for the specific dummy example but it doesnt make since to capture everything from the begining of the text. Maybe I should start by unnest_tokens() to paragraphs? but then I lose the possibility to get paragraphs as context.

So far it’s simple, and `quanteda` provides great interface for such simple tasks

However, If i want to modify my window and search pattern, that’s where `quanteda` lacks the possibilities.

One result disappeared, because it didnt have 4 words (the `context_after` pattern) after it. `quanteda::kwic` is able to accommodate and will provide the 2nd match as well.

I might want a full sentence as context, another feature I can’t achieve with `quanteda`. If I want the prior sentence, defined by the words between two dots, my function will fail since there is no dot (maybe it was the first sentence)

I found a workaround for the specific dummy example but it doesnt make since to capture everything from the begining of the text. Maybe I should start by `unnest_tokens()` to paragraphs? but then I lose the possibility to get paragraphs as context.