I am writing my own key-word-in-context function,
but i notice an annoying, yet expected behavior by stringi
and stringr
that quanteda
is able to overcome.
Say I have the string "Ask a question"
and my pattern exceeds the string, for example "\w+\s+\w+\s+\w+\s+\w+"
– one or more letters followed by one or more spaces four times.
What I would like:
Output of stri_extract_all_regex
(or str_extract_all
) to be the whole string, as the pattern fully covers the string (and exceeds it).
Unfortunately, the output is NA
.
I am not sure how to overcome this behavior.
I’ve added some examples to hopefully provide you with the context of my motivation to write my own function, instead of using quanteda
, or to encourage developers of quanteda
to add these functionalities:
Dummy example, car lease contract
part_of_contract <- "if you drive over 300000 miles a year you will pay a fine of 10000 usd."
I want to locate instances of numbers with 4 to 6 digits, with a window of 1 word before & after these numbers. I might expect typos, so i take one space or more between each word.
search_term <- "\d{4,6}"
context_before <- rep("\w+\s+",1)
context_after <- rep("\s+\w+",1)
So far it’s simple, and quanteda
provides great interface for such simple tasks
stri_extract_all_regex(part_of_contract, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]
yields the same-ish output as quanteda (a list-like result)
quanteda::tokens(part_of_contract) |> quanteda::kwic(search_term,valuetype = "regex",window = 1)
However, If i want to modify my window and search pattern, that’s where quanteda
lacks the possibilities.
- I dont necessarily want the same window before/after the keyword,
- I dont necessarily want a word before the keyword, maybe a full sentence?
context_before <- stri_c(rep("\w+\s+",2),collapse = "") # two words before
context_after <- stri_c(rep("\s+\w+",4),collapse = "") # four words after
stri_extract_all_regex(part_of_contract, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]
[1] "drive over 300000 miles a year you"
One result disappeared, because it didnt have 4 words (the context_after
pattern) after it. quanteda::kwic
is able to accommodate and will provide the 2nd match as well.
I might want a full sentence as context, another feature I can’t achieve with quanteda
. If I want the prior sentence, defined by the words between two dots, my function will fail since there is no dot (maybe it was the first sentence)
part_of_contract_2 <- paste("This is the sentence before.",part_of_contract)
# same search terms
context_before <- "\..*?\." # context before is the sentence before the keyword's sentence (before the dot)
context_after <- rep("\s+\w+",1) # just the word after
stri_extract_all_regex(part_of_contract_2, stri_c(context_before,search_term,context_after,collapse = ""))[[1]]
[1] NA
It returns NA since there is no dot. but I dont always know if there is or isn’t a dot.
I found a workaround for the specific dummy example but it doesnt make since to capture everything from the begining of the text. Maybe I should start by unnest_tokens()
to paragraphs? but then I lose the possibility to get paragraphs as context.
context_before <- "(^.*?)\." # everything before a dot
stri_c(stri_extract_all_regex(part_of_contract_2,context_before)[[1]],
stri_extract_all_regex(part_of_contract_2, stri_c(search_term,context_after,collapse = ""))[[1]],sep = " "
)
# I am pleased with the output, but it's too specific to the dummy example
[1] "This is the sentence before. 300000 miles" "This is the sentence before. 10000 usd"