I have been using stringr
since it’s supposed to be faster, but I found out today that it’s much slower when dealing with factor terms. I didn’t see any warning that this would be the case nor why it is.
For example:
string_options = c("OneWord", "TwoWords", "ThreeWords")
sample_chars = sample(string_options, 1e6, replace = TRUE)
sample_facts = as_factor(sample_chars)
When working with character
terms, base R is slower than stringr
, as expected. But when dealing with factor
terms, base R is like 30x faster.
bench::mark(
base_chars = grepl("Two", sample_chars),
stringr_chars = str_detect(sample_chars, "Two"),
base_facts = grepl("Two", sample_facts),
stringr_facts = str_detect(sample_facts, "Two")
)
# A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#1 base_chars 116.1ms 116.38ms 8.58 3.81MB 0 5 0 583ms <lgl [1,000,000]> <Rprofmem [1 × 3]> <bench_tm [5]> <tibble>
#2 stringr_chars 86.04ms 88.2ms 11.3 3.81MB 0 6 0 532ms <lgl [1,000,000]> <Rprofmem [2 × 3]> <bench_tm [6]> <tibble>
#3 base_facts 3.59ms 3.65ms 271. 11.44MB 0 136 0 501ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [136]> <tibble>
#4 stringr_facts 90.71ms 91.29ms 10.9 11.44MB 0 6 0 549ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [6]> <tibble>
It looks like stringr
isn’t doing anything different with factor
terms but base R is significantly optimizing it. Is this expected behaviour? Should I report this as a stringr
issue? Is there some stringr
setting I’m completely missing? I’d like to not have to think about the format of the data to determine if I’m using stringr
or base R.