I have a set of many (around 20 thousand) short job descriptions in English. My purpose for now is to be able to detect their optimal number of topics.
I use an R script which worked decently on a different corpus, but here I get some error I cannot decipher. Please have a look at the reprex at the end of this post.
The data can be downloaded from
https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k
and the frequency matrix from
https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V
(in any case it is calculated in the script).
Any suggestion is appreciated.
library(tidyverse)
library(quanteda)
#> Package version: 4.0.2
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
library(seededlda)
#> Loading required package: proxyC
#>
#> Attaching package: 'proxyC'
#> The following object is masked from 'package:stats':
#>
#> dist
#>
#> Attaching package: 'seededlda'
#> The following object is masked from 'package:stats':
#>
#> terms
library(ldatuning)
## Download the data from
## https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k
jobs <- readRDS("jobs_in_english.RDS") ## read the data
corp <- corpus(jobs, docid_field = "id",
text_field = "description") ## create a corpus
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE)
## generate the frequency matrix
## if you want, you can download it directly from
## https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V
dfmt <- dfm(toks) |>
dfm_remove(stopwords("en")) |>
dfm_remove("*@*") |>
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 19,989 documents, 91,979 features (99.90% sparse) and 1 docvar.
#> features
#> docs panel paint technician colchester essex 7.00am-4.30pm basic p.a
#> 872828466 5 5 4 2 1 1 2 1
#> 857077872 0 0 0 0 0 0 0 0
#> 801801567 0 0 0 0 0 0 0 0
#> 855162927 0 0 0 0 0 0 0 0
#> 767099713 0 0 0 0 0 0 0 0
#> 770142853 0 0 0 0 0 0 0 0
#> features
#> docs depending held
#> 872828466 1 1
#> 857077872 0 0
#> 801801567 0 0
#> 855162927 0 0
#> 767099713 0 0
#> 770142853 0 0
#> [ reached max_ndoc ... 19,983 more documents, reached max_nfeat ... 91,969 more features ]
## try to determine the optimal number of topics.
result <- FindTopicsNumber(
dfmt,
topics = seq(from = 2, to = 10, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
#> fit models...
#> Error in checkForRemoteErrors(val): 2 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
### Here the code fails...and I do not understand why
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Debian GNU/Linux 12 (bookworm)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ldatuning_1.0.2 seededlda_1.2.1 proxyC_0.4.1 quanteda_4.0.2
#> [5] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
#> [9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
#> [13] ggplot2_3.5.1 tidyverse_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] styler_1.10.3 utf8_1.2.4 generics_0.1.3 xml2_1.3.6
#> [5] slam_0.1-50 stringi_1.8.4 lattice_0.22-6 hms_1.1.3
#> [9] digest_0.6.35 magrittr_2.0.3 evaluate_0.23 grid_4.4.1
#> [13] timechange_0.3.0 fastmap_1.1.1 R.oo_1.26.0 R.cache_0.16.0
#> [17] Matrix_1.7-0 tm_0.7-13 R.utils_2.12.3 topicmodels_0.2-16
#> [21] stopwords_2.3 fansi_1.0.6 scales_1.3.0 modeltools_0.2-23
#> [25] cli_3.6.2 rlang_1.1.3 R.methodsS3_1.8.2 munsell_0.5.1
#> [29] reprex_2.1.0 withr_3.0.0 yaml_2.3.8 parallel_4.4.1
#> [33] NLP_0.2-1 tools_4.4.1 tzdb_0.4.0 colorspace_2.1-0
#> [37] fastmatch_1.1-4 vctrs_0.6.5 R6_2.5.1 stats4_4.4.1
#> [41] lifecycle_1.0.4 fs_1.6.4 pkgconfig_2.0.3 pillar_1.9.0
#> [45] gtable_0.3.5 glue_1.7.0 Rcpp_1.0.12 xfun_0.43
#> [49] tidyselect_1.2.1 knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26
#> [53] compiler_4.4.1
Created on 2024-06-18 with reprex v2.1.0