Thiết kế website giá rẻ

Question

I have a set of many (around 20 thousand) short job descriptions in English. My purpose for now is to be able to detect their optimal number of topics.
I use an R script which worked decently on a different corpus, but here I get some error I cannot decipher. Please have a look at the reprex at the end of this post.
The data can be downloaded from

https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k

and the frequency matrix from

https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V

(in any case it is calculated in the script).

Any suggestion is appreciated.

library(tidyverse)
library(quanteda)
#> Package version: 4.0.2
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
library(seededlda)
#> Loading required package: proxyC
#> 
#> Attaching package: 'proxyC'
#> The following object is masked from 'package:stats':
#> 
#>     dist
#> 
#> Attaching package: 'seededlda'
#> The following object is masked from 'package:stats':
#> 
#>     terms
library(ldatuning)


 
## Download the data from

## https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k


jobs <- readRDS("jobs_in_english.RDS") ## read the data


corp <- corpus(jobs, docid_field = "id",
  text_field = "description") ## create a corpus

toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)


## generate the frequency matrix
## if you want, you can download it directly from
##  https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V

dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

print(dfmt)
#> Document-feature matrix of: 19,989 documents, 91,979 features (99.90% sparse) and 1 docvar.
#>            features
#> docs        panel paint technician colchester essex 7.00am-4.30pm basic p.a
#>   872828466     5     5          4          2     1             1     2   1
#>   857077872     0     0          0          0     0             0     0   0
#>   801801567     0     0          0          0     0             0     0   0
#>   855162927     0     0          0          0     0             0     0   0
#>   767099713     0     0          0          0     0             0     0   0
#>   770142853     0     0          0          0     0             0     0   0
#>            features
#> docs        depending held
#>   872828466         1    1
#>   857077872         0    0
#>   801801567         0    0
#>   855162927         0    0
#>   767099713         0    0
#>   770142853         0    0
#> [ reached max_ndoc ... 19,983 more documents, reached max_nfeat ... 91,969 more features ]


## try to determine the optimal number of topics.

result <- FindTopicsNumber(
  dfmt,
  topics = seq(from = 2, to = 10, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)
#> fit models...
#> Error in checkForRemoteErrors(val): 2 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry

### Here the code fails...and I do not understand why

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] ldatuning_1.0.2 seededlda_1.2.1 proxyC_0.4.1    quanteda_4.0.2 
#>  [5] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
#>  [9] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
#> [13] ggplot2_3.5.1   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.3      utf8_1.2.4         generics_0.1.3     xml2_1.3.6        
#>  [5] slam_0.1-50        stringi_1.8.4      lattice_0.22-6     hms_1.1.3         
#>  [9] digest_0.6.35      magrittr_2.0.3     evaluate_0.23      grid_4.4.1        
#> [13] timechange_0.3.0   fastmap_1.1.1      R.oo_1.26.0        R.cache_0.16.0    
#> [17] Matrix_1.7-0       tm_0.7-13          R.utils_2.12.3     topicmodels_0.2-16
#> [21] stopwords_2.3      fansi_1.0.6        scales_1.3.0       modeltools_0.2-23 
#> [25] cli_3.6.2          rlang_1.1.3        R.methodsS3_1.8.2  munsell_0.5.1     
#> [29] reprex_2.1.0       withr_3.0.0        yaml_2.3.8         parallel_4.4.1    
#> [33] NLP_0.2-1          tools_4.4.1        tzdb_0.4.0         colorspace_2.1-0  
#> [37] fastmatch_1.1-4    vctrs_0.6.5        R6_2.5.1           stats4_4.4.1      
#> [41] lifecycle_1.0.4    fs_1.6.4           pkgconfig_2.0.3    pillar_1.9.0      
#> [45] gtable_0.3.5       glue_1.7.0         Rcpp_1.0.12        xfun_0.43         
#> [49] tidyselect_1.2.1   knitr_1.46         htmltools_0.5.8.1  rmarkdown_2.26    
#> [53] compiler_4.4.1

^{Created on 2024-06-18 with reprex v2.1.0}

Thiết kế website giá rẻ

Danh mục

R + quanteda + automatic detection of topics: error when running model