First, I am very new to api stuff, so maybe the solution is simple.
I have access to the Danish Erhvervsstyrelsen’s api for enterprise data (CVR information from virk.dk).
I know how to connect to the api, but I don’ know how to go from the very nested json-data to an R data frame.
Here’s my code:
# Library
library(httr)
# Connecting
url <- GET("http://distribution.virk.dk/cvr-permanent/virksomhed/_search",
authenticate("user_name",
"password"))
# Checking if the connection was succesfull (it is)
url$status_code
[1] 200
From here I have tried some different stuff. Again I know nothing about this api stuff, I just need the data for my analysis.
# Trying the jsonlite package
url_char <- rawToChar(url$content)
json_data<- fromJSON(url_char)
json_data
$took
[1] 14
$timed_out
[1] FALSE
$`_shards`
$`_shards`$total
[1] 6
$`_shards`$successful
[1] 6
$`_shards`$skipped
[1] 0
$`_shards`$failed
[1] 0
$hits
$hits$total
[1] 2115475
$hits$max_score
[1] 1
$hits$hits
# Trying to make it into a data.frame
data_raw <- do.call("rbind",
lapply(json_data, as.data.frame))
Error in rbind(deparse.level, ...) :
antal argumentkolonner svarer ikke overens
I think that it doesn’t work might be because my data is more nested.
Another method I’ve tried:
# Extracting in json
cont_raw <- content(url)
# Unlisting
data_raw <- enframe(unlist(cont_raw))
# How many cols?
rgx_split <- "\."
n_cols_max <- data_raw %>%
pull(name) %>%
str_split(rgx_split) %>%
map_dbl(~length(.)) %>%
max()
n_cols_max
[1] 11
# Separating
nms_sep <- paste0("name", 1:n_cols_max)
data_sep <- data_raw %>%
separate(name,
into = nms_sep,
sep = rgx_split,
fill = "right")
And this is when I noticed that I only get 10 enterprises (“cvr-numbers”). I am expecting thousands, maybe even more than 100.000.
I noticed in my cont_raw
that the list hits
has a list hits
and this list has 10 lists. Each one a list of 5. Inside each of the 10 elements of cont_raw$hits$hits
there is a list called _source
. This is a list of 1, and it has the list Vrvirksomhed
(virksomhed means enterprise in Danish). The Vrvirksomhed
list is a list of 44. Some of these elements are lists too, but not all.
It seems like the cont_raw
consist of the the data for those 10 enterprises (the first 10 lists = one for each of the enterprises).
This is where I am very confused since I have access to all of the data.
I have also tried rrapply
with melt
, and it gives me the same as cont_raw
, just not as long.
I am sorry if I’m not clear, please let me know if you need more info to help me out.