I’m a communication scientist and a total newbie with TraMineR and sequenze analyis. I have a (relatively large) dataset that includes the app usage of study participants. My aim is to identify sequences of app-categories used in succession.
The original dataset looks like this:
Participant ID | Session ID | Category of used Apps | Start Time (in unix-time) | End Time (in unix-time) |
---|---|---|---|---|
0001 | 0001_1 | Communication | 1614868224 | 1614868236 |
0001 | 0001_1 | Social Media | 1614868236 | 1614868265 |
0002 | 0002_1 | Games | 1614868265 | 1614868320 |
… | … | … | … | … |
Accordingly, I have two levels of analysis: (1) On the one hand the participants and on the other hand (2) the sessions.
In the first step my aim is to identify sequences of app-cateogries used in succession. A session is a coherent usage sequence between switching the smartphone screen on and off. The data set comprises just under 400 participants, with each participant having around 2000-5000 sessions (~ 1,4 mio sessions for the whole dataset).
labels = seqstatl(sample$app_category)
states = 1:length(labels)
session_seq = seqdef(data = sample,
var = c("session", "begin", "end", "app_category"),
informat = "SPELL",
states = states,
labels = labels,
process = FALSE)
print(session_seq[1:15, ], format = "SPS")
# Using the transition rates between states observed in the sequence data
cost = seqsubm(session_seq, method = "TRATE", with.missing = TRUE)
# compute the distances using the matrix and the default indel cost of 1
session_seq_OM = seqdist(session_seq, method = "OM", sm = cost, with.missing = TRUE)
# --> Function crashed due to lack of RAM
I have already made my first attempts with subsamples and have come across the question:
The question relates to the computing resources required. I need a relatively large amount of computing power even for a sub-sample. Is it possible to make the calculation more resource-efficient? Is it an option to split the data set and later merge the sequence distance (batching) or will this distort my results?
I have already created the sequence object in STS format (530 objects and 1222844 variables) for a subset of the data set (the mobile sessions of a participant, n = ~ 4000; the structure of the data set looks as described above) and then wanted to calculate the sequence distance (“OM”). However, I was unable to calculate the sequence distance due to the high computing resources required. The calculation was cancelled due to a lack of RAM even on a 1TB RAM machine.
I am also happy to receive further tips for reading. The TraMineR User Guide has already helped me a lot.
confused_person is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.