I have a pair of dicts, indexed by an organization id, representing sets of users I want to match on multiple criteria:
{
425525: [
{
user_id: 52323
first_name: firstname1
last_name: lastname1
},
{
user_id: 968675
first_name: firstname2
last_name: lastname2
},
{
user_id: 216266
first_name: firstname3
last_name: lastname3
}
],
542452: [
{
user_id: 98754
first_name: firstname4
last_name: lastname4
},
{
user_id: 23425
first_name: firstname5
last_name: lastname5
},
{
user_id: 27364
first_name: firstname6
last_name: lastname6
}
],
}
{
425525: [
{
user_id: 60974
first_name: firstname7
last_name: lastname7
},
{
user_id: 968675
first_name: firstname2
last_name: lastname2
},
{
user_id: 43645
first_name: firstname8
last_name: lastname8
}
],
542452: [
{
user_id: 98754
first_name: firstname4
last_name: lastname4
},
{
user_id: 23425
first_name: firstname9
last_name: lastname9
},
{
user_id: 76554
first_name: firstname10
last_name: lastname10
}
],
}
I want to count how many users within each organization match certain criteria. Specifically, I’d like to get counts of how many users have ID matches only, how many have ID and exact name matches, and how many have no matches. For instance, from these two lists I want to generate:
{
425525: {
exact_matches: 1,
partial_matches: 0,
no_matches: 2
},
542452: {
exact_matches: 1,
partial_matches: 1,
no_matches: 1
},
}
I can of course do this naively by iterating over the lists, comparing the values with conditionals and counting the matches, but the actual lists contain over 1000 orgs with thousands of users each, for a total of several million users. In my testing it can take several seconds to do all of the comparisons I want within an individual org, and the entire dataset might take hours to process. Is there an optimized path that python offers here for rapidly comparing these datasets?
Alexi Maschas is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.