TL;DR: How can I code an aggregation, in python, to guarantee is impossible to identify an individual, retaining as much data as possible, and avoiding groups too large?
Example
Imagine a dataset of Foobar observations as show bellow:
Observation | Species | Color | Food | PK |
---|---|---|---|---|
1 | bar | red | meat | bar;red;meat |
2 | bar | blue | meat | bar;blue;meat |
3 | foo | blue | meat | foo;blue;meat |
4 | foo | blue | meat | foo;blue;meat |
5 | qux | red | egg | qux;red;egg |
6 | foo | blue | egg | foo;blue;egg |
7 | bar | yellow | egg | bar;yellow;egg |
8 | qux | red | egg | qux;red;egg |
9 | baz | red | egg | baz;red;egg |
10 | bar | red | meat | bar;red;meat |
Due to legal restrictions no Foobar that participates the experiment can be identify. Is possible to check this counting the PK observations.
PK | Count of Observation |
---|---|
bar;blue;meat | 1 |
foo;blue;meat | 2 |
qux;red;egg | 2 |
foo;blue;egg | 1 |
bar;yellow;egg | 1 |
baz;red;egg | 1 |
bar;red;meat | 2 |
Total | 10 |
As we can see, there is 4 Foobar that can be identify by they characteristics.
Manually this can be done:
So the occurrences became:
PK | Count of Observation |
---|---|
bar∪baz;red∪yellow;egg | 2 |
qux;red;egg | 2 |
foo;blue;egg∪meat | 3 |
bar;blue∪red;meat | 3 |
Total | 10 |
But the dataset is too large to make this by hand.
My inicial solution was to iterate over all variables and count then, the inretate of all combinations of variables, but my dataset has more than 50 variables. This way the code will have (2^50 – 1) or 1.125.899.906.842.623 intarations over a a dataset that has more than 170.000 observations.
NOTE: The result does not need to be exactly the same as shown above the most important is not have a single PK observation.