I have the following dataset with 20 million rows. It’s data on companies and user by month.
I have created first_app_company, which flags first appeareance of a company in the dataset. The code is as follows
df$first_app_company <- as.numeric(!duplicated(df$Company_id))
Company_id Customer Month-Year first_app_company
11 X 201501 1
12 Y 201501 1
13 Z 201501 1
13 Q 201501 0
13 R 201501 0
14 E 201501 1
14 W 201501 0
15 X 201501 1
15 Z 201501 0
15 H 201501 0
15 K 201501 0
16 Q 201501 0
However, I realised now, that when a company enters my dataset in month M, I would like to flag all rows that match that company name and month M as 1. So my desired output would look like:
(Please note that 201501 is first month in my dataset so all entries will be flagged as 1, but it shows the logic)
Company_id Customer Month-Year first_app_company
11 X 201501 1
12 Y 201501 1
13 Z 201501 1
13 Q 201501 1
13 R 201501 1
14 E 201501 1
14 W 201501 1
15 X 201501 1
15 Z 201501 1
15 H 201501 1
15 K 201501 1
16 Q 201501 1
I am currently trying to figure it out with lead and lag functions from base R, but it is getting a bit confusing. Was hoping someone would point me in the right direction