There is a dataframe with the columns district, crime_type, date, month
df = spark.createDataFrame(
[('D1', 'ROBBERY', '2024-02-01', 2),
('D1', 'ROBBERY', '2024-02-01', 2),
('D1', 'DRUGS', '2024-03-05', 3),
('D1', 'FRAUD', '2024-03-05', 3),
('D1', 'AUTO THEFT', '2024-01-09',1),
('D1', 'AUTO THEFT', '2024-01-03', 1),
('D2', 'MURDER', '2024-05-04', 5),
('D2', 'MURDER', '2024-06-01', 6),
('D2', 'RAPE', '2024-07-02', 7)],
['district', 'crime_type', 'date', 'month'])
It is necessary to get the list of top 3 most frequent crime_type for each district (as a comma-separated string) and the median (not average!) value of the crime count by month column for this district.
The result should be a new dataframe with three columns: district, top_3_crime_types, median_crimes_monthly:
district | top_3_crime_types | median_crimes_monthly |
---|---|---|
D1 | ROBBERY, AUTO THEFT, DRUGS | 2 |
D2 | MURDER, RAPE | 1 |