Say I have the Airbnb dataset with a bunch of columns. Of interest are ‘neighbourhood_cleansed’, ‘host_is_superhost’ and ‘price’. I wish to find the neighbourhood in which the difference between the median prices of superhosts and non-superhosts is the maximum.
I want to know if this can be done entirely using pandas functions.
My logic is to group by ‘neighbourhood_cleansed’ at first, then filter the groupby object into superhosts and non-superhosts, and then use the median function.
I have defined a function func
def func(host_is_superhost, price):
superhost_prices = price[host_is_superhost == 't']
notsuperhost_prices = price[host_is_superhost == 'f']
return (superhost_prices.median() - notsuperhost_prices.median())
listings = pd.read_csv("https://storage.googleapis.com/public-data-337819/listings%202%20reduced.csv",low_memory=False)
neighbourhoods = listings.groupby('neighbourhood_cleansed')[['host_is_superhost', 'price']]
When I run the following:
neighbourhoods.apply(func)
The error thrown is
TypeError: func() missing 1 required positional argument: 'price'
How do I solve this?
Do y’all have better ways of solving the initial question?