I’m studying dask using the book “Fast Python: High performance techniques for large datasets” (2023). I have been trying to run the example shown below (p.241) that uses the taxes file that can be found HERE (the correct file is Annual Survey of State Government Tax Collections by Category Table). I corrected the column names (in the file and book they have different names for some reason) but I get this error when I try to run it:
KeyError: ‘Cannot index with non-boolean dask Series. Try passing
computed values instead (e.g.ddf.loc[iindexer.compute()]
)’
I tried using the compute method suggested in the error but then I get another error: TypeError: ‘Index…’ is an invalid key. Do you know how can I solve this problem?
import dask.dataframe as dd
import numpy as np
fname = "taxes.csv"
taxes = dd.read_csv(fname)
taxes["Amount"] = (
taxes["Amount in Thousands"].str.replace(",", "").replace("X", np.nan).astype(float)
)
taxes["Tax Type"] = taxes["Tax Type"].astype("category").cat.as_known()
pivot = taxes.pivot_table(index="State", columns="Tax Type", values="Amount")
has_property_info = pivot[~pivot["Property Taxes"].isna()].index
pivot_clean = pivot.loc[has_property_info]
frac_property = pivot_clean["Property Taxes"] / pivot_clean["Total Taxes"]
- You don’t need to use index to select rows by condition. You just use the condition as a selector
pivot_clean = pivot[~(pivot["Property Taxes"].isna())]
or better
pivot_clean = pivot[pivot["Property Taxes"].notna()]
- Even better, there is
dropna
:
pivot_clean = pivot.dropna(subset=["Property Taxes"])
1