I am using DataFrame.apply()
to calculate a new “Metric” column by taking an existing integer categorical column and looking up the integer in a list, i.e., indexing into the list. It works fine when I use list comprehension, but when I use DataFrame.apply()
, I get an error about the integer being used as an index is not an integer.
When I set the Spyder parameter to enter debug after an error, it places the program counter on the apply
statement. I queried the intermediate results and confirmed that what should be an integer becomes a float.
Here is the test code:
import numpy as np
import pandas as pd
import seaborn as sns # Generates 3-tuples in original application
nObservations = 4 # Number of rows
nIntCategory = 4 # Number of categories in the categorical integer column
dfObservations = pd.DataFrame()
# Categorical data column
dfObservations['IntCategory'] = np.random.randint(nIntCategory,size=nObservations)
# IntCategory
# 1
# 0
# 3
# 2
# Table to look up "Metric" value based on "IntCategory"
Cat2metric = np.random.random(nIntCategory)
# array([0.02190873, 0.9570024 , 0.34785749, 0.13852149])
# Create the "Metric" column via table lookup in Cat2metric
# Whether to use apply instead of list comprehension
ifUseApplyNotComprehension = True
if ifUseApplyNotComprehension: # Creates error in the comment below
dfObservations['Metric'] =
dfObservations.apply( lambda row: Cat2metric[ row.IntCategory ] , axis=1 )
# Index error [1]: list indices must be integers or slices,
# not numpy.float64
else: # No errors
dfObservations['Metric'] =
[ Cat2metric[i] for i in dfObservations.IntCategory ]
The data is randomly generated, so it varies from trial to trial, but here is an example of what happens when I query the index row.IntCategory
in the above apply
loop. It returns a float!
row.IntCategory # Returns 3.0
The apply
loop can be made to work by coercing the index into an integer, i.e., int(row.IntCategory)
.
In this situation, why does the numerical data type change in an apply
loop? It can represent a signifcant detour to track down.
The explanations I found here, here, and here all point to upcasting if the columns contain a mix of numerical types, but that’s not the situation here.
P.S. This is not a question of performance. I’ve read about avoiding apply
if possible. I am seeking the most readable solution, as explained here. Also, the code above is just for troubleshooting. It doesn’t represent the actual code that led to this behaviour.