Consider this column from df:
PersonEntityID
0 324141
1
2 425544
3
4 434456
Can I reindex it so that it changes when there is a new value in the column:
PersonEntityID
0 324141
0
1 425544
1
2 434456
4
You can mask
the indices matching NaN positions and ffill
to keep the previous existing index:
df.index = df.index.to_series().mask(df['PersonEntityID'].isna()).ffill()
Output:
PersonEntityID
0 324141.0
0 NaN
2 425544.0
2 NaN
4 434456.0
If you want a new value per unique ID:
df.index = pd.factorize(df['PersonEntityID'].ffill())[0]
If you want a unique value only when a new ID is found you can combine np.maximum.accumulate
and factorize
:
df.index = np.maximum.accumulate(pd.factorize(df['PersonEntityID'])[0])
Output:
PersonEntityID
0 324141.0
0 NaN
1 425544.0
1 NaN
2 434456.0
Here is a summary of the differences with a modified input:
PersonEntityID consecutive per_unique per_new_unique
0 324141.0 0 0 0
1 NaN 0 0 0
2 425544.0 2 1 1
3 NaN 2 1 1
4 324141.0 4 0 1
1
Use factorize
with forward filling missing values:
df.index = pd.factorize(df['PersonEntityID'].ffill())[0]
Or for general solution use:
s = df['PersonEntityID'].ffill()
df.index = pd.factorize(s.ne(s.shift()).cumsum())[0]
Output with duplicated values:
df['unique'] = pd.factorize(df['PersonEntityID'].ffill())[0]
s = df['PersonEntityID'].ffill()
df['dupl'] = pd.factorize(s.ne(s.shift()).cumsum())[0]
print (df)
PersonEntityID unique dupl
0 324141.0 0 0
1 NaN 0 0
2 425544.0 1 1
3 NaN 1 1
4 324141.0 0 2
1