I have a log table which contains changes. Sign + means addition, sign – means deletion.
<code>import pandas as pd
history = pd.DataFrame({
"First":
["X", "Y", "Y", "X", "X", "Y", "Z"],
"Last":
["X", "Y", "Y", "X", "X", "Y", "A"],
"Change":
["+", "-", "+", "-", "+", "+", "-"],
"Date":
["2024-05-01", "2024-06-01", "2024-06-01", "2024-05-03",
"2024-05-02", "2024-06-02", "2024-06-01"]
})
history = history.sort_values(by=["Date", "Change"])
# sort needed to process the entries anachronically
</code>
<code>import pandas as pd
history = pd.DataFrame({
"First":
["X", "Y", "Y", "X", "X", "Y", "Z"],
"Last":
["X", "Y", "Y", "X", "X", "Y", "A"],
"Change":
["+", "-", "+", "-", "+", "+", "-"],
"Date":
["2024-05-01", "2024-06-01", "2024-06-01", "2024-05-03",
"2024-05-02", "2024-06-02", "2024-06-01"]
})
history = history.sort_values(by=["Date", "Change"])
# sort needed to process the entries anachronically
</code>
import pandas as pd
history = pd.DataFrame({
"First":
["X", "Y", "Y", "X", "X", "Y", "Z"],
"Last":
["X", "Y", "Y", "X", "X", "Y", "A"],
"Change":
["+", "-", "+", "-", "+", "+", "-"],
"Date":
["2024-05-01", "2024-06-01", "2024-06-01", "2024-05-03",
"2024-05-02", "2024-06-02", "2024-06-01"]
})
history = history.sort_values(by=["Date", "Change"])
# sort needed to process the entries anachronically
This produces
<code> First Last Change Date
0 X X + 2024-05-01
4 X X + 2024-05-02
3 X X - 2024-05-03
2 Y Y + 2024-06-01
1 Y Y - 2024-06-01
6 Z A - 2024-06-01
5 Y Y + 2024-06-02
</code>
<code> First Last Change Date
0 X X + 2024-05-01
4 X X + 2024-05-02
3 X X - 2024-05-03
2 Y Y + 2024-06-01
1 Y Y - 2024-06-01
6 Z A - 2024-06-01
5 Y Y + 2024-06-02
</code>
First Last Change Date
0 X X + 2024-05-01
4 X X + 2024-05-02
3 X X - 2024-05-03
2 Y Y + 2024-06-01
1 Y Y - 2024-06-01
6 Z A - 2024-06-01
5 Y Y + 2024-06-02
In next step, I want to display only what is currently available.
- The last available sign needs to be + for item to be available.
- Available: +, ++, +-+, -+, –+, etc.
- Not available: -, ++-, +– etc.
- Item is a combination of columns First and Last
- Sorting is done by Date and Change
I build this logic using iteration which is very slow. Basically
<code>latest = {}
item_columns = [
"First",
"Last",
]
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns )
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in state:
del latest[key]
today = pd.DataFrame(latest.keys(), columns=item_columns)
</code>
<code>latest = {}
item_columns = [
"First",
"Last",
]
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns )
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in state:
del latest[key]
today = pd.DataFrame(latest.keys(), columns=item_columns)
</code>
latest = {}
item_columns = [
"First",
"Last",
]
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns )
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in state:
del latest[key]
today = pd.DataFrame(latest.keys(), columns=item_columns)
This produces available items
<code> First Last
0 X X
1 Y Y
</code>
<code> First Last
0 X X
1 Y Y
</code>
First Last
0 X X
1 Y Y
The issue is that loop is slow with big tables, e.g. 20 seconds for below
<code>duplicated = pd.concat([history.iloc[[0]]] * 50000, ignore_index=True)
history = pd.concat([history, duplicated], ignore_index=True)
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns )
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in state:
del latest[key]
today = pd.DataFrame(latest.keys(), columns=item_columns)
</code>
<code>duplicated = pd.concat([history.iloc[[0]]] * 50000, ignore_index=True)
history = pd.concat([history, duplicated], ignore_index=True)
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns )
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in state:
del latest[key]
today = pd.DataFrame(latest.keys(), columns=item_columns)
</code>
duplicated = pd.concat([history.iloc[[0]]] * 50000, ignore_index=True)
history = pd.concat([history, duplicated], ignore_index=True)
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns )
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in state:
del latest[key]
today = pd.DataFrame(latest.keys(), columns=item_columns)
Any way to speed up?