Hi guys ‘Big problem here’,
I have a csv file as input file, which represent the structure of a “Car”.
There are 14 columns In my input file, at the end of the process I need to get back all the columns.
Basically there are 3 mains columns to deal with to create the architecture of the “Car” :
“CAR_REFERENCE”, “PARENT”, “CHILD”
The original input file doesn’t have any ID. (Lines can be duplicated)
For example in A CAR, there are lots of same bolts and Screws mounted on the same assembly.
CAR_REFERENCE | PARENT | CHILD |
---|---|---|
CAR_1 | WHEEL_1 | BOLT |
CAR_1 | WHEEL_1 | BOLT |
This example is just here to show you guys that rows of the input file are not unique.
WHEEL_1 can obviously be part of many many different CAR_REFERENCE.
My final GOAL is :
- First, to add a new “Level” column to visually see the hierarchy of the whole structure
(Ex : Top Item = Level 1, first child = Level 2, first child of first child = Level 3 …)
Expected for this point :
(I’ve deliberately reduced the structure so as not to include too much information, but in my input file I have absolutely all the details and levels.)
CAR_REFERENCE | PARENT | CHILD | LEVEL |
---|---|---|---|
CAR_1 | -1 | CHASSIS | 1 |
CAR_1 | CHASSIS | REAR AXLE | __2 |
CAR_1 | REAR AXLE | WHEEL HUB | ___3 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 |
CAR_1 | CHASSIS | FRONT AXLE | __2 |
- Second, create another column to SORT the hierarchy by keeping the order (PowerBi don’t know the exact structure of my cars so I need to create a column only for sorting purpose, to keep the structure as it is). New column SORTED_ORDER
CAR_REFERENCE | PARENT | CHILD | LEVEL | SORTED_ORDER |
---|---|---|---|---|
CAR_1 | -1 | CHASSIS | 1 | 1 |
CAR_1 | CHASSIS | REAR AXLE | __2 | 1.1 |
CAR_1 | REAR AXLE | WHEEL HUB | ___3 | 1.1.1 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 | 1.1.1.1 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 | 1.1.1.2 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 | 1.1.1.3 |
CAR_1 | WHEEL HUB | WHEEL_1 | ____4 | 1.1.1.4 |
CAR_1 | CHASSIS | FRONT AXLE | __2 | 1.2 |
**One of the probleme here is if we got : **
1.1 1.2 ... 1.9 1.10
**It will be sorted like : **
1.1 1.10 1.2 1.3 ...
So first I separate each “values” between the “.” to create segments, I check the max value of each segment and the make a condition :
if segment.value < 10
Only one digit = 1.2.1.5
if segment.value < 100
Max value of segment 2 = 98
Only two digits = 1.2.02.5
Else
Max value of segment 3 = 102
Only Three digits = 1.2.02.006
So if a segment[x] value Hit 1.76.1
ALL the previous elements from the 1.1.1 to 1.76.1 will be on two digits like so :
1.01.1
1.02.1
1.03.1
Same thing when a segment hit 100 or more
Doing this I can keep the original structure Order while sorting.
Here is all my code, and What I tried to reach my two main objectives :
First I tried to create Unique ID’s by counting occurences of each child of the whole input file (1.8 millions lines) :
import pandas as pd
df = pd.read_csv('Original.csv', dtype=str)
child_counts = {}
child_unique_ids = {}
for index, row in df.iterrows():
child = row['CHILD']
if child not in child_counts:
child_counts[child] = 1
else:
child_counts[child] += 1
child_unique = f'Child_{child}_{child_counts[child]}'
df.loc[index, 'UNIQUE_CHILD'] = child_unique
if child not in child_unique_ids:
child_unique_ids[child] = child_unique
df['PARENT'] = df['PARENT'].map(lambda x: child_unique_ids[x] if x in child_unique_ids else x)
df.to_csv('Output.csv', index=False)
OUTPUT :
CAR_REFERENCE | PARENT | CHILD | UNIQUE_CHILD |
---|---|---|---|
CAR_1 | -1 | CHASSIS | CHASSIS_1 |
CAR_1 | CHASSIS | REAR AXLE | REAR AXLE_1 |
CAR_1 | REAR AXLE | WHEEL HUB | WHEEL HUB_1 |
CAR_1 | WHEEL HUB | WHEEL_1 | WHEEL_1_1 |
CAR_1 | WHEEL HUB | WHEEL_1 | WHEEL_1_2 |
CAR_1 | WHEEL HUB | WHEEL_1 | WHEEL_1_3 |
CAR_1 | WHEEL HUB | WHEEL_1 | WHEEL_1_4 |
CAR_1 | CHASSIS | FRONT AXLE | FRONT AXLE_1 |
By doing this I’m losing Parent / Unique child hierarchy, because the UNIQUE_CHILD is no longer a PARENT.
So The last part of the code is replacing ‘PARENT’ with the first ‘UNIQUE_CHILD’ occurence :
CAR_REFERENCE | PARENT | CHILD | UNIQUE_CHILD |
---|---|---|---|
CAR_1 | -1 | CHASSIS | CHASSIS_1 |
CAR_1 | CHASSIS_1 | REAR AXLE | REAR AXLE_1 |
CAR_1 | REAR AXLE_1 | WHEEL HUB | WHEEL HUB_1 |
CAR_1 | WHEEL HUB_1 | WHEEL_1 | WHEEL_1_1 |
CAR_1 | WHEEL HUB_1 | WHEEL_1 | WHEEL_1_2 |
CAR_1 | WHEEL HUB_1 | WHEEL_1 | WHEEL_1_3 |
CAR_1 | WHEEL HUB_1 | WHEEL_1 | WHEEL_1_4 |
CAR_1 | CHASSIS_1 | FRONT AXLE | FRONT AXLE_1 |
So the first STEP to create Unique ID is working.
Now, let’s create the ‘LEVEL’ and ‘Sorted_order’ columns :
Here is my code :
import pandas as pd
import networkx as nx
def make_hierarchy(g):
G = nx.from_pandas_edgelist(g, create_using=nx.DiGraph,
source='PARENT', target='UNIQUE_CHILD')
def dfs_with_level(node, level, order, parent_order, max_depth):
hierarchy_data = [(g.name, node, level, parent_order + f'{order:0{max_depth}d}')]
children = list(G.successors(node))
if children:
for i, child in enumerate(children, start=1):
hierarchy_data.extend(dfs_with_level(child, level + 1, i, parent_order + f'{order:0{max_depth}d}' + '.', max_depth))
return hierarchy_data
hierarchy_data = []
for node in G.nodes:
if not list(G.predecessors(node)):
max_depth = len(str(len(G)))
hierarchy_data.extend(dfs_with_level(node, 1, 1, '', max_depth))
df_hierarchy = pd.DataFrame(hierarchy_data, columns=['CAR_REFERENCE', 'UNIQUE_CHILD', 'Level', 'sorted_order'])
# Add underscores to represent hierarchy levels visually (reverse order)
max_level = df_hierarchy['Level'].max()
df_hierarchy['Level'] = df_hierarchy['Level'].apply(lambda x: '_' * (x - 1) + str(x))
return df_hierarchy
# Read data from CSV
df = pd.read_csv('Csvfrompreviouspart.csv', dtype=str)
out = df.groupby('CAR_REFERENCE', group_keys=False).apply(make_hierarchy)
# Export the DataFrame to a CSV file
out.to_csv('Output_step2.csv', index=False)
Here is the problem :
The Input file is : 1 841 724 lines
The output is : 1 981 260 lines
(Why this difference ?)
AND
The sorted_order column is “Bugged” I mostly got two “segments” but the structure can have 12 levels so 12 segments.
Here is a sample of the .csv file after this :
I only have “Two LEVELS” on this product, but I know that there are many more
(There are approx. 2500 lines of structure for each CAR_REFERENCE)
CAR_REFERENCE,UNIQUE_CHILD,Level,sorted_order
CAR_1,Child_G2000_4,_2,001.001
CAR_1,Child_G1000_3,_2,001.002
CAR_1,Child_G5200_3,_2,001.003
CAR_1,Child_G7200_2,_2,001.004
CAR_1,Child_G9200_2,_2,001.005
CAR_1,Child_G0200_3,_2,001.006
CAR_1,Child_G1010_2,_2,001.007
CAR_1,Child_G7201_2,_2,001.008
CAR_1,Child_G6200_3,_2,001.009
So for some products the ‘Level’ column seems correct, also the ‘Sorted_order’ too, but for the most part are incorrect.
If this script was correct the goal was to merge back this dataframe with the previous one to get back all the columns from the original file.
If you guys can take a Look of all of this, I tried to give as much details as I can, I’m stuck since 3 weeks on this, trying every day something new.
Thank’s a lot for your HELP.