Thiết kế website giá rẻ

Question

Hi guys ‘Big problem here’,

I have a csv file as input file, which represent the structure of a “Car”.
There are 14 columns In my input file, at the end of the process I need to get back all the columns.

Basically there are 3 mains columns to deal with to create the architecture of the “Car” :
“CAR_REFERENCE”, “PARENT”, “CHILD”

The original input file doesn’t have any ID. (Lines can be duplicated)

For example in A CAR, there are lots of same bolts and Screws mounted on the same assembly.

CAR_REFERENCE	PARENT	CHILD
CAR_1	WHEEL_1	BOLT
CAR_1	WHEEL_1	BOLT

This example is just here to show you guys that rows of the input file are not unique.

WHEEL_1 can obviously be part of many many different CAR_REFERENCE.

My final GOAL is :

First, to add a new “Level” column to visually see the hierarchy of the whole structure
(Ex : Top Item = Level 1, first child = Level 2, first child of first child = Level 3 …)

Expected for this point :
(I’ve deliberately reduced the structure so as not to include too much information, but in my input file I have absolutely all the details and levels.)

CAR_REFERENCE	PARENT	CHILD	LEVEL
CAR_1	-1	CHASSIS	1
CAR_1	CHASSIS	REAR AXLE	__2
CAR_1	REAR AXLE	WHEEL HUB	___3
CAR_1	WHEEL HUB	WHEEL_1	____4
CAR_1	WHEEL HUB	WHEEL_1	____4
CAR_1	WHEEL HUB	WHEEL_1	____4
CAR_1	WHEEL HUB	WHEEL_1	____4
CAR_1	CHASSIS	FRONT AXLE	__2

Second, create another column to SORT the hierarchy by keeping the order (PowerBi don’t know the exact structure of my cars so I need to create a column only for sorting purpose, to keep the structure as it is). New column SORTED_ORDER

CAR_REFERENCE	PARENT	CHILD	LEVEL	SORTED_ORDER
CAR_1	-1	CHASSIS	1	1
CAR_1	CHASSIS	REAR AXLE	__2	1.1
CAR_1	REAR AXLE	WHEEL HUB	___3	1.1.1
CAR_1	WHEEL HUB	WHEEL_1	____4	1.1.1.1
CAR_1	WHEEL HUB	WHEEL_1	____4	1.1.1.2
CAR_1	WHEEL HUB	WHEEL_1	____4	1.1.1.3
CAR_1	WHEEL HUB	WHEEL_1	____4	1.1.1.4
CAR_1	CHASSIS	FRONT AXLE	__2	1.2

**One of the probleme here is if we got : **

1.1 1.2 ... 1.9 1.10

**It will be sorted like : **

1.1 1.10 1.2 1.3 ...

So first I separate each “values” between the “.” to create segments, I check the max value of each segment and the make a condition :

<code>

if segment.value < 10

Only one digit = 1.2.1.5

if segment.value < 100

Max value of segment 2 = 98

Only two digits = 1.2.02.5

Else

Max value of segment 3 = 102

Only Three digits = 1.2.02.006

</code>

<code> if segment.value < 10 Only one digit = 1.2.1.5 if segment.value < 100 Max value of segment 2 = 98 Only two digits = 1.2.02.5 Else Max value of segment 3 = 102 Only Three digits = 1.2.02.006 </code>


if segment.value < 10

Only one digit = 1.2.1.5


if segment.value < 100

Max value of segment 2 = 98
Only two digits = 1.2.02.5

Else

Max value of segment 3 = 102
Only Three digits = 1.2.02.006

So if a segment[x] value Hit 1.76.1

ALL the previous elements from the 1.1.1 to 1.76.1 will be on two digits like so :
1.01.1
1.02.1
1.03.1

Same thing when a segment hit 100 or more

Doing this I can keep the original structure Order while sorting.

Here is all my code, and What I tried to reach my two main objectives :

First I tried to create Unique ID’s by counting occurences of each child of the whole input file (1.8 millions lines) :

<code>import pandas as pd

df = pd.read_csv('Original.csv', dtype=str)

child_counts = {}

child_unique_ids = {}

for index, row in df.iterrows():

child = row['CHILD']

if child not in child_counts:

child_counts[child] = 1

else:

child_counts[child] += 1

child_unique = f'Child_{child}_{child_counts[child]}'

df.loc[index, 'UNIQUE_CHILD'] = child_unique

if child not in child_unique_ids:

child_unique_ids[child] = child_unique

df['PARENT'] = df['PARENT'].map(lambda x: child_unique_ids[x] if x in child_unique_ids else x)

df.to_csv('Output.csv', index=False)

</code>

<code>import pandas as pd df = pd.read_csv('Original.csv', dtype=str) child_counts = {} child_unique_ids = {} for index, row in df.iterrows(): child = row['CHILD'] if child not in child_counts: child_counts[child] = 1 else: child_counts[child] += 1 child_unique = f'Child_{child}_{child_counts[child]}' df.loc[index, 'UNIQUE_CHILD'] = child_unique if child not in child_unique_ids: child_unique_ids[child] = child_unique df['PARENT'] = df['PARENT'].map(lambda x: child_unique_ids[x] if x in child_unique_ids else x) df.to_csv('Output.csv', index=False) </code>

import pandas as pd

df = pd.read_csv('Original.csv', dtype=str)

child_counts = {}
child_unique_ids = {}

for index, row in df.iterrows():
    child = row['CHILD']

    if child not in child_counts:
        child_counts[child] = 1
    else:
        child_counts[child] += 1

    child_unique = f'Child_{child}_{child_counts[child]}'
    df.loc[index, 'UNIQUE_CHILD'] = child_unique

    if child not in child_unique_ids:
        child_unique_ids[child] = child_unique

df['PARENT'] = df['PARENT'].map(lambda x: child_unique_ids[x] if x in child_unique_ids else x)

df.to_csv('Output.csv', index=False)

OUTPUT :

CAR_REFERENCE	PARENT	CHILD	UNIQUE_CHILD
CAR_1	-1	CHASSIS	CHASSIS_1
CAR_1	CHASSIS	REAR AXLE	REAR AXLE_1
CAR_1	REAR AXLE	WHEEL HUB	WHEEL HUB_1
CAR_1	WHEEL HUB	WHEEL_1	WHEEL_1_1
CAR_1	WHEEL HUB	WHEEL_1	WHEEL_1_2
CAR_1	WHEEL HUB	WHEEL_1	WHEEL_1_3
CAR_1	WHEEL HUB	WHEEL_1	WHEEL_1_4
CAR_1	CHASSIS	FRONT AXLE	FRONT AXLE_1

By doing this I’m losing Parent / Unique child hierarchy, because the UNIQUE_CHILD is no longer a PARENT.

So The last part of the code is replacing ‘PARENT’ with the first ‘UNIQUE_CHILD’ occurence :

CAR_REFERENCE	PARENT	CHILD	UNIQUE_CHILD
CAR_1	-1	CHASSIS	CHASSIS_1
CAR_1	CHASSIS_1	REAR AXLE	REAR AXLE_1
CAR_1	REAR AXLE_1	WHEEL HUB	WHEEL HUB_1
CAR_1	WHEEL HUB_1	WHEEL_1	WHEEL_1_1
CAR_1	WHEEL HUB_1	WHEEL_1	WHEEL_1_2
CAR_1	WHEEL HUB_1	WHEEL_1	WHEEL_1_3
CAR_1	WHEEL HUB_1	WHEEL_1	WHEEL_1_4
CAR_1	CHASSIS_1	FRONT AXLE	FRONT AXLE_1

So the first STEP to create Unique ID is working.

Now, let’s create the ‘LEVEL’ and ‘Sorted_order’ columns :

Here is my code :

<code>import pandas as pd

import networkx as nx

def make_hierarchy(g):

G = nx.from_pandas_edgelist(g, create_using=nx.DiGraph,

source='PARENT', target='UNIQUE_CHILD')

def dfs_with_level(node, level, order, parent_order, max_depth):

hierarchy_data = [(g.name, node, level, parent_order + f'{order:0{max_depth}d}')]

children = list(G.successors(node))

if children:

for i, child in enumerate(children, start=1):

hierarchy_data.extend(dfs_with_level(child, level + 1, i, parent_order + f'{order:0{max_depth}d}' + '.', max_depth))

return hierarchy_data

hierarchy_data = []

for node in G.nodes:

if not list(G.predecessors(node)):

max_depth = len(str(len(G)))

hierarchy_data.extend(dfs_with_level(node, 1, 1, '', max_depth))

df_hierarchy = pd.DataFrame(hierarchy_data, columns=['CAR_REFERENCE', 'UNIQUE_CHILD', 'Level', 'sorted_order'])

# Add underscores to represent hierarchy levels visually (reverse order)

max_level = df_hierarchy['Level'].max()

df_hierarchy['Level'] = df_hierarchy['Level'].apply(lambda x: '_' * (x - 1) + str(x))

return df_hierarchy

# Read data from CSV

df = pd.read_csv('Csvfrompreviouspart.csv', dtype=str)

out = df.groupby('CAR_REFERENCE', group_keys=False).apply(make_hierarchy)

# Export the DataFrame to a CSV file

out.to_csv('Output_step2.csv', index=False)

</code>

<code>import pandas as pd import networkx as nx def make_hierarchy(g): G = nx.from_pandas_edgelist(g, create_using=nx.DiGraph, source='PARENT', target='UNIQUE_CHILD') def dfs_with_level(node, level, order, parent_order, max_depth): hierarchy_data = [(g.name, node, level, parent_order + f'{order:0{max_depth}d}')] children = list(G.successors(node)) if children: for i, child in enumerate(children, start=1): hierarchy_data.extend(dfs_with_level(child, level + 1, i, parent_order + f'{order:0{max_depth}d}' + '.', max_depth)) return hierarchy_data hierarchy_data = [] for node in G.nodes: if not list(G.predecessors(node)): max_depth = len(str(len(G))) hierarchy_data.extend(dfs_with_level(node, 1, 1, '', max_depth)) df_hierarchy = pd.DataFrame(hierarchy_data, columns=['CAR_REFERENCE', 'UNIQUE_CHILD', 'Level', 'sorted_order']) # Add underscores to represent hierarchy levels visually (reverse order) max_level = df_hierarchy['Level'].max() df_hierarchy['Level'] = df_hierarchy['Level'].apply(lambda x: '_' * (x - 1) + str(x)) return df_hierarchy # Read data from CSV df = pd.read_csv('Csvfrompreviouspart.csv', dtype=str) out = df.groupby('CAR_REFERENCE', group_keys=False).apply(make_hierarchy) # Export the DataFrame to a CSV file out.to_csv('Output_step2.csv', index=False) </code>

import pandas as pd
import networkx as nx

def make_hierarchy(g):
    G = nx.from_pandas_edgelist(g, create_using=nx.DiGraph,
                                source='PARENT', target='UNIQUE_CHILD')
    
    def dfs_with_level(node, level, order, parent_order, max_depth):
        hierarchy_data = [(g.name, node, level, parent_order + f'{order:0{max_depth}d}')]
        children = list(G.successors(node))
        if children:
            for i, child in enumerate(children, start=1):
                hierarchy_data.extend(dfs_with_level(child, level + 1, i, parent_order + f'{order:0{max_depth}d}' + '.', max_depth))
        return hierarchy_data
    
    hierarchy_data = []
    for node in G.nodes:
        if not list(G.predecessors(node)):
            max_depth = len(str(len(G)))
            hierarchy_data.extend(dfs_with_level(node, 1, 1, '', max_depth))
    
    df_hierarchy = pd.DataFrame(hierarchy_data, columns=['CAR_REFERENCE', 'UNIQUE_CHILD', 'Level', 'sorted_order'])

    # Add underscores to represent hierarchy levels visually (reverse order)
    max_level = df_hierarchy['Level'].max()
    df_hierarchy['Level'] = df_hierarchy['Level'].apply(lambda x: '_' * (x - 1) + str(x))
    
    return df_hierarchy

# Read data from CSV
df = pd.read_csv('Csvfrompreviouspart.csv', dtype=str)

out = df.groupby('CAR_REFERENCE', group_keys=False).apply(make_hierarchy)

# Export the DataFrame to a CSV file
out.to_csv('Output_step2.csv', index=False)

Here is the problem :

The Input file is : 1 841 724 lines
The output is : 1 981 260 lines

(Why this difference ?)

AND

The sorted_order column is “Bugged” I mostly got two “segments” but the structure can have 12 levels so 12 segments.

Here is a sample of the .csv file after this :
I only have “Two LEVELS” on this product, but I know that there are many more
(There are approx. 2500 lines of structure for each CAR_REFERENCE)

<code>CAR_REFERENCE,UNIQUE_CHILD,Level,sorted_order

CAR_1,Child_G2000_4,_2,001.001

CAR_1,Child_G1000_3,_2,001.002

CAR_1,Child_G5200_3,_2,001.003

CAR_1,Child_G7200_2,_2,001.004

CAR_1,Child_G9200_2,_2,001.005

CAR_1,Child_G0200_3,_2,001.006

CAR_1,Child_G1010_2,_2,001.007

CAR_1,Child_G7201_2,_2,001.008

CAR_1,Child_G6200_3,_2,001.009

</code>

<code>CAR_REFERENCE,UNIQUE_CHILD,Level,sorted_order CAR_1,Child_G2000_4,_2,001.001 CAR_1,Child_G1000_3,_2,001.002 CAR_1,Child_G5200_3,_2,001.003 CAR_1,Child_G7200_2,_2,001.004 CAR_1,Child_G9200_2,_2,001.005 CAR_1,Child_G0200_3,_2,001.006 CAR_1,Child_G1010_2,_2,001.007 CAR_1,Child_G7201_2,_2,001.008 CAR_1,Child_G6200_3,_2,001.009 </code>

CAR_REFERENCE,UNIQUE_CHILD,Level,sorted_order
CAR_1,Child_G2000_4,_2,001.001
CAR_1,Child_G1000_3,_2,001.002
CAR_1,Child_G5200_3,_2,001.003
CAR_1,Child_G7200_2,_2,001.004
CAR_1,Child_G9200_2,_2,001.005
CAR_1,Child_G0200_3,_2,001.006
CAR_1,Child_G1010_2,_2,001.007
CAR_1,Child_G7201_2,_2,001.008
CAR_1,Child_G6200_3,_2,001.009

So for some products the ‘Level’ column seems correct, also the ‘Sorted_order’ too, but for the most part are incorrect.

If this script was correct the goal was to merge back this dataframe with the previous one to get back all the columns from the original file.

If you guys can take a Look of all of this, I tried to give as much details as I can, I’m stuck since 3 weeks on this, trying every day something new.

Thank’s a lot for your HELP.

Thiết kế website giá rẻ

Danh mục

How to process a Parent / Child hierarchy using Python Pandas?