Thiết kế website giá rẻ

Question

I have a MySQL 5.7 table that stores products. The node columns refer to categories and subcategories. Among these nodes, only node_0 and node_4 are guaranteed to be non-NULL, while the others may or may not be NULL. This table will be used to process products in batches based on unique combinations of their node values (unique_cat). Each unique combination can have between 1 and 50 products. The individual node names are not unique (i.e., node_i can have the same name as node_k), but the combinations of these nodes are unique.

One reproducible example is as follows:

<code> CREATE TABLE products (

product_id int(11) NOT NULL AUTO_INCREMENT,

node_0 varchar(400) DEFAULT NULL,

node_1 varchar(400) DEFAULT NULL,

node_2 varchar(400) DEFAULT NULL,

node_3 varchar(400) DEFAULT NULL,

node_4 varchar(255) DEFAULT NULL,

PRIMARY KEY (id)

);

INSERT INTO products (node_0, node_1, node_2, node_3, node_4) VALUES

('a_0', NULL, NULL, NULL, 'a_1'),

('a_2', 'a_1', NULL, NULL, 'a_1'),

('a_0', NULL, NULL, 'a_3', 'a_2'),

('a_3', NULL, NULL, 'a_0', 'a_2'),

('a_0', NULL, NULL, NULL, 'a_2'),

('a_2', 'a_1', NULL, NULL, 'a_1')

</code>

<code> CREATE TABLE products ( product_id int(11) NOT NULL AUTO_INCREMENT, node_0 varchar(400) DEFAULT NULL, node_1 varchar(400) DEFAULT NULL, node_2 varchar(400) DEFAULT NULL, node_3 varchar(400) DEFAULT NULL, node_4 varchar(255) DEFAULT NULL, PRIMARY KEY (id) ); INSERT INTO products (node_0, node_1, node_2, node_3, node_4) VALUES ('a_0', NULL, NULL, NULL, 'a_1'), ('a_0', NULL, NULL, NULL, 'a_1'), ('a_2', 'a_1', NULL, NULL, 'a_1'), ('a_0', NULL, NULL, 'a_3', 'a_2'), ('a_3', NULL, NULL, 'a_0', 'a_2'), ('a_0', NULL, NULL, NULL, 'a_2'), ('a_2', 'a_1', NULL, NULL, 'a_1') </code>

    CREATE TABLE products (
    product_id int(11) NOT NULL AUTO_INCREMENT,
    node_0 varchar(400) DEFAULT NULL,
    node_1 varchar(400) DEFAULT NULL,
    node_2 varchar(400) DEFAULT NULL,
    node_3 varchar(400) DEFAULT NULL,
    node_4 varchar(255) DEFAULT NULL,
    PRIMARY KEY (id)
);

INSERT INTO products (node_0, node_1, node_2, node_3, node_4) VALUES
('a_0', NULL, NULL, NULL, 'a_1'),
('a_0', NULL, NULL, NULL, 'a_1'),
('a_2', 'a_1', NULL, NULL, 'a_1'),
('a_0', NULL, NULL, 'a_3', 'a_2'),
('a_3', NULL, NULL, 'a_0', 'a_2'),
('a_0', NULL, NULL, NULL, 'a_2'),
('a_2', 'a_1', NULL, NULL, 'a_1')

I need to create a new column, unique_cat, which will be a number representing distinct values of the combination of node_0, node_1, node_2, node_3, and node_4. I’m unsure how to handle NULL values in this context.
Expected output:

|--------|--------|--------|--------|--------|------------|

| a_0 | NULL | NULL | NULL | a_1 | 0 |

| a_2 | a_1 | NULL | NULL | a_1 | 1 |

| a_0 | NULL | NULL | a_3 | a_2 | 2 |

| a_3 | NULL | NULL | a_0 | a_2 | 3 |

| a_0 | NULL | NULL | NULL | a_2 | 4 |

| a_2 | a_1 | NULL | NULL | a_1 | 1 |

</code>

<code>| node_0 | node_1 | node_2 | node_3 | node_4 | unique_cat | |--------|--------|--------|--------|--------|------------| | a_0 | NULL | NULL | NULL | a_1 | 0 | | a_0 | NULL | NULL | NULL | a_1 | 0 | | a_2 | a_1 | NULL | NULL | a_1 | 1 | | a_0 | NULL | NULL | a_3 | a_2 | 2 | | a_3 | NULL | NULL | a_0 | a_2 | 3 | | a_0 | NULL | NULL | NULL | a_2 | 4 | | a_2 | a_1 | NULL | NULL | a_1 | 1 | </code>

| node_0 | node_1 | node_2 | node_3 | node_4 | unique_cat |
|--------|--------|--------|--------|--------|------------|
| a_0    | NULL   | NULL   | NULL   | a_1    | 0          |
| a_0    | NULL   | NULL   | NULL   | a_1    | 0          |
| a_2    | a_1    | NULL   | NULL   | a_1    | 1          |
| a_0    | NULL   | NULL   | a_3    | a_2    | 2          |
| a_3    | NULL   | NULL   | a_0    | a_2    | 3          |
| a_0    | NULL   | NULL   | NULL   | a_2    | 4          |
| a_2    | a_1    | NULL   | NULL   | a_1    | 1          |

When dealing with only node_0 and node_4, the following approach works:

<code># Create a unique combination of node_0,node_4 with unique IDs

unique_cat_df = df

.select("node_0", "node_4")

.distinct()

.withColumn("unique_cat", monotonically_increasing_id())

# Join the unique combinations back to the original DataFrame

df_with_cat_ids = df.join(

unique_cat_df,

on=["node_0", "node_4"],

how="left"

)

</code>

<code># Create a unique combination of node_0,node_4 with unique IDs unique_cat_df = df .select("node_0", "node_4") .distinct() .withColumn("unique_cat", monotonically_increasing_id()) # Join the unique combinations back to the original DataFrame df_with_cat_ids = df.join( unique_cat_df, on=["node_0", "node_4"], how="left" ) </code>

# Create a unique combination of node_0,node_4 with unique IDs
unique_cat_df = df 
      .select("node_0", "node_4") 
      .distinct() 
      .withColumn("unique_cat", monotonically_increasing_id())

# Join the unique combinations back to the original DataFrame
df_with_cat_ids = df.join(
      unique_cat_df,
      on=["node_0", "node_4"],
      how="left"
)

However, when I try to include nodes that can be NULL, it doesn’t work as expected. Here’s what I have tried:

<code>placeholder = "___NULL___"

df = df_0

.withColumn("node_2", F.when(col("node_2").isNull(), placeholder).otherwise(col("node_2")))

.withColumn("node_3", F.when(col("node_3").isNull(), placeholder).otherwise(col("node_3")))

# Select columns and create distinct combinations with a unique ID

unique_combinations_df = df

.select("node_0", "node_1", "node_2", "node_3", "node_4")

.distinct()

.withColumn("unique_cat", monotonically_increasing_id())

# Join the unique combinations back to the original DataFrame

df_with_ids = lastest_data_df_2.join(

unique_combinations_df,

on=["current_node", "node_1", "node_2", "node_3", "root_node"],

how="left"

)

</code>

<code>placeholder = "___NULL___" df = df_0 .withColumn("node_2", F.when(col("node_2").isNull(), placeholder).otherwise(col("node_2"))) .withColumn("node_3", F.when(col("node_3").isNull(), placeholder).otherwise(col("node_3"))) # Select columns and create distinct combinations with a unique ID unique_combinations_df = df .select("node_0", "node_1", "node_2", "node_3", "node_4") .distinct() .withColumn("unique_cat", monotonically_increasing_id()) # Join the unique combinations back to the original DataFrame df_with_ids = lastest_data_df_2.join( unique_combinations_df, on=["current_node", "node_1", "node_2", "node_3", "root_node"], how="left" ) </code>

placeholder = "___NULL___"
df = df_0 
    .withColumn("node_2", F.when(col("node_2").isNull(), placeholder).otherwise(col("node_2"))) 
    .withColumn("node_3", F.when(col("node_3").isNull(), placeholder).otherwise(col("node_3")))

# Select columns and create distinct combinations with a unique ID
unique_combinations_df = df 
    .select("node_0", "node_1", "node_2", "node_3", "node_4") 
    .distinct() 
    .withColumn("unique_cat", monotonically_increasing_id())

# Join the unique combinations back to the original DataFrame
df_with_ids = lastest_data_df_2.join(
    unique_combinations_df,
    on=["current_node", "node_1", "node_2", "node_3", "root_node"],
    how="left"
)

This approach is not working. Any ideas or suggestions? Thanks!

Thiết kế website giá rẻ

Danh mục

Pyspark on Mysql HOW to create a new column to categorize products based on distinct combination of another columns with NULL values