I have a MySQL 5.7 table that stores products. The node columns refer to categories and subcategories. Among these nodes, only node_0 and node_4 are guaranteed to be non-NULL, while the others may or may not be NULL. This table will be used to process products in batches based on unique combinations of their node values (unique_cat). Each unique combination can have between 1 and 50 products. The individual node names are not unique (i.e., node_i can have the same name as node_k), but the combinations of these nodes are unique.
One reproducible example is as follows:
<code> CREATE TABLE products (
product_id int(11) NOT NULL AUTO_INCREMENT,
node_0 varchar(400) DEFAULT NULL,
node_1 varchar(400) DEFAULT NULL,
node_2 varchar(400) DEFAULT NULL,
node_3 varchar(400) DEFAULT NULL,
node_4 varchar(255) DEFAULT NULL,
INSERT INTO products (node_0, node_1, node_2, node_3, node_4) VALUES
('a_0', NULL, NULL, NULL, 'a_1'),
('a_0', NULL, NULL, NULL, 'a_1'),
('a_2', 'a_1', NULL, NULL, 'a_1'),
('a_0', NULL, NULL, 'a_3', 'a_2'),
('a_3', NULL, NULL, 'a_0', 'a_2'),
('a_0', NULL, NULL, NULL, 'a_2'),
('a_2', 'a_1', NULL, NULL, 'a_1')
<code> CREATE TABLE products (
product_id int(11) NOT NULL AUTO_INCREMENT,
node_0 varchar(400) DEFAULT NULL,
node_1 varchar(400) DEFAULT NULL,
node_2 varchar(400) DEFAULT NULL,
node_3 varchar(400) DEFAULT NULL,
node_4 varchar(255) DEFAULT NULL,
PRIMARY KEY (id)
);
INSERT INTO products (node_0, node_1, node_2, node_3, node_4) VALUES
('a_0', NULL, NULL, NULL, 'a_1'),
('a_0', NULL, NULL, NULL, 'a_1'),
('a_2', 'a_1', NULL, NULL, 'a_1'),
('a_0', NULL, NULL, 'a_3', 'a_2'),
('a_3', NULL, NULL, 'a_0', 'a_2'),
('a_0', NULL, NULL, NULL, 'a_2'),
('a_2', 'a_1', NULL, NULL, 'a_1')
</code>
CREATE TABLE products (
product_id int(11) NOT NULL AUTO_INCREMENT,
node_0 varchar(400) DEFAULT NULL,
node_1 varchar(400) DEFAULT NULL,
node_2 varchar(400) DEFAULT NULL,
node_3 varchar(400) DEFAULT NULL,
node_4 varchar(255) DEFAULT NULL,
PRIMARY KEY (id)
);
INSERT INTO products (node_0, node_1, node_2, node_3, node_4) VALUES
('a_0', NULL, NULL, NULL, 'a_1'),
('a_0', NULL, NULL, NULL, 'a_1'),
('a_2', 'a_1', NULL, NULL, 'a_1'),
('a_0', NULL, NULL, 'a_3', 'a_2'),
('a_3', NULL, NULL, 'a_0', 'a_2'),
('a_0', NULL, NULL, NULL, 'a_2'),
('a_2', 'a_1', NULL, NULL, 'a_1')
I need to create a new column, unique_cat, which will be a number representing distinct values of the combination of node_0, node_1, node_2, node_3, and node_4. I’m unsure how to handle NULL values in this context.
Expected output:
<code>| node_0 | node_1 | node_2 | node_3 | node_4 | unique_cat |
|--------|--------|--------|--------|--------|------------|
| a_0 | NULL | NULL | NULL | a_1 | 0 |
| a_0 | NULL | NULL | NULL | a_1 | 0 |
| a_2 | a_1 | NULL | NULL | a_1 | 1 |
| a_0 | NULL | NULL | a_3 | a_2 | 2 |
| a_3 | NULL | NULL | a_0 | a_2 | 3 |
| a_0 | NULL | NULL | NULL | a_2 | 4 |
| a_2 | a_1 | NULL | NULL | a_1 | 1 |
<code>| node_0 | node_1 | node_2 | node_3 | node_4 | unique_cat |
|--------|--------|--------|--------|--------|------------|
| a_0 | NULL | NULL | NULL | a_1 | 0 |
| a_0 | NULL | NULL | NULL | a_1 | 0 |
| a_2 | a_1 | NULL | NULL | a_1 | 1 |
| a_0 | NULL | NULL | a_3 | a_2 | 2 |
| a_3 | NULL | NULL | a_0 | a_2 | 3 |
| a_0 | NULL | NULL | NULL | a_2 | 4 |
| a_2 | a_1 | NULL | NULL | a_1 | 1 |
</code>
| node_0 | node_1 | node_2 | node_3 | node_4 | unique_cat |
|--------|--------|--------|--------|--------|------------|
| a_0 | NULL | NULL | NULL | a_1 | 0 |
| a_0 | NULL | NULL | NULL | a_1 | 0 |
| a_2 | a_1 | NULL | NULL | a_1 | 1 |
| a_0 | NULL | NULL | a_3 | a_2 | 2 |
| a_3 | NULL | NULL | a_0 | a_2 | 3 |
| a_0 | NULL | NULL | NULL | a_2 | 4 |
| a_2 | a_1 | NULL | NULL | a_1 | 1 |
When dealing with only node_0 and node_4, the following approach works:
<code># Create a unique combination of node_0,node_4 with unique IDs
.select("node_0", "node_4")
.withColumn("unique_cat", monotonically_increasing_id())
# Join the unique combinations back to the original DataFrame
df_with_cat_ids = df.join(
<code># Create a unique combination of node_0,node_4 with unique IDs
unique_cat_df = df
.select("node_0", "node_4")
.distinct()
.withColumn("unique_cat", monotonically_increasing_id())
# Join the unique combinations back to the original DataFrame
df_with_cat_ids = df.join(
unique_cat_df,
on=["node_0", "node_4"],
how="left"
)
</code>
# Create a unique combination of node_0,node_4 with unique IDs
unique_cat_df = df
.select("node_0", "node_4")
.distinct()
.withColumn("unique_cat", monotonically_increasing_id())
# Join the unique combinations back to the original DataFrame
df_with_cat_ids = df.join(
unique_cat_df,
on=["node_0", "node_4"],
how="left"
)
However, when I try to include nodes that can be NULL, it doesn’t work as expected. Here’s what I have tried:
<code>placeholder = "___NULL___"
.withColumn("node_2", F.when(col("node_2").isNull(), placeholder).otherwise(col("node_2")))
.withColumn("node_3", F.when(col("node_3").isNull(), placeholder).otherwise(col("node_3")))
# Select columns and create distinct combinations with a unique ID
unique_combinations_df = df
.select("node_0", "node_1", "node_2", "node_3", "node_4")
.withColumn("unique_cat", monotonically_increasing_id())
# Join the unique combinations back to the original DataFrame
df_with_ids = lastest_data_df_2.join(
on=["current_node", "node_1", "node_2", "node_3", "root_node"],
<code>placeholder = "___NULL___"
df = df_0
.withColumn("node_2", F.when(col("node_2").isNull(), placeholder).otherwise(col("node_2")))
.withColumn("node_3", F.when(col("node_3").isNull(), placeholder).otherwise(col("node_3")))
# Select columns and create distinct combinations with a unique ID
unique_combinations_df = df
.select("node_0", "node_1", "node_2", "node_3", "node_4")
.distinct()
.withColumn("unique_cat", monotonically_increasing_id())
# Join the unique combinations back to the original DataFrame
df_with_ids = lastest_data_df_2.join(
unique_combinations_df,
on=["current_node", "node_1", "node_2", "node_3", "root_node"],
how="left"
)
</code>
placeholder = "___NULL___"
df = df_0
.withColumn("node_2", F.when(col("node_2").isNull(), placeholder).otherwise(col("node_2")))
.withColumn("node_3", F.when(col("node_3").isNull(), placeholder).otherwise(col("node_3")))
# Select columns and create distinct combinations with a unique ID
unique_combinations_df = df
.select("node_0", "node_1", "node_2", "node_3", "node_4")
.distinct()
.withColumn("unique_cat", monotonically_increasing_id())
# Join the unique combinations back to the original DataFrame
df_with_ids = lastest_data_df_2.join(
unique_combinations_df,
on=["current_node", "node_1", "node_2", "node_3", "root_node"],
how="left"
)
This approach is not working. Any ideas or suggestions? Thanks!