Relative Content

Tag Archive for apache-sparkpysparkapache-spark-sql

Spark: fill spec value between flag values

I’m trying to figure out how to put a specific value between two flag values, for example:

How to get the name of the file which spark generates?

We are writing data frame to a directory like below in pySpark code.
Its writing with file name as “part-****”, is there anyway to get the name of the file in code after its written?

Dynamic partition pruning between large and small table without extra filter conditions

I have a large partitioned fact table and a small dimension table.
The partition column of the large table is the key column of the dimension table.
I would like to use the small table to reduce the number of partitions I read from the large table.
There is no condition other than that there has to be a small table record corresponding to each large table partition.
Both ‘INNER JOIN’ and ‘LEFT SEMI JOIN’ are acceptable here.

Finding overlap in groups and sorting into new distinct groups

Inititaly I thought this was an easy problem, but I just can’t figure it out.
Here is a simplified example. I have 8 different people buying some items from a store. Afterwards I want to look at all the items and sort them into groups so that each overlapping initial shopping goes into the same new group.

Error message while using “pyspark” command in CMD

I’ve installed pyspark using pip install pyspark command in CMD.

How to combine two Datasets to create list of nested JSON objects

I am new to Apache Spark (Java) and am trying to create a text file consisting of multiple json objects that represent a combination of these two datasets. The firstToSecondGeneration is very long so I omitted some columns.

Pyspark apply regex pattern on array elements

I have below Pyspark code to validate the field in nested json –

Thiết kế website giá rẻ

Danh mục