I have created a sample data frame in Pyspark and the ID column contain few of values having more than 8 digits number. But its return only those row having less than 8 digits values in ID field. Can anyone please suggest how to write a proper code that will return all the values if the condition is matched.
data = [["2116722", "sravan", "company 1"],
["2716722", "ojaswi", "company 2"],
["2119722", "bobby", "company 3"],
["21156311722", "sravan", "company 1"],
["21422", "ojaswi", None],
["2216722", "rohith", "company 2"],
["3116722672", "gnanesh", "company 1"],
["2156722", None, "company 2"],
["4115666122", "bobby", "company 3"],
["21190745", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.where(dataframe["ID"] > 100000).show()
OutPut:
+--------+-------------+------------+
| ID|Employee NAME|Company Name|
+--------+-------------+------------+
| 2116722| sravan| company 1|
| 2716722| ojaswi| company 2|
| 2119722| bobby| company 3|
| 2216722| rohith| company 2|
| 2156722| NULL| company 2|
|21190745| rohith| company 2|
+--------+-------------+------------+
Expected Output
+--------+-------------+------------+
| ID|Employee NAME|Company Name|
+--------+-------------+------------+
| 2116722| sravan| company 1|
| 2716722| ojaswi| company 2|
| 2119722| bobby| company 3|
| 2216722| rohith| company 2|
| 2156722| NULL| company 2|
|21190745| rohith| company 2|
|21156311722| sravan| company 1|
|4115666122| bobby| company 3|
|3116722672| gnanesh| company 1|
+--------+-------------+------------+
New contributor
Linux Math is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.