I am trying to create a Spark program to read the airport data from airport.text file, find all the airports which are located in United States and output the airport’s name and city’s name to an out_airport_in_usa.text file but I get error codes when using .saveAsTextFile
I created a .py file containing a function Utils.COMMA_DELIMITER and I added the directory folder containing this .py file to my PYTHONPATH path so that I can import the module contained in the .py file
My code and error are below:
<code>from pyspark import SparkContext, SparkConf
import Utils #this is the module in the .py file
Create a function
def splitComma(line: str):
splits = Utils.COMMA_DELIMITER.split(line)
return "{}, {}".format(splits[1], splits[2])`
conf = SparkConf().setAppName("airports").setMaster("local[*]")
sc = SparkContext(conf = conf)`
Import dataset and auto-creating an RDD
airports = sc.textFile("airports.text")
filter the data to find all airports in d USA using a lamda function
airportsInUSA = airports.filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == ""United States"")
airportsNameAndCityNames = airportsInUSA.map(splitComma)
airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")
--------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
c:UsersuserDocumentsPython ProgrammingPySpark_for_Big_Data_3.ipynb Cell 18 line 2
1 # Save output in a new text file
----> 2 airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")
3 # displays an error
File C:sparkspark-3.4.2-bin-hadoop3pythonpysparkrdd.py:3406, in RDD.saveAsTextFile(self, path, compressionCodecClass)
3404 keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path, compressionCodec)
3405 else:
-> 3406 keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File C:sparkspark-3.4.2-bin-hadoop3pythonlibpy4j-0.10.9.7-src.zippy4jjava_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +
1317 self.command_header +
1318 args_command +
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
File C:sparkspark-3.4.2-bin-hadoop3pythonlibpy4j-0.10.9.7-src.zippy4jprotocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
...
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:842)
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...```
Help is urgently needed.
My software specs are:
Python 3.11.8
Pyspark 3.4.2
Spark 3.4.2
Java 17
I use VS Code IDE
</code>
<code>from pyspark import SparkContext, SparkConf
import Utils #this is the module in the .py file
Create a function
def splitComma(line: str):
splits = Utils.COMMA_DELIMITER.split(line)
return "{}, {}".format(splits[1], splits[2])`
conf = SparkConf().setAppName("airports").setMaster("local[*]")
sc = SparkContext(conf = conf)`
Import dataset and auto-creating an RDD
airports = sc.textFile("airports.text")
filter the data to find all airports in d USA using a lamda function
airportsInUSA = airports.filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == ""United States"")
airportsNameAndCityNames = airportsInUSA.map(splitComma)
airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")
--------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
c:UsersuserDocumentsPython ProgrammingPySpark_for_Big_Data_3.ipynb Cell 18 line 2
1 # Save output in a new text file
----> 2 airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")
3 # displays an error
File C:sparkspark-3.4.2-bin-hadoop3pythonpysparkrdd.py:3406, in RDD.saveAsTextFile(self, path, compressionCodecClass)
3404 keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path, compressionCodec)
3405 else:
-> 3406 keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File C:sparkspark-3.4.2-bin-hadoop3pythonlibpy4j-0.10.9.7-src.zippy4jjava_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +
1317 self.command_header +
1318 args_command +
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
File C:sparkspark-3.4.2-bin-hadoop3pythonlibpy4j-0.10.9.7-src.zippy4jprotocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
...
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:842)
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...```
Help is urgently needed.
My software specs are:
Python 3.11.8
Pyspark 3.4.2
Spark 3.4.2
Java 17
I use VS Code IDE
</code>
from pyspark import SparkContext, SparkConf
import Utils #this is the module in the .py file
Create a function
def splitComma(line: str):
splits = Utils.COMMA_DELIMITER.split(line)
return "{}, {}".format(splits[1], splits[2])`
conf = SparkConf().setAppName("airports").setMaster("local[*]")
sc = SparkContext(conf = conf)`
Import dataset and auto-creating an RDD
airports = sc.textFile("airports.text")
filter the data to find all airports in d USA using a lamda function
airportsInUSA = airports.filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == ""United States"")
airportsNameAndCityNames = airportsInUSA.map(splitComma)
airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")
--------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
c:UsersuserDocumentsPython ProgrammingPySpark_for_Big_Data_3.ipynb Cell 18 line 2
1 # Save output in a new text file
----> 2 airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")
3 # displays an error
File C:sparkspark-3.4.2-bin-hadoop3pythonpysparkrdd.py:3406, in RDD.saveAsTextFile(self, path, compressionCodecClass)
3404 keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path, compressionCodec)
3405 else:
-> 3406 keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File C:sparkspark-3.4.2-bin-hadoop3pythonlibpy4j-0.10.9.7-src.zippy4jjava_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +
1317 self.command_header +
1318 args_command +
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
File C:sparkspark-3.4.2-bin-hadoop3pythonlibpy4j-0.10.9.7-src.zippy4jprotocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
...
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:842)
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...```
Help is urgently needed.
My software specs are:
Python 3.11.8
Pyspark 3.4.2
Spark 3.4.2
Java 17
I use VS Code IDE