Relative Content

Tag Archive for apache-sparkpyspark

How to connect from my host system to a Spark Container?

I have a container running Spark on port 7077, and got the container IPAddress through Docker inspect command, and I’m trying to create a SparkSession through my host machine (I’m using Jupyter on my host) on the container’s cluster.

Free up local checkpoint memory

I am writing an iterative algorithm and I am using localCheckpoint to break the dataframe lineage as well as to persist the data for next iteration. The problem is unpersist doesn’t seem to work on local checkpointed dataframe, and thus the local checkpointed dataframe is never cleared (the storage UI also confirms that).

Join text file content to existing dataframe

I have a dataframe with metadata df_meta that has the uri of text files with the content that I require:
| id | s3_uri |
| ——– | ————– |
| 1 | s3://bucket1/prefix |
| 2 | s3://bucket2/another_prefix |
| 3 | s3://bucket3/a_different_prefix |