I have been learning Spark (3.5.0) and I tried out the following exercise:
- start a spark session locally :
spark = pyspark.sql.SparkSession
.builder
.master("local")
.appName("hello-spark")
.getOrCreate()
- use spark.catalog API to list all databases
spark.catalog.listDatabases()
I see this:
[Database(name=’default’, catalog=’spark_catalog’, description=’default database’, locationUri=’file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse’)]
- I create a database my_db using
spark.sql('create database my_db')
and create a table using
spark.sql("""create table my_db.fire_service_calls_tbl (CallNumber integer, UnitID string, IncidentNumber integer)
using parquet""")
- Listing databases again
spark.catalog.listDatabases()
, I see two databases like this:
[Database(name=’default’, catalog=’spark_catalog’, description=’default database’, locationUri=’file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse’),
Database(name=’my_db’, catalog=’spark_catalog’, description=”, locationUri=’file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse/my_db.db’)]
I also see the new my_db.db folder containing the tables that I have created in the file directory.
However, the next day when I start a new spark Session and I do a spark.catalog.listDatabases()
, spark can NOT find my_db at all! It just shows the default database:
Database(name=’default’, catalog=’spark_catalog’, description=’default database’, locationUri=’file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse’)]
Running spark.catalog.databaseExists('my_db')
returns False
I am sure I am missing some fundamental point on how this works.
I was expecting to see my_db to show up but it did not.