I am working through a book chapter in pyspark and the write.csv command is creating a folder, rather than a .csv file.
I am working in jupyter notebook rather than the shell and the following command is creating a folder:
results.coalesce(1).write.csv("./simple_count_single_partition.csv")
I’ve worked through the following example, taken from “Data Analysis with Python and PySpark” by Jonathan Rioux
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
col,
explode,
lower,
regexp_extract,
split,
)
spark = SparkSession.builder.appName(
"Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()
book = spark.read.text("./data/gutenberg_books/1342-0.txt")
lines = book.select(split(book.value, " ").alias("line"))
words = lines.select(explode(col("line")).alias("word"))
words_lower = words.select(lower(col("word")).alias("word"))
words_clean = words_lower.select(
regexp_extract(col("word"), "[a-z']*", 0).alias("word")
)
words_nonull = words_clean.where(col("word") != "")
results = words_nonull.groupby(col("word")).count()
results.orderBy("count", ascending=False).show(10)
results.coalesce(1).write.csv("./simple_count_single_partition.csv")
Convert your dataframe to pandas and then save to CSV:
results.coalesce(1).toPandas().to_csv("./simple_count_single_partition.csv")