I have been trying to set up a validation rule for a few tables that would end up in BigQuery.
I am using Great Expectations 0.18.13.
But I am facing some issues regarding false negatives in the validation results.
Here is a simple code snippet that replicates the issue for me:
from pyspark.sql import SparkSession
from great_expectations.dataset import SparkDFDataset
spark = SparkSession.builder
.appName("Data Validation with Great Expectations")
.getOrCreate()
df = spark.read.csv("Sample.csv", header=True, inferSchema=True)
dataset = SparkDFDataset(df)
# dataset.sampling_method = "none"
column_to_check = "sample"
compound_columns = ['SampleId', 'SampleNumber']
# result = dataset.expect_column_values_to_be_unique(column_to_check)
result = dataset.expect_compound_columns_to_be_unique(compound_columns)
if result.success:
print(f"All items in column {str(compound_columns)} are unique: PASSED")
else:
print(f"Some items in column {str(compound_columns)} are not unique: FAILED")
print("Details:", result.result)
spark.stop()
On further checking, it seems that the validation returns values picked up from a different column, other than the columns mentioned, and compared them and gives the result, evident from the partial_unexpected_list
from the result.
This happened while checking for unique values in a column, and also trying with compound columns.
BUT, this issue does NOT occur when I remove the other columns from the dataset and only use the two columns, getting perfect results.
Here are the details of files:
Original File- <200mb, 333477 rows, 84 columns.
Result: Failed despite having perfect data
Sample file:225kb, 333477 rows, 2 columns.
Result: Passed.
What am I doing wrong?