I have the below example DataFrame
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
'Invoice Type': pa.Column(str, pa.Check(lambda s: s.isin(['Invoice', 'Credit']))),
'Quantity': pa.Column(
int,
checks=[
pa.Check(
lambda df: df[True] < 0,
groupby=lambda df: (
df.assign(is_credit=lambda d: d["Invoice Type"] == 'Credit').groupby(['is_credit'])
),
name = 'negative credits quantities'),
pa.Check(
lambda df: df[True] >= 0,
groupby=lambda df: (
df.assign(is_invoice=lambda d: d['Invoice Type'] == 'Invoice').groupby(['is_invoice'])
),
name = 'postitive invoices quantities'),
]),
'Customer Number': pa.Column(str)
})
df = pandas.DataFrame(
{'Quantity': [1, 2, -3],
'Invoice Type': ['Invoice', 'Invoice', 'Credit'],
'Customer Number': ['ABC', 'ABC', 'DEF']}
)
Which correctly identifies all credits that are negative and invoices that are positive. Where I’m running into trouble is when I don’t have any values for that particular series. I would commonly have situations where there are no credits, example below:
df = pandas.DataFrame(
{'Quantity': [1, 2, 3],
'Invoice Type': ['Invoice', 'Invoice', 'Invoice'],
'Customer Number': ['ABC', 'ABC', 'DEF']}
)
which means that the DataFrame with all invoices fails with a key error. How can I ensure that this passes validation if only one invoice type is in the DataFrame