I’m currently trying to clean up a database and create unique ids for each parent. For each parent there’s been duplicate accounts that have been created – one for each of their students. For example, a parent enters in their name one year with an alias and the next without, those are seen as two different accounts. These ids would allow me to create one parent account for all of their kids.
Example table of dummy data given below. This data includes student ids, parent names, relationships, phone numbers, emails, and addresses and a couple of other things. My issue is trying to combine parents such as alice johnson into one identifier using python and pandas.
Dummy Parent Data
I’m having trouble accurately determining one identifier for each parent without overwriting parent data or missing parents that have aliases.
So far I’ve read the data into a dataframe. I’ve preprocessed the data to try to minimize duplicates by setting strings to lowercase and stripping spaces. Below I’ve also melted the data to combine phonenumber columns into one column for each student. I’ve removed duplicates from all columns.
data = pd.read_csv(
"archive/pall_cust.csv"
)
data["street"] = data["street"].str.lower()
data["email"] = data["email"].str.lower()
data["street"] = data["street"].astype("string")
data["parent_name"] = data["parent_name"].astype("string")
data["parent_name"] = data["parent_name"].str.strip()
data["parent_name"] = data["parent_name"].str.title()
phone_df = pd.melt(
data,
id_vars=["dcid", "priority"],
value_vars=["priphone", "workphone", "cell"],
var_name="phonetype",
value_name="phonenumberasentered",
)
phone_df.sort_values("dcid")
phone_df.dropna(subset=["phonenumberasentered"], inplace=True)
dup = phone_df.duplicated(
subset=["dcid", "priority", "phonenumberasentered"], keep="first"
)
phone_df.loc[dup, ["phonetype", "phonenumberasentered"]] = None
phone_df.to_csv("phone.csv")
merged_df = pd.merge(data, phone_df, on=["priority", "dcid"], how="left").drop(
columns=["priphone", "workphone", "cell", "entry_date", "school"]
)
merged_df.drop_duplicates(
subset=[
"parent_name",
"dcid",
"liveswith",
"legalguardian",
"rel",
"email",
"phonenumberasentered",
"street",
"city_state_zip",
"ln",
"fn",
"studentnumber",
],
inplace=True,
)
This is the trouble area. I can’t figure out how to set a unique parent id. Below is what I’ve tried but grouping by priority, rel, street, city_state_zip, and phonenumberasentered and then taking the max student id from that grouping to identify parents I’ve found overwrites some parents names in my data set. I can’t use email because some parents with different names, i.e. mother and father use the same email address.
max_student = merged_df.loc[
merged_df.groupby(
["priority", "rel", "street", "city_state_zip", "phonenumberasentered"]
)["dcid"].idxmax()
]
parent_names_from_max = max_student.set_index(
["priority", "rel", "street", "city_state_zip", "phonenumberasentered"]
)["parent_name"]
merged_df["parent_name"] = merged_df.set_index(
["priority", "rel", "street", "city_state_zip", "phonenumberasentered"]
).index.map(parent_names_from_max)
merged_df.reset_index(drop=True, inplace=True)
parent_to_id = {
value: idx + 1 for idx, value in enumerate(merged_df["parent_name"].unique())
}
merged_df.loc[:, "New Contact Identifier"] = merged_df["parent_name"].map(parent_to_id)
Finally, I’ve tried first using things like Spacy or Levenshtein distance to try to select similar parent names but I’ve found that doesn’t work efficiently enough. Thanks for your ideas and help.
Chase Brown is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.