I have a set of data that I need to perform some transformations on. The raw form of the data is as follows:
lVoterUniqueID | sElectionAbbr1 | sElectionAbbr2 | sElectionAbbr3 | sElectionAbbr4 | sElectionAbbr5 |
---|---|---|---|---|---|
371527 | 2024PR | 2022Gen | 2022PR | 2020GEN | 2020PR |
1843949 | 2024PR | null | null | null | null |
2813398 | 2024PR | 2022Gen | null | 2020GEN | 2020PR |
The output I’m looking for is:
-
lVoterUniqueID relabeled as ID
-
Each sElectionAbbr[X] column relabeled to Election_Code
ID | Election_Code |
---|---|
371527 | 2024PR |
1843949 | 2024PR |
I can accomplish this via sql:
select distinct
lVoterUniqueID,
sElectionAbbr1 as Election_Code
from data_set
where sElectionAbbr1 != ''
union
select distinct
lVoterUniqueID,
sElectionAbbr2
from data_set
where sElectionAbbr2 != ''
union ...
However, I wanted to do this within Python and I’ve accomplished that goal, but it feels like the process is a bit heavy handed and I’m wondering if there is a more efficient way to do this.
Working Python below:
import pandas as pd
df = pd.read_csv([path to data file])
#find number of election columns.
selected_columns = [column for column in df.columns if column.startswith("sElection")]
#add voter id column
selected_columns.insert(0,'lVoterUniqueID')
#create dataframe schema
rcdp_voters_prod = pd.DataFrame(columns=['ID','Election_Code'])
def access_elements(list_var, list_index):
# Use list comprehension to create a new list containing elements from 'list_var' at the specified indices in 'list_index'
result = [list_var[i] for i in list_index]
return result
#indices of list values that hold the election column
python_indices = [index for (index, item) in enumerate(selected_columns) if item != "lVoterUniqueID"]
union_dict = {}
#Create dictionary of lists containing unique voterid and each election abbreviation column
# ['lVoterUniqueID', 'sElectionAbbr1']
# ['lVoterUniqueID', 'sElectionAbbr2']
# ['lVoterUniqueID', 'sElectionAbbr3']
# ['lVoterUniqueID', 'sElectionAbbr4']
# ['lVoterUniqueID', 'sElectionAbbr5']
for index in python_indices:
union_dict.update({index:access_elements(selected_columns,[0,index])})
dataframe_list = []
#create list of dataframes containing only distinct columns from column list
#rename columns for concating
for key in union_dict:
loop_df = df.filter(items=union_dict[key]).drop_duplicates()
loop_df.rename(columns={'lVoterUniqueID':'ID',union_dict[key][1]:'Election_Code'}, inplace= True)
dataframe_list.append(loop_df)
final_frame_form = pd.concat(dataframe_list)
#drop rows where election code is null
final_frame_form=final_frame_form[final_frame_form['Election_Code'].notnull()]