I have a xlsx file that is structured in an unorthodox way. Simplified, it looks like this:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | ||||||
2 | member_id | 101 | 102 | 102 | 103 | |
3 | first_name | paul | john | george | ringo | |
4 | last_name | mccartney | lennon | harrison | starr |
Note the member_id is NOT unique (it makes sense in the original data).
My goal is to have a table that looks like this:
member_id | first_name | last_name |
---|---|---|
101 | paul | mccartney |
102 | john | lennon |
102 | george | harrison |
103 | ringo | starr |
(note that member_id is still not unique)
So my approach is to read this table into Python, transpose it, and write it to a database.
This is what I have so far:
import pandas as pd
import openpyxl
df = pd.read_excel('ingest_pzt/sample_inv.xlsx', sheet_name='data', engine='openpyxl', skiprows=[0])
This is already off to a bad start, as it seems that Pandas wants my columns to be unique (note that the columns are called “102” and “102.1”):
Unnamed: 0 member_id 101 102 102.1 104
0 NaN first_name paul john george ringo
1 NaN last_name mccartney lennon harrison starr
If I got the issue with the member_id fixed, I would drop the empty column:
df = df.drop(df.columns[[0]], axis='columns')
Which would give me
member_id 101 102 102.1 104
0 first_name paul john george ringo
1 last_name mccartney lennon harrison starr
Then I would transpose the table like this:
df = df.transpose()
0 1
member_id first_name last_name
101 paul mccartney
102 john lennon
102.1 george harrison
104 ringo starr
Which has an odd first row of what used to be the index, which I can’t seem to drop.
Help is greatly appreciated!