We have TSV files which holds IOT data, want to convert to table like structure using pandas. I have worked on TSV data, similar to given below, were the logics goes like
- read the file
- Add new column names
- do transpose
- reindex
This is bit challenging as explained, col1 to col3 is dimension data and remaining is fact data
tsv file data looks as below
col1 qweqweq
col2 345435
col3 01/01/2024 35:08:09
col4 1
col5 0
col4 0
col5 0
col4 1
col5 1
col4 0
col5 1
Want to project as table like structure
col1 | col2 | col3 | col4 | col5 |
---|---|---|---|---|
qweqweq | 345435 | 01/01/2024 35:08:09 | 1 | 0 |
qweqweq | 345435 | 01/01/2024 35:08:09 | 0 | 0 |
qweqweq | 345435 | 01/01/2024 35:08:09 | 1 | 1 |
qweqweq | 345435 | 01/01/2024 35:08:09 | 0 | 1 |
col4 and col5 can differ in each IOT file. How to achieve with python, pandas?
2
Assuming you can rely on “col1” to define the groups, you can use a pivot
after de-duplicating the rows, and groupby.ffill
:
df = (pd.read_csv('input_file.tsv', sep='t', header=None)
.assign(index=lambda x: x[0].eq('col1').cumsum(),
n=lambda x: x.groupby(['index', 0]).cumcount())
.pivot(index=['index', 'n'], columns=0, values=1)
.groupby(level='index').ffill()
.reset_index(drop=True)
)
Output:
col1 col2 col3 col4 col5
0 qweqweq 345435 01/01/2024 35:08:09 1 0
1 qweqweq 345435 01/01/2024 35:08:09 0 0
2 qweqweq 345435 01/01/2024 35:08:09 1 1
3 qweqweq 345435 01/01/2024 35:08:09 0 1
Reproducible input:
import io
input_file = io.StringIO('''col1tqweqweq
col2t345435
col3t01/01/2024 35:08:09
col4t1
col5t0
col4t0
col5t0
col4t1
col5t1
col4t0
col5t1''')