U_d = pd.read_csv("Uzud.csv", decimal = ',')
print(U_d)
what comes out:
f in kHz U in mV Amplitude in Vpp entfernung in cm
43.7 52 1 5.0 9.0
43.7 43 5 11.0 NaN
43.7 30 5 13.0 NaN
43.7 27,5 5 15.0 NaN
43.7 24 5 16.0 NaN
(Note that the value 27.5 in ‘U in mV’ is 27,5 and interpreted as a string.)
In all columns all data are actually floats but unfortunately it is stored like this:
f in kHz,U in mV,amplitude in Vpp,distance in cm
"43,7",52,1,5,9
"43,7",43,5,11
"43,7",30,5,13
"43,7", "27,5",5,15
"43,7",24,5,16
which causes some data in ‘U in mV’ to be interpreted as a string. I want everything to be stored as a float.
Use pd.read_csv
with skipinitialspace=True
. Add dtype=float
, if integers need to become floats as well. For “bad lines” correction, see below.
import pandas as pd
from io import StringIO
s = """f in kHz,U in mV,amplitude in Vpp,distance in cm
"43,7",52,1,5,9
"43,7",43,5,11
"43,7",30,5,13
"43,7", "27,5",5,15
"43,7",24,5,16
"""
df = pd.read_csv(StringIO(s), decimal=',', skipinitialspace=True, dtype=float)
Output
f in kHz U in mV amplitude in Vpp distance in cm
43.7 52.0 1.0 5.0 9.0
43.7 43.0 5.0 11.0 NaN
43.7 30.0 5.0 13.0 NaN
43.7 27.5 5.0 15.0 NaN
43.7 24.0 5.0 16.0 NaN
As noted in the answer by @AutumnKome, your first line appears to be flawed:
"43,7",52,1,5,9`
# should probably be
"43,7","52,1",5,9
Since pd 1.4.0
you can try and handle such cases, passing a callable to on_bad_lines
.
It’s a bit annoying that the first error appears in the first non-header row, leading to the issue discussed in this SO post, with submitted ticket here.
As a workaround, use header=None
, which will push the headers to row 1, and correct it afterwards.
bad_rows = []
def handle_bad_lines(row):
new_row = row[:1] + [float('.'.join(row[1:3]))] + row[3:]
# ['43.7', '52', '1', '5', '9'] ->
# ['43.7', 52.1, '5', '9']
return new_row
df = pd.read_csv(StringIO(s), decimal=',',
skipinitialspace=True,
header=None,
on_bad_lines=handle_bad_lines,
engine='python') # with `pd 2.2.0`, use `engine='pyarrow'`
df.head(3)
0 1 2 3
0 f in kHz U in mV amplitude in Vpp distance in cm # <header pushed to row 1
1 43.7 52.1 5 9
2 43.7 43 5 11
Correcting the header, index, and adjusting the dtype (using df.T
, df.set_index
, df.rename_axis
, df.reset_index
and df.astype
).
df = (
df.T
.set_index(0)
.T
.rename_axis(columns=None)
.reset_index(drop=True)
.astype(float)
)
Output (bad lines corrected, header restored)
f in kHz U in mV amplitude in Vpp distance in cm
0 43.7 52.1 5.0 9.0
1 43.7 43.0 5.0 11.0
2 43.7 30.0 5.0 13.0
3 43.7 27.5 5.0 15.0
4 43.7 24.0 5.0 16.0
If your actual data has more complicated bad lines, first maybe collect them and investigate how they can best be handled. E.g.,
bad_rows = []
def handle_bad_lines(row):
global bad_rows
bad_rows.append(row)
return None # returning `None` is effectively skipping the line
# after executing `df = pd.read_csv(...)`
print(bad_rows)
# [['43.7', '52', '1', '5', '9']]
1
In case you don’t have access to pandas (like me), you’re going to have to make a few subroutines first. One to split the line on regular expressions (regex), and the second to convert values to float.
import re
import string
from collections import OrderedDict
import sys
# Inputting the filepath via 1st CL argument here.
# This is a placeholder for however you are inputting your user arguments .
# I would normally pass sys.argv[1:] to the parser module,
# but this is quick and dirty script to verify that it works
fpath = sys.argv[1:][0]
def split_line(line):
# r'' means the text is a "regular expression"
return re.sub(r'"(w+),(w+)"', r'1.2', line).split(',')
def try_make_float(value):
if isinstance(value, str):
# printable = digits + ascii_letters + punctuation + whitespace
# This is to catch any leading/trailing non-numeric characters
stripchars = value.strip(string.printable)
value = value.strip(stripchars)
# Catch NaN and Hex values:
if value.upper() != value.lower():
return None # or can return a NaN value instead
value = float(value)
return value
return value
# Initialize ordered dict:
read_data = OrderedDict()
keys = ["f in kHz","U in mV","amplitude in Vpp","distance in cm"]
for key in keys:
read_data[key] = []
with open(fpath, 'r') as file:
file.readline() # skip header
while True:
line = file.readline().strip()
if line == '':
break
line = split_line(line)
for value, key in zip(line, keys):
read_data[key].append(try_make_float(value))
I also think you may have an error in that first line in the data file where quotes are missing? The other lines all have 4 columns each.
Should the first line of data be:
"43,7","52,1",5,9
3