I am trying to read in Pandas a single column from a huge CSV file using the answer from another question:
import pandas as pd
test_df = pd.read_csv("test.csv", usecols=["id_str"], engine="pyarrow")
and I obtain this error:
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 4 columns, got 3
Using a much smaller file, I can read it using just pd.read_csv
without any option.
Reading around it seems this problem is related to the fact that the CSV file has empty cells, which are filled by NaN
when pd.read_csv
is used without options, but they create problems in the other case.
I didn’t find any solution yet for this problem, any sugestions?
I want to read just some columns, because the file is really huge and I need just those for the analysis I have to do.
Your CSV file is broken. Somewhere down the file, you have an incorrect number of commas… so there are less columns than pandas would expect. You cannot reproduce this error with a (different) smaller file, because that smaller data (e.g., only the top 100 rows) are formatted correctly – so your code works. Somewhere down below in your original file, at least one row is not looking like the rows above, and that causes the error (only on the original file).
This is not about missing values (e.g., np.NaN
, represented in csv as ",,"
). These can be parsed. It’s about incorrect number of commas in a row.
Assume you want to fix the file, you will need to find the broken row (and either remove it, or fix the content). Try to read just a certain number of rows (top 100, 500, 1000, ..) until you hit the error. That will let you find the row. Or make a copy of the file, and delete the bottom 90%, bottom 80%, .. until the error pops up.
3