I keep encountering the same problem, ‘KeyError: “None of [‘site’] are in the columns”‘
I’m going to share my entire code, because i don’t know if there have been any problems with my read.csv until now, but perhaps i have been unknowingly troubleshooting them.
If any trained eyes out there have time to go through it, i would be eternally grateful.
For context: C, H, E, G, and D are species types, and all the sites are in Scotland. I use ‘site’ and ‘county’ pretty interchangeably which will need amending, and the dataset was imported from https://opendata.nature.scot/datasets/snh::waxcap-sites/explore?location=55.056158%2C1.905751%2C5.00&showTable=true. I suck at coding, but i want to be an employable environmental scientist so i’m trying my best 😀
import pandas as pd
import numpy as np
from collections import Counter
import spacy
import matplotlib.pyplot as plt
import collections, functools, operator
import itertools
import functools
from functools import reduce
import pyproj
nlp = spacy.load('en_core_web_sm')
dataframe = pd.read_csv('Grassland_Fungi.csv', low_memory=False)
countycolumn = dataframe.iloc[:,6]
indicatorscolumn = dataframe.iloc[:,33]
C = dataframe.iloc[:,11]
H = dataframe.iloc[:,12]
E = dataframe.iloc[:,13]
G = dataframe.iloc[:,14]
D = dataframe.iloc[:,15]
# Function to remove decimal error from 'East Ross.'
def EastRoss(countycolumn):
for error in countycolumn:
county = error.replace(".","")
yield county
# Function to county site occurences
def county_occurrence(countycolumn):
print('County Occurrence')
countylist = []
for county in countycolumn:
countylist.append(county)
a = Counter(countylist).keys()
b = Counter(countylist).values()
alist = []
blist = []
for a, b in zip(a, b):
alist.append(a)
blist.append(b)
df_SO = pd.DataFrame(list(zip(alist, blist)), columns = ['site', 'number'])
sorted = df_SO.sort_values('site')
return sorted
# Function to create dataframe presenting indicators by county
def indicatorcount(countylist, indicatorlist):
print('Sum of indicators per county')
df = pd.DataFrame(list(zip(countylist, indicatorlist)), columns = ['site',
'indicators'])
df2 = df.groupby('site').sum()
return df2
# Function to sum CHEGD instances per county
def CHEGD_funct(countylist, Clist, Hlist, Elist, Glist, Dlist):
print('Sum of C H E G D frequency per county')
Cdict = pd.DataFrame(list(zip(countylist, Clist)), columns = ['site', 'C'])
Hdict = pd.DataFrame(list(zip(countylist, Hlist)), columns = ['site', 'H'])
Edict = pd.DataFrame(list(zip(countylist, Elist)), columns = ['site', 'E'])
Gdict = pd.DataFrame(list(zip(countylist, Glist)), columns = ['site', 'G'])
Ddict = pd.DataFrame(list(zip(countylist, Dlist)), columns = ['site', 'D'])
Cdf = Cdict.groupby('site').sum().reset_index()
Hdf = Hdict.groupby('site').sum().reset_index()
Edf = Edict.groupby('site').sum().reset_index()
Gdf = Gdict.groupby('site').sum().reset_index()
Ddf = Ddict.groupby('site').sum().reset_index()
dataframes = [Cdf, Hdf, Edf, Gdf, Ddf]
mergedf = reduce(lambda left,right: pd.merge(left,right,on=['site'],
how='outer'), dataframes).fillna('void')
return mergedf
# Function to sum CHEGD instances per county
def mean_chegd(siteoccurence, CHEGD):
print("mean CHEGD frequency per county")
merge = siteoccurence.reset_index(drop=True).merge(CHEGD.reset_index(drop=True),
how="right")
cols = ['C', 'H', 'E', 'G', 'D']
out = (merge[cols].div(merge['number'],
axis=0).combine_first(merge).reindex_like(merge)).set_index('site')
return out
# EastRoss error correct
countylist = []
for item in EastRoss(countycolumn):
countylist.append(item)
# Site occurrence
siteoccurence = county_occurrence(countylist)
print(siteoccurence)
# Indicators per county
indicatorslist = []
for u in indicatorscolumn:
indicatorslist.append(u)
indicators = indicatorcount(countylist, indicatorslist)
print(indicators)
for item in indicatorcount(countylist, indicatorslist):
print(item)
# CHEGD per county
Clist = []
Hlist = []
Elist = []
Glist = []
Dlist = []
for c in C:
Clist.append(c)
for h in H:
Hlist.append(h)
for e in E:
Elist.append(e)
for g in G:
Glist.append(g)
for d in D:
Dlist.append(d)
CHEGD = CHEGD_funct(countylist, Clist, Hlist, Elist, Glist, Dlist)
print(CHEGD)
# Mean CHEGD per county
mean = mean_chegd(siteoccurence, CHEGD)
print(mean)
# Prepare average for visualisation
average = mean.drop('number', axis=1).T
print(average)
average.columns = average.columns.str.strip()
average.columns = [col.replace("-", "") for col in average.columns]
average.set_index('site',inplace=True)
This is where i get the ‘KeyError: “None of [‘site’] are in the columns”‘
For a reproducible example, this is what the dataframe looks like at this point:
Site | A | B | C | D |
---|---|---|---|---|
Blue | 4 | 13 | 9 | 11 |
Green | 1 | 12 | 30 | 20 |
Yellow | 12 | 2 | 3 | 3 |
Red | 20 | 14 | 4 | 0 |
I have tried converting it into a dictionary, to see if ‘site’ is recognised as the index, which gives me this output (sorry its not from the same example);
(and imagine C is cosistently on the same line as the name, and H, E, G, D are all on respective new lines
{‘Angus’: C 0.606061 n
H 2.787879
E 0.757576
G 0.000000
D 0.000000
Name: Angus, dtype: float64, ‘Angus / East Perthshire’: C 1.0
H 8.0
E 3.0
G 0.0
D 0.0
Name: Angus / East Perthshire, dtype: float64, ‘Argyll’: C 0.582645
H 4.111570
E 0.367769
G 0.280992
D 0.028926
Name: Argyll, dtype: float64, ‘Ayrshire’: C 0.702970
H 3.326733
E 0.673267
G 0.168317
D 0.089109
Name: Ayrshire, dtype: float64, ‘Banffshire’: C 0.241379
H 2.965517
E 0.655172
G 0.000000
D 0.000000
Name: Banffshire, dtype: float64} <
which looks very, very wrong, because the ‘Name’: contains two column names.
I have tried:
dataframe = pd.read_csv('Grassland_Fungi.csv', header=0, delim_whitespace=True, low_memory=False)
and a few other variations. but get this same ugly error.
Traceback (most recent call last):
File “/Users/macbook/Desktop/mushrooms/mushrooms.py”, line 16, in
dataframe = pd.read_csv(‘Grassland_Fungi.csv’, header=0, delim_whitespace=True, low_memory=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py”, line 948, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py”, line 617, in _read
return parser.read(nrows)
^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py”, line 1748, in read
) = self._engine.read( # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py”, line 239, in read
data = self._reader.read(nrows)
^^^^^^^^^^^^^^^^^^^^^^^^
File “parsers.pyx”, line 825, in pandas._libs.parsers.TextReader.read
File “parsers.pyx”, line 913, in pandas._libs.parsers.TextReader._read_rows
File “parsers.pyx”, line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
File “parsers.pyx”, line 2058, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 15 fields in line 6, saw 16
Hmmmm, my line 6 is an import, which i’ve tried, hashing out. I get that removing the header removes a field, but i’m not sure where to adjust to that in my code?
Will lots of my code need amending? And if so, is there a way around it without changing the import line?
Is ‘C error’ referencing my C index?
Thanks again, and sorry if this is structured terribly.
1