I want to learn Pandas framework, so I find free csv with Euro data from kaggle.com
https://www.kaggle.com/datasets/piterfm/football-soccer-uefa-euro-1960-2024/data
But there’re plenty of columns which looks like this
subset['goals'][1]
"[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"
So I’d like to extract this data and manipulate on it
example dataframe
I’ve traied to use this code
import json
stdf = subset['goals'].apply(json.loads)
# stlst = list(stdf)
# stjson = json.dumps(stlst)
# subset.join(pandas.read_json(stjson))
But for stdf = subset['goals'].apply(json.loads)
I’m getting the error message
the JSON object must be str, bytes or bytearray, not float
So, I don’t know how to solve this problem.
I guess I have to iterate over the goal column, I’ve tried something, but still the results were not what they should have been.
4
This
"[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"
is not JSON, but is valid python list and thus can be loaded using ast.literal_eval
that is
import ast
import pandas as pd
data_string = "[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"
data = ast.literal_eval(data_string)
df = pd.DataFrame(data)
print(df)
gives output
phase ... goal_type
0 FIRST_HALF ... SCORED
1 FIRST_HALF ... PENALTY
2 SECOND_HALF ... SCORED
[3 rows x 8 columns]
@Daweo – ok it’s helpful but It’s not working with NaN values. I’ve tried iterate over every row but it returned me ValueError.
I will show it in a different perspective.
this is my dataframe
data = [[2036211,'Spain','England','','',''], [2036210,'Netherlands','England',1.0,2.0,"[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"]]
df = pd.DataFrame(data, columns=['id_match' ,'home_team' ,'away_team' ,'home_score','away_score','goals'])
How to extract goals column to have dataframe with column id_match, home_team, away_team, home_score, away_score, goals_phase, goals_time etc.
If there are more than one goal, It returns many rows with the same id_match, home_team, away_team, home_score, away_score
columns and unique from goals
column.
1
So what you’ll need to do is first convert those string values into a dictionary. Then use explode
to convert each if those values of the keys into rows. Then ultimately use json_normalize
import pandas as pd
import ast
file = 'C:/Users/SchvaJ01/Downloads/2024.csv/2024.csv'
df = pd.read_csv(file)
# Convert string representations of lists to actual lists
df['goals'] = df['goals'].fillna('[]').apply(ast.literal_eval)
# Explode the list of dictionaries into separate rows
df_exploded = df.explode('goals')
# Normalize the nested dictionaries
df_flattened = pd.json_normalize(df_exploded['goals'])
print(df_flattened.head(5).to_string())
Output:
print(df_flattened.head(5).to_string())
phase international_name club_shirt_name country_code national_field_position national_jersey_number goal_type time.minute time.second time.injuryMinute
0 SECOND_HALF Nico Williams WILLIAMS ARTHUER ESP FORWARD 17 SCORED 47.0 10.0 NaN
1 SECOND_HALF Cole Palmer Palmer ENG MIDFIELDER 24 SCORED 73.0 9.0 NaN
2 SECOND_HALF Mikel Oyarzabal Oyarzabal ESP FORWARD 21 SCORED 86.0 56.0 NaN
3 FIRST_HALF Xavi Simons Xavi NED FORWARD 7 SCORED 7.0 41.0 NaN
4 FIRST_HALF Harry Kane Kane ENG FORWARD 9 PENALTY 18.0 34.0 NaN