I have a project I’m working on for a data analysis course, where we pick a data set and go through the steps of cleaning and exploring the data with a question to answer in mind. I want to be able to see how many instances of the data occur in different years, but right now the Year column in the data set is set to datatype object, with values spanning from whole years like 1998, just the last 2 digits likes 87, ranges of presumed years (‘early 1990’s’, ’89 or 90′, ‘2011- 2012’, ‘approx 2001’)
I’m trying to determine the best way to convert all these various instances to the proper format. Or would it be better to drop the values that are not definitive? I worry that this would lead to too much data loss because the dataset is already pretty small (about 5000 rows total)
I have looked into regex and it seems like that is the path I should go down to keep and alter the values, but I still don’t understand it conceptually very well, and I worry about the efficiency of filtering for so many different value variations. I’m still very new to python and pandas.
Jubilbee Draws is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.