I am working in R with some really messy address data and have been able to solve every issue except one. At the end of the address string, where the zip code is — there is often 1, 2, or 3 additional digits which I need to remove (when they are present). I cannot just substring the field as the number of trailing digits varies, so it appears I need a regex solution.
Here are some examples of how the addresses currently look:
“3660 Nogales St West Covina, CA, 9179266”
“6666 W Peoria Ave #106 Glendale, AZ, 85302174”
“10391 Friars Rd San Diego, CA, 9212051”
“7950 E Mississippi Ave Suite F Denver, CO, 8024766”
“1079 S Federal Blvd Denver, CO, 8021956”
“1420 Saratoga Ave San Jose, CA, 9512948”
I’ve tried several things which resulted in unexpected outputs. The most recent attempt, which I thought would work, is greedy matching on commas and extracting the 5 digits following the final comma group.
This is the code I used:
str_extract_all(df$address, "^\w+(?:,\w+)*,\d{5}")
but this turns the string into character(0)
when I expected it to return the addresses less the trailing digits. So, I expect there is an issue with the code rather than the solution itself.
I am still learning regex and consider myself a beginner, so I may be making a trivial mistake or missing a feature that is critical to the code working. Any help is greatly appreciated!