So, I’m in the process of setting up error handling for AWS Glue Jobs written in PySpark. At present, I’m working on residential address validation.
Context: For this, I want to compare against a list of valid Zip/Postal Code prefixes for all US states (save NJ and Hawaii, but including DC and State/ISPO Codes for US Territories that have assigned Postal Codes. I can sort those out already with the state error handeling, so that shouldn’t be a problem). In addition, I want to set an Error for Unique Zip Code prefixes that can’t possibly be assigned to a residential address (e.g., 88888 belongs to the USPS Santa Claus delivery service, 20521 is the redirect code for Diplomatic deliveries, 72716 is for the unique code for Walmart). I have a handy list of every Postal Code which does mark out the one’s assigned as “Unique” but not why (I can’t include University addresses because some people employed by a University would use that postal code prefix so those need to be hand sorted out).
Anyway, is there a way to import a list of these codes from somewhere to the PySpark script and compare against the input so I don’t have to manually set up the check for each Postal Code prefix in the script?
And no, I have no faith in humans to correctly input anything anymore. Nor the compliance team who will be throwing these csv files into Glue until we get the entire automation process complete.
Asking before I attempt. Here are my present references because I can’t find anybody that has done something like this before:
WebLink to Download an Excel Sheet of all Zip Codes by Area and District Codes
Address Error Codes that give me hope of making this possible
And yes, I am aware that this is probably overkill
Kat2017 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.