Thanks for looking!
Background
I have been tasked with writing a program to normalize and ingest data from various sources to a common database.
For the sake of simplicity, let’s say that the program is for a public library system, and that they are wanting to maintain a database of all books currently lent out of their various branches. Let’s further assume that the branches are not linked to a common network or database (silly, I know, but please bear with me).
The task is to accept submitted data from the various branch managers, and then automate the process of normalizing that data and storing it into the common database.
Variable raw data formats
The raw data may be submitted in the form of a MS Excel file, a .csv, a tab-delimited file, a plain text file, possibly even just a simple email, a field delimited file, etc.
Loosely related data contents
The contents of the raw data files will generally contain these fields:
- Book ID
- Book Title
- Author
- Is Checked Out?
- Days Overdue
- ISBN
- Due date
. . .and so on. The problem is that some of the submitted data files will have these fields in column headers and some will not (so will need to infer from data what the field is).
Further, the field names will not always be consistent. One library branch may call the boolean field for whether a book is checked out “OnLoan”, while another branch calls it “IsCheckedOut”.
Common data repository
All of this data will be ingested to a common database with normalized data that has been cleaned up during the ingest process.
So, hopefully, we have something like this in the final DB:
- BranchId
- BookId
- Title
- CheckedOut
- ISBN
- DueDate
- DaysOverdue
. . .and so on.
Automation of Ingest Process
Let’s assume that there are thousands of branches and that they must each issue this report to the Library HQ once monthly.
Obviously, my client can hire a bunch of data-entry people to do this job (in fact, that is how it is done today). The request from them however, is to automate as much of this as possible to cut data-entry costs.
So here is my plan, please suggest or criticize away:
- Standardize the file submission process. This will be handled by creating a web page with a file upload dialog, DONE!
- Determine the file type. I will be using C# (not that language matters) and it has a pretty easy way of getting the file type but sometimes I will simply get a
.txt
that turns out to be tab or pipe-delimited so I need an algorithm to detect this. I am thinking of using a Bayes Classifier or Artificial Neural Network for this. - Attempt to parse the data into memory. Now I have hopefully determined if I have an excel file, a tab-delimited, a csv, etc. I will run the file through the correct parser to get it into memory but now need to determine if the file has headers or if I can infer what the headers should be by the value. For this I hope to again use a Bayes Classification system and perhaps calculate a Levenshtein Distance from the value to items in an array of known/standardized header names. But what about header inference from the data? How would I identify one column as containing due dates and one as containing ISBN numbers?
- Glean, clean and submit the values in each column. If I am lucky enough to have gotten this far (I know what the headers are), then I need to loop through the values in each column and clean/normalize them. For example, some library branches may enter an ISBN value as “ISBN12-345-67-89” whereas another branch enters “123456789”. I need to catch that difference and normalize them. Is this a case for just a plain ol’ expert system or
if. . .then
? Is there a better way? - Submit normalized data to database. This step is not as trivial as it sounds because some library branches may report a book title as “Algorithms for Dummies” while another reports it as “Algorithms for Dummies, 1st Edition”. Let’s assume for a second that I don’t have an ISBN to tie the two books together (though they are the same), what method might be suitable for deducing that these books are the same and assigning them a common primary key int the related
Books
table?
Many, Many thanks for your suggestions!!!
8
Regular batch or bulk processing of divergent data sources is proven practice that follows a common strategy.
For the example given, there seems to be a strong intent to do a lot of ongoing dynamic data analysis that probably isn’t necessary as an ongoing activity for this kind of process, and will add a great deal of unnecessary complexity to the example system.
Yet, some of the techniques mentioned might be useful for initial format discovery, analysis, and definition where a layout cannot be provided by each source in either a timely or accurate manner. Each submission will have characteristics that will make it unique from both other submissions and other submitters. Those characteristics are markers. During the transfer, or upload of the submission we can capture some of that information. The rest of the information used as markers will come from particular inspection points of the content of the submission.
Once a format is determined (and periodically validated), we would only need to locate necessary markers in subsequent submissions to determine which format the submission follows. After we know which format a submission follows, we can choose an appropriate adaptor to apply during the batch of the job.
As far as points four and five, that is slightly less than trivial, but not terribly complex. Just like with dates, the individual pieces of data may take several different forms, or formats. Each source will likely be consistent about which format they choose in their submission.
It then becomes a matter of matching to known formats and determining which detail we must include, what detail we can discard, and what detail we may infer (based on other disconnected but related information). If we cannot infer the missing data for a format, then the data is simply in an incomplete state that may suggest manual entry or rejection. There may even be well known formats that automatically flag a record or series of records as manual entry or rejected.
Occasionally, submissions won’t follow the format created. Perhaps the entire submission, or perhaps just a few records do not follow the format. In either case, determining why is more important than determining how and what is next. Why can’t be determined by any software except in trivial, predicted cases. Why is a business value that we won’t have the answer to in a simple technical solution. So it is a question that will be asked out of cycle. It may very well lead us back to a how and a what, but first things should clear as to why it changed and why it is broken.