Algorithm, AI, or intelligent agent suggestions for ingesting poorly-formatted & variable data from different document types

Thanks for looking!

Background

I have been tasked with writing a program to normalize and ingest data from various sources to a common database.

For the sake of simplicity, let’s say that the program is for a public library system, and that they are wanting to maintain a database of all books currently lent out of their various branches. Let’s further assume that the branches are not linked to a common network or database (silly, I know, but please bear with me).

The task is to accept submitted data from the various branch managers, and then automate the process of normalizing that data and storing it into the common database.

Variable raw data formats

The raw data may be submitted in the form of a MS Excel file, a .csv, a tab-delimited file, a plain text file, possibly even just a simple email, a field delimited file, etc.

Loosely related data contents

The contents of the raw data files will generally contain these fields:

Book ID
Book Title
Author
Is Checked Out?
Days Overdue
ISBN
Due date

. . .and so on. The problem is that some of the submitted data files will have these fields in column headers and some will not (so will need to infer from data what the field is).

Further, the field names will not always be consistent. One library branch may call the boolean field for whether a book is checked out “OnLoan”, while another branch calls it “IsCheckedOut”.

Common data repository

All of this data will be ingested to a common database with normalized data that has been cleaned up during the ingest process.

So, hopefully, we have something like this in the final DB:

BranchId
BookId
Title
CheckedOut
ISBN
DueDate
DaysOverdue

. . .and so on.

Automation of Ingest Process

Let’s assume that there are thousands of branches and that they must each issue this report to the Library HQ once monthly.

Obviously, my client can hire a bunch of data-entry people to do this job (in fact, that is how it is done today). The request from them however, is to automate as much of this as possible to cut data-entry costs.

So here is my plan, please suggest or criticize away:

Standardize the file submission process. This will be handled by creating a web page with a file upload dialog, DONE!
Determine the file type. I will be using C# (not that language matters) and it has a pretty easy way of getting the file type but sometimes I will simply get a .txt that turns out to be tab or pipe-delimited so I need an algorithm to detect this. I am thinking of using a Bayes Classifier or Artificial Neural Network for this.
Attempt to parse the data into memory. Now I have hopefully determined if I have an excel file, a tab-delimited, a csv, etc. I will run the file through the correct parser to get it into memory but now need to determine if the file has headers or if I can infer what the headers should be by the value. For this I hope to again use a Bayes Classification system and perhaps calculate a Levenshtein Distance from the value to items in an array of known/standardized header names. But what about header inference from the data? How would I identify one column as containing due dates and one as containing ISBN numbers?
Glean, clean and submit the values in each column. If I am lucky enough to have gotten this far (I know what the headers are), then I need to loop through the values in each column and clean/normalize them. For example, some library branches may enter an ISBN value as “ISBN12-345-67-89” whereas another branch enters “123456789”. I need to catch that difference and normalize them. Is this a case for just a plain ol’ expert system or if. . .then? Is there a better way?
Submit normalized data to database. This step is not as trivial as it sounds because some library branches may report a book title as “Algorithms for Dummies” while another reports it as “Algorithms for Dummies, 1st Edition”. Let’s assume for a second that I don’t have an ISBN to tie the two books together (though they are the same), what method might be suitable for deducing that these books are the same and assigning them a common primary key int the related Books table?

Many, Many thanks for your suggestions!!!

Regular batch or bulk processing of divergent data sources is proven practice that follows a common strategy.

For the example given, there seems to be a strong intent to do a lot of ongoing dynamic data analysis that probably isn’t necessary as an ongoing activity for this kind of process, and will add a great deal of unnecessary complexity to the example system.

Yet, some of the techniques mentioned might be useful for initial format discovery, analysis, and definition where a layout cannot be provided by each source in either a timely or accurate manner. Each submission will have characteristics that will make it unique from both other submissions and other submitters. Those characteristics are markers. During the transfer, or upload of the submission we can capture some of that information. The rest of the information used as markers will come from particular inspection points of the content of the submission.

Once a format is determined (and periodically validated), we would only need to locate necessary markers in subsequent submissions to determine which format the submission follows. After we know which format a submission follows, we can choose an appropriate adaptor to apply during the batch of the job.

As far as points four and five, that is slightly less than trivial, but not terribly complex. Just like with dates, the individual pieces of data may take several different forms, or formats. Each source will likely be consistent about which format they choose in their submission.

It then becomes a matter of matching to known formats and determining which detail we must include, what detail we can discard, and what detail we may infer (based on other disconnected but related information). If we cannot infer the missing data for a format, then the data is simply in an incomplete state that may suggest manual entry or rejection. There may even be well known formats that automatically flag a record or series of records as manual entry or rejected.

Occasionally, submissions won’t follow the format created. Perhaps the entire submission, or perhaps just a few records do not follow the format. In either case, determining why is more important than determining how and what is next. Why can’t be determined by any software except in trivial, predicted cases. Why is a business value that we won’t have the answer to in a simple technical solution. So it is a question that will be asked out of cycle. It may very well lead us back to a how and a what, but first things should clear as to why it changed and why it is broken.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 07:26

Thẻ: algorithms, artificial-intelligence, statistics

Algorithm, AI, or intelligent agent suggestions for ingesting poorly-formatted & variable data from different document types

Thanks for looking!

Background

I have been tasked with writing a program to normalize and ingest data from various sources to a common database.

The task is to accept submitted data from the various branch managers, and then automate the process of normalizing that data and storing it into the common database.

Variable raw data formats

The raw data may be submitted in the form of a MS Excel file, a .csv, a tab-delimited file, a plain text file, possibly even just a simple email, a field delimited file, etc.

Loosely related data contents

The contents of the raw data files will generally contain these fields:

Book ID
Book Title
Author
Is Checked Out?
Days Overdue
ISBN
Due date

. . .and so on. The problem is that some of the submitted data files will have these fields in column headers and some will not (so will need to infer from data what the field is).

Further, the field names will not always be consistent. One library branch may call the boolean field for whether a book is checked out “OnLoan”, while another branch calls it “IsCheckedOut”.

Common data repository

All of this data will be ingested to a common database with normalized data that has been cleaned up during the ingest process.

So, hopefully, we have something like this in the final DB:

BranchId
BookId
Title
CheckedOut
ISBN
DueDate
DaysOverdue

. . .and so on.

Automation of Ingest Process

Let’s assume that there are thousands of branches and that they must each issue this report to the Library HQ once monthly.

So here is my plan, please suggest or criticize away:

Standardize the file submission process. This will be handled by creating a web page with a file upload dialog, DONE!
Determine the file type. I will be using C# (not that language matters) and it has a pretty easy way of getting the file type but sometimes I will simply get a .txt that turns out to be tab or pipe-delimited so I need an algorithm to detect this. I am thinking of using a Bayes Classifier or Artificial Neural Network for this.
Attempt to parse the data into memory. Now I have hopefully determined if I have an excel file, a tab-delimited, a csv, etc. I will run the file through the correct parser to get it into memory but now need to determine if the file has headers or if I can infer what the headers should be by the value. For this I hope to again use a Bayes Classification system and perhaps calculate a Levenshtein Distance from the value to items in an array of known/standardized header names. But what about header inference from the data? How would I identify one column as containing due dates and one as containing ISBN numbers?
Glean, clean and submit the values in each column. If I am lucky enough to have gotten this far (I know what the headers are), then I need to loop through the values in each column and clean/normalize them. For example, some library branches may enter an ISBN value as “ISBN12-345-67-89” whereas another branch enters “123456789”. I need to catch that difference and normalize them. Is this a case for just a plain ol’ expert system or if. . .then? Is there a better way?
Submit normalized data to database. This step is not as trivial as it sounds because some library branches may report a book title as “Algorithms for Dummies” while another reports it as “Algorithms for Dummies, 1st Edition”. Let’s assume for a second that I don’t have an ISBN to tie the two books together (though they are the same), what method might be suitable for deducing that these books are the same and assigning them a common primary key int the related Books table?

Many, Many thanks for your suggestions!!!

Regular batch or bulk processing of divergent data sources is proven practice that follows a common strategy.

Filed under: softwareengineering - @ 07:26

Thẻ: algorithms, artificial-intelligence, statistics

Thiết kế website giá rẻ

Danh mục

Algorithm, AI, or intelligent agent suggestions for ingesting poorly-formatted & variable data from different document types

Background

Variable raw data formats

Loosely related data contents

Common data repository

Automation of Ingest Process

So here is my plan, please suggest or criticize away:

Many, Many thanks for your suggestions!!!

Algorithm, AI, or intelligent agent suggestions for ingesting poorly-formatted & variable data from different document types

Background

Variable raw data formats

Loosely related data contents

Common data repository

Automation of Ingest Process

So here is my plan, please suggest or criticize away:

Many, Many thanks for your suggestions!!!