I have been in an ongoing conversation concerning a project we are about to undertake at my place at work. The project concerns data integration. Our customers want to be able to integrate our data with theirs, either by sending data from their systems into ours, or by getting data from our system into theirs. I don’t believe we are talking about ‘big data.’ Our entire database is less than 20 gigs.
The individual transfers from our customers will be very small as well – on the order of 10s of MBs, if not smaller.
A major goal for this project is to minimize the amount of work needed to be done by the customer, who tends to not be very technical. In the past, this problem has been “solved” by having our customers send us an xml file that is transformed by xlst before being saved to the database, or through a WCF application for exporting data. These are not seen as good solutions.
My first thought was to create a REST interface with very good monitoring tools and libraries for our clients.
Another option that is being considered are big tools like IBM datastage or Pentaho (or similar).
My personal problem is that when I look at these tools, I don’t see how they can help. It looks like they are aimed at setting up large scale data transfers that happen on a regular schedule with a given format. Adding a new import/export using those tools requires a lot of setup both for us and the customer, which we are trying to avoid. They are also not really useful for real-time data updates. Finally, they seem like they are trying to kill an ant with a flamethrower – they are heavyweights for a light weight problem.
In contrast, with a REST setup, data can be transferred whenever required, and Setup only needs to happen in a very small area.
Am I missing something when it comes to these ETL solutions? Am I not being fair to them?
3
Nope, you’re spot on here. ETL is the wrong choice for this problem, ETL is for information transfers from complex proprietary formats, that are regular and or large and must integrate into other complex proprietary schemas. Also ETL is often the tool of choice for business types as they can be more user friendly for doing data imports/integrations and avoid the overhead of paying for a software engineer, but your company already has SEs on staff which can make a simpler tailor made import that will be more maintainable. It’s like access, business types can throw stuff together in it, but the maintainability will be garbage over time and get worse rapidly. Up front cost of throwing an engineer at problems like that is steeper than the business guy + access approach, but in the long run is way cheaper, plus the results are always better for the customers which increases marketability.
You have a simple format available which you can somewhat control, all you need is a REST (or even just FTP and teach them to use some simple FTP client) endpoint for them to upload some block of XML data which you guys already have tools for parsing and importing it sounds like. This is by far a simpler system than ETL will present, and for the problem you are defining, the correct approach.
Additionally to connectivity provided by REST
interfaces or FTP
file transfers, you will require a certain degree of data security and fault tolerance. Therefore, my recommendation is to use AS2 as secure and reliable message transfer. Mail transfer of XML
messages (secured via S/MIME
X.509 PKI certificates) might be a bit slow for you but would be an option to consider.
There are quite a few aspects to take into account when selecting the “best” business-to-business solution:
- Reliability (error avoidance and error handling)
- Flexibility (time/effort to establish new connections)
- Operating cost
- Monitoring (technical and on business-level)
- Investments for software and hardware
- Throughput (data volume per time period)
- Latency (end-to-end delay for messages)
- Privacy
- Fraud detection/prevention
- Adherence to standards and regulations
- Availability of knowledgeable consultants
- Number of messages and maximum message size
- Message operations required (signature validation, compression, routing, mapping,
format/content checking) - Management of partner data (addresses, ports, formats, etc.)
- Archiving (for audits and support)
The domain of ETL
is data migration and “big data”. A typical use case for ETL
is loading thousands of Gigabytes into a datawarehouse for business intelligence purposes.