TL;DR
Sort of brainstorming: why integrating systems by means of exchanging loosely-structured files (as opposed to XML web services integration) is a bad idea?
TL Part
We’re offering an online solution for enterprise customers (big ones) and in implementing this we often need to integrate with their internal systems (HR databases, CRM and ERP systems, you know the drill).
One of the most common integration point is that we need to have information about their employees in our internal databases. To that end, we’re advocating information exchange via XML Web Services with either us calling their WS Endpoint on a regular basis and asking for changes since timestamp
or them invoking our WS and supplying all the information required. Enterprisey and buzzwordy enough, sure.
Now, what we’re offered is: every week we’ll manually export our HR data into an Excel workbook, email it to you and expect this data to be where it belongs. Slightly better option: automatically exported Excel file is shared on a network drive accessible over VPN.
Naturally, this is error-prone and doesn’t scale, but that’s the only cost-effective option for our customers. We need somehow to persuade them to switch over to Web Services, but since the final decision on whether our system will be implemented or not is made by someone without technical background we need to either present non-techincal arguments or present them in a way that’s understandable by a mere mortal.
4
The more the input is loosely-structured, the harder is to parse it. In this way, XML is already loosely-structured. When I get <price/>
tag, I expect it to be a number, like <price>59.90</price>
, but nothing, in XML, ensures that I will actually receive a number. What about <price>USD59.90</price>
, or <price>59,90 €</price>
, or <price>I don't know what to put here because I'm lazy to read the API</price>
?
The nice part is that you can validate XML with DTD. The same is not true for Excel.
The closer the data format is to the user, the more random this data would be. Just because users appreciate the ability for them to add sheets in Excel or reorder columns, etc., and they expect any parser using the data from the Excel being able to still understand where the data is and how is it saved.
I had to work on an application which parsed some PDF documents. Once, the phone rung in the IT support department. The customer was screaming about how the application sucks: when he sends his PDF, the application responds that the PDF contains no text. When the customer finally sent us the PDF, we immediately understood what was wrong: it was a scanned document with no OCR: every page contained just a bitmap with the scanned text. For this customer, it was still text and our app was expected to read it as any other text document.
This randomness makes it very difficult to work with those files. In XML, you at least expect some structure. What if you had to parse:
<product>
<product <title>Product 1
price="59.90
in-stock >19<
<product> title=Product 2
price="Fourty nine USD ;
inStock = 62
</products>
This is still XML, and the user would not understand why your stupid app can’t parse something like this, while it’s very clear.
Now, let’s get back to the arguments you can give to your stakeholder with no technical background:
1. How the integration service would be notified?
With a web service, it’s easy. You invoke it, sending the HR data to it, and that’s all.
With an Excel file on a network drive, things become complicated:
-
Either the integration service constantly checks for new files on this drive, in which case this will have an important impact in terms of bandwidth (also, if the network drive is unreliable, this may arise even more issues),
-
Or the integration service must be called after saving the Excel file, in which case instead of directly using the web service, you are using the network drive, then a web service.
2. Loading data from Excel is expensive
2. a. In terms of immediate cost
Any decent programming language can parse XML. None, I believe, can read Excel files. In order to read them, you must either use Microsoft Excel COM (which is limited to the desktop version, and can’t be done server-side), or some paid third party products.
2. b. In terms of resources
I don’t have profiler results to support this, but very probably loading data from an Excel file would cost much more in terms of CPU then parsing XML.
3. Loading data from Excel is error-prone
Excel files have one problem: they are modified by users, and users may do any change they want. What if they rename a column? What if they add a sheet before the one you need to parse? What if they export the Excel data in a format you didn’t expect?
Here’s the conclusion for the stakeholder with no technical background:
Cost effective, you say?
Let’s see. With the Excel on a network drive approach, you would need to develop two systems instead of one, given that one of the systems will be hugely error prone (heavily increasing the maintenance cost) and would require buying licenses and more powerful servers.
Higher infrastructure cost;
Higher development cost;
Higher maintenance cost.
3
Talk to them and find out who really looks after the technical side of the source system.
Its unlikely that the data is held in a spreadsheet. More likely their system has an fairly easy to use “export to spreadsheet” function which is all the people you are talking to know.
If its a large enterprise customer they are probably using a mega expensive package from one of the tier one vendors. All of these support many interfaces, and, have a variety of facilities for export data to other systems.
If you do go the exchange files route try to get a database extract/unload file, and try to have it generated automatically by “cron” or some other scheduler. Any manual process will fail every six months or so.
I’ve done several systems that interoperate with external systems using many different transport mechanisms – serial comms, web services, sockets, MQ, file drops. Of them all, the file drop one was the easiest to implement and understand and debug.
The trouble with it is really that it is a low-tech way of passing documents and tech people don’t like that, not when there’s a transactional, enterprise, service-oriented, queueing mechanism to pass the same document across a network!
One really big benefit to the file drop is that you can replay messages by, dropping the file in manually. the other benefit is logging – you want to archive all your messages, copy will put them in a different directory. Admin can also see what’s happened if things go wrong by simply reading the file. You’d be surprised how many mission critical systems use this approach as it is simple, and simple things tend to work more reliably.
For the communications, chances are you’ll still need some ‘give me all data’ communication mechanism that drives the file-drop transfer, and if this gets implemented as a web service, then you might as well transfer the files using web services too. The implementation cost to your customers should not be underestimated however.
If, on the other hand, you’re transferring documents on a regular basis (nightly, weekly or hourly) then you don’t need a notification system, when the documents are ready, read the transfer location and take all the waiting documents. Simple. (any that were not ready, eg you read too early, the missed documents will be there for the next transfer).
And of course, you can implement a directory-change API to trigger a read if you want on-demand notifications.
The big benefit of course, is that it is very decoupled. All you need to do is either wait periodically or write your own file-ready notifications. All your customer has to do is write a file. The cost of implementing this is truly minimal, and any shop that doesn’t have developers can do this. That’s probably the biggest factor in the design decision to make – are your customers tech savvy enough to write code, or not.
(for critical systems, I used a SMS messaging system that only accepted file-drops of sms messages to send, my system did all the fancy processing it usually does, but writes the data to a file instead of a socket or ws or mq. Upon processing, they wrote a status file. From my PoV, it was just another transport layer. From their PoV, they didn’t have to work with me to write a 2-way connected comms channel, they just delivered the exe and described the config options).
I think the difference between the alternatives is more important when you can process the records of the file/service in parallel.
In general terms, let’s say that the file represents a set of records. Suppose each record has some id (so you can report errors tied to this id) and suppose you can create a thread-safe method to import a record into the system.
So now you have the means to process records in parallel.
This implies that you can easily create a method to which you pass a collection of records – a sub-set of the file.
So now let’s compare:
File: when you get the file, you can read it sequentially and asynchronously call the importing service for each group of records, so processing is done in parallel.
Web service: the content of the file is sent by invoking a service passing a group of records – a sub-set of the file.
These are the advantages in my mind of the web service over file transfer:
- Performance: With a file you are limited to a single process that reads the file and processes it. Suppose it takes 50 hours to process it. With a service you can scale it as necessary – just add servers to your cluster. For example with 10 servers the same job can be done in about 5 hours.
- Performance: also using web services, any processing may be started as soon as the first record is ready (although it would not be practical to send a single record request. It would be better to send a group of records per request) but with a file, processing cannot start until much much later.
- Reporting back: During the processing there might be some bad/conflicting records that you would need to communicate back. with a file transfer you need to worry about how you would communicate back the report/results of the import. You need to invent and invest in another mechanism. Perhaps you create a file and allow the customer to download this result file. but how will they know when to start the download?
With a web service you simply return a response containing the report.
Note that the web service may have a life span such as start request, send group of records in N requests, send an ending request which responds with the report/results - Data contract communication: with a file you need to invent a way to communicate to your customer the syntax of the request and response. You don’t need to do that with a service (There are tools like Swagger that make it even easier)
- Early data contract validation: can validate their data easily if you just create a stub service method.
- Privacy/Security: with file transfer there is more potential for privacy issues that are not there when you use a secure service, namely:
1) once you create the file at the origin – someone there can read/steal it
2) same goes for when the file arrives at its destination
3) you need to deal with transferring the file securely. deal with encryption?
4) you may have security/privacy issues with the resus report that you would need to send back - Syntax: With a file you need to read the content and validate its syntax. You also need to generate the response so that it is according to the contract. With a service the container creates the Java request object from the payload before the service code is entered, and it creates the response, so you don’t need to deal with it and your customer
- Need File Detection: with a file you need a mechanism to detect when the file has completed its transfer and ready to be processed. you don’t need that with a service.
Also if this file-detection mechanism fails you are stuck so you need a way to resurrect it in case of a crash. with web service you have several servers in the cluster and they can all process this request. - Potential of communication failure: the larger the file, the more likely the operation would fail and would have to be repeated
- Massive redundant File I/O: with file transfer there is the cost of creating the file at the origin. writing the file upon receiving it and reading the file to process it. All this time is saved if the web service is called.
- Need File Transfer mechanism: with file transfer you need to invest and provide the means of transferring files (FTP?) especially if you already have a web server.
- Reputation: File transfer is low-tech – it makes a bad impression.
File based:
- If any of systems is down, unprocessed data stays, and eventually is in sync.
- Dead simple to implement.
- High throughput.
- Use special
sourcefile.trigger
files to signal writing has finished forsourcefile
. - Easy testing of consumer service.
- Latest data available after a delay.
Web services based:
- Data loss unless timestamps based fetching used, or data buffered.
- Latest data always available if producer calls consumer instantly.
Both:
- File/Data structure has to be strictly agreed upon.