Obviously, data cannot move itself. A processor somewhere must pick up and move data somewhere else. So, while in the previous blog article of our series Data Integration Best Practices we talked about HOW we can move data, this time we are going to see WHAT moves data between systems and the design choices that we need to consider for that.
When systems move data without intermediaries
Here we are basically talking about a system that either has some import / export / synchronization capabilities by design or allows the user to define or add such capabilities.
Some forms of this include SQL Server Push/Pull capabilities, cron jobs – i.e. jobs that “wake up” on a specified schedule – which push or pull data between systems, or cron jobs that run import or export tasks to write data to or from certain files. Alternatively, an application might have already been designed by its manufacturer to interact with other specific systems, which is common to larger software suites such as Sage or Microsoft 360. Last but not least, there may be customer plugins, extensions or customizations that allow one application to interact with each other – something one would find often in the eCommerce space, for example, with Magento or Shopware.
Like with any other methods we have covered so far, this one, too, has its own advantages and disadvantages.
Pros:
- This way of how we move data can be more performant. Especially, if system pairs are optimized to talk to each other
- If this is a standard pairing, then setup of the integration can be quite fast
- You don’t need to add and manage a third system. And by definition, you don’t end up with a vendor lock-in regarding this third system
- There is no single point of failure for all integrations
Cons:
- As the system architecture grows, it becomes more and more difficult to visualize and document it. Maintaining the configurations becomes more difficult as well.
- Building integrations is generally a long and expensive process
- You cannot replicate the integrations. Every time you need to reuse an integration, you need to do that from scratch
- Missing standardisation in integrations requires a know-how owner to be present in case of changes
- Logging and monitoring mechanisms are spread across systems. It is, therefore, difficult to tell which jobs are running on which system, and when
- Lifecycle management of integration logic becomes difficult
- The number of credential pairs you’d have to manage will grow exponentially
- You would need to rediscover and re-implement the particular specifics of each system per system pair, instead of per system.
- Integration errors become hard to manage
Integration Layer / iPaaS Option
The principle here is that there is a system that is separate from the systems with the data. This third system’s sole responsibility is to move data. There are several categories of software product or service that is designed for that. Enterprise Service Bus is one of them; a more modern and lightweight solution is call integration platform as a service – or iPaaS. Full disclosure: elastic.io belongs to the second category and is also currently working with the Open Integration Hub foundation to produce an open source version of our iPaaS.
The premise behind these services/products is that all integrations must solve the following problems:
- Move data from inside the system to outside of the system
- Transform data from one system’s schema to another system’s schema
- Allow ID Linking
- Execute the tasks
- Monitoring, error collect, logging and other operational concerns
Many of these systems (including elastic.io) separate the tasks that transform data and tasks that move data into separate modules. A special software is then responsible for running and coordinating these tasks. Some vendors sell the use of the software as a service, while others sell a license to software that you must run and host yourself.
Pros:
- Even if the number of integration grows considerably, you still have one place to overview them all
- Most if not all such systems are designed to ensure reusability of integrations
- Many integration layer solutions provide so-called connectors, which are responsible for connecting with an application without having to deal with the actual code of this application
- You have both logging and monitoring all one place. This allows to, for instance, quickly find the source of errors and the reason for them
- You have a centralized place to control integration processes, and
- A centralized place to manage connections to other systems
Cons:
- The added value is not immediately obvious, because compared to the sheer costs of integration development, such a system might seem to be overpriced
- There is a risk of a vendor lock-in
- If the integration layer solution is down, then all integrations fail by default
- Having a third system adds some performance overhead since it requires time to process and sort the information it receives and sends
So, when would you choose one approach over the other? Using the inherent integration capabilities of applications to integrate them directly makes sense when you have only a handful of them. As soon as your business or the number of automated business processes start to grow, having this kind of point-to-point integration will result in more of a tape spaghetti than anything else. It doesn’t mean, though, that you have to buy expensive integration suits from the start. There are many services out there that fit various integration needs, from very little and simple to heavily complex ones.
In our next article, we’ll talk about the differences in what is being integrated. For instance, we’ll have a look at the specifics of integration with a shared authentication mechanism or the difference between event propagation and data synchronization. Stay tuned by following us on Twitter and LinkedIn!