Simplifying ETL


I have just got back from the Netezza user conference and there was an interesting presentation on using IBM’s Information Server in conjunction with Netezza. The major bone of contention during the presentation, which caused much discussion, was that while most customers and prospects for Netezza were drawn, at least in part, to its simplicity, the last thing they wanted was complex data integration software front-ending to it.

You can see their point. Conversely, if you have 15 data sources, or 50, then you are dealing with a complex problem and hiding that complexity isn’t necessarily easy. However, it did get me thinking about how much easier it could be. So here are some thoughts (which are not specific to IBM by the way).

The first thing I would address is semantics, though I don’t think that having a semantic layer (à la expressor) is actually necessary: at this level a thesaurus is all you really need. This would know that “account_no” is the same as “custID” and “clientnum” for example.

It would be shipped with common synonyms pre-built but should be extensible so that users can define their own equivalences where necessary. It should probably also include support for concatenations and decompositions. For example, you might have a product category field (“AB”) and a product number field (“12345”) that you want to join together in the warehouse or, conversely, you might have a combined field that you want to split.

Now, of course you can do this stuff using rules but what I am really advocating here is that rules creation and management should be reserved for those transformations that are genuinely complex and the rest should be automated. By separating the thesaurus from rules management you can apply automated capabilities to the former where you cannot do so more broadly.

The second area where more automation could be provided is in datatype conversions. If “account_no” in your OLTP system is numeric but it is alphanumeric in your warehouse then a) the data integration software should recognise this fact and b) it should be able to perform the transformation automatically without you have to tell it do so. Tools like expressor that use semantic typing can do this but, again, you don’t necessarily need semantics for this purpose.

Reuse could also be made simpler. If you are going to perform some sort of complex transformation on data from tables X and Y, say, then it cannot be beyond the wit of man for the software to automatically display to you (in a pop-up window perhaps) all the existing transformations that have been defined across those two tables. The big problem with encouraging reuse is that’s it’s a hassle finding what’s been done previously: so make it really simple.

There’s one other biggie that could make a big difference towards removing complexity but unfortunately IBM mentioned it to me rather than the other way around and as it is in their roadmap (not all of the above are) and they want it to be an unpleasant surprise for their competition, you’ll have to wait until that feature is announced.

Finally, one of the interesting things that IBM has done with Information Server to support Netezza is to allow jobs to be run in either ETL or ELT-mode. So, you can push joins and processing to either sources or targets, or run on the Information Server platform, as suits you best. This is an excellent idea, I like it a lot.

However, the problem is that you have to define even more rules about when you do this: for example, “between 8.00am and 5.00pm use an ETL-based approach but otherwise use ELT”. Wouldn’t a much simpler option (for the user) be if the data integration product had its own optimiser, collected statistics and so forth, just like a database does, generated an “integration plan” (as opposed to a query plan) and generally automated the whole process of where you perform relevant tasks?

The Netezza users are right: data integration is complex but it behoves vendors of data integration solutions to make it as simple for users as possible.

Philip Howard is Research Director (Data Management) at Bloor Research. Data management refers to the management, movement, governance and storage of data and involves diverse technologies that include (but are not limited to) databases and data warehousing, data integration (including ETL, data migration and data federation), data quality, master data management, metadata management and log and event management. Philip also tracks spreadsheet management and complex event processing.