This is an update to my previous blog post on ELT vs. ETL for SSIS, which was specific to SSIS. This remix speaks to any data integration product/tool.
More than a simple transposition of the letters ‘T’ and ‘L’, choosing a data integration approach based on ELT or ETL can make a huge difference with your Data Integration strategy. This is true when building an Operational Data Store (ODS), Enterprise Data Warehouse (EDW), or really any solution involving ‘data movement ‘and/or ‘transformations.
It’s well known that most of the time required to build a Data Warehouse (DW) is spent on ELT/ETL development efforts. There are many great tools available for general data integration, but without adherence to key development standards, and predetermined integration strategies, these efforts can consume far too much time in the development cycle. A strategy based on “ELT patterns” can help to avoid many these time-consuming pitfalls and place you in better alignment with client expectations.
An ELT patterned approach breaks down into 2 main components, in order:
· ‘EL’ (the dataflows)
· ‘T’ (the transformations)
When populating a Data Warehouse, you generally ‘Extract’ data from operational systems, preferably using the least possible overhead on the source system. On the other end of the dataflow, raw tables are ‘Loaded’ to a staging schema in the DW. This simple data pipeline approach is something that many products do almost effortlessly, and is an easily reproduced ‘pattern’, which can often be duplicated very quickly.
Next (and last) are the ‘Transformations’. For a Data Warehouse, this often means populating a “Kimball style” dimensional model, which mostly consists of two types of tables: Fact and Dimension. Albeit more difficult to create as ‘patterns’ for rapid development (for this you’ll want to leverage metadata along with ‘code generators’), the most important take away is that all the required transformation logic can be handled directly in the database. From the perspective of orchestration, these ‘Transformation’ steps are managed by simply using ‘Execute SQL’ tasks, which will perform best when running directly on the DW.
So now you have a complete Data Warehouse ELT solution by only using two generic tasks:
Dataflows – “source to destination” table movement for DW staging
Transformations – native SQL Statements to populate Fact and Dimensions
To be honest, I often get challenged for ignoring the rich ETL functionality of products such as SSIS, Informatica, or similar, where you can place all transformations inside a dataflow pipeline. However here are my reasons for performing transformations natively in the database:
Nothing is going to work faster for set based operations than code in the database layer
Database logic can be encapsulated in stored procedures, with the added benefit of providing ‘where used’ (i.e., table dependency) capability
Much of the transformation logic, especially for dimension tables will be based on the SQL ‘Merge’ statement. These statements can be auto-generated by leveraging the metadata in your data (more on this topic soon)
Code changes are easier in the database
More developers have experience with ‘SQL’ than ‘ETL’ tools
‘SQL’ centric solutions are more ‘cloud adaptable’
Finally, it seems like all major ‘ETL’ Products encourage developers to place transformations inside their pipelines, which also happens to ensure “Vendor Lock”. If instead, you design your dataflows ‘EL’ separate from transformations ‘T’, you will have a solution centric to just ‘SQL’, which is the true “lingua franca” of today’s popular databases.
In summary, an ELT solution to populate your Data Warehouse should provide both a superior “time to market”, as well improved manageability of your data integration processes. By adding “metadata management”, along with some automated pattern capabilities, dramatic acceleration is possible for both constructing, and maintaining your Data Warehouse projects.
For more information on a tool to support ‘ELT’ download a trial version of eltSnap.