Drive Data Warehouse Integration via Metadata
There is an acronym ‘RBAR’ that captures well the pain inflicted by poorly performing database logic. But what about the repetitive discomfort of ‘TBAT’ (Table by Agonizing Table) when creating an Enterprise Data Warehouse (EDW)?
Operational systems (ERP, CRM, etc.) often have hundreds, if not thousands of tables, and the movement of this data into a Data Warehouse platform can become a time-consuming, never-ending, and error-prone task. If you add to this the finesse required to handling full and incremental table refreshes, along with inevitable schema changes, you can quickly succumb to TBAT maintenance on the data road to building your EDW.
Most EDWs use an ‘intermediate layer’ to marshal data before formal ‘Fact’ and ‘Dimension’ modeling. This intermediate layer is often referred to as an Operational Data Store (ODS), which for purposes of this article, suggests this data exists in a relational database system (as opposed to a Data Lake that may also contain a variety of semi and/or unstructured data).
An ODS, housing data from multiple sources, provides a convenient platform to:
Construct and conform dimensions
Implement Data Governance
Establish Master Data
In addition, an ODS can be a practical platform to share raw (or ‘unmodeled’) data within an organization. This allows for faster time-to-market on granting access, while also providing official controls and better consistency to the common practice of informal, departmental data sharing.
To complete our stage (before digging into the metadata driven topic) there is a pre-ODS layer to consider, often referred to as a Landing Zone (LZ). From a practical perspective, a Landing Zone (LZ) may be used to:
Profile and examine data prior to the ODS
Buffer the ODS from breaking changes (i.e., new or missing columns)
Resolving subtle data variances such as encoding and collation structures
Populating the LZ is designed to be a basic Extract and Load (EL) operation. It is the first stop of data arriving from Operational Systems which may involve a variety of transport mechanisms. For large tables, it’s critical that this (EL) data movement supports incremental loads, where only changes are propagated from the originating systems.
Driving it all via Metadata
Now for the mechanics of populating the Landing Zone (LZ). Step one is to determine which tables should be replicated from your Operational System(s), as well as which ones can be supported with incremental loads. Here is the first place where TBAT can surface, so let’s consider a data-driven approach.
Most relational database systems provide metadata views as well as system tables which can facilitate automated integration. For example, the domain of tables available for data extraction, and columns that could be referenced for incremental data pulls. If a subset of tables is desired, metadata can be used to filter objects based on schema names, table names, or related pattern.
In the event the metadata itself is not enough to determine an official list, control tables can be introduced to qualify the list of tables needed to be replicated. These control tables are typically based on an include/exclude lists of schemas and/or table names. Exclude lists are more permissible when new tables are created in your source systems, whereas Include lists are by design more restrictive. In any event, the idea is to make your Extract and Load process as flexible and extensible as possible, and to not require new processes or coding when simply adjusting the tables in your replication scheme.
Metadata can also assist when table structures change, which introduces the concept of “self-healing”. This is surprisingly straight forward when using the common relational metadata view: “INFORMATION_SCHEMA.COLUMNS”, comparing the source and destination table definitions, and then automating corrective “ALTER TABLE…” commands. This is easiest between the Operational System and the LZ since almost no transformations (T) are required. Moving table structures (for self-healing) from the LZ to the ODS may be a little more difficult since tables may be altered and/or split in this movement to accommodate organizational security constraints. Finally, table structures in the EDW, frequently conforming to the Kimball Dimension Model, are probably not candidates for Self-healing, but the author would love to know if someone has figured this one out…
In summary, not leveraging your metadata to automate table replication on the road to your data warehouse, can lead to high maintenance TBAT solutions that will most likely bottleneck your best efforts. Worse, when you are managing hundreds, if not thousands of tables in your LZ/ODS/EDW platforms and considering these databases will probably live in multiple environments (e.g., Dev/Test/Prod), driving all this ELT/ETL via metadata is probably your only option.
For more information on a tool to support this and other ‘ELT’ solutions, see our flagship product: eltSnap.