ETL

RICH HANEY

ETL done as ELT (Extract, Load, and Transform)


Since 2001, to load data into Oracle, SAS, and other data stores in the terabyte range, we have been making use of a family of methods usually known as methods of "Extract Load and Transform" or ELT.  In this approach, the loading step precedes almost all data quality (validation), transformation, and translation steps.  Our  experience is that ETL done in a de facto way as Extract Load and Transform ( ELT ) is relatively common.

General approaches to ETL done in practice as ELT ( Extract, Load, and Transform )

In the method used most frequently, companies designate a database area -- sometimes with data in the terabyte range or above -- as a data staging area. It is into the staging area that the data is initially loaded. As that is being done, there are trivial checks involving data types such as ensuring that lengths of strings are reasonable. The reason is that when data is not in the form of free-form text, one wants to at least get a head start on work in data quality. It is mostly only after data is loaded into staging
tables that data quality ("validation", "edit check") and transformation or translation steps are executed.
A second approach occurs when statistical programming involving SAS is involved. Let us start by noting that arguments that the use of Hadoop tools such as SQOOP comprises work in ETL are valid -- ETL programming applies to data stores other than just relational database systems. From both a practical perspective and also the formal, algebraic perspective that we prefer to use, Hadoop Distributed File System (HDFS ) storage of data comprises use of a database; one loads ( or "imports" ) data into that using methods that are best described in ETL terms. Moreover, systems of SAS datasets also comprise databases; one loads data into them using PROC IMPORT statements or other steps are also best described in ETL terms.

In this context, the initial load step involved in most statistical programming work is typically quite simple. The main need is to program the system so that if the initial data changes, the rest of the SAS tables -- that is, the database -- does not have to change necessarily. In general, our experience is that statistical programming done when all data is for the most part in SAS datasets also frequently makes use of an Extract, Load and Transform ( ELT) model in practice. Data for each of the many data sources typically needed are validated, transformed and worked on in other ways after the initial loading of data into SAS datasets.

We currently find that the use of ELT methods, techniques allowing schemas to evolve quickly, and rich language capabilities of languages such as PL/SQL and SAS datastep language, all work together to allow us to import and quality-check new data sources roughly three times faster than when using Pig and Hive on Hadoop. Since data validation and other tasks on the data that have been loaded typically take several passes to carry out, and queries for tasks are much faster when all data have been parsed, overall run times are also usually shorter. But as Hive and Pig evolve and Apache Spark becomes more available, those kinds of results may change.

Our own work in ETL done in practice as ELT ( Extract, Load, and Transform )

Since 2001, in connection with the above approaches, we have been making use of a fast approach that makes use in a comprehensive way of automated discovery of data types prior to the load. The step includes automated construction of database tables definitions and also database loader scripts. Loading itself is done by the use of bulk methods that are parallelized as necessary. As noted above, the transformation or translation steps follow the load steps.
We use a "high-level" version of the same approach to load vocabularies, schemas provided from regulators, and systems of rules such as data quality or "edit check" rules. We apply the methods discussed as above to these high level data or meta-data as well. We make a head start in the data quality area by at least checking the basic types. But even after they are loaded -- like the big data itself -- the rules, constraints, transformations, and other high-data are still in relatively raw form.
After rules are loaded, in the transformation (or translation ) step, it is often possible to use rule compilers to rules to convert rules to precise form. We convert the rules to a form that -- according to algebraic logic -- is the simplest one possible. We take a similar approach to sets of constraints, data transforms, and schema definitions from regulators. All of these are curated with the help of the database itself. After we make them active during some time interval, we can apply them to the main ( big ) data tables.

ETL is done in practice as ELT also lets us also make use of formal algebraic methods more easily. We suggest that we can use those methods to make the whole process cleaner and faster overall.

ETL