Our solution

ETL: The Ultimate Guide

In this guide, we will explain everything about Extraction, Transforming, and Loading or ETL without needing any technical skills or specific know-how.

You will learn about things like definition, the history, why it matters, and also we’ll discuss some ETL examples. We will cover the benefits and challenges as well as the relation to SQL. After this crash course about ETL, you’ll understand the basic concepts and understand why it is still of much importance nowadays.

Let’s get started!

1. A brief history of data

In the ’60s databases had become more and more common in businesses for storing documents, client records, and financial information. This created a need for integration of data: in order to make efficient analyses, it was necessary to combine and link data from different databases within the organization.

A decade later more and more databases emerged within organizations, making that multiple data streams were processed in different ways and multiple data solutions were deployed. Instead of easy integration of data warehouses, the opposite occurred: a myriad of data sources spread all over the organization.

A solution was required.

2. ETL: Extract Transform and Load

ETL stands for Extract, Transform, and Load. Basically, it is a generic way of extracting (raw) data from multiple data sources into a single data environment, transforming it, and then loading it into a data warehouse or other target database. Lots of companies have deployed ETL tools as part of their data integration process, deriving actionable insights. Furthermore, ETL provides a foundation for data analytics and machine learning work-streams.

In the next section, we elaborate on the Extract, Transform and Load parts of the ETL process.

2.1 Extract data

The first step of the ETL process is data extraction. The purpose is to extract data from multiple source systems. Data extracted from these data sources can be of different types: such as .csv formatted files, records from a relational database, or even images.

The data can be pushed to the ETL solution or the ETL solution can pull the data from the data source. In larger setups, there will be a mix of data pulled from and pushed by the various data sources. The extracted data can be both structured and unstructured data.

Examples of data sources and types:

  • data from relational databases
  • data from NoSQL servers (like images and sound files)
  • flat text files (json, yaml, markdown)
  • e-mail
  • (content scraped from) webpages
  • records from CRM systems
  • records from ERP systems
  • CSV files
  • data from internal or external data lakes
  • data from social media
  • paid 3rd party external data

As you can see, the extracted data can be internal data, but with the rise of the internet and its external data source and easy APIs (e.g. social media like Twitter and Facebook), external data can also be integrated into the data extraction process.

Data preparation in the extraction process

Data published by multiple sources for extraction, usually need to be prepared before transforming. It may be that source data must be reformatted in order to be ready for transformation.

An example is a .xml parsing module, parsing XML files (e.g. published by legacy systems) and storing the outcome to a relational database (in case the transformation is done with SQL).

Other issues to deal with are empty values and incomplete source data. In order to process data properly, rules have to be implemented for these cases.

Duplicates can also be a challenge: when data is published multiple times the overall performance of the ETL process may decrease. When lots of data are published more than once, a backlog can occur. This can be prevented by placing a file broker between the data sources with a de-duplication feature, preventing the same data is published more than once. Cortex, part of the 1OPTIC product family, is such a file broker solution capable of advanced data de-duplication.

2.2 Transform data

Data extracted? The next step in the ETL process is to transform the extracted data into a combined or transformed if you like, format. This step is not always the same since it depends on the required output and the characteristics of the source data (e.g. records, json, images).

The transform stage can cover a wide variety of processing steps, like;

  • data cleansing
  • combining data from multiple sources into the desired format
  • data validation
  • anonymization or pseudonymization of data
  • encryption of data
  • data quality audits
  • authentication processes for security
  • data mapping

2.3 Data lineage

As mentioned data transformation consists of multiple steps, leading to a combined, processed outcome that can be used for analysis and or reporting. But the outcome can also be input for another transform process.

In order to keep track of all transformations, a map can be created, displaying all steps of the data transformation process. This is called data lineage. With data lineage in place, it is possible to select a specific transformed result and drill down via all steps of the underlying transform process to the original source data.

Data lineage

2.4 Load data

The final step of the ETL process is loading the transformed data into a target data source. This can be a database, data lake, an enterprise data warehouse, a business intelligence solution, but also a directory containing daily reports. The target can be local, on-premise, but it can also be a cloud location.

Subscribers can be engineers monitoring a network of systems, data scientists doing ML/AI projects are management members requiring the weekly financial results.

2.5 Batch processing vs continuous data loading

In the case near-real-time monitoring is required, ETL is a continuous process: streaming data is extracted, transformed, and loaded continuously.

When ETL is used for offline daily or weekly reporting, batch-wise ETL processes can be scheduled on specific off-hours (typically sometime between 1:00 and 4:00 AM).

Both use cases have their own characteristics.

Offline reporting is less complex: processes can be scheduled, performance and timing are less crucial.

Near-real-time monitoring with ETL is more complex: data may not be complete at the extraction stage and intelligent re-processing schemes must be in place. Also, good performance is important; processing delays and backlog must be prevented at any cost.

The ETL component of the 1OPTIC data science solution, Minerva ETL, is able to facilitate lots of ETL pipelines in parallel, extracting (big) data from multiple data sources, transforming data (e.g. aggregations, peak hour, and KPI calculations), and loading the data to lots of targets. 1OPTIC supports both continuous and batch-wise data processing.

2.6 Monitoring data quality

Once the ETL pipelines or data flows, have been deployed, and transformed data is loaded to target systems, it’s important to monitor the quality of the data processed. Some useful metrics to monitor are:

  • the volume of data processed
  • processing delay
  • backlog
  • business rules

At Hendrikx ITC Prometheus is used for implementing and monitoring the mentioned metrics. Also, Minerva ETL has features for custom rule-based triggering.

Detecting data leakage

All extracted data should be processed. To make sure that no data is missing or leaked from the ETL pipelines, the data volumes at the beginning as well as at the end of the pipelines should be monitored. By applying some basic rule-based triggers possible data outages can be detected in an early stage.

Examples of rule-based triggers for detecting data leakage:

if data volume = 0 for N seconds then send notification
if processed data volume decreases with 50% then send notification

Detecting processing delay

In online or near real-time scenarios the data processing should be done swiftly without any delay. Delay can be monitored with rule-based triggers like:

if data corresponding to timestamp N is not available after M seconds then send notification
if data processing step takes more than N seconds then send notification

Detecting backlog

Backlog is the case when the ETL platform is not able to process all extracted data in time and gets behind. Backlog can be detected with a rule-based trigger like:

if job queue > N jobs then send notification

Business rules

Another important item to monitor is the correctness of the published data. It depends on the context of how thorough the data verification should be, but in most cases, basic business rules can be implemented, like:

if value X is empty then send notification
if percentage N > 100% then send notification
if percentage N < 0% then send notification
if item M > then sum of sub-items N_i then send notification

Of course more complex and context-specific business rules can be implemented.

Smart ticketing

TripAiku, part of the 1OPTIC product family, is a solution for collecting notifications generated by these rule-based triggers (monitoring data volumes, delay, and backlog) and creating smart weighed tickets out of them.

3. Two ETL cases

ETL can be used in many contexts. We’ll discuss some cases:

  1. Business reporting in an SME
  2. Multi-vendor high-performance online KPI monitoring in a 5G mobile network

3.1 Business reporting in an SME

For SMEs, ETL can be used for daily or weekly business reporting. A typical deployment would look like this:

business reporting scheme

Data sources like ERP and CRM systems can be used for creating management reports on a weekly or monthly basis. But also external sources like social media and competitors’ websites may be relevant.

Monitoring delay and backlog will be less important than monitoring data leakage and checking business rules.

3.2 Multi-vendor online KPI monitoring in a 5G mobile network

This case is about using ETL in big data, multi-vendor environments of mobile telecom operators:

business reporting scheme

Hundreds of thousands of network elements publish Performance Management (PM) counter data on a 15-minute basis that has to be processed in order to get online Key Performance Indicator (KPI) values for monitoring the overall performance of the network. Some KPIs to monitor:

  • throughput in up and downlink
  • connection setup ratio
  • session completion ratio
  • handover success ratios

These KPI values are on the antenna (cell) level and aggregated on the network level.

Delay and backlog should be prevented as much as possible. To mitigate possible delays and backlog, data is processed in parallel and stored in separate databases.

4. Cloud versus on-premise

ETL can be implemented in a cloud environment or on-premise. Cost and performance may vary per scenario.

For data-intensive scenarios (like mobile telecom networks) on-premise is more cost-effective. Processing large volumes of data with high performance are expensive in the cloud. Organizations with big data ETL pipelines have already on-premise resources in most cases (including security policies and CISO), making ETL deployments in the cloud less interesting.

When smaller data sets are involved, and there is more room for scheduling data transforming processing, cloud may be more cost-effective. Also, compliance and security (GDPR) are being dealt with at the major cloud platforms like Azure, AWS, and GCP.

For proto-typing and data exploration cloud is a good option since lots of features are available out of the box.

5. SQL based data pipelines

Data transformation can be done with custom scripting (e.g. Python) or modules written in high-performance languages like Rust. Another way is storing all extracted data in a transactional database and doing the transformation via SQL scripts.

All data pipelines in Minerva ETL are implemented with SQL scripts (on top of a PostgreSQL database). This has several benefits:

  • all transformations are transactional, meaning that in case of failure or an empty result, the whole action can be rolled back. This mechanism increases the overall data integrity;
  • data types can be enforced, increasing data integrity;
  • data transformation steps can be defined in yaml or json based configuration files that can be used for automated SQL generation, increasing all over system integrity. Also, these configuration files can be managed via configuration management tools like SaltStack (which is an open-source tool);
  • change data capture: since all processed and stored data is known to the system, it is possible to do incremental loading and transforming.

6. The difference between ETL and ELT

The main difference between ETL and ELT is the order of steps: ETL transforms data into structured data (via data pipelines) before loading it to subscribers. With ELT raw data is loaded directly to a target system. Data cleansing and preparation are done at a later stage and left to the subscribers.

The main advantage of ELT is that data ingestion can be less of an issue: while raw data has to be prepared and cleansed in traditional ETL processes before it is published (or loaded), ELT processes just pass the raw extracted data to a target system, skipping data pipelines. With ELT data integration is left to the subscribers.

6.1 Which one is better?

Which one is better depends on the use case.

In case of performance issues with data ingestion, an ELT strategy is preferable. Raw data can be stored as soon as possible, avoiding possible performance issues caused by data cleansing and/or data preparation.

The tradeoff is of course that the data subscribers have to the data integration themselves.

Minerva ETL is a high-performance ETL solution designed for extracting large volumes of data, validating and preparing the data, and storing the extracted data into a transactional database.

7. We help you with your ETL strategy!

Hendrikx ITC is an ambitious team of data professionals that helps companies to consolidate data, migrate data and set up on-premises and cloud ETL solutions. We’ve also developed the best of the open source etl tools, called Minerva ETL.

We would love to share our ETL knowledge with you

Let us know how we can assist you and we’ll get back to you soon!