The Twelve Days of AWS: Data Pipelines

12 Days of AWS Day 10 written around snowflakes with two penguins holding balloons

Extract, Transform & Load (ETL) is the name of the game when it comes to Data Pipelines. 

The Extract portion will acquire data from some source(s), which will then pass through Transform where some alteration may be needed to that data, to then end up Loaded into another storage format, such as Redshift, S3, to name a couple.

You might have a MySQL database of normalized data that is used on a website, but you need to take sales records (Extract), flatten the table data into discrete records (Transform) and store them in your Data Warehouse on Redshift (Load).

Multiple Data Pipelines can be executed on either a scheduled or on-demand basis, depending on how and when the data is available to be processed.

The power of Data Pipelines resides in their self-contained resilience, as each one can chain up data validation, error notification and retry attempts, meaning that, to quote from the AWS documentation “you don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system”.

For daily running processes that just need to happen, and should be operating happily in the background regardless of what goes on elsewhere, these are very time and cost-effective methods of getting your Data Warehousing process going.

Creating Data Pipelines can be done via the UI or also via AWS CLI, with config in the form of JSON, which defines each of the steps of the Pipeline, with dependencies between events, what to do if something fails, and also the credentials needed to access the data required.

There are sometimes oddities with the Data Pipelines UI, and Pipeline Activation failures can sometimes be a bit obscure to diagnose but once they are working they are a very solid solution that you don’t have to babysit.