DataEngineeringI320

Glossary of Data Engineering Terms

analytics
Analytics is defined as the systematic computational analysis of data or statistics¹.

In our context, analytics are what we will perform on data transformed in our data pipelines to produce reports and visualizations. There are several different sub-categories of analytics, including:

descriptive analytics
predictive analytics
prescriptive analytics

analytical system
A computing system or platform specifically dedicated to processing data for the purpose of reporting on, analyzing, and visualizing that data. The heart of most analytical systems is a data warehouse or data lake, but other architectures are also possible.

aws
Amazon Web Services². Amazon’s cloud computing service, which is the most popular cloud platform. Some of the more popular AWS services are:

EC2 (Elastic Compute Cloud) - provides the VM you will log into for your labs and semester project.
RDS (Relational Database Service) - hosted database services.
S3 (Simple Storage Service) - data object storage.

There are many dozens of other services provided by AWS.

DAG
DAG stands for Directed Acyclic Graph³. A DAG is a graph consisting of vertices (nodes) and edges (lines) were the edges connect the vertices.

starting from a single vertice
edges join each of the vertices (a graph)
in one single direction (directed)
with no cycling back to previously joined vertices (acylic)

data flow diagram
A data flow diagram¹⁷ (DFD) maps out the flow of information for any process or system.

data lake
A data lake is defined as a system or repository of data stored in its natural/raw format, usually object blobs or files.⁴

data pipeline
A data pipeline⁵ is a series of programs or tasks that transforms a set of input datasets into a set of output datasets, usually in the form of a DAG where each individual task is a vertice in the DAG and the edges define the path of execution.

The path of execution in the data pipeline DAG implicitly defines dependencies between tasks. If we have a simple graph A -> B -> C, where A, B, and C are tasks and -> defines the flow of execution, then C is dependent on B, meaning that B must complete successfully before C can be executed.

The phases of the data pipeline can be defined as follows:

Ingestion - Loading data from source system(s) into a data warehouse or data lake.
Transformation - Restructuring data into a format suitable for analytical processing (for example, transforming source tables into a star schema).
Presentation - Making transformed data available through BI Tools (e.g. Superset), APIs, or other tools.

data product
One definition of a data product is a reusable data asset, built to deliver a trusted dataset, for a specific purpose.⁶

In this class, we narrow the definition of a data product to be a dataset we create in a data pipeline for the specific purpose of creating reports or analytics.

data warehouse
A data warehouse⁷ is a database system that is dedicated to reporting and analysis. It is usually the core component of an organization’s analytical system. We will use a Postgres database for our data warehouse, but many organizations use the new breed of cloud data warehouses such as Snowflake, Big Query, and Redshift.

dbt
dbt (Data Build Tool)⁸ is an open-source software tool that uses SQL to transform one or more input database tables into an output table.

We will use dbt to build our data pipelines is this class. dbt allows us to define a pipeline as a DAG by defining individual transformations as tasks and specifying dependencies between them.

dimensional model
Also known as a Star Schema⁹. A dimensional model is a data model that is optimized for complex analytical queries. It consists of 2 kinds of tables:

Fact tables
Dimension tables

Facts are measures - each row in a fact table is a measurement of something.

We prefer facts to be numeric and additive - that is, we can perform aggregate functions on them like COUNT, SUM, AVG, MIN, MAX, etc.

Dimensions provide context for different kinds of measurement. They describe what we are measuring.

Dimensions consist of descriptive attributes - usually (but not always) text strings or date/times - that allow us to label, search, filter, group, and order the data in various ways.

ELT
Stands for Extract, Load, and Transform¹⁰ A more modern alternative to ETL, in which data is extracted from source systems, loaded in its original format to an analytical system, and then taken through a series of transformational steps to produce data products and analytics.

ETL
Stands for Extract, Transform, and Load¹¹ An older data engineering concept where data is extracted from a source system, transformed in a staging area, and then loaded into an analytical system.

Entity-Relationship Diagram (ERD)
An Entity-Relationship Diagram (ERD)¹⁸ visualizes the relationships between entities like people, things, or concepts in a database. An ERD will also often visualize the attributes of these entities. By defining the entities, their attributes, and showing the relationships between them, an ERD can illustrate the logical structure of databases.

Foreign Key A foreign key consists of a column that references the primary key of another table. The purpose of the foreign key is to ensure referential integrity of the data. In other words, only values that are primary key values in the referenced table are permitted in the foreign key column.

idempotent
An operation that produces the same result no matter how many times it is executed³.

This is a term from mathematical theory, but in the context of data engineering it means that running a task in a data pipeline or even a series of tasks in a pipeline will produce the same result even when run multiple times over the same data.

A simple example of idempotence is writing a task that first deletes all the data in a table and then reloads the data from the source. A task like this is guaranteed to be idempotent.

linux
Linux¹² is an open-source operating system (OS) that is the most popular server OS on the internet. There are many different distributions (versions) of linux

We will build our data pipelines on an EC2 VM running the Ubuntu distribution of linux.

operational data source
Also referred to as a source system. An operational data source is a database or other data store that is used by an organization’s operational systems - the systems that are used to run the organization.

For example, the system that processes payments on your credit card is an operational system. It will store the record of those payment transactions in a database. That transactional data in a database will then be copied into an analytical system to be transformed and analyzed.

We often refer to the data from operational systems as “raw data” because it needs to be transformed to be useful for analytics.

orchestration tool
Data orchestration is the process of automating the execution and monitoring of data pipelines.¹³ A data orchestration tool automates the management and execution of data pipelines. It visualizes data pipelines as DAGs and can show the execution of each pipeline as it occurs.

Popular orchestration tools are Airflow, Dagster, and Prefect.

Postgres
Postgres¹⁴ is an open-source Database Management System (DBMS) originally developed at UC-Berkeley. Postgres and MySQL are the most popular open-source SQL databases.

We will use Postgres as our database system in this course.

Primary Key A database primary key¹⁵ is a column or combination of columns with a value that uniquely identifies each row in a table.

Superset
Superset¹⁶is an open-source business intelligence tool that can query analytic data stores to build reports and visualizations.

We will use Superset to create reports and visualizations on the data products we build.

transformation
The process of taking operational data and turning it into data that is useful for analysis. Transformation can involve any or all of the following steps:

cleaning - removing bad or duplicate data
sanitizing - for example, anonymizing Personally Identifying Information (PII) to protect individual’s privacy
scrubbing - removing irrelevant or unnecessary data
calculating - create new data points from existing data points
aggregating - summarizing data; for example, doing counts, sums, and averages
restructuring - organizing data differently to make it easier to understand and query

Virtual Machine (VM)
A virtual machine or VM is defined as the virtualization or emulation of a computer system¹⁶. We will be doing our work on an EC2 system in AWS that is a virtual machine running Postgres, JupyterHub notebooks, dbt, and Superset.

Data Engineering References

These are the most important and influential books on data engineering, databases, and SQL:
Data Mesh: Delivering Data-Driven Value at Scale - Zhamak Deghani
Fundamentals of Data Engineering: Plan and Build Robust Data Systems - Joe Reis and Matt Housley
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Martin Kleppmann
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling - Ralph Kimball and Margy Ross
Learning SQL: Generate, Manipulate, and Retrieve Data - Alan Beaulieu
Database Design and Relational Theory: Normal Forms and All That Jazz - C.J. Date
Storytelling with Data: A Data Visualization Guide for Business Professionals - Cole Nussbaumer Knaflic The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights - Dave Fowler and Matthew C. David

These last two books are not books on data engineering or databases, but are the best books available on software architecture and python and I highly recommend them. Fundamentals of Software Architecture: An Engineering Approach - Mark Richards and Neal Ford
Fluent Python: Clear, Concise, and Effective Programming - Luciano Ramalho

Notes:

¹https://en.wikipedia.org/wiki/Analytics
²https://docs.aws.amazon.com/?nc2=h_ql_doc_do
³https://en.wikipedia.org/wiki/Directed_acyclic_graph
⁴https://en.wikipedia.org/wiki/Data_lake ⁵https://en.wikipedia.org/wiki/Pipeline_(computing) ⁶https://data.world/blog/what-is-a-data-product-and-why-does-it-matter-to-data-mesh/ ⁷https://en.wikipedia.org/wiki/Data_warehouse ⁸https://en.wikipedia.org/wiki/Data_build_tool ⁹https://en.wikipedia.org/wiki/Star_schema ¹⁰https://www.astera.com/type/blog/elt-extract-load-and-transform/ ¹¹https://en.wikipedia.org/wiki/Extract,_transform,_load ¹²https://en.wikipedia.org/wiki/Linux ¹³https://medium.com/memory-leak/data-orchestration-a-primer-56f3ddbb1700 ¹⁴https://www.postgresql.org/ ¹⁵https://www.relationaldbdesign.com/database-analysis/module2/database-primary-key.php ¹⁶https://superset.apache.org/ ¹⁷https://www.lucidchart.com/pages/data-flow-diagram ¹⁸https://www.smartdraw.com/entity-relationship-diagram/

This site is open source. Improve this page.