analytics
Analytics is defined as the systematic computational analysis of data or statistics1.
In our context, analytics are what we will perform on data transformed in our data pipelines to produce reports and visualizations. There are several different sub-categories of analytics, including:
analytical system
A computing system or platform specifically dedicated to processing data for the purpose of
reporting on, analyzing, and visualizing that data. The heart of most analytical systems is a
data warehouse or data lake, but other architectures are also possible.
aws
Amazon Web Services2. Amazon’s cloud computing service, which is the most popular cloud platform.
Some of the more popular AWS services are:
There are many dozens of other services provided by AWS.
DAG
DAG stands for Directed Acyclic Graph3. A DAG is a graph consisting of vertices (nodes) and edges
(lines) were the edges connect the vertices.
data flow diagram
A data flow diagram17 (DFD) maps out the flow of information for any process or system.
data lake
A data lake is defined as a system or repository of data stored in its natural/raw format, usually
object blobs or files.4
data pipeline
A data pipeline5 is a series of programs or tasks that transforms a set of input datasets into a set
of output datasets, usually in the form of a DAG where each individual task is a vertice in the DAG
and the edges define the path of execution.
The path of execution in the data pipeline DAG implicitly defines dependencies between tasks. If we have a simple graph A -> B -> C, where A, B, and C are tasks and -> defines the flow of execution, then C is dependent on B, meaning that B must complete successfully before C can be executed.
The phases of the data pipeline can be defined as follows:
data product
One definition of a data product is a reusable data asset, built to deliver a trusted dataset, for
a specific purpose.6
In this class, we narrow the definition of a data product to be a dataset we create in a data pipeline for the specific purpose of creating reports or analytics.
data warehouse
A data warehouse7 is a database system that is dedicated to reporting and analysis. It is
usually the core component of an organization’s analytical system. We will use a Postgres database
for our data warehouse, but many organizations use the new breed of cloud data warehouses such as
Snowflake, Big Query,
and Redshift.
dbt
dbt (Data Build Tool)8 is an open-source software tool that uses SQL to transform one or more
input database tables into an output table.
We will use dbt to build our data pipelines is this class. dbt allows us to define a pipeline as a DAG by defining individual transformations as tasks and specifying dependencies between them.
dimensional model
Also known as a Star Schema9. A dimensional model is a data model that is optimized for complex
analytical queries. It consists of 2 kinds of tables:
Facts are measures - each row in a fact table is a measurement of something.
We prefer facts to be numeric and additive - that is, we can perform aggregate functions on them like COUNT, SUM, AVG, MIN, MAX, etc.
Dimensions provide context for different kinds of measurement. They describe what we are measuring.
Dimensions consist of descriptive attributes - usually (but not always) text strings or date/times - that allow us to label, search, filter, group, and order the data in various ways.
ELT
Stands for Extract, Load, and Transform10
A more modern alternative to ETL, in which data is extracted from source systems, loaded in its
original format to an analytical system, and then taken through a series of transformational steps
to produce data products and analytics.
ETL
Stands for Extract, Transform, and Load11
An older data engineering concept where data is extracted from a source system, transformed in a
staging area, and then loaded into an analytical system.
Entity-Relationship Diagram (ERD)
An Entity-Relationship Diagram (ERD)18 visualizes the relationships between entities like people,
things, or concepts in a database. An ERD will also often visualize the attributes of these entities.
By defining the entities, their attributes, and showing the relationships between them, an ERD
can illustrate the logical structure of databases.
Foreign Key A foreign key consists of a column that references the primary key of another table. The purpose of the foreign key is to ensure referential integrity of the data. In other words, only values that are primary key values in the referenced table are permitted in the foreign key column.
idempotent
An operation that produces the same result no matter how many times it is executed3.
This is a term from mathematical theory, but in the context of data engineering it means that running a task in a data pipeline or even a series of tasks in a pipeline will produce the same result even when run multiple times over the same data.
A simple example of idempotence is writing a task that first deletes all the data in a table and then reloads the data from the source. A task like this is guaranteed to be idempotent.
linux
Linux12 is an open-source operating system (OS) that is the most popular server OS on the internet.
There are many different distributions (versions) of linux
We will build our data pipelines on an EC2 VM running the Ubuntu distribution of linux.
operational data source
Also referred to as a source system. An operational data source is a database or other data store
that is used by an organization’s operational systems - the systems that are used to run the
organization.
For example, the system that processes payments on your credit card is an operational system. It will store the record of those payment transactions in a database. That transactional data in a database will then be copied into an analytical system to be transformed and analyzed.
We often refer to the data from operational systems as “raw data” because it needs to be transformed to be useful for analytics.
orchestration tool
Data orchestration is the process of automating the execution and monitoring of data pipelines.13
A data orchestration tool automates the management and execution of data pipelines. It visualizes
data pipelines as DAGs and can show the execution of each pipeline as it occurs.
Popular orchestration tools are Airflow, Dagster, and Prefect.
Postgres
Postgres14 is an open-source Database Management System (DBMS) originally developed at UC-Berkeley.
Postgres and MySQL are the most popular open-source SQL databases.
We will use Postgres as our database system in this course.
Primary Key A database primary key15 is a column or combination of columns with a value that uniquely identifies each row in a table.
Superset
Superset16is an open-source business intelligence tool that can query analytic data stores to build
reports and visualizations.
We will use Superset to create reports and visualizations on the data products we build.
transformation
The process of taking operational data and turning it into data that is useful for analysis.
Transformation can involve any or all of the following steps:
Virtual Machine (VM)
A virtual machine or VM is defined as the virtualization or emulation of a computer system16.
We will be doing our work on an EC2 system in AWS that is a virtual machine running Postgres,
JupyterHub notebooks, dbt, and Superset.
These are the most important and influential books on data engineering, databases, and SQL:
Data Mesh: Delivering Data-Driven Value at Scale - Zhamak Deghani
Fundamentals of Data Engineering: Plan and Build Robust Data Systems - Joe Reis and Matt Housley
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Martin Kleppmann
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling - Ralph Kimball and Margy Ross
Learning SQL: Generate, Manipulate, and Retrieve Data - Alan Beaulieu
Database Design and Relational Theory: Normal Forms and All That Jazz - C.J. Date
Storytelling with Data: A Data Visualization Guide for Business Professionals - Cole Nussbaumer Knaflic
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights - Dave Fowler and Matthew C. David
These last two books are not books on data engineering or databases, but are the best books available on software architecture and python and I highly recommend them.
Fundamentals of Software Architecture: An Engineering Approach - Mark Richards and Neal Ford
Fluent Python: Clear, Concise, and Effective Programming - Luciano Ramalho
1https://en.wikipedia.org/wiki/Analytics
2https://docs.aws.amazon.com/?nc2=h_ql_doc_do
3https://en.wikipedia.org/wiki/Directed_acyclic_graph
4https://en.wikipedia.org/wiki/Data_lake
5https://en.wikipedia.org/wiki/Pipeline_(computing)
6https://data.world/blog/what-is-a-data-product-and-why-does-it-matter-to-data-mesh/
7https://en.wikipedia.org/wiki/Data_warehouse
8https://en.wikipedia.org/wiki/Data_build_tool
9https://en.wikipedia.org/wiki/Star_schema
10https://www.astera.com/type/blog/elt-extract-load-and-transform/
11https://en.wikipedia.org/wiki/Extract,_transform,_load
12https://en.wikipedia.org/wiki/Linux
13https://medium.com/memory-leak/data-orchestration-a-primer-56f3ddbb1700
14https://www.postgresql.org/
15https://www.relationaldbdesign.com/database-analysis/module2/database-primary-key.php
16https://superset.apache.org/
17https://www.lucidchart.com/pages/data-flow-diagram
18https://www.smartdraw.com/entity-relationship-diagram/