Oregon State University Jacob Mastel Data Modeler Oregon State University Banner Institution One of only 2 universities with land sea space and sun grant designations Home of the CORE team ID: 931624
Download Presentation The PPT/PDF document "Modularizing Data with dbt" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Modularizing Data with dbt
Oregon State University
Jacob Mastel – Data Modeler
Slide2Oregon State University
Banner Institution
One of only 2 universities with land, sea, space, and sun grant designations
Home of
the:CORE teamYInMn blue aka “Bluetiful”Maraschino cherriesThe computer mouse
Slide3Talk Roadmap
A high level overview of our AWS architecture
The data tools we evaluated
An example project
Final Q&A
Slide4High Level AWS Architecture
Extract
Most on-
prem
data is pulled into s3 using either DMS or Lambda
functions.
Load
An in-house application, Elsa, intelligently reads the data in s3 and imports it into
Snowflake.
Transform
dbt
is used to create and manage SQL models
.
Slide5Evaluation of Data Tools
Primary Criteria:
Cost
Developer time
ELT vs ELTVendor Lock In
Slide6Why we chose dbt
Mostly SQL and YAML, with some other text files mixed in
Fairly easy to read
Very simple to integrate into a
github workflowBrings DRY principles to SQLAllows us to “unit test” our incoming data as well as the developed modelsVery little vendor lock inSupports multiple RDMS outputs“Compiles” to SQLRelies on our existing service for the heavy lifting
Slide7SFR2410: A test project
~20,000 lines of SQL-
ish
ETL code written with
AppworkUsed to create official snapshot data used for legal reporting and long term statistical analysisDue to its important nature and complexity, nobody wants to touch it
Slide8SFR2410: Setting it up
dbt
init
dbt-SFR2410cd dbt-SFR2410git init
Analysis: models not pushed to database
Data: static data used in models
Macros: jinja2 macros
Models: What appears in the database
Tests: Unit tests
Slide9What’s a Jinja?
dbt
models are built using
jinja
enhanced SQLThink what old school, inline PHP is to HTML
Slide10SFR2410: Breaking the Monolith
Instead of one monolithic procedure, we have small, specialized components.
Package configuration allows for a significant control over model groups, or even individual models
From
dbt_project.yml:
Slide11SFR2410: Comparing Blocks
Citizenship in SFR2410
Citizenship in dbt-SFR2410
Slide12SFR2410: Combining Components
The smaller components are then combined into a larger, complete model
output/
student.sql
Slide13SFR2410: Unit Testing
Unit tests are just SQL statements looking for errors
Citizenship should return a single entry. This test looks for any GOBINTL_PIDM with two entries
If the test fails, we already know exactly which records caused the failure
tests/single_citizenship_entry.sql
Slide14SFR2410: But a component is too slow!
Each model can be configured within its individual SQL file, or can be adjusted within the project file
In this case, we’ve selected a single model within the
supporting
group to be materialized into a table.dbt-project.yml
Slide15Questions?