Ricard Illa 5ed10fb179 | ||
---|---|---|
airflow_img | ||
dags | ||
data | ||
dbt | ||
etl | ||
notebooks | ||
scripts | ||
terraform | ||
.gitignore | ||
README.md | ||
docker-compose.yml |
README.md
Sustainability Score
The task at hand is to compute a sustainability score for products available on Target's website. This score will be based on various product attributes such as materials, packaging, weight, dimensions, TCIN, origin, and other relevant information. The goal is to process and clean the provided product data, store it in an SQL database, and calculate the sustainability score for each product.
Architecture and stack
To tackle this issue I decided to use technologies that are easy to run locally for a small prove of concept but have an easy migration path to run on a cloud provider.
I am using docker-compose to define the services used.
Jupyter Notebooks
Jupyter Notebooks were used for initial data exploration and for some small analyses afterwards.
I am using the official jupyter/scipy-notebook
image, which already contains
everything that I needed.
The notebooks used are stored in the notebooks
directory.
PostgreSQL
PostgreSQL is used as the database where the data will be stored and as the SQL engine to run calculations.
Apache Airflow
Apache Airflow is used as the orchestrator and scheduler of the whole pipeline. Airflow is also where the credentials to the database are stored.
I am using an image that inherits from Google Composer's
composer-2.3.1-airflow-2-5-1
image. This creates an environment more similar
to Composer and makes a potential migration to Composer easier.
The directory airflow_img
contains the files for building that imnge . It
is essentially an image inheriting from Google's image but which also has DBT
and which overrides its entrypoint (because the entrypoint of the original
image sets things up for Cloud Composer, not for running locally).
The directory dags
directory contains one Airflow DAG: sustainability_score
.
This is the DAG in charge or orchestrating the pipeline
Apache Beam
Apache Beam is used as the ETL tool, to read rows as elements from the input CSV file, clean them up and upsert them into the PostgreSQL database.
In this proof of concept, I am using the DirectRunner, because it's the easiest to run an reproduce locally. This means that Beam will run within a python virtual environment inside the Airflow scheduler container.
The code for Apache Beam is stored in the etl
directory.
DBT
DBT is used as the transformation tool, to leverage PostgreSQL SQL engine to calculate the score incrementally. I see DBT as an analytics-specific orchestrator that is itself orchestrator by the more general-purpose orchestrator Airflow.
In this proof of concept, DBT is not running on its own container but is directly installed on the same image as Airflow. I made this decision because I didn't like any of the other options:
- If I had used DBT cloud, this whole setup could not have been run locally (which is a decision I made myself).
- If I had run DBT with an ad-hoc Docker invocation, I would have had to mount the docker socket inside the Airflow containers (to allow for orchestration), which it not ideal.
- dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions
- dbt-labs' dbt-server is already quite immature and not well documented.
The code for dbt is stored in the dbt
directory.
Terraform
I am using terraform to provision some state within the containers in this stack. Namely, Terraform is used to set up the user, database and schema in the Postgres database, and also to create the Airflow connection within Airflow to access the database.
I am not creating the tables with Terraform because I believe that job is more appropriate for DBT.
The code for terraform is stored in the terraform
directory.
Other directories
data
This contains the sample CSV file.
scripts
Contains misc scripts used inside containers. Currently, there is only
terraform-entrypoint.sh
, which servers as an entrypoint to terraform's
container.
state
This directory will be created at runtime and is used for storing state, so that the setup has persistence.
Migration to the cloud
While developing this setup, I kept in mind an eventual move to a cloud provider. The following adjustments could be made if this were to run on Google Cloud.
Apache Airflow
The Airflow code could be easily moved to run on Cloud Composer.
For any credentials needed, it would then be better to use the Google Secrets manager backend instead of storing the connections in Airflow's database.
BigQuery
The final datawarehouse used could be BigQuery.
However, due to BigQuery's analytical-first nature a few small changes should be changed when moving from a traditional transactional database like PostgreSQL.
Instead of inserting the elements into the database individually using upsert
statements, it would be better to first write all of them into a given table or
partition with the write disposition set to WRITE_TRUNCATE
and then use a
MERGE
statement to write or update the elements in the eventual final
products
table.
DBT
would likely be used for that MERGE
statement.
DBT
Wither DBT cloud or a self-hosted DBT server could be used instead of directly calling the dbt cli. Alternatively, if BigQuery is used as datawarehouse, Dataform would also be worth exploring, since it provides a much better integration with BigQuery specifically.
Apache Beam
The Beam ETL code would run using the DataflowRunner.
GCS
The input CSV file would probably be stored on GCS. Because the ETL pipeline
already reads the input file using apache_beam.io.filesystems.FileSystems
,
this change should be transparent.
Terraform
Terraform would still be used to declaratively define all the needed resources.
Missing pieces
Due to time constraints, there were a few things that I did not have time to implement.
Pipeline monitoring
For the local setup I may have used something like Grafana. For the cloud deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor.
Data quality and integrity tests
DBT is already in place, but I would have liked to add more tests for the data itself.
Alerting
Airflow can provide alerts if the DAG fails via things like an STMP server, an instant messaging notification or even a Pub/Sub queue.
Dashboards
I am not experienced in this enough to have it done quick enough, but a dashboard visualisation of the data using Looker (for the Cloud deployment) or Grafana (for the local setup) would have been beneficial too.
Better analysis, statistics and plots
Again, due to time constraints, I could not add statistics, analyses and plots polished enough to the data analysis stage.