Go to file

Ricard Illa b28ddc350d doc: added readme to etl		2023-06-26 12:09:50 +02:00
airflow_img	fix: fix dbt writeable directories permissions	2023-06-26 10:04:53 +02:00
dags	fix: airflow's BashOperator's cwd cannot be templated	2023-06-26 10:04:28 +02:00
data	added sample data	2023-06-21 15:45:01 +02:00
dbt	docs: added dbt readme	2023-06-26 08:51:54 +02:00
etl	doc: added readme to etl	2023-06-26 12:09:50 +02:00
notebooks	feat: added some plots	2023-06-25 23:52:23 +02:00
scripts	feat: moved transformations to dbt	2023-06-25 20:53:51 +02:00
terraform	refactor: some refactoring of file placements	2023-06-24 19:07:02 +02:00
.gitignore	feat: ignore dbt/.user.yaml	2023-06-26 11:54:17 +02:00
README.md	feat: added reame	2023-06-26 08:42:26 +02:00
docker-compose.yml	feat: postgresql port doesn't need to be exposed	2023-06-26 11:57:50 +02:00

README.md

Sustainability Score

The task at hand is to compute a sustainability score for products available on Target's website. This score will be based on various product attributes such as materials, packaging, weight, dimensions, TCIN, origin, and other relevant information. The goal is to process and clean the provided product data, store it in an SQL database, and calculate the sustainability score for each product.

Architecture and stack

To tackle this issue I decided to use technologies that are easy to run locally for a small prove of concept but have an easy migration path to run on a cloud provider.

I am using docker-compose to define the services used.

Jupyter Notebooks

Jupyter Notebooks were used for initial data exploration and for some small analyses afterwards.

I am using the official jupyter/scipy-notebook image, which already contains everything that I needed.

The notebooks used are stored in the notebooks directory.

PostgreSQL

PostgreSQL is used as the database where the data will be stored and as the SQL engine to run calculations.

Apache Airflow

Apache Airflow is used as the orchestrator and scheduler of the whole pipeline. Airflow is also where the credentials to the database are stored.

I am using an image that inherits from Google Composer's composer-2.3.1-airflow-2-5-1 image. This creates an environment more similar to Composer and makes a potential migration to Composer easier.

The directory airflow_img contains the files for building that imnge . It is essentially an image inheriting from Google's image but which also has DBT and which overrides its entrypoint (because the entrypoint of the original image sets things up for Cloud Composer, not for running locally).

The directory dags directory contains one Airflow DAG: sustainability_score. This is the DAG in charge or orchestrating the pipeline

Apache Beam

Apache Beam is used as the ETL tool, to read rows as elements from the input CSV file, clean them up and upsert them into the PostgreSQL database.

In this proof of concept, I am using the DirectRunner, because it's the easiest to run an reproduce locally. This means that Beam will run within a python virtual environment inside the Airflow scheduler container.

The code for Apache Beam is stored in the etl directory.

DBT

DBT is used as the transformation tool, to leverage PostgreSQL SQL engine to calculate the score incrementally. I see DBT as an analytics-specific orchestrator that is itself orchestrator by the more general-purpose orchestrator Airflow.

In this proof of concept, DBT is not running on its own container but is directly installed on the same image as Airflow. I made this decision because I didn't like any of the other options:

If I had used DBT cloud, this whole setup could not have been run locally (which is a decision I made myself).
If I had run DBT with an ad-hoc Docker invocation, I would have had to mount the docker socket inside the Airflow containers (to allow for orchestration), which it not ideal.
dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions
dbt-labs' dbt-server is already quite immature and not well documented.

The code for dbt is stored in the dbt directory.

Terraform

I am using terraform to provision some state within the containers in this stack. Namely, Terraform is used to set up the user, database and schema in the Postgres database, and also to create the Airflow connection within Airflow to access the database.

I am not creating the tables with Terraform because I believe that job is more appropriate for DBT.

The code for terraform is stored in the terraform directory.

Other directories

`data`

This contains the sample CSV file.

`scripts`

Contains misc scripts used inside containers. Currently, there is only terraform-entrypoint.sh, which servers as an entrypoint to terraform's container.

`state`

This directory will be created at runtime and is used for storing state, so that the setup has persistence.

Migration to the cloud

While developing this setup, I kept in mind an eventual move to a cloud provider. The following adjustments could be made if this were to run on Google Cloud.

Apache Airflow

The Airflow code could be easily moved to run on Cloud Composer.

For any credentials needed, it would then be better to use the Google Secrets manager backend instead of storing the connections in Airflow's database.

BigQuery

The final datawarehouse used could be BigQuery.

However, due to BigQuery's analytical-first nature a few small changes should be changed when moving from a traditional transactional database like PostgreSQL.

Instead of inserting the elements into the database individually using upsert statements, it would be better to first write all of them into a given table or partition with the write disposition set to WRITE_TRUNCATE and then use a MERGE statement to write or update the elements in the eventual final products table. DBT would likely be used for that MERGE statement.

DBT

Wither DBT cloud or a self-hosted DBT server could be used instead of directly calling the dbt cli. Alternatively, if BigQuery is used as datawarehouse, Dataform would also be worth exploring, since it provides a much better integration with BigQuery specifically.

Apache Beam

The Beam ETL code would run using the DataflowRunner.

GCS

The input CSV file would probably be stored on GCS. Because the ETL pipeline already reads the input file using apache_beam.io.filesystems.FileSystems, this change should be transparent.

Terraform

Terraform would still be used to declaratively define all the needed resources.

Missing pieces

Due to time constraints, there were a few things that I did not have time to implement.

Pipeline monitoring

For the local setup I may have used something like Grafana. For the cloud deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor.

Data quality and integrity tests

DBT is already in place, but I would have liked to add more tests for the data itself.

Alerting

Airflow can provide alerts if the DAG fails via things like an STMP server, an instant messaging notification or even a Pub/Sub queue.

Dashboards

I am not experienced in this enough to have it done quick enough, but a dashboard visualisation of the data using Looker (for the Cloud deployment) or Grafana (for the local setup) would have been beneficial too.

Better analysis, statistics and plots

Again, due to time constraints, I could not add statistics, analyses and plots polished enough to the data analysis stage.