194 lines
6.6 KiB
Markdown
194 lines
6.6 KiB
Markdown
|
# Sustainability Score
|
||
|
|
||
|
The task at hand is to compute a sustainability score for products available on
|
||
|
Target's website. This score will be based on various product attributes such
|
||
|
as materials, packaging, weight, dimensions, TCIN, origin, and other relevant
|
||
|
information. The goal is to process and clean the provided product data, store
|
||
|
it in an SQL database, and calculate the sustainability score for each product.
|
||
|
|
||
|
|
||
|
## Architecture and stack
|
||
|
|
||
|
To tackle this issue I decided to use technologies that are easy to run locally
|
||
|
for a small prove of concept but have an easy migration path to run on a cloud
|
||
|
provider.
|
||
|
|
||
|
I am using docker-compose to define the services used.
|
||
|
|
||
|
### Jupyter Notebooks
|
||
|
|
||
|
Jupyter Notebooks were used for initial data exploration and for some small
|
||
|
analyses afterwards.
|
||
|
|
||
|
I am using the official `jupyter/scipy-notebook` image, which already contains
|
||
|
everything that I needed.
|
||
|
|
||
|
The notebooks used are stored in the `notebooks` directory.
|
||
|
|
||
|
### PostgreSQL
|
||
|
|
||
|
PostgreSQL is used as the database where the data will be stored and as the SQL
|
||
|
engine to run calculations.
|
||
|
|
||
|
### Apache Airflow
|
||
|
|
||
|
Apache Airflow is used as the orchestrator and scheduler of the whole pipeline.
|
||
|
Airflow is also where the credentials to the database are stored.
|
||
|
|
||
|
I am using an image that inherits from Google Composer's
|
||
|
`composer-2.3.1-airflow-2-5-1` image. This creates an environment more similar
|
||
|
to Composer and makes a potential migration to Composer easier.
|
||
|
|
||
|
The directory `airflow_img` contains the files for building that imnge . It
|
||
|
is essentially an image inheriting from Google's image but which also has DBT
|
||
|
and which overrides its entrypoint (because the entrypoint of the original
|
||
|
image sets things up for Cloud Composer, not for running locally).
|
||
|
|
||
|
The directory `dags` directory contains one Airflow DAG: `sustainability_score`.
|
||
|
This is the DAG in charge or orchestrating the pipeline
|
||
|
|
||
|
### Apache Beam
|
||
|
|
||
|
Apache Beam is used as the ETL tool, to read rows as elements from the input
|
||
|
CSV file, clean them up and upsert them into the PostgreSQL database.
|
||
|
|
||
|
In this proof of concept, I am using the DirectRunner, because it's the easiest
|
||
|
to run an reproduce locally. This means that Beam will run within a python
|
||
|
virtual environment inside the Airflow scheduler container.
|
||
|
|
||
|
The code for Apache Beam is stored in the `etl` directory.
|
||
|
|
||
|
### DBT
|
||
|
|
||
|
DBT is used as the transformation tool, to leverage PostgreSQL SQL
|
||
|
engine to calculate the score incrementally.
|
||
|
I see DBT as an analytics-specific orchestrator that is itself orchestrator by
|
||
|
the more general-purpose orchestrator Airflow.
|
||
|
|
||
|
In this proof of concept, DBT is not running on its own container but is
|
||
|
directly installed on the same image as Airflow. I made this decision because I
|
||
|
didn't like any of the other options:
|
||
|
|
||
|
* If I had used DBT cloud, this whole setup could not have been run locally
|
||
|
(which is a decision I made myself).
|
||
|
* If I had run DBT with an ad-hoc Docker invocation, I would have had to mount
|
||
|
the docker socket inside the Airflow containers (to allow for orchestration),
|
||
|
which it not ideal.
|
||
|
* dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions
|
||
|
* dbt-labs' dbt-server is already quite immature and not well documented.
|
||
|
|
||
|
The code for dbt is stored in the `dbt` directory.
|
||
|
|
||
|
### Terraform
|
||
|
|
||
|
I am using terraform to provision some state within the containers in this stack.
|
||
|
Namely, Terraform is used to set up the user, database and schema in the
|
||
|
Postgres database, and also to create the Airflow connection within Airflow to
|
||
|
access the database.
|
||
|
|
||
|
I am not creating the tables with Terraform because I believe that job is more
|
||
|
appropriate for DBT.
|
||
|
|
||
|
The code for terraform is stored in the `terraform` directory.
|
||
|
|
||
|
### Other directories
|
||
|
|
||
|
#### `data`
|
||
|
|
||
|
This contains the sample CSV file.
|
||
|
|
||
|
#### `scripts`
|
||
|
|
||
|
Contains misc scripts used inside containers. Currently, there is only
|
||
|
`terraform-entrypoint.sh`, which servers as an entrypoint to terraform's
|
||
|
container.
|
||
|
|
||
|
#### `state`
|
||
|
|
||
|
This directory will be created at runtime and is used for storing state, so
|
||
|
that the setup has persistence.
|
||
|
|
||
|
|
||
|
## Migration to the cloud
|
||
|
|
||
|
While developing this setup, I kept in mind an eventual move to a cloud
|
||
|
provider. The following adjustments could be made if this were to run on Google
|
||
|
Cloud.
|
||
|
|
||
|
### Apache Airflow
|
||
|
|
||
|
The Airflow code could be easily moved to run on Cloud Composer.
|
||
|
|
||
|
For any credentials needed, it would then be better to use the
|
||
|
[Google Secrets manager backend](https://cloud.google.com/composer/docs/secret-manager)
|
||
|
instead of storing the connections in Airflow's database.
|
||
|
|
||
|
### BigQuery
|
||
|
|
||
|
The final datawarehouse used could be BigQuery.
|
||
|
|
||
|
However, due to BigQuery's analytical-first nature a few small changes should
|
||
|
be changed when moving from a traditional transactional database like
|
||
|
PostgreSQL.
|
||
|
|
||
|
Instead of inserting the elements into the database individually using upsert
|
||
|
statements, it would be better to first write all of them into a given table or
|
||
|
partition with the write disposition set to `WRITE_TRUNCATE` and then use a
|
||
|
`MERGE` statement to write or update the elements in the eventual final
|
||
|
`products` table.
|
||
|
`DBT` would likely be used for that `MERGE` statement.
|
||
|
|
||
|
### DBT
|
||
|
|
||
|
Wither DBT cloud or a self-hosted DBT server could be used instead of directly
|
||
|
calling the dbt cli.
|
||
|
Alternatively, if BigQuery is used as datawarehouse, Dataform would also be
|
||
|
worth exploring, since it provides a much better integration with BigQuery
|
||
|
specifically.
|
||
|
|
||
|
### Apache Beam
|
||
|
|
||
|
The Beam ETL code would run using the DataflowRunner.
|
||
|
|
||
|
### GCS
|
||
|
|
||
|
The input CSV file would probably be stored on GCS. Because the ETL pipeline
|
||
|
already reads the input file using `apache_beam.io.filesystems.FileSystems`,
|
||
|
this change should be transparent.
|
||
|
|
||
|
### Terraform
|
||
|
|
||
|
Terraform would still be used to declaratively define all the needed resources.
|
||
|
|
||
|
|
||
|
## Missing pieces
|
||
|
|
||
|
Due to time constraints, there were a few things that I did not have time to
|
||
|
implement.
|
||
|
|
||
|
### Pipeline monitoring
|
||
|
|
||
|
For the local setup I may have used something like Grafana. For the cloud
|
||
|
deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor.
|
||
|
|
||
|
### Data quality and integrity tests
|
||
|
|
||
|
DBT is already in place, but I would have liked to add more tests for the data
|
||
|
itself.
|
||
|
|
||
|
### Alerting
|
||
|
|
||
|
Airflow can provide alerts if the DAG fails via things like an STMP server, an
|
||
|
instant messaging notification or even a Pub/Sub queue.
|
||
|
|
||
|
### Dashboards
|
||
|
|
||
|
I am not experienced in this enough to have it done quick enough, but a
|
||
|
dashboard visualisation of the data using Looker (for the Cloud deployment) or
|
||
|
Grafana (for the local setup) would have been beneficial too.
|
||
|
|
||
|
### Better analysis, statistics and plots
|
||
|
|
||
|
Again, due to time constraints, I could not add statistics, analyses and plots
|
||
|
polished enough to the data analysis stage.
|