feat: added reame

main
Ricard Illa 2023-06-26 08:42:26 +02:00
parent d97fb6456a
commit 3d230263e2
No known key found for this signature in database
GPG Key ID: F69A672B72E54902
1 changed files with 193 additions and 0 deletions

193
README.md Normal file
View File

@ -0,0 +1,193 @@
# Sustainability Score
The task at hand is to compute a sustainability score for products available on
Target's website. This score will be based on various product attributes such
as materials, packaging, weight, dimensions, TCIN, origin, and other relevant
information. The goal is to process and clean the provided product data, store
it in an SQL database, and calculate the sustainability score for each product.
## Architecture and stack
To tackle this issue I decided to use technologies that are easy to run locally
for a small prove of concept but have an easy migration path to run on a cloud
provider.
I am using docker-compose to define the services used.
### Jupyter Notebooks
Jupyter Notebooks were used for initial data exploration and for some small
analyses afterwards.
I am using the official `jupyter/scipy-notebook` image, which already contains
everything that I needed.
The notebooks used are stored in the `notebooks` directory.
### PostgreSQL
PostgreSQL is used as the database where the data will be stored and as the SQL
engine to run calculations.
### Apache Airflow
Apache Airflow is used as the orchestrator and scheduler of the whole pipeline.
Airflow is also where the credentials to the database are stored.
I am using an image that inherits from Google Composer's
`composer-2.3.1-airflow-2-5-1` image. This creates an environment more similar
to Composer and makes a potential migration to Composer easier.
The directory `airflow_img` contains the files for building that imnge . It
is essentially an image inheriting from Google's image but which also has DBT
and which overrides its entrypoint (because the entrypoint of the original
image sets things up for Cloud Composer, not for running locally).
The directory `dags` directory contains one Airflow DAG: `sustainability_score`.
This is the DAG in charge or orchestrating the pipeline
### Apache Beam
Apache Beam is used as the ETL tool, to read rows as elements from the input
CSV file, clean them up and upsert them into the PostgreSQL database.
In this proof of concept, I am using the DirectRunner, because it's the easiest
to run an reproduce locally. This means that Beam will run within a python
virtual environment inside the Airflow scheduler container.
The code for Apache Beam is stored in the `etl` directory.
### DBT
DBT is used as the transformation tool, to leverage PostgreSQL SQL
engine to calculate the score incrementally.
I see DBT as an analytics-specific orchestrator that is itself orchestrator by
the more general-purpose orchestrator Airflow.
In this proof of concept, DBT is not running on its own container but is
directly installed on the same image as Airflow. I made this decision because I
didn't like any of the other options:
* If I had used DBT cloud, this whole setup could not have been run locally
(which is a decision I made myself).
* If I had run DBT with an ad-hoc Docker invocation, I would have had to mount
the docker socket inside the Airflow containers (to allow for orchestration),
which it not ideal.
* dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions
* dbt-labs' dbt-server is already quite immature and not well documented.
The code for dbt is stored in the `dbt` directory.
### Terraform
I am using terraform to provision some state within the containers in this stack.
Namely, Terraform is used to set up the user, database and schema in the
Postgres database, and also to create the Airflow connection within Airflow to
access the database.
I am not creating the tables with Terraform because I believe that job is more
appropriate for DBT.
The code for terraform is stored in the `terraform` directory.
### Other directories
#### `data`
This contains the sample CSV file.
#### `scripts`
Contains misc scripts used inside containers. Currently, there is only
`terraform-entrypoint.sh`, which servers as an entrypoint to terraform's
container.
#### `state`
This directory will be created at runtime and is used for storing state, so
that the setup has persistence.
## Migration to the cloud
While developing this setup, I kept in mind an eventual move to a cloud
provider. The following adjustments could be made if this were to run on Google
Cloud.
### Apache Airflow
The Airflow code could be easily moved to run on Cloud Composer.
For any credentials needed, it would then be better to use the
[Google Secrets manager backend](https://cloud.google.com/composer/docs/secret-manager)
instead of storing the connections in Airflow's database.
### BigQuery
The final datawarehouse used could be BigQuery.
However, due to BigQuery's analytical-first nature a few small changes should
be changed when moving from a traditional transactional database like
PostgreSQL.
Instead of inserting the elements into the database individually using upsert
statements, it would be better to first write all of them into a given table or
partition with the write disposition set to `WRITE_TRUNCATE` and then use a
`MERGE` statement to write or update the elements in the eventual final
`products` table.
`DBT` would likely be used for that `MERGE` statement.
### DBT
Wither DBT cloud or a self-hosted DBT server could be used instead of directly
calling the dbt cli.
Alternatively, if BigQuery is used as datawarehouse, Dataform would also be
worth exploring, since it provides a much better integration with BigQuery
specifically.
### Apache Beam
The Beam ETL code would run using the DataflowRunner.
### GCS
The input CSV file would probably be stored on GCS. Because the ETL pipeline
already reads the input file using `apache_beam.io.filesystems.FileSystems`,
this change should be transparent.
### Terraform
Terraform would still be used to declaratively define all the needed resources.
## Missing pieces
Due to time constraints, there were a few things that I did not have time to
implement.
### Pipeline monitoring
For the local setup I may have used something like Grafana. For the cloud
deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor.
### Data quality and integrity tests
DBT is already in place, but I would have liked to add more tests for the data
itself.
### Alerting
Airflow can provide alerts if the DAG fails via things like an STMP server, an
instant messaging notification or even a Pub/Sub queue.
### Dashboards
I am not experienced in this enough to have it done quick enough, but a
dashboard visualisation of the data using Looker (for the Cloud deployment) or
Grafana (for the local setup) would have been beneficial too.
### Better analysis, statistics and plots
Again, due to time constraints, I could not add statistics, analyses and plots
polished enough to the data analysis stage.