feat: added reame

2023-06-26 08:42:26 +02:00 · 2023-06-26 08:42:26 +02:00 · 3d230263e2
parent d97fb6456a
commit 3d230263e2
1 changed files with 193 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,193 @@
+# Sustainability Score
+
+The task at hand is to compute a sustainability score for products available on
+Target's website. This score will be based on various product attributes such
+as materials, packaging, weight, dimensions, TCIN, origin, and other relevant
+information. The goal is to process and clean the provided product data, store
+it in an SQL database, and calculate the sustainability score for each product.
+
+
+## Architecture and stack
+
+To tackle this issue I decided to use technologies that are easy to run locally
+for a small prove of concept but have an easy migration path to run on a cloud
+provider.
+
+I am using docker-compose to define the services used.
+
+### Jupyter Notebooks
+
+Jupyter Notebooks were used for initial data exploration and for some small
+analyses afterwards.
+
+I am using the official `jupyter/scipy-notebook` image, which already contains
+everything that I needed.
+
+The notebooks used are stored in the `notebooks` directory.
+
+### PostgreSQL
+
+PostgreSQL is used as the database where the data will be stored and as the SQL
+engine to run calculations.
+
+### Apache Airflow
+
+Apache Airflow is used as the orchestrator and scheduler of the whole pipeline.
+Airflow is also where the credentials to the database are stored.
+
+I am using an image that inherits from Google Composer's
+`composer-2.3.1-airflow-2-5-1` image. This creates an environment more similar
+to Composer and makes a potential migration to Composer easier.
+
+The directory `airflow_img` contains the files for building that imnge . It
+is essentially an image inheriting from Google's image but which also has DBT
+and which overrides its entrypoint (because the entrypoint of the original
+image sets things up for Cloud Composer, not for running locally).
+
+The directory `dags` directory contains one Airflow DAG: `sustainability_score`.
+This is the DAG in charge or orchestrating the pipeline
+
+### Apache Beam
+
+Apache Beam is used as the ETL tool, to read rows as elements from the input
+CSV file, clean them up and upsert them into the PostgreSQL database.
+
+In this proof of concept, I am using the DirectRunner, because it's the easiest
+to run an reproduce locally. This means that Beam will run within a python
+virtual environment inside the Airflow scheduler container.
+
+The code for Apache Beam is stored in the `etl` directory.
+
+### DBT
+
+DBT is used as the transformation tool, to leverage PostgreSQL SQL
+engine to calculate the score incrementally.
+I see DBT as an analytics-specific orchestrator that is itself orchestrator by
+the more general-purpose orchestrator Airflow.
+
+In this proof of concept, DBT is not running on its own container but is
+directly installed on the same image as Airflow. I made this decision because I
+didn't like any of the other options:
+
+* If I had used DBT cloud, this whole setup could not have been run locally
+  (which is a decision I made myself).
+* If I had run DBT with an ad-hoc Docker invocation, I would have had to mount
+  the docker socket inside the Airflow containers (to allow for orchestration),
+  which it not ideal.
+* dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions
+* dbt-labs' dbt-server is already quite immature and not well documented.
+
+The code for dbt is stored in the `dbt` directory.
+
+### Terraform
+
+I am using terraform to provision some state within the containers in this stack.
+Namely, Terraform is used to set up the user, database and schema in the
+Postgres database, and also to create the Airflow connection within Airflow to
+access the database.
+
+I am not creating the tables with Terraform because I believe that job is more
+appropriate for DBT.
+
+The code for terraform is stored in the `terraform` directory.
+
+### Other directories
+
+#### `data`
+
+This contains the sample CSV file.
+
+#### `scripts`
+
+Contains misc scripts used inside containers. Currently, there is only
+`terraform-entrypoint.sh`, which servers as an entrypoint to terraform's
+container.
+
+#### `state`
+
+This directory will be created at runtime and is used for storing state, so
+that the setup has persistence.
+
+
+## Migration to the cloud
+
+While developing this setup, I kept in mind an eventual move to a cloud
+provider. The following adjustments could be made if this were to run on Google
+Cloud.
+
+### Apache Airflow
+
+The Airflow code could be easily moved to run on Cloud Composer.
+
+For any credentials needed, it would then be better to use the
+[Google Secrets manager backend](https://cloud.google.com/composer/docs/secret-manager)
+instead of storing the connections in Airflow's database.
+
+### BigQuery
+
+The final datawarehouse used could be BigQuery.
+
+However, due to BigQuery's analytical-first nature a few small changes should
+be changed when moving from a traditional transactional database like
+PostgreSQL.
+
+Instead of inserting the elements into the database individually using upsert
+statements, it would be better to first write all of them into a given table or
+partition with the write disposition set to `WRITE_TRUNCATE` and then use a
+`MERGE` statement to write or update the elements in the eventual final
+`products` table.
+`DBT` would likely be used for that `MERGE` statement.
+
+### DBT
+
+Wither DBT cloud or a self-hosted DBT server could be used instead of directly
+calling the dbt cli.
+Alternatively, if BigQuery is used as datawarehouse, Dataform would also be
+worth exploring, since it provides a much better integration with BigQuery
+specifically.
+
+### Apache Beam
+
+The Beam ETL code would run using the DataflowRunner.
+
+### GCS
+
+The input CSV file would probably be stored on GCS. Because the ETL pipeline
+already reads the input file using `apache_beam.io.filesystems.FileSystems`,
+this change should be transparent.
+
+### Terraform
+
+Terraform would still be used to declaratively define all the needed resources.
+
+
+## Missing pieces
+
+Due to time constraints, there were a few things that I did not have time to
+implement.
+
+### Pipeline monitoring
+
+For the local setup I may have used something like Grafana. For the cloud
+deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor.
+
+### Data quality and integrity tests
+
+DBT is already in place, but I would have liked to add more tests for the data
+itself.
+
+### Alerting
+
+Airflow can provide alerts if the DAG fails via things like an STMP server, an
+instant messaging notification or even a Pub/Sub queue.
+
+### Dashboards
+
+I am not experienced in this enough to have it done quick enough, but a
+dashboard visualisation of the data using Looker (for the Cloud deployment) or
+Grafana (for the local setup) would have been beneficial too.
+
+### Better analysis, statistics and plots
+
+Again, due to time constraints, I could not add statistics, analyses and plots
+polished enough to the data analysis stage.