# Sustainability Score The task at hand is to compute a sustainability score for products available on Target's website. This score will be based on various product attributes such as materials, packaging, weight, dimensions, TCIN, origin, and other relevant information. The goal is to process and clean the provided product data, store it in an SQL database, and calculate the sustainability score for each product. ## Architecture and stack To tackle this issue I decided to use technologies that are easy to run locally for a small prove of concept but have an easy migration path to run on a cloud provider. I am using docker-compose to define the services used. ### Jupyter Notebooks Jupyter Notebooks were used for initial data exploration and for some small analyses afterwards. I am using the official `jupyter/scipy-notebook` image, which already contains everything that I needed. The notebooks used are stored in the `notebooks` directory. ### PostgreSQL PostgreSQL is used as the database where the data will be stored and as the SQL engine to run calculations. ### Apache Airflow Apache Airflow is used as the orchestrator and scheduler of the whole pipeline. Airflow is also where the credentials to the database are stored. I am using an image that inherits from Google Composer's `composer-2.3.1-airflow-2-5-1` image. This creates an environment more similar to Composer and makes a potential migration to Composer easier. The directory `airflow_img` contains the files for building that imnge . It is essentially an image inheriting from Google's image but which also has DBT and which overrides its entrypoint (because the entrypoint of the original image sets things up for Cloud Composer, not for running locally). The directory `dags` directory contains one Airflow DAG: `sustainability_score`. This is the DAG in charge or orchestrating the pipeline ### Apache Beam Apache Beam is used as the ETL tool, to read rows as elements from the input CSV file, clean them up and upsert them into the PostgreSQL database. In this proof of concept, I am using the DirectRunner, because it's the easiest to run an reproduce locally. This means that Beam will run within a python virtual environment inside the Airflow scheduler container. The code for Apache Beam is stored in the `etl` directory. ### DBT DBT is used as the transformation tool, to leverage PostgreSQL SQL engine to calculate the score incrementally. I see DBT as an analytics-specific orchestrator that is itself orchestrator by the more general-purpose orchestrator Airflow. In this proof of concept, DBT is not running on its own container but is directly installed on the same image as Airflow. I made this decision because I didn't like any of the other options: * If I had used DBT cloud, this whole setup could not have been run locally (which is a decision I made myself). * If I had run DBT with an ad-hoc Docker invocation, I would have had to mount the docker socket inside the Airflow containers (to allow for orchestration), which it not ideal. * dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions * dbt-labs' dbt-server is already quite immature and not well documented. The code for dbt is stored in the `dbt` directory. ### Terraform I am using terraform to provision some state within the containers in this stack. Namely, Terraform is used to set up the user, database and schema in the Postgres database, and also to create the Airflow connection within Airflow to access the database. I am not creating the tables with Terraform because I believe that job is more appropriate for DBT. The code for terraform is stored in the `terraform` directory. ### Other directories #### `data` This contains the sample CSV file. #### `scripts` Contains misc scripts used inside containers. Currently, there is only `terraform-entrypoint.sh`, which servers as an entrypoint to terraform's container. #### `state` This directory will be created at runtime and is used for storing state, so that the setup has persistence. ## Migration to the cloud While developing this setup, I kept in mind an eventual move to a cloud provider. The following adjustments could be made if this were to run on Google Cloud. ### Apache Airflow The Airflow code could be easily moved to run on Cloud Composer. For any credentials needed, it would then be better to use the [Google Secrets manager backend](https://cloud.google.com/composer/docs/secret-manager) instead of storing the connections in Airflow's database. ### BigQuery The final datawarehouse used could be BigQuery. However, due to BigQuery's analytical-first nature a few small changes should be changed when moving from a traditional transactional database like PostgreSQL. Instead of inserting the elements into the database individually using upsert statements, it would be better to first write all of them into a given table or partition with the write disposition set to `WRITE_TRUNCATE` and then use a `MERGE` statement to write or update the elements in the eventual final `products` table. `DBT` would likely be used for that `MERGE` statement. ### DBT Wither DBT cloud or a self-hosted DBT server could be used instead of directly calling the dbt cli. Alternatively, if BigQuery is used as datawarehouse, Dataform would also be worth exploring, since it provides a much better integration with BigQuery specifically. ### Apache Beam The Beam ETL code would run using the DataflowRunner. ### GCS The input CSV file would probably be stored on GCS. Because the ETL pipeline already reads the input file using `apache_beam.io.filesystems.FileSystems`, this change should be transparent. ### Terraform Terraform would still be used to declaratively define all the needed resources. ## Missing pieces Due to time constraints, there were a few things that I did not have time to implement. ### Pipeline monitoring For the local setup I may have used something like Grafana. For the cloud deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor. ### Data quality and integrity tests DBT is already in place, but I would have liked to add more tests for the data itself. ### Alerting Airflow can provide alerts if the DAG fails via things like an STMP server, an instant messaging notification or even a Pub/Sub queue. ### Dashboards I am not experienced in this enough to have it done quick enough, but a dashboard visualisation of the data using Looker (for the Cloud deployment) or Grafana (for the local setup) would have been beneficial too. ### Better analysis, statistics and plots Again, due to time constraints, I could not add statistics, analyses and plots polished enough to the data analysis stage.