diff --git a/README.md b/README.md new file mode 100644 index 0000000..325c437 --- /dev/null +++ b/README.md @@ -0,0 +1,193 @@ +# Sustainability Score + +The task at hand is to compute a sustainability score for products available on +Target's website. This score will be based on various product attributes such +as materials, packaging, weight, dimensions, TCIN, origin, and other relevant +information. The goal is to process and clean the provided product data, store +it in an SQL database, and calculate the sustainability score for each product. + + +## Architecture and stack + +To tackle this issue I decided to use technologies that are easy to run locally +for a small prove of concept but have an easy migration path to run on a cloud +provider. + +I am using docker-compose to define the services used. + +### Jupyter Notebooks + +Jupyter Notebooks were used for initial data exploration and for some small +analyses afterwards. + +I am using the official `jupyter/scipy-notebook` image, which already contains +everything that I needed. + +The notebooks used are stored in the `notebooks` directory. + +### PostgreSQL + +PostgreSQL is used as the database where the data will be stored and as the SQL +engine to run calculations. + +### Apache Airflow + +Apache Airflow is used as the orchestrator and scheduler of the whole pipeline. +Airflow is also where the credentials to the database are stored. + +I am using an image that inherits from Google Composer's +`composer-2.3.1-airflow-2-5-1` image. This creates an environment more similar +to Composer and makes a potential migration to Composer easier. + +The directory `airflow_img` contains the files for building that imnge . It +is essentially an image inheriting from Google's image but which also has DBT +and which overrides its entrypoint (because the entrypoint of the original +image sets things up for Cloud Composer, not for running locally). + +The directory `dags` directory contains one Airflow DAG: `sustainability_score`. +This is the DAG in charge or orchestrating the pipeline + +### Apache Beam + +Apache Beam is used as the ETL tool, to read rows as elements from the input +CSV file, clean them up and upsert them into the PostgreSQL database. + +In this proof of concept, I am using the DirectRunner, because it's the easiest +to run an reproduce locally. This means that Beam will run within a python +virtual environment inside the Airflow scheduler container. + +The code for Apache Beam is stored in the `etl` directory. + +### DBT + +DBT is used as the transformation tool, to leverage PostgreSQL SQL +engine to calculate the score incrementally. +I see DBT as an analytics-specific orchestrator that is itself orchestrator by +the more general-purpose orchestrator Airflow. + +In this proof of concept, DBT is not running on its own container but is +directly installed on the same image as Airflow. I made this decision because I +didn't like any of the other options: + +* If I had used DBT cloud, this whole setup could not have been run locally + (which is a decision I made myself). +* If I had run DBT with an ad-hoc Docker invocation, I would have had to mount + the docker socket inside the Airflow containers (to allow for orchestration), + which it not ideal. +* dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions +* dbt-labs' dbt-server is already quite immature and not well documented. + +The code for dbt is stored in the `dbt` directory. + +### Terraform + +I am using terraform to provision some state within the containers in this stack. +Namely, Terraform is used to set up the user, database and schema in the +Postgres database, and also to create the Airflow connection within Airflow to +access the database. + +I am not creating the tables with Terraform because I believe that job is more +appropriate for DBT. + +The code for terraform is stored in the `terraform` directory. + +### Other directories + +#### `data` + +This contains the sample CSV file. + +#### `scripts` + +Contains misc scripts used inside containers. Currently, there is only +`terraform-entrypoint.sh`, which servers as an entrypoint to terraform's +container. + +#### `state` + +This directory will be created at runtime and is used for storing state, so +that the setup has persistence. + + +## Migration to the cloud + +While developing this setup, I kept in mind an eventual move to a cloud +provider. The following adjustments could be made if this were to run on Google +Cloud. + +### Apache Airflow + +The Airflow code could be easily moved to run on Cloud Composer. + +For any credentials needed, it would then be better to use the +[Google Secrets manager backend](https://cloud.google.com/composer/docs/secret-manager) +instead of storing the connections in Airflow's database. + +### BigQuery + +The final datawarehouse used could be BigQuery. + +However, due to BigQuery's analytical-first nature a few small changes should +be changed when moving from a traditional transactional database like +PostgreSQL. + +Instead of inserting the elements into the database individually using upsert +statements, it would be better to first write all of them into a given table or +partition with the write disposition set to `WRITE_TRUNCATE` and then use a +`MERGE` statement to write or update the elements in the eventual final +`products` table. +`DBT` would likely be used for that `MERGE` statement. + +### DBT + +Wither DBT cloud or a self-hosted DBT server could be used instead of directly +calling the dbt cli. +Alternatively, if BigQuery is used as datawarehouse, Dataform would also be +worth exploring, since it provides a much better integration with BigQuery +specifically. + +### Apache Beam + +The Beam ETL code would run using the DataflowRunner. + +### GCS + +The input CSV file would probably be stored on GCS. Because the ETL pipeline +already reads the input file using `apache_beam.io.filesystems.FileSystems`, +this change should be transparent. + +### Terraform + +Terraform would still be used to declaratively define all the needed resources. + + +## Missing pieces + +Due to time constraints, there were a few things that I did not have time to +implement. + +### Pipeline monitoring + +For the local setup I may have used something like Grafana. For the cloud +deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor. + +### Data quality and integrity tests + +DBT is already in place, but I would have liked to add more tests for the data +itself. + +### Alerting + +Airflow can provide alerts if the DAG fails via things like an STMP server, an +instant messaging notification or even a Pub/Sub queue. + +### Dashboards + +I am not experienced in this enough to have it done quick enough, but a +dashboard visualisation of the data using Looker (for the Cloud deployment) or +Grafana (for the local setup) would have been beneficial too. + +### Better analysis, statistics and plots + +Again, due to time constraints, I could not add statistics, analyses and plots +polished enough to the data analysis stage.