feat: added reame
parent
d97fb6456a
commit
3d230263e2
|
@ -0,0 +1,193 @@
|
|||
# Sustainability Score
|
||||
|
||||
The task at hand is to compute a sustainability score for products available on
|
||||
Target's website. This score will be based on various product attributes such
|
||||
as materials, packaging, weight, dimensions, TCIN, origin, and other relevant
|
||||
information. The goal is to process and clean the provided product data, store
|
||||
it in an SQL database, and calculate the sustainability score for each product.
|
||||
|
||||
|
||||
## Architecture and stack
|
||||
|
||||
To tackle this issue I decided to use technologies that are easy to run locally
|
||||
for a small prove of concept but have an easy migration path to run on a cloud
|
||||
provider.
|
||||
|
||||
I am using docker-compose to define the services used.
|
||||
|
||||
### Jupyter Notebooks
|
||||
|
||||
Jupyter Notebooks were used for initial data exploration and for some small
|
||||
analyses afterwards.
|
||||
|
||||
I am using the official `jupyter/scipy-notebook` image, which already contains
|
||||
everything that I needed.
|
||||
|
||||
The notebooks used are stored in the `notebooks` directory.
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
PostgreSQL is used as the database where the data will be stored and as the SQL
|
||||
engine to run calculations.
|
||||
|
||||
### Apache Airflow
|
||||
|
||||
Apache Airflow is used as the orchestrator and scheduler of the whole pipeline.
|
||||
Airflow is also where the credentials to the database are stored.
|
||||
|
||||
I am using an image that inherits from Google Composer's
|
||||
`composer-2.3.1-airflow-2-5-1` image. This creates an environment more similar
|
||||
to Composer and makes a potential migration to Composer easier.
|
||||
|
||||
The directory `airflow_img` contains the files for building that imnge . It
|
||||
is essentially an image inheriting from Google's image but which also has DBT
|
||||
and which overrides its entrypoint (because the entrypoint of the original
|
||||
image sets things up for Cloud Composer, not for running locally).
|
||||
|
||||
The directory `dags` directory contains one Airflow DAG: `sustainability_score`.
|
||||
This is the DAG in charge or orchestrating the pipeline
|
||||
|
||||
### Apache Beam
|
||||
|
||||
Apache Beam is used as the ETL tool, to read rows as elements from the input
|
||||
CSV file, clean them up and upsert them into the PostgreSQL database.
|
||||
|
||||
In this proof of concept, I am using the DirectRunner, because it's the easiest
|
||||
to run an reproduce locally. This means that Beam will run within a python
|
||||
virtual environment inside the Airflow scheduler container.
|
||||
|
||||
The code for Apache Beam is stored in the `etl` directory.
|
||||
|
||||
### DBT
|
||||
|
||||
DBT is used as the transformation tool, to leverage PostgreSQL SQL
|
||||
engine to calculate the score incrementally.
|
||||
I see DBT as an analytics-specific orchestrator that is itself orchestrator by
|
||||
the more general-purpose orchestrator Airflow.
|
||||
|
||||
In this proof of concept, DBT is not running on its own container but is
|
||||
directly installed on the same image as Airflow. I made this decision because I
|
||||
didn't like any of the other options:
|
||||
|
||||
* If I had used DBT cloud, this whole setup could not have been run locally
|
||||
(which is a decision I made myself).
|
||||
* If I had run DBT with an ad-hoc Docker invocation, I would have had to mount
|
||||
the docker socket inside the Airflow containers (to allow for orchestration),
|
||||
which it not ideal.
|
||||
* dbt-labs' dbt-rpc is already deprecated and missing in newer DBT versions
|
||||
* dbt-labs' dbt-server is already quite immature and not well documented.
|
||||
|
||||
The code for dbt is stored in the `dbt` directory.
|
||||
|
||||
### Terraform
|
||||
|
||||
I am using terraform to provision some state within the containers in this stack.
|
||||
Namely, Terraform is used to set up the user, database and schema in the
|
||||
Postgres database, and also to create the Airflow connection within Airflow to
|
||||
access the database.
|
||||
|
||||
I am not creating the tables with Terraform because I believe that job is more
|
||||
appropriate for DBT.
|
||||
|
||||
The code for terraform is stored in the `terraform` directory.
|
||||
|
||||
### Other directories
|
||||
|
||||
#### `data`
|
||||
|
||||
This contains the sample CSV file.
|
||||
|
||||
#### `scripts`
|
||||
|
||||
Contains misc scripts used inside containers. Currently, there is only
|
||||
`terraform-entrypoint.sh`, which servers as an entrypoint to terraform's
|
||||
container.
|
||||
|
||||
#### `state`
|
||||
|
||||
This directory will be created at runtime and is used for storing state, so
|
||||
that the setup has persistence.
|
||||
|
||||
|
||||
## Migration to the cloud
|
||||
|
||||
While developing this setup, I kept in mind an eventual move to a cloud
|
||||
provider. The following adjustments could be made if this were to run on Google
|
||||
Cloud.
|
||||
|
||||
### Apache Airflow
|
||||
|
||||
The Airflow code could be easily moved to run on Cloud Composer.
|
||||
|
||||
For any credentials needed, it would then be better to use the
|
||||
[Google Secrets manager backend](https://cloud.google.com/composer/docs/secret-manager)
|
||||
instead of storing the connections in Airflow's database.
|
||||
|
||||
### BigQuery
|
||||
|
||||
The final datawarehouse used could be BigQuery.
|
||||
|
||||
However, due to BigQuery's analytical-first nature a few small changes should
|
||||
be changed when moving from a traditional transactional database like
|
||||
PostgreSQL.
|
||||
|
||||
Instead of inserting the elements into the database individually using upsert
|
||||
statements, it would be better to first write all of them into a given table or
|
||||
partition with the write disposition set to `WRITE_TRUNCATE` and then use a
|
||||
`MERGE` statement to write or update the elements in the eventual final
|
||||
`products` table.
|
||||
`DBT` would likely be used for that `MERGE` statement.
|
||||
|
||||
### DBT
|
||||
|
||||
Wither DBT cloud or a self-hosted DBT server could be used instead of directly
|
||||
calling the dbt cli.
|
||||
Alternatively, if BigQuery is used as datawarehouse, Dataform would also be
|
||||
worth exploring, since it provides a much better integration with BigQuery
|
||||
specifically.
|
||||
|
||||
### Apache Beam
|
||||
|
||||
The Beam ETL code would run using the DataflowRunner.
|
||||
|
||||
### GCS
|
||||
|
||||
The input CSV file would probably be stored on GCS. Because the ETL pipeline
|
||||
already reads the input file using `apache_beam.io.filesystems.FileSystems`,
|
||||
this change should be transparent.
|
||||
|
||||
### Terraform
|
||||
|
||||
Terraform would still be used to declaratively define all the needed resources.
|
||||
|
||||
|
||||
## Missing pieces
|
||||
|
||||
Due to time constraints, there were a few things that I did not have time to
|
||||
implement.
|
||||
|
||||
### Pipeline monitoring
|
||||
|
||||
For the local setup I may have used something like Grafana. For the cloud
|
||||
deployment I'd leverage Dataflow's monitoring interface and Google Cloud Monitor.
|
||||
|
||||
### Data quality and integrity tests
|
||||
|
||||
DBT is already in place, but I would have liked to add more tests for the data
|
||||
itself.
|
||||
|
||||
### Alerting
|
||||
|
||||
Airflow can provide alerts if the DAG fails via things like an STMP server, an
|
||||
instant messaging notification or even a Pub/Sub queue.
|
||||
|
||||
### Dashboards
|
||||
|
||||
I am not experienced in this enough to have it done quick enough, but a
|
||||
dashboard visualisation of the data using Looker (for the Cloud deployment) or
|
||||
Grafana (for the local setup) would have been beneficial too.
|
||||
|
||||
### Better analysis, statistics and plots
|
||||
|
||||
Again, due to time constraints, I could not add statistics, analyses and plots
|
||||
polished enough to the data analysis stage.
|
Loading…
Reference in New Issue