diff --git a/etl/README.md b/etl/README.md new file mode 100644 index 0000000..302ba2f --- /dev/null +++ b/etl/README.md @@ -0,0 +1,47 @@ +This is the ETL ppipeline to read elements from a CSV file, parsing/cleaning +them up and inserting into a PostgreSQL +It has been tested only with DirectRunner, but it could be moved to run on +DataFlow easily. + +## Running + +This is intended to be scheduled by Airflow but it the necessary packages are +available it can also be run manually with: + +```sh +python3 /etl/main.py \ + --runner=DirectRunner \ + --input="$CSV_INPUT_FILE" \ + --pg_hostname="$PG_HOSTNAME" \ + --pg_port="$PG_PORT" \ + --pg_username="$PG_USERNAME" \ + --pg_password="$PG_PASSWORD" \ + --pg_database="$PG_DATABASE" \ + --pg_table="$PG_TABLE" +``` + +## Testing and linting + +To help with development and testing a `Dockerfile`, a `Makefile` and +`justfile` files are also provided. + +The `Makefile` provides a mechanism to + +* automate the generation of `dev-requirements.txt` and `requirements.txt` out + of `pyproject.toml` +* automate the creation of a python virtual environment which contains the + right python version (installed by pyenv) and the packages defined in `pyproject.toml` +* automate the building of an OCI image with the necessary dependencies + +The provided `Dockerfile` is used to build an image with the necessary packages +to run `pytest` and `pylint`. + +The provided `justfile` provides the commands to run `pytest` and `pylint` from +a container. + +If [`just`](https://github.com/casey/just) is installed, `pytest` and `pylint` can be run like so: + +```sh +just test +just lint +```