dayrize-usecase/etl/README.md

48 lines
1.4 KiB
Markdown

This is the ETL ppipeline to read elements from a CSV file, parsing/cleaning
them up and inserting into a PostgreSQL
It has been tested only with DirectRunner, but it could be moved to run on
DataFlow easily.
## Running
This is intended to be scheduled by Airflow but it the necessary packages are
available it can also be run manually with:
```sh
python3 /etl/main.py \
--runner=DirectRunner \
--input="$CSV_INPUT_FILE" \
--pg_hostname="$PG_HOSTNAME" \
--pg_port="$PG_PORT" \
--pg_username="$PG_USERNAME" \
--pg_password="$PG_PASSWORD" \
--pg_database="$PG_DATABASE" \
--pg_table="$PG_TABLE"
```
## Testing and linting
To help with development and testing a `Dockerfile`, a `Makefile` and
`justfile` files are also provided.
The `Makefile` provides a mechanism to
* automate the generation of `dev-requirements.txt` and `requirements.txt` out
of `pyproject.toml`
* automate the creation of a python virtual environment which contains the
right python version (installed by pyenv) and the packages defined in `pyproject.toml`
* automate the building of an OCI image with the necessary dependencies
The provided `Dockerfile` is used to build an image with the necessary packages
to run `pytest` and `pylint`.
The provided `justfile` provides the commands to run `pytest` and `pylint` from
a container.
If [`just`](https://github.com/casey/just) is installed, `pytest` and `pylint` can be run like so:
```sh
just test
just lint
```