48 lines
1.4 KiB
Markdown
48 lines
1.4 KiB
Markdown
|
This is the ETL ppipeline to read elements from a CSV file, parsing/cleaning
|
||
|
them up and inserting into a PostgreSQL
|
||
|
It has been tested only with DirectRunner, but it could be moved to run on
|
||
|
DataFlow easily.
|
||
|
|
||
|
## Running
|
||
|
|
||
|
This is intended to be scheduled by Airflow but it the necessary packages are
|
||
|
available it can also be run manually with:
|
||
|
|
||
|
```sh
|
||
|
python3 /etl/main.py \
|
||
|
--runner=DirectRunner \
|
||
|
--input="$CSV_INPUT_FILE" \
|
||
|
--pg_hostname="$PG_HOSTNAME" \
|
||
|
--pg_port="$PG_PORT" \
|
||
|
--pg_username="$PG_USERNAME" \
|
||
|
--pg_password="$PG_PASSWORD" \
|
||
|
--pg_database="$PG_DATABASE" \
|
||
|
--pg_table="$PG_TABLE"
|
||
|
```
|
||
|
|
||
|
## Testing and linting
|
||
|
|
||
|
To help with development and testing a `Dockerfile`, a `Makefile` and
|
||
|
`justfile` files are also provided.
|
||
|
|
||
|
The `Makefile` provides a mechanism to
|
||
|
|
||
|
* automate the generation of `dev-requirements.txt` and `requirements.txt` out
|
||
|
of `pyproject.toml`
|
||
|
* automate the creation of a python virtual environment which contains the
|
||
|
right python version (installed by pyenv) and the packages defined in `pyproject.toml`
|
||
|
* automate the building of an OCI image with the necessary dependencies
|
||
|
|
||
|
The provided `Dockerfile` is used to build an image with the necessary packages
|
||
|
to run `pytest` and `pylint`.
|
||
|
|
||
|
The provided `justfile` provides the commands to run `pytest` and `pylint` from
|
||
|
a container.
|
||
|
|
||
|
If [`just`](https://github.com/casey/just) is installed, `pytest` and `pylint` can be run like so:
|
||
|
|
||
|
```sh
|
||
|
just test
|
||
|
just lint
|
||
|
```
|