1.4 KiB
1.4 KiB
This is the ETL ppipeline to read elements from a CSV file, parsing/cleaning them up and inserting into a PostgreSQL It has been tested only with DirectRunner, but it could be moved to run on DataFlow easily.
Running
This is intended to be scheduled by Airflow but it the necessary packages are available it can also be run manually with:
python3 /etl/main.py \
--runner=DirectRunner \
--input="$CSV_INPUT_FILE" \
--pg_hostname="$PG_HOSTNAME" \
--pg_port="$PG_PORT" \
--pg_username="$PG_USERNAME" \
--pg_password="$PG_PASSWORD" \
--pg_database="$PG_DATABASE" \
--pg_table="$PG_TABLE"
Testing and linting
To help with development and testing a Dockerfile
, a Makefile
and
justfile
files are also provided.
The Makefile
provides a mechanism to
- automate the generation of
dev-requirements.txt
andrequirements.txt
out ofpyproject.toml
- automate the creation of a python virtual environment which contains the
right python version (installed by pyenv) and the packages defined in
pyproject.toml
- automate the building of an OCI image with the necessary dependencies
The provided Dockerfile
is used to build an image with the necessary packages
to run pytest
and pylint
.
The provided justfile
provides the commands to run pytest
and pylint
from
a container.
If just
is installed, pytest
and pylint
can be run like so:
just test
just lint