Airflow, writing custom operators and publishing them as a package: Part 1.
Welcom to this part 1 of the tutorial series on writing airflow custom operators and publishing them as a package. In this part of the series we will setup our local environment before we can start developing our custom airflow operator. Airflow has many providers and there are a lot of operators already in the ecosystem but at times you may need to build your own operator to talk to external systems that there is no airflow operator built already. Maybe you don’t want to exactly write a whole new operator but extend the functionality of an existing operator and publish it. This article is here to help you in that journey. During my summer internship (2022 at Shipt), I took up the task of creating an operator that could talk to Kafka and Snowflake. This article is the born out from that work. Here I summarize my learnings on how I setup my local environment for the task at hand and ultimately published the operator as a package.
Table of Contents
- Poetry to intialize our project.
- Installing airflow with docker-compose file.
- Test the airflow instance.
1. Poetry to intialize our project:
We will start with intializing our project. We will be using poetry to maintain and package our dependencies for this project. You can install poetry and learn more about the project from docs. We will be using python 3.9.13 in the rest of the tutorial. We choose this version of python because we will work with airflow docker image airflow:2.2.5-python3.9 which is built using python 3.9. I use VSCODE as my code editor you can choose any editor you prefer to follow along the rest of the tutorial.
Once you have poetry installed on your system you can intalize a new project using the following command.
>poetry new airflow-custom-operator
The above command will create a python virtual environment, some folders and files for you. The dependcies we have will be managed by the file pyproject.toml.
Here is what you should see once you install poetry and run the above command.
Lets install apache-airflow package in our virtual environment. To do that change the directory to airflow-custom-operator and run the following:
airflow-custom-operator> poetry add apache-airflow@2.2.5
This will install apache airlfow to the virtual environment and will help you develop your code. Ensure in your code editor you set the virtual environment path you created through poetry as the interpter. This will help us a lot when we are developing our operator and help us avoid any syntax errors and give us the much needed tab completetion for productivity.
In vscode I was able to do update my interptre path to the virtual environment by running the following command.
airflow-custom-operator> poetry show -v
The first line of output from the above command points me to the virtual environment path. I updated the same path in vscode.
I found that having the above setup helped me in reducing syntax errors a lot during the time I was learning to work with the apache-airflow package.
2. Installing airflow with docker-compose file:
The poetry part is to help us setup our package for the custom operator. We set this thing upfront. Next we will need a working airflow instance to develop and test our custom airflow opertaor that we will build. To setup airflow instance we will use docker. We use docker since airflow has many services that is required for it to run and airflow provides a healthy docker-compose file to get started.
After we have settled on using docker to setup our local airflow environment. Create a folder named dev in your airflow-custom-operator folder that poetry created.
Here is what the folder looks like after the above step.
To learn more about the docker-compose file you can follow the offical docs here. In this tutorial however we will use a slightly modified version of the docker-compose file. We won’t be running the flower and redis service as we won’t be using them and we also disable the example airflow jobs.
This is what our docker-compose file looks like:
You can install additional dependencies to the airflow containers by updating the line 36 in the docker-compose file mentioned above. The below line illustrates how you can install tensorflow and scikit-learn packages to your airflow containers on the fly. The best thing about this option is that you don’t have to update the airflow image
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:- scikit-learn tensorflow}
We will mount the following volumes within the dev folder for easier development.
- ./dags:/opt/airflow/dags (This volume will hold our Airflow DAGS)
- ./logs:/opt/airflow/logs (This volume will hold our Airflow Logs)
- ./plugins:/opt/airflow/plugins (This volume will hold our Airflow plugins-hooks, operators-)
Note:
- Before running the docker-compose file. Please ensure you have set docker desktop (or rancher desktop) atleast 4.5 GB of RAM. This required for the airflow UI to start. If you don’t allocate the required resources the airflow UI won’t load up. Following are the settings I have made to my setup. You don’t need to go overboard like me by allocating docker 6 GB RAM.
2. Create a file named .env in dev folder and include the following line
AIRFLOW_UID=50000
This is how things should look once you are done.
3. Test the airflow instance
Once you have completed the setup it is time to test things. Run the following command from dev directory.
airflow-custom-operator/dev> docker-compose up
You should start seeing docker pull down the required images. It will take around 5–10 mins for the airflow instance to be up. The docker-compose will create a few folders in dev folders (dags, logs and plugins). Here is a view of what it will look like:
Once the airflow UI is up you can navigate to http://localhost:8080 and check it out.
To login into airflow the default username and password is airflow and airflow respectively.
username: airflow
password: airflow
Once you login you should be greeted with the following UI.
Next let us test our airflow setup by running a simple DAG. Create your dag under the airflow-custom-operator/dev/dags folder.
You can copy paste the following dag or write your own dag to test your setup.
Your dag should show up in the UI in a minute or two. Once it does, unpause the dag and trigger it manually to compelte your test. The dag should run successfully and you should have a working airflow environment for development and testing.
Thank you all for reading, that’s it for this part of the series. We did quite a few things in this article to setup our local environment to start developing airflow custom operators. We started by installing poetry and created a package structure for our airflow custom operator. Then we setup our local airflow environment in docker. At last we tested our setup.
In the next part of the series we will be using this setup to develop a custom operator and test it within our airflow setup. We also will talk about linting our code, unit tests of airflow operator before we can publish the operator as a package.
More content to follow. Please clap if the article was helpful and comment if you have any questions. If you want to connect with me, learn and grow with me or collaborate you can reach me at any of the following:
Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir
:) :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)