Using Airflow
Using Airflow
If you are using Apache Airflow for your scheduling then you might want to also use it for scheduling your ingestion recipes. For any Airflow specific questions you can go through Airflow docs for more details.
We've provided a few examples of how to configure your DAG:
mysql_sample_dag
embeds the full MySQL ingestion configuration inside the DAG.snowflake_sample_dag
avoids embedding credentials inside the recipe, and instead fetches them from Airflow's Connections feature. You must configure your connections in Airflow to use this approach.
These example DAGs use the PythonVirtualenvOperator
to run the ingestion. This is the recommended approach, since it guarantees that there will not be any conflicts between DataHub and the rest of your Airflow environment.
When configuring the task, it's important to specify the requirements with your source and set the system_site_packages
option to false.
ingestion_task = PythonVirtualenvOperator(
task_id="ingestion_task",
requirements=[
"acryl-datahub[<your-source>]",
],
system_site_packages=False,
python_callable=your_callable,
)
Advanced: loading a recipe file
In more advanced cases, you might want to store your ingestion recipe in a file and load it from your task.
- Ensure the recipe file is in a folder accessible to your airflow workers. You can either specify absolute path on the machines where Airflow is installed or a path relative to
AIRFLOW_HOME
. - Ensure DataHub CLI is installed in your airflow environment.
- Create a DAG task to read your DataHub ingestion recipe file and run it. See the example below for reference.
- Deploy the DAG file into airflow for scheduling. Typically this involves checking in the DAG file into your dags folder which is accessible to your Airflow instance.
Example: generic_recipe_sample_dag