Ingestion
Ingestion
Introduction
Starting in version 0.8.25
, DataHub supports creating, configuring, scheduling, & executing batch metadata ingestion using the DataHub user interface. This makes
getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines.
This document will describe the steps required to configure, schedule, and execute metadata ingestion inside the UI.
Running Metadata Ingestion
Prerequisites
To view & manage UI-based metadata ingestion, you must have the Manage Metadata Ingestion
& Manage Secrets
privileges assigned to your account. These can be granted by a Platform Policy.
Once you have these privileges, you can begin to manage ingestion by navigating to the 'Ingestion' tab in DataHub.
On this page, you'll see a list of active Ingestion Sources. An Ingestion Sources is a unique source of metadata ingested into DataHub from an external source like Snowflake, Redshift, or BigQuery.
If you're just getting started, you won't have any sources. In the following sections, we'll describe how to create your first Ingestion Source.
Creating an Ingestion Source
- UI
- CLI
- GraphQL
Before ingesting any metadata, you need to create a new Ingestion Source. Start by clicking + Create new source.
Step 1: Select a Platform Template
In the first step, select a Recipe Template corresponding to the source type that you'd like to extract metadata from. Choose among
a variety of natively supported integrations, from Snowflake to Postgres to Kafka.
Select Custom
to construct an ingestion recipe from scratch.
Next, you'll configure an ingestion Recipe, which defines how and what to extract from the source system.
Step 2: Configure a Recipe
Next, you'll define an ingestion Recipe in YAML. A Recipe is a set of configurations which is used by DataHub to extract metadata from a 3rd party system. It most often consists of the following parts:
- A source type: The type of system you'd like to extract metadata from (e.g. snowflake, mysql, postgres). If you've chosen a native template, this will already be populated for you. To view a full list of currently supported types, check out this list.
- A source config: A set of configurations specific to the source type. Most sources support the following types of configuration values:
- Coordinates: The location of the system you want to extract metadata from
- Credentials: Authorized credentials for accessing the system you want to extract metadata from
- Customizations: Customizations regarding the metadata that will be extracted, e.g. which databases or tables to scan in a relational DB
- A sink type: A type of sink to route the metadata extracted from the source type. The officially supported DataHub sink
types are
datahub-rest
anddatahub-kafka
.
- A sink config: Configuration required to send metadata to the provided sink type. For example, DataHub coordinates and credentials.
A sample of a full recipe configured to ingest metadata from MySQL can be found in the image below.
Detailed configuration examples & documentation for each source type can be found on the DataHub Docs website.
Creating a Secret
For production use cases, sensitive configuration values, such as database usernames and passwords, should be hidden from plain view within your ingestion recipe. To accomplish this, you can create & embed Secrets. Secrets are named values that are encrypted and stored within DataHub's storage layer.
To create a secret, first navigate to the 'Secrets' tab. Then click + Create new secret
.
Creating a Secret to store the username for a MySQL database
Inside the form, provide a unique name for the secret along with the value to be encrypted, and an optional description. Click Create when you are done. This will create a Secret which can be referenced inside your ingestion recipe using its name.
Referencing a Secret
Once a Secret has been created, it can be referenced from within your Recipe using variable substitution. For example, to substitute secrets for a MySQL username and password into a Recipe, your Recipe would be defined as follows:
source:
type: mysql
config:
host_port: 'localhost:3306'
database: my_db
username: ${MYSQL_USERNAME}
password: ${MYSQL_PASSWORD}
include_tables: true
include_views: true
profiling:
enabled: true
sink:
type: datahub-rest
config:
server: 'http://datahub-gms:8080'
Referencing DataHub Secrets from a Recipe definition
When the Ingestion Source with this Recipe executes, DataHub will attempt to 'resolve' Secrets found within the YAML. If a secret can be resolved, the reference is substituted for its decrypted value prior to execution. Secret values are not persisted to disk beyond execution time, and are never transmitted outside DataHub.
Attention: Any DataHub users who have been granted the
Manage Secrets
Platform Privilege will be able to retrieve plaintext secret values using the GraphQL API.
Step 3: Schedule Execution
Next, you can optionally configure a schedule on which to execute your new Ingestion Source. This enables to schedule metadata extraction on a monthly, weekly, daily, or hourly cadence depending on the needs of your organization. Schedules are defined using CRON format.
An Ingestion Source that is executed at 9:15am every day, Los Angeles time
To learn more about the CRON scheduling format, check out the Wikipedia overview.
If you plan to execute ingestion on an ad-hoc basis, you can click Skip to skip the scheduling step entirely. Don't worry - you can always come back and change this.
Step 4: Finishing Up
Finally, give your Ingestion Source a name.
Once you're happy with your configurations, click 'Done' to save your changes.
Advanced ingestion configs:
DataHub's Managed Ingestion UI comes pre-configured to use the latest version of the DataHub CLI (acryl-datahub) that is compatible with the server. However, you can override the default package version using the 'Advanced' source configurations.
To do so, simply click 'Advanced', then change the 'CLI Version' text box to contain the exact version of the DataHub CLI you'd like to use.
Pinning the CLI version to version 0.8.23.2
Other advanced options include specifying environment variables, DataHub plugins or python packages at runtime.
Once you're happy with your changes, simply click 'Done' to save.
You can upload and even update recipes using the cli as mentioned in the cli documentation for uploading ingestion recipes.
An example execution for a given recipe.yaml
file, would look something like:
datahub ingest deploy --name "My Test Ingestion Source" --schedule "5 * * * *" --time-zone "UTC" -c recipe.yaml
This would create a new recipe with the name My Test Ingestion Source
. Note that to update an existing recipe, it's urn
id must be passed as a parameter.
DataHub supports having multiple recipes with the same name so to distinguish them we use the urn for unique identification.
Create ingestion sources using DataHub's GraphQL API using the createIngestionSource mutation endpoint.
mutation {
createIngestionSource(input: {
name: "My Test Ingestion Source",
type: "mysql",
description: "My ingestion source description",
schedule: {interval: "*/5 * * * *", timezone: "UTC"},
config: {
recipe: "{\"source\":{\"type\":\"mysql\",\"config\":{\"include_tables\":true,\"database\":null,\"password\":\"${MYSQL_PASSWORD}\",\"profiling\":{\"enabled\":false},\"host_port\":null,\"include_views\":true,\"username\":\"${MYSQL_USERNAME}\"}},\"pipeline_name\":\"urn:li:dataHubIngestionSource:f38bd060-4ea8-459c-8f24-a773286a2927\"}",
version: "0.8.18",
executorId: "mytestexecutor",
}
})
}
To update sources, please use the updateIngestionSource
endpoint. It is almost identical to the create endpoint, only requiring the urn of the source to be updated in addition to the same input as the create endpoint.
Note: Recipe must be double quotes escaped
Running an Ingestion Source
Once you've created your Ingestion Source, you can run it by clicking 'Execute'. Shortly after,
you should see the 'Last Status' column of the ingestion source change from N/A
to Running
. This
means that the request to execute ingestion has been successfully picked up by the DataHub ingestion executor.
If ingestion has executed successfully, you should see it's state shown in green as Succeeded
.
Cancelling an Ingestion Run
If your ingestion run is hanging, there may a bug in the ingestion source, or another persistent issue like exponential timeouts. If these situations, you can cancel ingestion by clicking Cancel on the problematic run.
Once cancelled, you can view the output of the ingestion run by clicking Details.
Debugging a Failed Ingestion Run
A variety of things can cause an ingestion run to fail. Common reasons for failure include:
- Recipe Misconfiguration: A recipe has not provided the required or expected configurations for the ingestion source. You can refer to the Metadata Ingestion Framework source docs to learn more about the configurations required for your source type.
- Failure to resolve Secrets: If DataHub is unable to find secrets that were referenced by your Recipe configuration, the ingestion run will fail. Verify that the names of the secrets referenced in your recipe match those which have been created.
- Connectivity / Network Reachability: If DataHub is unable to reach a data source, for example due to DNS resolution failures, metadata ingestion will fail. Ensure that the network where DataHub is deployed has access to the data source which you are trying to reach.
- Authentication: If you've enabled Metadata Service Authentication, you'll need to provide a Personal Access Token in your Recipe Configuration. To so this, set the 'token' field of the sink configuration to contain a Personal Access Token:
The output of each run is captured and available to view in the UI for easier debugging. To view output logs, click DETAILS on the corresponding ingestion run.
FAQ
I tried to ingest metadata after running 'datahub docker quickstart', but ingestion is failing with 'Failed to Connect' errors. What do I do?
If not due to one of the reasons outlined above, this may be because the executor running ingestion is unable
to reach DataHub's backend using the default configurations. Try changing your ingestion recipe to make the sink.config.server
variable point to the Docker
DNS name for the datahub-gms
pod:
I see 'N/A' when I try to run ingestion. What do I do?
If you see 'N/A', and the ingestion run state never changes to 'Running', this may mean
that your executor (datahub-actions
) container is down.
This container is responsible for executing requests to run ingestion when they come in, either
on demand on a particular schedule. You can verify the health of the container using docker ps
. Moreover, you can inspect the container logs using by finding the container id
for the datahub-actions
container and running docker logs <container-id>
.
When should I NOT use UI Ingestion?
There are valid cases for ingesting metadata without the UI-based ingestion scheduler. For example,
- You have written a custom ingestion Source
- Your data sources are not reachable on the network where DataHub is deployed. DataHub Cloud users can use a remote executor for remote UI-based ingestion.
- Your ingestion source requires context from a local filesystem (e.g. input files)
- You want to distribute metadata ingestion among multiple producers / environments
How do I attach policies to the actions pod to give it permissions to pull metadata from various sources?
This varies across the underlying platform. For AWS, please refer to this guide.
Demo
Click here to see a full demo of the UI Ingestion feature.
Feedback / Questions / Concerns
We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on Slack!