Crawl Data Sources

Crawling is the process of discovering and cataloging data assets from a connected data source. Once a connection is established, you can trigger a crawl job to scan the data source and identify all available datasets, such as databases, schemas, tables, or files.

The discovered assets are displayed in the Discover Assets page in ADOC for further exploration and analysis.

For supported data sources, you can configure include and exclude metadata filters to control which databases, schemas, and tables the crawler ingests. Filters are defined during data source setup on the Observability Setup page and apply to all subsequent crawls for that data source.

Types of Crawling

There are two ways to crawl a data source:

Full Crawling: Scans all assets within a data source.
Selective Crawling: Lets you manually crawl specific assets within an existing data source.

Both crawl types respect any include and exclude metadata filters configured on the data source. If filters are configured, only the assets that pass the filter evaluation are crawled and ingested.

Full Crawling

A Full Crawl performs a comprehensive scan of all available assets in a connected data source. This is typically the first crawl performed after connecting the source. A full crawl collects metadata for:

All databases or containers
Schemas and folders
Tables, views, or files

This ensures that the entire data source is registered and visible in Discover Assets in ADOC.

Note During the initial connection, there is no option to perform selective crawling. The full crawl must be completed first.

Selective Crawling

After a full crawl is complete, you can choose to crawl specific assets within a data source whenever needed. This helps reduce unnecessary scans, save time, and keep metadata up to date for assets that change frequently.

Selective Crawling is supported for the following data sources:

ADLS (Azure Data Lake Storage)
Snowflake
Databricks
BigQuery
GCS (Google Cloud Storage)
Amazon S3

When to Use Selective Crawling

Use selective crawling when:

You only need to update metadata for a specific asset.
You have recently added or modified a single database, schema, or table.
You want to avoid re-crawling large sources with minimal changes.
You are troubleshooting data quality or lineage for a particular dataset.

Example

If you have already performed a full crawl of your Snowflake source, you can later choose to crawl only one specific database instead of the entire connection. This means:

You do not have to re-scan all schemas and tables.
Only the selected database's metadata is refreshed.
The updated information appears in the Discover Assets view.

Metadata Filters

Metadata filters let you define which assets the crawler includes or excludes during ingestion, using regex patterns. Instead of crawling all assets in a selected database, you can scope the crawl precisely to the schemas and tables your team actually needs.

Filters are configured on the Observability Setup page when registering or editing a data source. Once saved, the configured filters apply to all subsequent crawls — both full and selective — for that data source.

Supported Data Sources

Metadata filters are available for the following data source types:

Oracle
Snowflake
Databricks
MS SQL
Redshift
Trino
SAP HANA

Configure Metadata Filters

Metadata filters are configured during data source registration or when editing an existing data source.

Navigate to Control Center -> Integrations.
Locate the data source and open it for editing, or proceed through the registration wizard to the Observability Setup step.
Under Data Reliability, select one or more databases to crawl from the Databases field.
In the Include Assets Filter (Regex) field, enter one or more regex patterns for the assets you want the crawler to include. Select the + button to add additional patterns.
In the Exclude Assets Filter (Regex) field, enter one or more regex patterns for the assets you want the crawler to skip. Select the + button to add additional patterns.
Select Submit to save the configuration.

Note Filter configuration is saved as part of the data source settings. Filters take effect only when the next crawl is triggered. Updating filter settings without running a crawl does not change the currently cataloged assets.

Note Regex patterns are validated at crawl time, not when the data source is saved. If a pattern is incorrect or does not match any assets, the crawler returns an empty result set for that filter scope.

Fully Qualified Asset Name (FQN) Format

Regex patterns are matched against the fully qualified asset name (FQN) of each asset. The FQN format is:

DATABASE.SCHEMA.TABLE_NAME

FQN matching is case-sensitive. Use the number of dot separators in your pattern to target the appropriate level of the hierarchy:

Dot separators in pattern	Level targeted	Example
None (no dot)	Database only	`MY_DB.*`
One (one dot)	Database and schema	`MY_DB\.sales.*`
Two (two dots)	Database, schema, and table	`MY_DB\.sales\.orders.*`

Include Assets Filter

The Include Assets Filter defines which assets the crawler retains after the exclude filter has been applied. If this field is left empty, the crawler includes all remaining assets in the selected databases (subject to any exclude patterns).

Multiple include patterns can be added. An asset is included if its FQN matches any one of the configured patterns (logical OR).

Exclude Assets Filter

The Exclude Assets Filter defines which assets the crawler removes from the crawl scope before the include filter is applied. If this field is left empty, no assets are excluded.

Multiple exclude patterns can be added. An asset is excluded if its FQN matches any one of the configured patterns (logical OR).

Evaluation Order

When both filters are configured, the crawler applies them in the following order:

The Exclude Assets Filter is applied first. Any asset whose FQN matches at least one exclude pattern is removed from the crawl scope.
The Include Assets Filter is applied to the remaining assets. Only assets whose FQN matches at least one include pattern are retained.

Important Exclude is always applied before Include. If an asset matches both an exclude and an include pattern, it is excluded.

Regex Pattern Rules

The following rules apply when writing regex patterns for metadata filters:

Rule	Description	Example
`\.`	Use . to match a literal dot between database, schema, and table name segments.	`MY_DB\.hr\..*`
`.*`	Use .* to match any sequence of characters (zero or more).	`.\.hr\..`
`(a\|b)`	Use \| inside parentheses to match multiple options (logical OR within a single pattern).	`MY_DB\.(hr\|finance)\..*`

Include Filter Examples

Intent	Pattern
All tables in the hr schema of MY_DB	`MY_DB\.hr\..*`
Tables starting with emp in MY_DB.hr	`MY_DB\.hr\.emp.*`
A specific table: MY_DB.hr.employees	`MY_DB\.hr\.employees`
All tables in the hr schema across any database	`.\.hr\..`
Tables starting with report_ in any schema	`.\..\.report_.*`
Two schemas in MY_DB: hr and finance	`MY_DB\.(hr\|finance)\..*`

Exclude Filter Examples

Intent	Pattern
All staging schemas	`.\.staging\..`
All tables prefixed with tmp_	`.\..\.tmp_.*`
A specific schema in MY_DB	`MY_DB\.test_data\..*`
Multiple schemas: staging, test, and dev	`MY_DB\.(staging\|test\|dev)\..*`

Combined Include and Exclude Examples

Intent	Include	Exclude
All of MY_DB except the staging schema	`MY_DB\..*`	`MY_DB\.staging\..*`
Only hr tables, but not temporary ones	`MY_DB\.hr\..*`	`.\..\.tmp_.*`

Filter Behaviour Reference

Configuration	Result
Database selected; no include or exclude filters configured	The crawler ingests all assets in the selected database. This is the default behavior and is unchanged for existing data sources.
Include filter configured; exclude filter empty	Only assets whose FQN matches at least one include pattern are ingested. If no assets match, the crawl returns an empty result set.
Exclude filter configured; include filter empty	All assets are ingested except those whose FQN matches at least one exclude pattern. If all assets match the exclude filter, the crawl returns an empty result set.
Both include and exclude filters configured	Exclude is applied first. Include is applied to the remaining assets. An asset matching both filters is excluded.
A filter pattern is configured but matches no assets	The crawl returns an empty result set for that filter scope. No assets are ingested.

How to Start Crawling

Follow these steps to start crawling a data source:

Navigate to Control Center from the left navigation menu and select Integrations.
From the Connectors page, locate your data source card.
Click the vertical ellipsis icon on the data source card.
Select Start Crawler.
In the pop-up window, choose one of the following options:
1. Full Crawl: Perform a comprehensive scan of all assets.
2. Selective Crawl: (Available after the first full crawl) Crawl only selected assets. This option is available for ADLS, Snowflake, Databricks, BigQuery, GCS, and Amazon S3.
Click Start Crawl to begin.

Note You can perform selective crawling multiple times for different assets within the same data source, whenever needed.

Note During a selective crawl, any databases that have not been crawled before are automatically included to ensure all newly discovered databases are registered in the system.

Benefits of Selective Crawling

Efficiency: Refresh only what's needed instead of re-crawling everything.
Faster updates: Quickly reflect recent changes in key datasets.
Flexibility: Choose when and what to crawl based on business needs.
Accuracy: Keep critical asset metadata up to date without impacting performance.
Precision: Use metadata filters to scope crawls to exactly the schemas and tables your team requires, reducing noise from irrelevant or temporary assets.

Last updated on

Was this page helpful?