Crawl Data Sources

Crawling is the process of discovering and cataloging data assets from a connected data source. Once a connection is established, you can trigger a crawl job to scan the data source and identify all available datasets, such as databases, schemas, tables, or files.

The discovered assets are displayed in the Discover Assets page in ADOC for further exploration and analysis.

For supported data sources, you can configure include and exclude metadata filters to control which databases, schemas, and tables the crawler ingests. Filters are defined during data source setup on the Observability Setup page and apply to all subsequent crawls for that data source.

Types of Crawling

There are two ways to crawl a data source:

  • Full Crawling: Scans all assets within a data source.
  • Selective Crawling: Lets you manually crawl specific assets within an existing data source.

Both crawl types respect any include and exclude metadata filters configured on the data source. If filters are configured, only the assets that pass the filter evaluation are crawled and ingested.

Full Crawling

A Full Crawl performs a comprehensive scan of all available assets in a connected data source. This is typically the first crawl performed after connecting the source. A full crawl collects metadata for:

  • All databases or containers
  • Schemas and folders
  • Tables, views, or files

This ensures that the entire data source is registered and visible in Discover Assets in ADOC.

Note During the initial connection, there is no option to perform selective crawling. The full crawl must be completed first.

Selective Crawling

After a full crawl is complete, you can choose to crawl specific assets within a data source whenever needed. This helps reduce unnecessary scans, save time, and keep metadata up to date for assets that change frequently.

Selective Crawling is supported for the following data sources:

  • ADLS (Azure Data Lake Storage)
  • Snowflake
  • Databricks
  • BigQuery
  • GCS (Google Cloud Storage)
  • Amazon S3

When to Use Selective Crawling

Use selective crawling when:

  • You only need to update metadata for a specific asset.
  • You have recently added or modified a single database, schema, or table.
  • You want to avoid re-crawling large sources with minimal changes.
  • You are troubleshooting data quality or lineage for a particular dataset.

Example

If you have already performed a full crawl of your Snowflake source, you can later choose to crawl only one specific database instead of the entire connection. This means:

  • You do not have to re-scan all schemas and tables.
  • Only the selected database's metadata is refreshed.
  • The updated information appears in the Discover Assets view.

Metadata Filters

Metadata filters let you define which assets the crawler includes or excludes during ingestion, using regex patterns. Instead of crawling all assets in a selected database, you can scope the crawl precisely to the schemas and tables your team actually needs.

Filters are configured on the Observability Setup page when registering or editing a data source. Once saved, the configured filters apply to all subsequent crawls — both full and selective — for that data source.

Supported Data Sources

Metadata filters are available for the following data source types:

  • Oracle
  • Snowflake
  • Databricks
  • MS SQL
  • Redshift
  • Trino
  • SAP HANA

Configure Metadata Filters

Metadata filters are configured during data source registration or when editing an existing data source.

  1. Navigate to Control Center -> Integrations.
  2. Locate the data source and open it for editing, or proceed through the registration wizard to the Observability Setup step.
  3. Under Data Reliability, select one or more databases to crawl from the Databases field.
  4. In the Include Assets Filter (Regex) field, enter one or more regex patterns for the assets you want the crawler to include. Select the + button to add additional patterns.
  5. In the Exclude Assets Filter (Regex) field, enter one or more regex patterns for the assets you want the crawler to skip. Select the + button to add additional patterns.
  6. Select Submit to save the configuration.

Note Filter configuration is saved as part of the data source settings. Filters take effect only when the next crawl is triggered. Updating filter settings without running a crawl does not change the currently cataloged assets.

Note Regex patterns are validated at crawl time, not when the data source is saved. If a pattern is incorrect or does not match any assets, the crawler returns an empty result set for that filter scope.

Fully Qualified Asset Name (FQN) Format

Regex patterns are matched against the fully qualified asset name (FQN) of each asset. The FQN format is:

DATABASE.SCHEMA.TABLE_NAME

FQN matching is case-sensitive. Use the number of dot separators in your pattern to target the appropriate level of the hierarchy:

Dot separators in patternLevel targetedExample
None (no dot)Database onlyMY_DB.*
One (one dot)Database and schemaMY_DB\.sales.*
Two (two dots)Database, schema, and tableMY_DB\.sales\.orders.*

Include Assets Filter

The Include Assets Filter defines which assets the crawler retains after the exclude filter has been applied. If this field is left empty, the crawler includes all remaining assets in the selected databases (subject to any exclude patterns).

Multiple include patterns can be added. An asset is included if its FQN matches any one of the configured patterns (logical OR).

Exclude Assets Filter

The Exclude Assets Filter defines which assets the crawler removes from the crawl scope before the include filter is applied. If this field is left empty, no assets are excluded.

Multiple exclude patterns can be added. An asset is excluded if its FQN matches any one of the configured patterns (logical OR).

Evaluation Order

When both filters are configured, the crawler applies them in the following order:

  1. The Exclude Assets Filter is applied first. Any asset whose FQN matches at least one exclude pattern is removed from the crawl scope.
  2. The Include Assets Filter is applied to the remaining assets. Only assets whose FQN matches at least one include pattern are retained.

Important Exclude is always applied before Include. If an asset matches both an exclude and an include pattern, it is excluded.

Regex Pattern Rules

The following rules apply when writing regex patterns for metadata filters:

RuleDescriptionExample
\.Use . to match a literal dot between database, schema, and table name segments.MY_DB\.hr\..*
.*Use .* to match any sequence of characters (zero or more)..*\.hr\..*
(a|b)Use | inside parentheses to match multiple options (logical OR within a single pattern).MY_DB\.(hr|finance)\..*

Include Filter Examples

IntentPattern
All tables in the hr schema of MY_DBMY_DB\.hr\..*
Tables starting with emp in MY_DB.hrMY_DB\.hr\.emp.*
A specific table: MY_DB.hr.employeesMY_DB\.hr\.employees
All tables in the hr schema across any database.*\.hr\..*
Tables starting with report_ in any schema.*\..*\.report_.*
Two schemas in MY_DB: hr and financeMY_DB\.(hr|finance)\..*

Exclude Filter Examples

IntentPattern
All staging schemas.*\.staging\..*
All tables prefixed with tmp_.*\..*\.tmp_.*
A specific schema in MY_DBMY_DB\.test_data\..*
Multiple schemas: staging, test, and devMY_DB\.(staging|test|dev)\..*

Combined Include and Exclude Examples

IntentIncludeExclude
All of MY_DB except the staging schemaMY_DB\..*MY_DB\.staging\..*
Only hr tables, but not temporary onesMY_DB\.hr\..*.*\..*\.tmp_.*

Filter Behaviour Reference

ConfigurationResult
Database selected; no include or exclude filters configuredThe crawler ingests all assets in the selected database. This is the default behavior and is unchanged for existing data sources.
Include filter configured; exclude filter emptyOnly assets whose FQN matches at least one include pattern are ingested. If no assets match, the crawl returns an empty result set.
Exclude filter configured; include filter emptyAll assets are ingested except those whose FQN matches at least one exclude pattern. If all assets match the exclude filter, the crawl returns an empty result set.
Both include and exclude filters configuredExclude is applied first. Include is applied to the remaining assets. An asset matching both filters is excluded.
A filter pattern is configured but matches no assetsThe crawl returns an empty result set for that filter scope. No assets are ingested.

How to Start Crawling

Follow these steps to start crawling a data source:

  1. Navigate to Control Center from the left navigation menu and select Integrations.

  2. From the Connectors page, locate your data source card.

  3. Click the vertical ellipsis icon on the data source card.

  4. Select Start Crawler.

  5. In the pop-up window, choose one of the following options:

    1. Full Crawl: Perform a comprehensive scan of all assets.
    2. Selective Crawl: (Available after the first full crawl) Crawl only selected assets. This option is available for ADLS, Snowflake, Databricks, BigQuery, GCS, and Amazon S3.
  6. Click Start Crawl to begin.

Note You can perform selective crawling multiple times for different assets within the same data source, whenever needed.

Note During a selective crawl, any databases that have not been crawled before are automatically included to ensure all newly discovered databases are registered in the system.

Benefits of Selective Crawling

  • Efficiency: Refresh only what's needed instead of re-crawling everything.
  • Faster updates: Quickly reflect recent changes in key datasets.
  • Flexibility: Choose when and what to crawl based on business needs.
  • Accuracy: Keep critical asset metadata up to date without impacting performance.
  • Precision: Use metadata filters to scope crawls to exactly the schemas and tables your team requires, reducing noise from irrelevant or temporary assets.
VariableType to search · ESC to discard
GlossaryType to search · ESC to discard
InsertType to search · ESC to discard
No matches