Title
Create new category
Edit page index title
Edit category
Edit link
Crawl Data Sources
Crawling is the process of discovering and cataloging data assets from a connected data source. Once a connection is established, you can trigger a crawl job to scan the data source and identify all available datasets, such as databases, schemas, tables, or files.
The discovered assets are displayed in the Discover Assets page in ADOC for further exploration and analysis.
For supported data sources, you can configure include and exclude metadata filters to control which databases, schemas, and tables the crawler ingests. Filters are defined during data source setup on the Observability Setup page and apply to all subsequent crawls for that data source.
Types of Crawling
There are two ways to crawl a data source:
- Full Crawling: Scans all assets within a data source.
- Selective Crawling: Lets you manually crawl specific assets within an existing data source.
Both crawl types respect any include and exclude metadata filters configured on the data source. If filters are configured, only the assets that pass the filter evaluation are crawled and ingested.
Full Crawling
A Full Crawl performs a comprehensive scan of all available assets in a connected data source. This is typically the first crawl performed after connecting the source. A full crawl collects metadata for:
- All databases or containers
- Schemas and folders
- Tables, views, or files
This ensures that the entire data source is registered and visible in Discover Assets in ADOC.
Selective Crawling
After a full crawl is complete, you can choose to crawl specific assets within a data source whenever needed. This helps reduce unnecessary scans, save time, and keep metadata up to date for assets that change frequently.
Selective Crawling is supported for the following data sources:
- ADLS (Azure Data Lake Storage)
- Snowflake
- Databricks
- BigQuery
- GCS (Google Cloud Storage)
- Amazon S3
When to Use Selective Crawling
Use selective crawling when:
- You only need to update metadata for a specific asset.
- You have recently added or modified a single database, schema, or table.
- You want to avoid re-crawling large sources with minimal changes.
- You are troubleshooting data quality or lineage for a particular dataset.
Example
If you have already performed a full crawl of your Snowflake source, you can later choose to crawl only one specific database instead of the entire connection. This means:
- You do not have to re-scan all schemas and tables.
- Only the selected database's metadata is refreshed.
- The updated information appears in the Discover Assets view.
Metadata Filters
Metadata filters let you define which assets the crawler includes or excludes during ingestion, using regex patterns. Instead of crawling all assets in a selected database, you can scope the crawl precisely to the schemas and tables your team actually needs.
Filters are configured on the Observability Setup page when registering or editing a data source. Once saved, the configured filters apply to all subsequent crawls — both full and selective — for that data source.
Supported Data Sources
Metadata filters are available for the following data source types:
- Oracle
- Snowflake
- Databricks
- MS SQL
- Redshift
- Trino
- SAP HANA
Configure Metadata Filters
Metadata filters are configured during data source registration or when editing an existing data source.
- Navigate to Control Center -> Integrations.
- Locate the data source and open it for editing, or proceed through the registration wizard to the Observability Setup step.
- Under Data Reliability, select one or more databases to crawl from the Databases field.
- In the Include Assets Filter (Regex) field, enter one or more regex patterns for the assets you want the crawler to include. Select the + button to add additional patterns.
- In the Exclude Assets Filter (Regex) field, enter one or more regex patterns for the assets you want the crawler to skip. Select the + button to add additional patterns.
- Select Submit to save the configuration.
Fully Qualified Asset Name (FQN) Format
Regex patterns are matched against the fully qualified asset name (FQN) of each asset. The FQN format is:
DATABASE.SCHEMA.TABLE_NAME
FQN matching is case-sensitive. Use the number of dot separators in your pattern to target the appropriate level of the hierarchy:
| Dot separators in pattern | Level targeted | Example |
|---|---|---|
| None (no dot) | Database only | MY_DB.* |
| One (one dot) | Database and schema | MY_DB\.sales.* |
| Two (two dots) | Database, schema, and table | MY_DB\.sales\.orders.* |
Include Assets Filter
The Include Assets Filter defines which assets the crawler retains after the exclude filter has been applied. If this field is left empty, the crawler includes all remaining assets in the selected databases (subject to any exclude patterns).
Multiple include patterns can be added. An asset is included if its FQN matches any one of the configured patterns (logical OR).
Exclude Assets Filter
The Exclude Assets Filter defines which assets the crawler removes from the crawl scope before the include filter is applied. If this field is left empty, no assets are excluded.
Multiple exclude patterns can be added. An asset is excluded if its FQN matches any one of the configured patterns (logical OR).
Evaluation Order
When both filters are configured, the crawler applies them in the following order:
- The Exclude Assets Filter is applied first. Any asset whose FQN matches at least one exclude pattern is removed from the crawl scope.
- The Include Assets Filter is applied to the remaining assets. Only assets whose FQN matches at least one include pattern are retained.
Regex Pattern Rules
The following rules apply when writing regex patterns for metadata filters:
| Rule | Description | Example |
|---|---|---|
\. | Use . to match a literal dot between database, schema, and table name segments. | MY_DB\.hr\..* |
.* | Use .* to match any sequence of characters (zero or more). | .*\.hr\..* |
(a|b) | Use | inside parentheses to match multiple options (logical OR within a single pattern). | MY_DB\.(hr|finance)\..* |
Include Filter Examples
| Intent | Pattern |
|---|---|
| All tables in the hr schema of MY_DB | MY_DB\.hr\..* |
| Tables starting with emp in MY_DB.hr | MY_DB\.hr\.emp.* |
| A specific table: MY_DB.hr.employees | MY_DB\.hr\.employees |
| All tables in the hr schema across any database | .*\.hr\..* |
| Tables starting with report_ in any schema | .*\..*\.report_.* |
| Two schemas in MY_DB: hr and finance | MY_DB\.(hr|finance)\..* |
Exclude Filter Examples
| Intent | Pattern |
|---|---|
| All staging schemas | .*\.staging\..* |
| All tables prefixed with tmp_ | .*\..*\.tmp_.* |
| A specific schema in MY_DB | MY_DB\.test_data\..* |
| Multiple schemas: staging, test, and dev | MY_DB\.(staging|test|dev)\..* |
Combined Include and Exclude Examples
| Intent | Include | Exclude |
|---|---|---|
| All of MY_DB except the staging schema | MY_DB\..* | MY_DB\.staging\..* |
| Only hr tables, but not temporary ones | MY_DB\.hr\..* | .*\..*\.tmp_.* |
Filter Behaviour Reference
| Configuration | Result |
|---|---|
| Database selected; no include or exclude filters configured | The crawler ingests all assets in the selected database. This is the default behavior and is unchanged for existing data sources. |
| Include filter configured; exclude filter empty | Only assets whose FQN matches at least one include pattern are ingested. If no assets match, the crawl returns an empty result set. |
| Exclude filter configured; include filter empty | All assets are ingested except those whose FQN matches at least one exclude pattern. If all assets match the exclude filter, the crawl returns an empty result set. |
| Both include and exclude filters configured | Exclude is applied first. Include is applied to the remaining assets. An asset matching both filters is excluded. |
| A filter pattern is configured but matches no assets | The crawl returns an empty result set for that filter scope. No assets are ingested. |
How to Start Crawling
Follow these steps to start crawling a data source:
Navigate to Control Center from the left navigation menu and select Integrations.
From the Connectors page, locate your data source card.
Click the vertical ellipsis icon on the data source card.
Select Start Crawler.
In the pop-up window, choose one of the following options:
- Full Crawl: Perform a comprehensive scan of all assets.
- Selective Crawl: (Available after the first full crawl) Crawl only selected assets. This option is available for ADLS, Snowflake, Databricks, BigQuery, GCS, and Amazon S3.
Click Start Crawl to begin.
Benefits of Selective Crawling
- Efficiency: Refresh only what's needed instead of re-crawling everything.
- Faster updates: Quickly reflect recent changes in key datasets.
- Flexibility: Choose when and what to crawl based on business needs.
- Accuracy: Keep critical asset metadata up to date without impacting performance.
- Precision: Use metadata filters to scope crawls to exactly the schemas and tables your team requires, reducing noise from irrelevant or temporary assets.
For additional help, contact www.acceldata.force.com OR call our service desk +1 844 9433282
Copyright © 2025