Data Quality Policy
A Data Quality Policy in ADOC ensures that your datasets are accurate, complete, and reliable from a business perspective.
- A policy is a set of rules that define your expectations for a dataset.
- A rule is a specific condition or check that must be met for the data to be considered “good.” Each rule focuses on a particular aspect of data quality, such as null values, data type consistency, value ranges, uniqueness, or patterns.
Example: A Customer Table Data Quality Policy might include rules such as:
- “Customer email cannot be null” (Null Values rule)
- “Customer ID must be unique” (Duplicate Row Check rule)
- “Order total must be between 0 and 10,000” (Range Match rule)
Creating a Data Quality Policy
1. Start Policy Creation
You can create a Data Quality Policy in two main ways: via Manage Policies or via the Asset Details page.
Option 1: Through Manage Policies
- Navigate to Data Reliability > Manage Policies from the left menu.
- Click Add Policy (top-right).
- Select Data Quality as the policy type.
- Choose the dataset(s) (assets) you want to monitor.
- The Create Data Quality Policy page opens for configuration.
Option 2: Through the Asset Details Page
There are two ways to start a Data Quality Policy from the Asset Details page:
Overview Tab
- Open the dataset in the Asset Details page.
- In the Overview tab, click the Actions button and select Add Data Quality Policy.
Policies Tab
- Navigate to the Policies tab in the Asset Details page.
- Click Add Policy or use the Actions button, then select Data Quality.
In both cases, the Create Data Quality Policy page opens for configuration.
2. Configure Data Selection
Before defining rules, ADOC must know which part of the dataset needs to be evaluated. There are two ways to select data:
1. Column-Based Selection
- Select one or more columns directly from the dataset.
- ADOC will apply rules only to these selected columns.
- Best for simple checks or when you only need to monitor specific fields.
2. SQL-Based Selection
Enter a custom SQL query to define the data subset.
Use the placeholders:
{{{lower_bound}}}: replaced with the starting value for incremental or selective runs.{{{upper_bound}}}: replaced with the ending value for incremental or selective runs.
Click Validate to check that your query is correct before proceeding.
- If using Native SQL (checkbox unchecked), follow the syntax of your underlying data platform (Snowflake, BigQuery, etc.), which may differ from Spark SQL.
- Some incremental features require that the column being incremented is included in the query’s SELECT statement.
- Older Dataplane versions may not support placeholders in SQL; upgrade to ADOC v4.3.0 or later if needed.
3. Choose Execution Engine
- Spark (default): Runs jobs in ADOC’s Spark engine. Good for standard profiling and complex rule execution.
- Pushdown: Lets queries run directly on the data source, reducing data movement. Best for very large datasets.
- Some incremental or advanced options (like certain segment selections) are only available depending on the data engine selected.
- If Pushdown is selected, persistence of records and some Spark job settings are disabled.
Example
- Column-Based: Monitor
customer_emailfor null values. - SQL-Based:
SELECT customer_id, email, signup_date FROM customers WHERE signup_date >= {{{lower_bound}}} AND signup_date < {{{upper_bound}}}
Configure Rules
You can choose from many pre-defined or custom rules to enforce data quality. Each rule has options to include or exclude nulls or empty values:
| Rule | Description | Example |
|---|---|---|
| Null Values | Checks if columns contain null values. | Customer email should never be null. |
| Schema Match | Validates column data types. | Customer ID column must be integer. |
| Pattern Match (Regex) | Ensures values follow a pattern. | Emails must match ^[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$. |
| Enumerations | Checks if values belong to a defined list. | The status column must be “Active”, “Inactive”, or “Pending”. |
| Tags Match | Validates values against tags provided. | Product categories must match predefined tags. |
| Range Match | Checks if values fall within a range. | Order total between 0 and 10,000. |
| Duplicate Row Check | Ensures values are unique. | The customer ID column must be distinct. |
| Row Count Check | Validates total rows fall within expected limits. | The dataset should have 10,000–12,000 rows. |
| Metric Check | Validates aggregate metrics (SUM, AVG, MIN, MAX). | Average salary between 25,000–60,000. |
| SQL Metric Check | Advanced numeric checks using custom SQL expressions. | avg(emp_salary) - 10000 must be between 25,000–60,000. |
| Freshness Check | Check if the latest data arrived on time. Supports anomaly detection. | Sales data must refresh daily by 6 AM. |
| User-Defined / Lookup | Custom code or reference-based validation. | Validate IDs against reference table. |
Policy Evaluation Strategy:
- Rule-Based: Policy passes only if all rules pass.
- Weightage-Based: Policy success can be weighted by rule importance.
Thresholds:
- Success Threshold: 0–100%
- Warning Threshold: 0–100%
Use a Data Policy Template for Faster Setup Instead of adding rules one by one, you can apply a Data Policy Template. A template is a reusable collection of rule definitions that represent a common set of data quality checks. When you apply a template to your policy, all rules in the template are automatically included and evaluated.
- Example: A “Customer Data Template” might include rules for email null checks, customer ID uniqueness, and phone number pattern validation.
- Benefit: Templates save time, ensure consistent data quality standards across datasets, and reduce manual configuration.
Configure Alerts & Notifications (Optional)
Alerts let you know when a policy fails or needs attention, so you can act before issues impact reports or analytics.
Severity Levels: Define how serious the alert is:
- Critical: Immediate attention needed
- High: Important, but not urgent
- Medium: Monitor for trends
- Low: Informational
Notification Channels: Choose where to receive alerts:
- Email, Slack, Microsoft Teams, Webhook, ServiceNow, or Chat
Notify on Success / Warning: Optional toggles:
- Notify on Success: Get notified when a policy passes successfully
- Notify on Warning: Get notified if a policy raises a warning
Re-notification Options: Control how often you are notified:
- Never: Only receive the first alert
- After n failed runs: Reduce noise; notify after a set number of failures
- Every time: Receive an alert each time the policy fails
Example:
- A critical alert is sent via Slack if daily sales data hasn’t arrived by 6 AM.
- Warnings are emailed if less than 95% of customer emails are valid.
Execution Settings (Optional)
Execution settings control how and when your policy runs, and where results are stored.
Incremental Strategy: Check only new or changed data:
- Choose a column (ID, datetime, or partition) to track incremental updates.
- Reduces processing time for large datasets.
Scheduling: Decide when the policy runs:
- Hourly, daily, weekly, monthly, or yearly
- Choose your time zone
Resource Strategy: Determine computing resources for execution:
- Small, Medium, Large, Global, or Custom
Persistence: Decide how to store policy results:
- Good Records: Data that passed all rules
- Bad Records: Data that failed rules
- Standardized folder paths make results easy to find and query:
Bad Records: s3a://<bucket>/data-quality-<policy-name>/<date>/<execution-id>/errorrecordsGood Records & Summary: s3a://<bucket>/data-quality-<policy-name>/<date>/<execution-id>/summary- Timeouts: Limit execution time:
- Run timeout: Maximum time for the job itself
- Total timeout: Maximum time including retries
Example: Daily incremental check on customer_signups table, storing failed rows for review, scheduled at 2 AM using Medium resources.
Dynamic Column Mapping for SQL Rules and UDFs
Dynamic column mapping helps apply the same rule across multiple datasets even if column names differ.
- Use placeholders
${column}or${table}in your SQL rules or UDFs. - Tag columns or tables with descriptive tags (e.g.,
customer_id). - When the policy runs, ADOC automatically maps the rule to all columns or tables with the matching tag.
Example: Tables with columns customer_id, cust_id, and client_id can all be checked using one dynamic rule tagged customer_id.
Benefit: Saves time and ensures consistent data quality checks across datasets with similar but differently named columns.
Executing a Data Quality Policy
Navigate to Data Reliability > Manage Policies.
Find your policy and click the
Play icon. Choose an execution mode:
- All Your Data: Run on the entire dataset
- Incremental: Run based on incremental strategy
- Selective: Run on a subset filtered by ID or datetime
Click Execute.
Review results in the Policy Execution Details page.
Data Quality Policy Execution Details
The execution page provides insights into the policy run:
| Panel / Tab | Description |
|---|---|
| Execution History | Shows details for each run (start/end time, engine, rules configured/passed). |
| Overall Quality Score | Summary of data quality performance. |
| Execution Details Panel | Key metrics like rows scanned, rules passed, and processing engine. |
| Filter Panel | Displays SQL filters used for execution. |
| Execution Summary Tab | Shows rules, status (pass/fail), success rate, weightage, thresholds. |
| Segmented Analysis Tab | Breaks down rules and rows by dataset segments. |
| Quality Summary Tab | Summarizes data quality dimensions for the asset. |