Data Quality Policy

A Data Quality Policy in ADOC ensures that your datasets are accurate, complete, and reliable from a business perspective.

  • A policy is a set of rules that define your expectations for a dataset.
  • A rule is a specific condition or check that must be met for the data to be considered “good.” Each rule focuses on a particular aspect of data quality, such as null values, data type consistency, value ranges, uniqueness, or patterns.

Example: A Customer Table Data Quality Policy might include rules such as:

  • “Customer email cannot be null” (Null Values rule)
  • “Customer ID must be unique” (Duplicate Row Check rule)
  • “Order total must be between 0 and 10,000” (Range Match rule)

Creating a Data Quality Policy

1. Start Policy Creation

You can create a Data Quality Policy in two main ways: via Manage Policies or via the Asset Details page.

Option 1: Through Manage Policies

  1. Navigate to Data Reliability > Manage Policies from the left menu.
  2. Click Add Policy (top-right).
  3. Select Data Quality as the policy type.
  4. Choose the dataset(s) (assets) you want to monitor.
  5. The Create Data Quality Policy page opens for configuration.

Option 2: Through the Asset Details Page

There are two ways to start a Data Quality Policy from the Asset Details page:

Overview Tab

  1. Open the dataset in the Asset Details page.
  2. In the Overview tab, click the Actions button and select Add Data Quality Policy.

Policies Tab

  1. Navigate to the Policies tab in the Asset Details page.
  2. Click Add Policy or use the Actions button, then select Data Quality.

In both cases, the Create Data Quality Policy page opens for configuration.

2. Configure Data Selection

Before defining rules, ADOC must know which part of the dataset needs to be evaluated. There are two ways to select data:

1. Column-Based Selection

  • Select one or more columns directly from the dataset.
  • ADOC will apply rules only to these selected columns.
  • Best for simple checks or when you only need to monitor specific fields.

2. SQL-Based Selection

  • Enter a custom SQL query to define the data subset.

  • Use the placeholders:

    • {{{lower_bound}}} : replaced with the starting value for incremental or selective runs.
    • {{{upper_bound}}} : replaced with the ending value for incremental or selective runs.
  • Click Validate to check that your query is correct before proceeding.

Note

  • If using Native SQL (checkbox unchecked), follow the syntax of your underlying data platform (Snowflake, BigQuery, etc.), which may differ from Spark SQL.
  • Some incremental features require that the column being incremented is included in the query’s SELECT statement.
  • Older Dataplane versions may not support placeholders in SQL; upgrade to ADOC v4.3.0 or later if needed.

3. Choose Execution Engine

  • Spark (default): Runs jobs in ADOC’s Spark engine. Good for standard profiling and complex rule execution.
  • Pushdown: Lets queries run directly on the data source, reducing data movement. Best for very large datasets.

Important

  • Some incremental or advanced options (like certain segment selections) are only available depending on the data engine selected.
  • If Pushdown is selected, persistence of records and some Spark job settings are disabled.

Example

  • Column-Based: Monitor customer_email for null values.
  • SQL-Based: SELECT customer_id, email, signup_date FROM customers WHERE signup_date >= {{{lower_bound}}} AND signup_date < {{{upper_bound}}}

Configure Rules

You can choose from many pre-defined or custom rules to enforce data quality. Each rule has options to include or exclude nulls or empty values:

RuleDescriptionExample
Null ValuesChecks if columns contain null values.Customer email should never be null.
Schema MatchValidates column data types.Customer ID column must be integer.
Pattern Match (Regex)Ensures values follow a pattern.Emails must match ^[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$.
EnumerationsChecks if values belong to a defined list.The status column must be “Active”, “Inactive”, or “Pending”.
Tags MatchValidates values against tags provided.Product categories must match predefined tags.
Range MatchChecks if values fall within a range.Order total between 0 and 10,000.
Duplicate Row CheckEnsures values are unique.The customer ID column must be distinct.
Row Count CheckValidates total rows fall within expected limits.The dataset should have 10,000–12,000 rows.
Metric CheckValidates aggregate metrics (SUM, AVG, MIN, MAX).Average salary between 25,000–60,000.
SQL Metric CheckAdvanced numeric checks using custom SQL expressions.avg(emp_salary) - 10000 must be between 25,000–60,000.
Freshness CheckCheck if the latest data arrived on time. Supports anomaly detection.Sales data must refresh daily by 6 AM.
User-Defined / LookupCustom code or reference-based validation.Validate IDs against reference table.

Policy Evaluation Strategy:

  • Rule-Based: Policy passes only if all rules pass.
  • Weightage-Based: Policy success can be weighted by rule importance.

Thresholds:

  • Success Threshold: 0–100%
  • Warning Threshold: 0–100%

Use a Data Policy Template for Faster Setup Instead of adding rules one by one, you can apply a Data Policy Template. A template is a reusable collection of rule definitions that represent a common set of data quality checks. When you apply a template to your policy, all rules in the template are automatically included and evaluated.

  • Example: A “Customer Data Template” might include rules for email null checks, customer ID uniqueness, and phone number pattern validation.
  • Benefit: Templates save time, ensure consistent data quality standards across datasets, and reduce manual configuration.

Configure Alerts & Notifications (Optional)

Alerts let you know when a policy fails or needs attention, so you can act before issues impact reports or analytics.

  1. Severity Levels: Define how serious the alert is:

    • Critical: Immediate attention needed
    • High: Important, but not urgent
    • Medium: Monitor for trends
    • Low: Informational
  2. Notification Channels: Choose where to receive alerts:

    • Email, Slack, Microsoft Teams, Webhook, ServiceNow, or Chat
  3. Notify on Success / Warning: Optional toggles:

    • Notify on Success: Get notified when a policy passes successfully
    • Notify on Warning: Get notified if a policy raises a warning
  4. Re-notification Options: Control how often you are notified:

    • Never: Only receive the first alert
    • After n failed runs: Reduce noise; notify after a set number of failures
    • Every time: Receive an alert each time the policy fails

Example:

  • A critical alert is sent via Slack if daily sales data hasn’t arrived by 6 AM.
  • Warnings are emailed if less than 95% of customer emails are valid.

Execution Settings (Optional)

Execution settings control how and when your policy runs, and where results are stored.

  1. Incremental Strategy: Check only new or changed data:

    • Choose a column (ID, datetime, or partition) to track incremental updates.
    • Reduces processing time for large datasets.
  2. Scheduling: Decide when the policy runs:

    • Hourly, daily, weekly, monthly, or yearly
    • Choose your time zone
  3. Resource Strategy: Determine computing resources for execution:

    • Small, Medium, Large, Global, or Custom
  4. Persistence: Decide how to store policy results:

    • Good Records: Data that passed all rules
    • Bad Records: Data that failed rules
    • Standardized folder paths make results easy to find and query:
SQL
Copy
  1. Timeouts: Limit execution time:
    • Run timeout: Maximum time for the job itself
    • Total timeout: Maximum time including retries

Example: Daily incremental check on customer_signups table, storing failed rows for review, scheduled at 2 AM using Medium resources.

Dynamic Column Mapping for SQL Rules and UDFs

Dynamic column mapping helps apply the same rule across multiple datasets even if column names differ.

  1. Use placeholders ${column} or ${table} in your SQL rules or UDFs.
  2. Tag columns or tables with descriptive tags (e.g., customer_id).
  3. When the policy runs, ADOC automatically maps the rule to all columns or tables with the matching tag.

Example: Tables with columns customer_id, cust_id, and client_id can all be checked using one dynamic rule tagged customer_id.

Benefit: Saves time and ensures consistent data quality checks across datasets with similar but differently named columns.

Executing a Data Quality Policy

  1. Navigate to Data Reliability > Manage Policies.

  2. Find your policy and click the Play icon.

  3. Choose an execution mode:

    • All Your Data: Run on the entire dataset
    • Incremental: Run based on incremental strategy
    • Selective: Run on a subset filtered by ID or datetime
  4. Click Execute.

  5. Review results in the Policy Execution Details page.

Data Quality Policy Execution Details

The execution page provides insights into the policy run:

Panel / TabDescription
Execution HistoryShows details for each run (start/end time, engine, rules configured/passed).
Overall Quality ScoreSummary of data quality performance.
Execution Details PanelKey metrics like rows scanned, rules passed, and processing engine.
Filter PanelDisplays SQL filters used for execution.
Execution Summary TabShows rules, status (pass/fail), success rate, weightage, thresholds.
Segmented Analysis TabBreaks down rules and rows by dataset segments.
Quality Summary TabSummarizes data quality dimensions for the asset.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard