Data Anomaly Policy

Profile Anomaly policies detect unusual patterns in your data's statistical profile metrics. While Data Drift policies identify changes in distributions, Profile Anomaly policies use machine learning to detect unexpected values in metrics like null percentages, distinct counts, min/max values, and statistical measures.

When to use it?

Use Profile Anomaly policies when you need to:

Detect sudden spikes in null values that could indicate data quality issues
Identify unexpected changes in distinct count (potential duplicate or missing data)
Catch unusual min/max values that suggest data errors or outliers
Monitor row count changes that don't follow normal patterns
Detect gradual degradation in data quality before it impacts business

Creating a Data Anomaly Policy

Create a Data Anomaly Policy from:

Option 1: Through Manage Policies

Navigate to Data Reliability > Manage Policies.
Click Add Policy (top-right).
Select Anomaly as the policy type.
Choose the dataset (asset) to monitor.
The Create Data Anomaly Policy page opens for configuration.

Option 2: Through the Asset Details Page

Overview Tab

Open the dataset in the Asset Details page.
In the Overview tab, click Actions > Add Data Anomaly Policy.

Policies Tab

Navigate to the Policies tab.
Click Add Policy or use the Actions button and select Data Anomaly.

Policy Configuration Steps

Step 1: Policy Details

Processing Engine: Choose either Spark or Pushdown as the processing engine.
Click Show Columns to view the dataset schema.

Step 2: Select Data Columns

By default, all columns are included for profiling. To enable anomaly detection for specific columns:

Check the box under the Data Anomaly column.
Click Configure Anomaly to continue.

Step 3: Configure Anomaly Rules

Policy Scoring

Policy Thresholds:
- Success Threshold (0–100): Minimum score for passing.
- Warning Threshold (0–100): Trigger a warning if the score drops below this value.
Anomaly Strength Inclusion:
- Select the minimum anomaly strength to include in scoring (High, Medium, Low).
- Example: Only include anomalies with strength >= HIGH.

Alerts & Notifications

Define how ADOC notifies you when anomalies are detected.

Severity Levels: Critical, High, Medium, Low.
Notification Channels: Email, Slack, MS Teams, ServiceNow, Webhook, Chat.
Notify on Success: Optionally enable success alerts.
Re-notification Preferences:
Reduce Noise (Never): One-time alerts only.
Send Every [X] Failed Runs: Alerts after repeated violations.
Notify Every Time: Alerts triggered for every violation.

Anomaly Detection Settings

Fine-tune anomaly detection behavior:

Setting	Description	Example / Options
Training Window Minimum	Minimum history required for training	At least 3 executions, default: 7 days
Model Sensitivity	Adjusts anomaly detection tolerance	High (catch subtle anomalies), Medium (default), Low (only major deviations)

Advanced Performance Settings

Optimize profiling for large or wide datasets.

Processing Batch Size:
- Default: 30
- Adjust only under support guidance.
Optimize Data Load with Batches:
- Loads data in smaller chunks instead of all at once.
- Useful for wide tables with frequent updates.
- May introduce slight inconsistencies in results.

Step 4: Summary

Review policy details, thresholds, alerts, and performance settings.
Click Save Policy to finalize.

Executing Data Anomaly Policies

Once created, anomaly policies are executed during asset profiling.

Method 1: From Manage Policies > Profiles (Beta)

Go to Profiles.
Search for the asset with the policy applied.
In the Profiles table, click Play.
Profiling begins and the Data Anomaly Policy executes automatically.

Method 2: From Asset Details Page

Open the Asset Details page.
Click Actions.
Select profiling type (Full or Selective).
Profiling triggers the anomaly policy.

Real-world Use Case

Scenario: Your customer data table has been reliable for years, but recently, analytics reports show incomplete customer profiles. After investigation, you discover that the percentage of null values in the phone number column suddenly jumped from 5% to 35%. An upstream system change caused phone numbers to be lost during integration, but the issue went unnoticed for weeks because row counts stayed normal.

Solution: Create a Profile Anomaly policy that monitors profile metrics including null percentages, distinct counts, and row counts for the customer table. The ML-based anomaly detection learns normal patterns and immediately alerts when null percentages spike abnormally, catching the issue within hours instead of weeks.

Step-by-step Guide

Step 1: Understand how profile anomaly detection works

Profile Anomaly policies are event-triggered - they run automatically after ADOC profiles your asset. The policy:

Collects profile metrics during each profiling run
Trains ML models to understand normal patterns
Detects when current metrics deviate from expected patterns
Generates alerts for significant anomalies

Step 2: Enable profiling for your asset

Navigate to your asset (table)
Go to Profile Settings
Enable Scheduled Profiling
- Recommended frequency:
  - Critical tables: Daily or more frequent
  - Standard tables: Weekly
  - Archive tables: Monthly
Select profiling type:
- Full profile: Analyze entire table
- Incremental profile: Analyze only new/changed data
- Sampled profile: Analyze representative sample

Step 3: Select columns for anomaly detection

Click on Profile Settings for the asset
Enable Profile Anomaly Detection
Select columns to monitor:

High-value columns to monitor:

Business-critical columns: Customer ID, Order ID, Transaction Amount
Frequently used in analytics: Status, Category, Region
Join keys: Foreign keys used to link tables
Nullable columns: Fields that should have low null rates

What gets monitored per column:

Null value percentage
Distinct count
Minimum value
Maximum value
Mean (for numeric)
Standard deviation (for numeric)
Top value frequency (for categorical)

Step 4: Configure anomaly detection settings

1. Training window:

Minimum data points: How many profile runs before detecting anomalies
- Recommended: 7-14 profile runs
- Fast-changing data: 5-7 runs
- Stable data: 14-30 runs

2. Sensitivity level:

High: Detects small deviations, more alerts
- Use for: Critical data, regulatory compliance data
Medium: Balanced approach (recommended starting point)
- Use for: Most business-critical tables
Low: Only major deviations, fewer alerts
- Use for: Exploratory data, highly variable data

3. Metrics to monitor:

Select which metrics to monitor per column:

Null percentage
Distinct count
Min/Max values
Mean
Standard deviation
Row count (table level)

Step 5: Review automatic policy creation

When you enable Profile Anomaly Detection and select columns, ADOC automatically creates a Profile Anomaly Policy with:

Policy name: {asset_name}-profile-anomaly-policy
Monitored columns: All selected columns
Monitored metrics: All enabled profile metrics
Execution: Automatic after each profiling run

Step 6: Configure scoring thresholds

The Profile Anomaly policy score is calculated as:

Total Evaluations = Total columns selected × Profile metrics captured per column

Failed Evaluations = Total profile metrics with anomaly detected

Score = ((Total Evaluations - Failed Evaluations) / Total Evaluations) × 100

Example:

Monitoring 5 columns
Each column has 4 profile metrics (null %, distinct count, min, max)
Total Evaluations = 5 × 4 = 20
2 metrics show anomalies (null % in column A, distinct count in column B)
Failed Evaluations = 2
Score = (20 - 2)/20 × 100 = 90%

Configure thresholds:

Success Threshold: Minimum score for policy to pass (e.g., 95%)
Warning Threshold: Score that triggers warning (e.g., 85%)

Step 7: Set up alerts and notifications

Alert severity:
- Critical: Production databases feeding real-time systems
- High: Customer data, financial data
- Medium: Analytics tables
- Low: Development/test data
Notification channels:
- Data engineering team (for data quality issues)
- Data platform team (for system issues)
- Business stakeholders (for critical table anomalies)
Alert content should include:
- Which column(s) showed anomalies
- Which metric(s) were anomalous (null %, distinct count, etc.)
- Current value vs expected range
- Severity of deviation
- Link to detailed profile comparison

Best Practices

Start with table-level metrics
- Row count anomalies
- Overall profile statistics
- Expand to column-level gradually
Profile at appropriate frequencies
- Match profiling frequency to data update frequency
- Daily updates → Daily profiling
- Weekly loads → Weekly profiling
Allow sufficient training period
- Don't trust anomaly detection immediately
- Let models learn for 2-4 weeks
- Manually review initial anomalies
Tune sensitivity based on false positives
- Start with medium sensitivity
- Increase if missing real issues
- Decrease if getting too many false alarms
Monitor complementary metrics together
- Null percentage + Distinct count
- Min + Max values
- Row count + Data volume
Use feedback mechanism
- Mark false positives to improve model
- Confirm true positives
- Help system learn your data patterns

Common Pitfalls to Avoid

Monitoring too many columns initially (start with 5-10 most critical)
Setting high sensitivity before model is trained (wait for training period)
Not distinguishing seasonal patterns from anomalies
Ignoring minor anomalies (they can indicate early problems)
Not investigating anomaly root causes (defeats purpose)
Expecting perfect anomaly detection immediately (models need time to learn)

Understanding Anomaly Detection Results

When an anomaly is detected, examine:

1. Anomaly type:

Null percentage spike: Data collection issue, upstream failure
Distinct count drop: Duplicate data, data loss
Min/Max out of bounds: Data entry errors, schema changes
Mean shift: Population change, calculation errors

2. Anomaly severity:

Minor (1-2 std dev): Monitor, may be natural variation
Moderate (2-3 std dev): Investigate cause
Severe (>3 std dev): Immediate action required

3. Anomaly pattern:

Sudden spike: System failure, data pipeline issue
Gradual drift: Process degradation, slow data quality decline
Cyclical: May be seasonal, check historical patterns

Response Procedures

When anomaly is detected:

Confirm it's real
- Check if metric calculation is correct
- Verify profile ran successfully
- Compare with raw data queries
Investigate cause
- Check data source for issues
- Review recent pipeline changes
- Look for upstream system changes
- Check for schema modifications
Assess impact
- Identify downstream consumers
- Check if reports are affected
- Verify machine learning models
- Contact business stakeholders
Remediate
- Fix data quality issues at source
- Reprocess affected data if possible
- Update documentation
- Communicate to affected parties
Prevent recurrence
- Add Data Quality rules to prevent issue
- Improve upstream validation
- Update data pipeline monitoring
- Document lessons learned

Troubleshooting Profile Anomaly Detection

Too many false positives:

Reduce sensitivity level (high → medium → low)
Increase training window (more data points for baseline)
Exclude highly variable columns from monitoring
Check if data has natural cyclical patterns
Provide feedback on false positives to improve model

Missing real anomalies:

Increase sensitivity level
Reduce training window if data patterns change frequently
Verify profiling is running at appropriate frequency
Check that affected columns are included in monitoring
Ensure metric calculations include full data range

Anomaly detected but cause unclear:

Run Data Quality policies to check for violations
Compare with Data Drift policy results
Check Schema Drift for structural changes
Profile with finer granularity (add segments)
Review data lineage for upstream changes

Anomalies on every profile run:

Data may be genuinely unstable (not anomalous)
Training data may have been during abnormal period
Consider resetting baseline or adjusting sensitivity
May need different monitoring approach (thresholds instead of anomaly detection)

Last updated on

Was this page helpful?