Set Up Pipeline Monitoring and Alerting

Reactive monitoring isn't enough - you need proactive alerting. This guide shows you how to set up intelligent monitoring that catches problems before they become incidents and alerts the right people at the right time.

Why This Matters

Without proactive monitoring:

  • You learn about failures from angry users
  • Issues compound before you notice
  • You can't spot degrading performance trends
  • Your team wastes time on manual checks

With it:

  • Alerts arrive before users notice
  • Trends reveal problems early
  • Your team focuses on fixes, not checking dashboards
  • SLAs are protected

Real-World Scenarios

Scenario 1: The Silent Slowdown

"Pipeline execution time crept from 20 to 40 minutes over 2 weeks. Nobody noticed until monthly reports were late."

Prevention: Alert when execution time exceeds baseline by 50%.

Impact: Caught performance issues early, optimized queries, stayed within SLA.

Scenario 2: The Weekend Outage

"Pipeline failed Friday night. Team discovered Monday morning. Weekend data missing."

Prevention: Immediate Slack/PagerDuty alert on any failure.

Impact: On-call engineer fixed it within 30 minutes. Zero data loss.

Scenario 3: The Data Quality Drift

"Customer ages gradually became invalid. 1000 records corrupted before anyone noticed."

Prevention: Alert on warning event threshold (>5 warnings = investigation needed).

Impact: Caught validation issues immediately, fixed upstream source.

Scenario 4: The Capacity Crunch

"Pipeline hitting resource limits. Started failing intermittently."

Prevention: Alert when consecutive failures > 2.

Impact: Identified capacity issue, scaled resources proactively.

Prerequisites

  • Existing pipeline with execution history
  • Alert destination (Slack, PagerDuty, email)
  • Baseline performance metrics (from historical runs)
  • On-call rotation schedule

Monitoring Strategy

Use these 4 APIs to build proactive monitoring:

  1. PUT /pipelines - Configure monitoring settings
  2. GET /pipelines/:pipelineId/runs - Analyze history
  3. GET /pipelines/:pipelineId/latestRun - Monitor current state
  4. Investigation APIs - When alerts trigger

Overview

This workflow covers:

  • Setting up pipeline monitoring
  • Configuring baseline metrics
  • Querying execution history
  • Building alerting logic using API data

APIs Used: 4 endpoints

Prerequisites

  • Existing pipeline with execution history
  • API credentials
  • Monitoring/alerting system (e.g., Slack, PagerDuty, custom dashboard)

Step 1: Establish Baseline Metrics

Configure baseline metrics for your pipeline.

API Call

Bash
Copy

Request

JSON
Copy

Key Configuration

FieldValuePurpose
pipelineBaselineMetricObjectDefine performance baseline
notificationChannelsStringAlert destination
meta.slaStringExpected completion time
meta.alertThresholdStringWhen to trigger alerts

Step 2: Collect Historical Performance Data

Gather execution history to establish normal behavior.

API Call

Bash
Copy

Response Analysis

JSON
Copy

Calculate Baselines

From 30 runs, calculate:

  • Average execution time: 28 minutes
  • Success rate: 93% (28 success / 30 total)
  • Typical start time: 2:00 AM
  • SLA: 30 minutes (worst case)

Step 3: Monitor Current Execution

Build real-time monitoring by polling latest run.

API Call

Bash
Copy

Monitoring Logic

Javascript
Copy

Step 4: Deep Dive on Anomalies

When alerts trigger, gather detailed information.

Get Span Details

Bash
Copy

Look for:

  • Spans with excessive duration
  • Spans with high error/warning counts
  • Skipped spans indicating failures

Get Error Events

Bash
Copy

Look for:

  • FAILED event types
  • Alert levels (ERROR, WARNING)
  • Context data with error details

Get Detailed Logs

Bash
Copy

Extract:

  • Error messages
  • Stack traces
  • Affected data samples

Step 5: Build Alerting Rules

Alert Rule 1: Pipeline Failure

Javascript
Copy

Alert Rule 2: SLA Breach

Javascript
Copy

Alert Rule 3: Increasing Error Rate

Javascript
Copy

Alert Rule 4: Data Quality Warnings

Javascript
Copy

Step 6: Dashboard Metrics

Build a monitoring dashboard using these metrics.

Overall Health

Bash
Copy

Display:

  • Total pipelines
  • Active vs disabled
  • Success rate across all pipelines

Per-Pipeline Status

Bash
Copy

Display:

  • Current status (RUNNING, COMPLETED, FAILED)
  • Execution time vs baseline
  • Error/warning counts
  • Last successful run time
Bash
Copy

Display:

  • Success rate chart (last 50 runs)
  • Execution time trend
  • Failure patterns by time of day

Execution Timeline

Bash
Copy

Display:

  • Span execution timeline
  • Bottleneck identification
  • Duration breakdown by job

Monitoring Workflow Summary

Setup Phase (Once)

Bash
Copy

Runtime Monitoring (Continuous)

Bash
Copy

Complete Monitoring System Example

Monitoring Script (Pseudo-code)

Javascript
Copy

Complete API Call Sequence

  1. PUT /torch-pipeline/api/pipelines - Configure monitoring
  2. GET /torch-pipeline/api/pipelines/:pipelineId/runs - Historical analysis
  3. GET /torch-pipeline/api/pipelines/:pipelineId/latestRun - Current status
  4. GET /torch-pipeline/api/pipelines/runs/:runId/spans - Anomaly investigation
  5. GET /torch-pipeline/api/pipelines/spans/:spanId/events - Error details
  6. GET /torch-pipeline/api/pipelines/spans/events/:eventId/log - Deep logs

Alerting Best Practices

Alert Fatigue Prevention

  • Use tiered severity (CRITICAL, HIGH, MEDIUM, LOW)
  • Aggregate similar alerts
  • Set cooldown periods between alerts

Actionable Alerts

Include in every alert:

  • Direct link to run/span
  • Error message summary
  • Recommended next steps
  • Runbook link

Alert Escalation

  1. First failure: INFO alert to team Slack
  2. Second consecutive failure: HIGH alert to on-call
  3. Third consecutive failure: CRITICAL page to manager

Troubleshooting

IssueSolution
Too many alertsAdjust thresholds, add cooldown periods
Missing alertsLower polling interval, check alert logic
False positivesRefine baseline calculations
Alert fatigueImplement tiered severity system
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard