Debug a Failed Pipeline Run

This is your troubleshooting guide - how to investigate pipeline failures quickly and systematically. You'll learn to identify exactly what failed, why it failed, and where the problem originated.

Why This Matters

When a pipeline fails at 2 AM, you need answers fast:

  • What failed? (Which job/span?)
  • When did it fail? (Exact timestamp)
  • Why did it fail? (Error message and context)
  • Where in the code? (Stack traces and logs)

Without this workflow, you're guessing. With it, you have forensics.

Real-World Scenarios

Scenario 1: The Midnight Page

"Production pipeline failed. Data team needs it fixed in 1 hour for morning reports."

Pressure: High. Time: Limited. Need: Fast root cause.

Solution: Follow this workflow in 5 minutes to find exact error, affected data, and next steps.

Scenario 2: The Mystery Failure

"Pipeline ran fine for 3 months, then started failing every day this week."

Challenge: Something changed, but what?

Solution: Compare failed runs with successful runs. Look for patterns in error events, execution times, and data volumes.

Scenario 3: Data Quality Crisis

"Dashboard shows missing customer records. Pipeline says 'success' but data is wrong."

Problem: Silent failure - no error but wrong results.

Solution: Check span events for warnings, examine event logs for data quality metrics, trace exactly which records were processed.

Scenario 4: The Cascading Failure

"One pipeline failed and now 5 downstream pipelines are broken."

Urgency: Fix root cause to unblock everything else.

Solution: Identify the first failure point, understand what data was missing, coordinate fixes.

Prerequisites

  • Pipeline ID with failed runs
  • API credentials
  • Basic understanding of your pipeline structure
  • (Optional) Access to your code repository for context

The Debug Workflow

Use these 5 APIs to investigate failures:

  1. GET /pipelines/:pipelineId/latestRun - Identify failure
  2. GET /pipelines/runs/:runId/spans - Find failed span
  3. GET /pipelines/spans/:spanId/events - Get error events
  4. GET /pipelines/spans/events/:eventId/log - Get error details
  5. GET /pipelines/runs/:runId/span-job-associations - Map to code

Overview

This workflow covers:

  • Identifying which run failed
  • Finding the failing span
  • Analyzing error events
  • Reviewing detailed error logs
  • Understanding span-job mappings for root cause

APIs Used: 5 endpoints

Prerequisites

  • Pipeline ID that has failed runs
  • API credentials
  • Basic understanding of your pipeline structure

Step 1: Identify the Failed Run

Start by getting the latest run to see if it failed.

API Call

Bash
Copy

Response (Failed Run)

JSON
Copy

Key Indicators:

  • status: "COMPLETED" - Run finished
  • result: "FAILED" - Run failed
  • errorEvents: 2 - Two errors occurred

Alternative: List Recent Runs

If you need to see multiple failed runs:

Bash
Copy

Filter response for runs where result: "FAILED".

Step 2: Get All Spans for the Failed Run

Identify which span(s) failed.

API Call

Bash
Copy

Response

JSON
Copy

Analysis:

  • Extract span (5011) - COMPLETED successfully
  • Transform span (5012) - FAILED with 2 errors
  • Load span (5013) - SKIPPED (didn't run due to previous failure)

The transform span is the culprit!

Step 3: Get Events for the Failed Span

Examine what happened during the failed span.

API Call

Bash
Copy

Response

JSON
Copy

Root Cause Identified:

  • Error event at 14:10:00: NullPointerException
  • Problem: Column 'customer_age' has 150 null values
  • Result: Transformation aborted

Step 4: Get Detailed Error Log

Get full details for the specific error event.

API Call

Bash
Copy

Response

JSON
Copy

Complete Picture:

  • What: NullPointerException in customer_age column
  • Where: AgeValidator.validate() method
  • Impact: 150 rows (1.5% of data)
  • Sample Data: Includes customer IDs with null ages

Step 5: Map Span to Job

Identify which job in your pipeline corresponds to the failed span.

API Call

Bash
Copy

Response

JSON
Copy

Failed Job Identified:

  • Span 5012 (span-transform) → Job: job-transform-customers

Now you know exactly which job code to fix!

Step 6: Get Specific Span Details (Optional)

For additional context about the failed span.

API Call

Bash
Copy

This uses the :identity parameter to get a specific span.

Response

JSON
Copy

Debugging Workflow Summary

Quick Debug (5 API calls)

Bash
Copy

Common Failure Patterns

Pattern 1: Data Quality Issues

Symptoms:

  • FAILED events in transform/validation spans
  • Error messages about null values, data types, or constraints

Debug Steps:

  1. Check span events for specific error messages
  2. Review event logs for sample data
  3. Examine upstream extraction job

Pattern 2: Connection Failures

Symptoms:

  • FAILED events in extract or load spans
  • Timeout or connection error messages

Debug Steps:

  1. Check span events for connection errors
  2. Verify source/destination availability
  3. Review network/credential configuration

Pattern 3: Resource Exhaustion

Symptoms:

  • FAILED events after long execution times
  • Out-of-memory or timeout errors

Debug Steps:

  1. Compare span execution times across runs
  2. Check for unusual data volume
  3. Review resource allocation

Resolution Steps

Once you've identified the issue:

1. Fix the Code/Configuration

Based on the error identified:

  • Data Quality: Add null handling or validation
  • Connection: Fix credentials or endpoints
  • Resource: Optimize query or increase resources

2. Test the Fix

Use the "Create and Execute" workflow to test your changes.

3. Monitor the Next Run

Use the "Monitor" workflow to verify the fix worked.

Complete API Call Sequence

  1. GET /torch-pipeline/api/pipelines/:pipelineId/latestRun - Identify failure
  2. GET /torch-pipeline/api/pipelines/:pipelineId/runs - Alternative: list recent runs
  3. GET /torch-pipeline/api/pipelines/runs/:runId/spans - Find failed span
  4. GET /torch-pipeline/api/pipelines/spans/:spanId/events - Get error events
  5. GET /torch-pipeline/api/pipelines/spans/events/:spanEventId/log - Get error details
  6. GET /torch-pipeline/api/pipelines/runs/:runId/span-job-associations - Map to job
  7. GET /torch-pipeline/api/pipelines/runs/:runId/spans/:identity - Optional: specific span details

Troubleshooting

IssueSolution
No failed runs foundCheck pipeline ID is correct
Can't find failed spanLook for errorEvents > 0 or status: "FAILED"
Events don't show errorCheck for FAILED event type or alert: "ERROR"
No log detailsVerify event ID is correct
Can't map span to jobCheck span-job-associations response
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard