Getting Started

Introduction to Acceldata SDK

Acceldata SDK includes APIs that allow for fine-grained end-to-end tracking and visibility of Python data pipelines.

Installation

The following Python modules must be installed in the environment:

Install acceldata-sdk

Bash
Copy

Prerequisites

To make calls to ADOC, API keys are required.

Creating an API Key

You can generate API keys in the ADOC UI's Admin Central by visiting the API Keys section.

Before using acceldata-sdk calls, make sure you have your API keys and ADOC URL handy.

Bash
Copy

Features Provided by Acceldata SDK

  • Pipeline: Represents an execution of a data pipeline
  • Span: Logical collection of various tasks
  • Job: Logical representation of a task
  • Event: An event can contain arbitrary process or business data and is transmitted to the ADOC system for future tracking against a Pipeline execution

Minimum Instrumentation Required

Step 1. Create Pipeline and Pipeline Run

A pipeline should be created and a new pipeline run should be started before beginning the data processing code.

You must provide the pipeline_uid, which will be updated in ADOC to track the data pipeline execution.

Python
Copy

Tracking Each Task

You must add more instrumentation to the code to allow ADOC to provide a fine-grained view of the data pipeline, as described in the sections below.

Tracking Each Task Using Jobs

Before each function is executed, a job_uid, input, output, and metadata should be passed as arguments to make each task visible as a job in the ADOC pipeline. The task's inputs should be described in the inputs list, and the task's output assets should be represented in the outputs list. A corresponding span can be created in addition to a job to populate the timeline and allow events to be associated with tasks.

Python
Copy

Getting the UID of the Asset to be Used in the Input and Output List

To get the UID of an asset, you must first open an asset in the ADOC UI. A path to the asset is shown in the asset under the Asset name, as shown in the image above. The first part highlighted in green is the data source name, and the remaining items can be used as asset names by using a period as a field separator. The DataSource name in this example is ATHENA-DS, and the asset name is AwsDataCatalog.sampledb.elb_logs.request processing time.

This asset can be used as an input with the following syntax: inputs=[Node(asset_uid='ATHENA-DS.AwsDataCatalog.sampledb.elb_logs.request_processing_time')],

Subdividing a Task into Multiple Spans

You can represent a single task with multiple steps in multiple child spans with create_child_span and send events for those child spans. To create a child span, you must first get the parent span context, which returns us to the root span. You must use the parent span context to call create child span, and it will appear as child span in the ADOC pipelines view.

Python
Copy

Linking a Task with Another Task

In previous examples, each pipeline job takes an asset as input and produces another asset as output, which the next job will use as input. Acceldata-sdk uses these to connect jobs in the ADOC pipeline UI. However, there may be times when a task does not produce another asset as an output. In such cases, you can provide a job_uid as output instead of an asset to link the next job.

Python
Copy

Conclusion

In this how-to guide, we explored using Acceldata SDK APIs to provide visibility to python data pipelines in ADOC pipelines UI.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard