Python

Installation

The following Python modules must be installed in the environment:

Install acceldata-sdk

Bash
Copy

The Acceldata Python SDK currently supports Python versions 3.9 through 3.11.

Prerequisites

API keys are required to authenticate and make requests to ADOC using the Python SDK.

Creating an API Key

You can generate API keys in the ADOC User Interface (UI) by visiting the API Keys section.

Before using acceldata-sdk requests, make sure you have your API keys and the ADOC URL readily available.

Bash
Copy

Features Provided by Acceldata SDK

  • Pipeline: Represents an execution of a data pipeline
  • Span: Logical collection of various tasks
  • Job: Logical representation of a task
  • Event: An event can contain arbitrary process or business data and is transmitted to the ADOC system for future tracking against a Pipeline execution.

Minimum Instrumentation Required

Step 1. Create Pipeline and Pipeline Run

A pipeline should be created and a new pipeline run must be started before beginning the data processing code.

You must provide the pipeline_uid, which is updated in ADOC to track the data pipeline execution.

Python
Copy

Tracking Each Task

You must add more instrumentation to the code to allow ADOC to provide a fine-grained view of the data pipeline, as described in the sections below.

Tracking Each Task Using Jobs

Before each function is executed, a job_uid, input, output, and metadata should be passed as arguments to make each task visible as a job in the ADOC pipeline. The task's inputs should be described in the inputs list, and the task's output assets should be represented in the outputs list. A corresponding span can be created in addition to a job to populate the timeline and allow events to be associated with tasks.

Python
Copy

Getting the UID of the Asset to be Used in the Input and Output List

To get the UID of an asset, you must first open an asset in the ADOC UI. A path to the asset is shown in the asset under the Asset name, as shown in the image above. The first part highlighted in green is the data source name, and the remaining items can be used as asset names by using a period as a field separator. The DataSource name in this example is ATHENA-DS, and the asset name is AwsDataCatalog.sampledb.elb_logs.request processing time.

This asset can be used as an input with the following syntax: inputs=[Node(asset_uid='ATHENA-DS.AwsDataCatalog.sampledb.elb_logs.request_processing_time')],

Subdividing a Task into Multiple Spans

You can represent a single task with multiple steps in multiple child spans with create_child_span and send events for those child spans. To create a child span, you must first get the parent span context, which returns us to the root span. You must use the parent span context to call create child span, and it will appear as child span in the ADOC pipelines view.

Python
Copy

Linking a Task with Another Task

In previous examples, each pipeline job takes an asset as input and produces another asset as output, which the next job will use as input. Acceldata-sdk uses these to connect jobs in the ADOC pipeline UI. However, there may be times when a task does not produce another asset as an output. In such cases, you can provide a job_uid as output instead of an asset to link the next job.

Python
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard