JAVA

ADOC offers a comprehensive data quality monitoring solution for your data lake and warehouse, enabling you to verify that your business decisions are based on high-quality data. Utilizing ADOC ensures that critical data sources are not neglected by providing tools to assess data quality in a data catalog. ADOC caters to all users, including analysts, data scientists, and developers, allowing them to depend on it for monitoring data flow in the warehouse or data lake, preventing any data loss.

The ADOC APIs are triggered by the adoc-java-sdk.

Features

The following features are provided by adoc-java-sdk:

  1. (Link Removed)
  2. (Link Removed)
  3. (Link Removed)
  4. (Link Removed)

Prerequisites

API keys are essential for authentication when making calls to ADOC.

Creating an API Key

You can create API keys using the API keys section of Admin Central in ADOC UI.

Before making calls to the Java SDK, ensure you have the necessary API keys and the ADOC Server URL readily available. These credentials are essential for establishing the connection and accessing ADOC services.

Bash
Copy
Java
Copy

Installation

Maven

Add the below repository URLs in the <repositories> section of your project's pom.xml file.

XML
Copy

Add the below dependency under the <dependencies> section of your project's pom.xml file.

XML
Copy

Gradle

Add the below repository URLs in the repositories section of your project's pom.xml file.

Groovy
Copy

Add the below dependency under the dependencies section of your project's build.gr

Asset APIs

The Acceldata Java SDK provides a variety of methods to fetch detailed asset information, ensuring the efficiency of your data management processes.

Obtaining Asset Details Using Asset ID or Asset UID

With the SDK, you can effortlessly obtain specific asset details by providing an asset identifier, be it the Asset ID or Asset UID.

Java
Copy

Discovering all Asset Types within ADOC

One of the key functionalities of the ADOC (Acceldata Operations Console) is the ability to discover all asset types within your data environment.

Java
Copy

Exploring Asset Attributes: Tags, Labels, Metadata, and Sample Data

You have the capability to retrieve asset metadata, asset tags, labels, and sample metadata for a comprehensive understanding of your data assets.

Java
Copy

You can also add tags, labels, and custom metadata using the SDK, facilitating asset filtering and organization.

Java
Copy
Java
Copy
Java
Copy

Initiating Asset Profiling and Sampling

Spark jobs executed on Livy enable the profiling and sampling of crawled assets.

Java
Copy

In summary, the Acceldata Java SDK equips you with a diverse range of capabilities to streamline and enhance your data asset management. Whether you're retrieving asset details via IDs or UIDs, exploring different asset types, examining attributes such as tags, labels, and metadata, or initiating profiling and sampling tasks, the SDK simplifies your data management journey.

With the Acceldata Java SDK, you can seamlessly fetch specific asset information, add tags, labels, and custom metadata, and delve into asset profiling and sampling. This versatile toolkit enables you to unlock the full potential of your data assets, ensuring that your data management tasks are optimized, and enhancing your overall data-driven operations.

Data Sources APIs

This document outlines the process of managing and crawling data sources within Acceldata. ADOC facilitates crawling across more than fifteen varied data sources. The following list consists of the data source APIs that seamlessly integrate with the SDK. It elaborates on acquiring data sources by name or ID, retrieving data sources of a specific type, initiating crawlers for designated data sources or assembly names, checking the crawling status, and deleting a data source process using the Java SDK. The accompanying code snippets illustrate these functionalities through both DataSource objects and the ADOC Client.

Obtaining a DataSource by its ID or Name

To get data sources by their name or IDs use the following code examples for each method:

Java
Copy

Retrieve All Data Sources

The following sections explains how to retrieve all available data sources. It provides a code snippet to obtain a list of all data sources without type filtering.

Java
Copy

Initiate Crawler for a Specified DataSource or Assembly Name

To initiate a crawler for a designated data source or assembly name, use the provided code snippets as it demonstrates the process using both the data source object and the ADOC client.

Java
Copy

Retrieve Crawler Status for a Specified DataSource or Assembly Name

To check the status of the crawling process for a specific data source or assembly name for both the data source object and ADOC client, use the following code snippet:

Java
Copy

Deleting a DataSource

Use the following code snippet to delete a particular data source within the Acceldata environment:

Java
Copy

By following this guide, you can efficiently manage data sources and optimize their data operations in the Acceldata platform.

Pipeline APIs

The Acceldata SDK provides robust support for a diverse array of pipeline APIs, covering everything from pipeline creation to job and span addition, as well as the initiation of pipeline runs and more. With capabilities extending throughout the entire span lifecycle, the SDK exercises complete control over the entire pipeline infrastructure, generating various events as needed.

Each pipeline encapsulates the ETL (Extract, Transform, Load) process, constituting Asset nodes and associated Jobs. The collective definition of the pipeline shapes the lineage graph, providing a comprehensive view of data assets across the system.

Within this framework, a Job Node, or Process Node, acts as a pivotal component, executing specific tasks within the ETL workflow. These jobs receive inputs from various assets or other jobs and produce outputs that seamlessly integrate into additional assets or jobs. ADOC harnesses this network of job definitions within the workflow to construct lineage and meticulously monitors version changes for the pipeline, ensuring a dynamic and efficient data processing environment.

Retrieve All Pipelines

All pipelines can be accessed through the Acceldata SDK.

Bash
Copy

Creating a Pipeline

This section pertains to the process of creating a pipeline using the Acceldata SDK. Creating a pipeline in the Acceldata SDK is a fundamental task in managing data workflows. To accomplish this, you need to construct a CreatePipeline object that encapsulates the essential information required for pipeline creation. Two mandatory parameters that must be defined when creating a pipeline are the Uid (Unique Identifier) and the Name.

By specifying the Uid and Name, you provide unique identification and a recognizable label to the new pipeline. Once the CreatePipeline object is properly configured, it can be used as input to initiate the pipeline creation process, allowing you to efficiently set up and manage your data workflows within Acceldata.

Java
Copy

Retrieve a Pipeline

To access a specific pipeline, you can provide either the pipeline UID (Unique Identifier) or the pipeline ID as a parameter for retrieval. This flexibility allows you to easily obtain the details of a desired pipeline.

Java
Copy

Deleting a Pipeline

You can initiate the deletion process by using the code snippet below:

Java
Copy

Retrieve Pipeline Runs for a Pipeline

You can access all the pipeline runs associated with a specific pipeline. There are two ways to accomplish this:

  1. Using the pipeline object: pipeline.getRuns()
  2. Utilizing the AdocClient: adocClient.getPipelineRuns(pipeline.getId())

Both methods allow you to retrieve information about pipeline runs.

Java
Copy

Initiating a Pipeline Run

A pipeline run signifies the execution of a pipeline. It's important to note that a single pipeline can be executed multiple times, with each execution, referred to as a "run," resulting in a new snapshot version. This process captures the evolving state of the data within the pipeline and allows for effective monitoring and tracking of changes over time.

To initiate a pipeline run, you can use the Pipeline object. When creating a run, there are two optional parameters that you can include:

  • optionalContextData: Additional context data that can be passed.
  • optionalContinuationId: A continuation identifier, which is also an optional parameter. It is an identifier used to link multiple ETL pipelines together as a single pipeline in the ADOC UI.

Here's an example of how to create a pipeline run using the Acceldata SDK:

Java
Copy

Getting the Latest Run of the Pipeline

You can obtain the latest run of a pipeline using either the Pipeline object or the AdocClient from the adoc-java-sdk. Here are two ways to achieve this:

Java
Copy

Getting a Particular Run of the Pipeline

You can retrieve a particular run of a pipeline by passing the unique pipeline run ID as a parameter.

Java
Copy

Updating a Pipeline Run

The updatePipelineRun method is responsible for updating the attributes of a pipeline run, such as its result, status, and optional context data. It accepts the following parameters:

  • result: A PipelineRunResult indicating the updated result of the pipeline run.
  • status: A PipelineRunStatus indicating the updated status of the pipeline run.
  • optionalContextData: An optional map that can be used to pass additional context data for the update.
Java
Copy

Create Job Using Pipeline

When creating a job using a Pipeline object, there are mandatory parameters that must be provided, including uid, name, and pipelineRunId. The adoc-java-sdk library offers the CreateJob class, which serves as the parameter for job creation. The required parameters are as follows:

  • uid: A unique identifier for the job.
  • name: The name of the job.
  • pipelineRunId: The ID of the associated pipeline run.

Several other parameters are available to define and configure the job. Below is a description of these parameters:

ParameterDescription
uidUnique ID of the job. It is a mandatory parameter.
nameName of the job. It is a mandatory parameter.
pipelineRunIdThe pipeline run's ID is required when you want to add a job. This is mandatory if a job is made using a pipeline object. However, it's optional if the job is created using a pipeline run.
descriptionDescription of the job. It is an optional parameter.
inputsJob input can take various forms. It might be the uid of an asset, specified using the asset_uid and source parameters in the Node object. Alternatively, it could be the uid of another job, specified using the jobUid parameter in the Node object.
outputsJob output comes in different forms. It may be the uid of an asset, specified through the asset_uid and source parameters in the Node object. Alternatively, it could be the uid of another job, specified through the jobUid parameter in the Node object.
metaIt is the metadata of the job and is an optional parameter.
contextIt is the context of the job and is an optional parameter.
boundedBySpanThis parameter is of boolean value. If the job needs to be bound by a span, this must be set to true. The default value for this parameter is false. It is an optional parameter.
spanUidThis is the new span's uid, which is a string. If the value of bounded by span i.e. boundedBySpan is set to true, this parameter is mandatory.
Java
Copy

Spans

Hierarchical spans are a way to organize and track tasks within pipeline runs for observability. Each pipeline run has its own set of spans. Spans are like containers for measuring how tasks are performed. They can be very detailed, allowing us to keep a close eye on specific jobs and processes in the pipeline.

Spans can also be connected to jobs, which help us monitor when jobs start, finish, or encounter issues. Spans keep track of when a task begins and ends, giving us insight into how long each task takes.

The Acceldata SDK makes it easy to create new pipeline runs and add spans. It also helps us generate custom events and metrics for observing the performance of pipeline runs. In simple terms, hierarchical spans and the Acceldata SDK give us the tools to closely monitor and analyze the execution of tasks in our pipelines.

Creating Span for a Pipeline Run

The createSpan function, used within the context of a pipeline run, offers the following parameters:

ParameterDescription
uidThis parameter represents the unique identifier (UID) for the newly created span. It is a mandatory field, and each span should have a distinct UID.
associatedJobUidsThis parameter accepts a list of job UIDs that should be associated with the span. In the case of creating a span for a pipeline run, an empty list can be provided since spans are typically associated with jobs.
optionalContextDataThis parameter is a dictionary that allows you to provide custom key-value pairs containing context information relevant to the span. It can be used to add additional details or metadata associated with the span.
Java
Copy

Obtaining the Root Span for a Pipeline Run

In this code snippet, the getRootSpan method is used to retrieve the root span associated with a specific pipeline run. The root span represents the top-level span within the execution hierarchy of the pipeline run. It provides essential context and observability data for the entire run. The resulting SpanContext object, stored in the rootSpanContext variable, represents the root span and can be used for further analysis and observability purposes. This code simplifies the process of accessing critical information about the pipeline run's top-level span.

Java
Copy

Checking if the Current Span is a Root Span

A root span is the top-level span that encompasses the entire task or process hierarchy.

The isRoot() method is used to make this determination. If the span is a root span, the isRootSpan variable is set to true, indicating its top-level status within the span hierarchy. This check provides insights into the context of the current span, which can be valuable for decision-making or monitoring in complex workflows.

Java
Copy

Creating Child Span

The createChildSpan function within SpanContext is used to create a hierarchical structure of spans by placing a span under another span. It offers the following parameters:

ParameterDescription
uidThis parameter represents the unique identifier (UID) for the newly created child span. It is a mandatory field, and each child span should have a distinct UID.
optionalContextDataThis parameter is a dictionary that allows you to provide custom key-value pairs containing context information relevant to the child span. It can be used to add additional details or metadata associated with the span.
associatedJobUidsThis parameter accepts a list of job UIDs that should be associated with the child span. This association allows you to link the child span with specific jobs.
Java
Copy

Retrieving Spans for a Given Pipeline Run

This code snippet demonstrates how to retrieve all the spans associated with a specific pipeline run. It begins by obtaining a reference to a PipelineRun object using the getRun method on a Pipeline object and providing the pipeline run's ID as a parameter. Subsequently, it calls the getSpans method on the pipelineRun object, which returns a list of Span objects related to the specified pipeline run. This functionality is valuable for accessing and analyzing span data to monitor the execution of the pipeline run.

Java
Copy

Sending Span Events

The SDK has the capability to send both custom and standard span events at different stages of a span's lifecycle. These events play a crucial role in gathering metrics related to pipeline run observability.

Java
Copy

This code snippet demonstrates the usage of the abort and failed methods within a SpanContext to send events for failed or aborted spans. It begins by preparing context data with information such as client time and row count. Including context data in these events provides essential information for tracking and handling span failures and aborts, aiding in observability and diagnostic purposes.

Java
Copy

Ending the Span

We can end the span by using end method on the SpanContext object and passing the optionalContextData .

Java
Copy

In conclusion, the Acceldata SDK equips users with powerful tools to manage and monitor data pipelines effectively. It enables seamless pipeline creation, job management, and hierarchical span organization, enhancing observability and control over data workflows. With the ability to generate custom events, the SDK offers a comprehensive solution for optimizing pipeline operations and ensuring data integrity.

Policy APIs

This section provides comprehensive information on the APIs and methods available for managing policies in your data management workflow. From a data observability perspective, the Policy APIs serve as a foundation for maintaining data integrity, ensuring data quality, and meeting compliance requirements. By harnessing these APIs, organizations can proactively detect and resolve data issues, ensuring the reliable and accurate flow of data through their systems, ultimately contributing to improved data observability and trust in data-driven decision-making.

Retrieving a Policy Based on Its Identifier

This API allows you to find a specific policy using its identifier, which can be either the policy ID or policy name. Whether you have the ID or name, you can efficiently retrieve the desired policy. This simplicity streamlines policy access within your application.

Java
Copy

Retrieving a Policy Based on Type and Identifier

This API allows you to find a specific policy based on its type and identifier. The identifier can be either the policy ID or policy name, offering flexibility in searching for policies.

Java
Copy

Executing a Policy

The Acceldata SDK provides a utility function called executePolicy that allows you to execute policies both synchronously and asynchronously. When you use this function, it returns an object on which you can call getResult and getStatus to obtain the execution's result and status.

The required parameters for using the executePolicy function typically include the following:

ParameterDescription
policyTypeThis parameter is specified using an enum and is a required parameter. It accepts constant values as PolicyType.DATA_QUALITY or PolicyType.RECONCILIATION. It specifies the type of policy to be executed.
policyIdThis is a required string parameter. It specifies the unique identifier (ID) of the policy that needs to be executed.
optionalSyncThis is an optional Boolean parameter. It determines whether the policy should be executed synchronously or asynchronously. The default value is false. If set to true, the execution will return only after it has completed. If set to false, it will return immediately after the execution begins.
optionalIncremental

This is an optional enum parameter used to determine the behavior in the event of a failure. The default value is FailureStrategy.DoNotFail. The FailureStrategy enum can have three values:

  • DoNotFail: It will never throw an exception. In the event of a failure, the error will be logged.
  • FailOnError: It will only throw an exception if there is an error. In the event of a warning, it will return with no errors.
  • FailOnWarning: It will throw exceptions on both warnings and errors.
pipelineRunIdThis is an optional parameter specified as a long. It represents the run ID of the pipeline where the policy is being executed. This parameter can be used to link the execution of the policy with a specific pipeline run, providing context for the execution.
Java
Copy

Getting Policy Execution Result

To access the outcome of policy execution, you have two options. You can call adocClient.getPolicyExecutionResult method directly on the adocClient , or you can opt to use the getResult method provided by the executor object generated during the policy execution. Also in case the policy execution is in a non terminal state, it will be retried till the execution reaches a terminal state. The getPolicyExecutionResult parameters include:

ParameterDescritption
policyTypeThis parameter is mandatory and is specified using an enum, which is like a set of predefined options. It tells the system what kind of policy to apply. It can take two values: PolicyType.DATA_QUALITY or PolicyType.RECONCILIATION.
executionIdThis is a required parameter, and it's a string value. It represents the unique identifier of the execution you want to query for results. You need to specify this identifier to retrieve specific results.
optionalFailureStrategyThis parameter is optional, and it's also an enum parameter. It helps decide what to do in case something goes wrong during the execution. The default behavior, if you don't specify this, is to not fail (FailureStrategy.DoNotFail). This means that if there is a problem, it will be logged, but the execution continues.
Java
Copy

Retrieving Current Policy Execution Status

To find out the current status of a policy execution, you have two options. You can either call the getPolicyStatus method on the adocClient, or you can call the getStatus method on the executor object. These methods will provide you with the current resultStatus of the execution, which gives information about where the execution stands at the moment.

The required parameters for getPolicyStatusinclude:

ParameterDescription
policyTypeThe policy type is specified using an enum parameter. It is a required parameter. It is an enum that will accept constant values as PolicyType.DATA_QUALITY or PolicyType.RECONCILIATION
executionIdThe execution id to be queried for the result is specified as a string parameter. It is a required parameter.
Java
Copy

Executing Policies Asynchronously

Here's an example that demonstrates how to initiate policy execution in an asynchronous manner.

Java
Copy

Executing Policies Synchronously

Here's an example that demonstrates how to initiate a synchronous policy execution, where the execution waits for completion before proceeding:

Java
Copy

Enabling a Policy

The provided code demonstrates how to enable a rule or policy.

Java
Copy

Disabling a Policy

The provided code demonstrates how to disable a rule or policy.

Java
Copy

Canceling Policy Execution

The provided code demonstrates how to cancel a policy execution using both the executor object and the adocClient.

Using the executor object:

Java
Copy

Using adocClient:

Java
Copy

Listing All Policies

Retrieving a list of all policies from your catalog server while applying specified filters.

ParameterDescription
policyFilter

PolicyFilter can have the following parameters

period: Allows you to filter policies based on a specified time period. It's defined as an enum called ExecutionPeriod .

tags: Enables filtering by tags, allowing you to narrow down policies that have specific tags associated with them.

lastExecutionResult: Permits filtering based on the results of the last execution, helping you identify policies with particular execution outcomes.

assets: This parameter is used to filter policies associated with specific assets. It expects a list of objects of type Asset returned as a filter criterion.

dataSources: You can filter policies by specifying particular datasources. This parameter also requires a list of objects of type Datasource.

policyType: This parameter filters policies by their type and is defined as an enum of type PolicyType.

enable: By setting enable to true, you can filter for policies that are currently enabled.

active: When active is set to true, you filter for policies that are currently active.

optionalPageSpecifies the page number of the query output, allowing you to navigate through paginated results.
optionalSizeDefines the size of each page, controlling how many policies are displayed on each page of the query output.
optionalWithLatestExecutionThis option allows you to include the latest execution details of the policy in the query output. When enabled, it fetches information about the most recent execution of each policy.
optionalSortByThis parameter determines the sorting order of the policies in the query output. You can specify the order in which policies are presented, for example, by name or execution date.
Java
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard