ADOC Glossary
The terms used in relation to ADOC are listed here, along with a brief description of what each one implies.
A
Term | Definition |
---|---|
ADOC CLI | A command-line interface for ADOC. It can generate binaries with dependencies, upload binaries, and create User Defined Template (UDT) definitions |
Access Control | Regulates access to computer or network resources based on user roles within an organization |
Absolute File Count | Monitors the absolute number of files. |
Absolute File Size | Monitors the absolute size of files |
Absolute Row Count | Related to Data Cadence, which can include metrics for the absolute number of rows for a certain assets. |
Alerts | An alert is a notification generated when a observability parameter fails or succeeds. For example, users receive an alert if a data quality check identifies anomalies in a dataset. |
Analysis Service | The Analysis Service performs data profiling, rule executions, and data sampling tasks using various configuration parameters, such as Data Retention Days, Historical Metrics Interval for Anomaly Detection, and Minimum Required Historical Metrics for Anomaly Detection. For example, the Analysis Service can identify unexpected items or events in a dataset using historical metrics. |
Anomaly | An anomaly refers to irregularities detected in a dataset's values using historical metrics. ADOC allows customization of Minimum Required Historical Metrics and Historical Metric Interval. For example, anomalies can include incorrect data values, unexpected data elements, or outlier records. |
Anomaly Detection Settings | ADOC allows customization of Minimum Required Historical Metrics and Historical Metric Interval. |
API (Application Programming Interface) keys | An API key is a unique identifier used to authenticate a user, developer, or calling program to an API. For example, the POST Start Profiling API method requires an API key to initiate the asset profiling process. |
Asset | An asset is an entity composed of data, such as a warehouse or database containing schemas and tables, or files in storage services like S3, GCS, or ADLS. For example, a data asset may consist of data records organized into schemas, tables, and columns. |
Asset List View | Display of all the assets discovered in ADOC. |
Asset Similarity | Compares the degree of similarity between columns in multiple tables, calculating similarity percentages and producing a Table Similarity score. |
Audit Log | ADOC logs Data Reliability and Compute events, such as crawler activities and scheduling |
Auto Profile | Auto profiling is the automated processing of information to analyze data. This process displays data source assets that have auto profiling enabled. |
Avro File Format | A data serialization system used by Hadoop. |
B
Term | Definition |
---|---|
Big Data | Big Data refers to extremely large volumes of data, arriving at high velocity, and encompassing a wide variety of data types (structured, semi-structured, and unstructured) |
Business Glossary | The Business Glossary captures all information about specific assets, pipelines, or business processes for future reference. |
Bulk Policies | Feature that simplifies creating data quality rules, grouping them, and applying them to data sources. It automatically creates a Data Quality policy and applies the rules to assets matching a tag-based condition |
C
Term | Definition |
---|---|
Change in File Count | Tracks variations in file count. |
Change in File Size | Tracks variations in file size. |
Cloud Service | Stores warehouse names, tables names, database names, usage and contract metadata, login details, and users details |
Compute | The Compute feature provides an estimate of resource utilization and the compute and storage costs of the underlying infrastructure. It offers recommendations to optimize resource allocation and reduce costs. For example, if instances are running for a specific time at a certain cost, the Compute feature helps understand the cost and provides recommendations on efficient instance usage. |
Context Switching | ADOC supports context switching in heterogeneous pipelines |
Contract | A contract is a summary that includes the organization's name, account consumption, capacity used as a percentage, and the contract end date. Contract costs can be predicted using previous cost consumption metrics. For example, viewing costs across accounts and services in an organization shows storage, cloud services, replication, data transfer, and compute costs. |
Crawl/Crawler | Crawling is the process of extracting metadata from data sources. After establishing a connection to the data source, you can crawl metadata from the remote data source into the ADOC database. For example, crawling retrieves metadata such as owner, table type, and row count. |
D
Term | Definition |
---|---|
Data Drift Policy | A Data Drift Policy determines the percentage change in certain metrics when the underlying data changes. Users can create data drift rules to validate data changes against tolerance thresholds for each metric type. For example, setting a data drift rule to alert if the average value of a column changes by more than 5% compared to the previous day. |
Data Freshness Policy | Tracks whether data is updated within the expected timeframe. |
Data Governance | Composite term used to ensures data quality and governance. |
Data Lineage | Establishes data lineage by detecting external data sources and finding relationships between them, enhancing cross-system data visibility. It depicts how data was obtained from various sources, showing a graphical representation of data flow. |
Data Plane | It is required to add data source in ADOC. The Data Plane list view displays all the Data Planes created in ADOC. The Data Plane is a client-managed layer within the ADOC architecture that directly interacts with and manages the client's data resources. It facilitates the smooth transfer of data between various software systems and is essential for leveraging Data Reliability for a data source |
Data Protection | Data Protection enables non-admin users to have selected columns from a table be masked. It imposes restrictions on columns containing personally identifiable information (PII). For example, enabling PII protection on sensitive columns hides the data from unauthorized users. |
Data Quality Policy | A Data Quality Policy measures how healthy the data is within a data source from a consumer or business standpoint. Multiple policies can be executed to check data quality. For example, a data quality policy may enforce that no null values are present in critical columns. |
Data Reconciliation Policy | A Data Reconciliation Policy refers to comparing the target data to the original source data to ensure that data migration transfers the data correctly. It can be created in ADOC between two assets of similar type or between assets that can be profiled. For example, reconciling data between a source database and a destination data warehouse after migration. |
Data Reliability | ADOC's function for ensuring data quality and governance, providing tools to maintain the integrity, consistency, and reliability of data assets. |
Data Retention | It sets rules to specify how long data should be kept. |
Data Source | A data source is the origin location of the data being used. The database is located on a remote server and is accessible via database connections. To retrieve data, a server must establish a connection to the database. |
Data Synchronization | ADOC supports metadata synchronization. Data Synchronization within ADOC ensures data consistency between different systems or storage locations by periodically comparing data and metadata to identify and resolve discrepancies. This process involves updating metadata or flagging inconsistencies to maintain data integrity across the ADOC environment. |
Dependencies | ADOC CLI build command will generate fat binaries with all the dependencies. |
Discover Page | The Discover Page provides a list of all the different assets present in your ecosystem while configuring ADOC to track them, with various filtering capabilities. |
Discrepancy Resolution | Reconciliation policies build data trust and reliability for discrepancy resolution. |
E
Term | Definition |
---|---|
Entity Relationship (ER) Diagram | An ER Diagram is a graphical representation of the relationships between entities in a database. It helps visualize data structures and connections, aiding in database design and data lineage understanding. |
Error Metric | Error metrics are measurements used to evaluate the accuracy of models or data processes. In ADOC, they can be used to track the number of errors detected during data profiling or reconciliation processes. For example, the Mean Absolute Error (MAE) can be used to compare the predicted values against actual values to identify deviations. |
Event Logs | Event logs capture and store occurrences of system activities, providing a traceable history for auditing and troubleshooting purposes. They contain information such as event types, timestamps, and source details. |
Export Policy | An export policy is a set of rules that dictate how data and reports can be exported from the ADOC platform. This includes configuring formats, destinations, and user permissions for exports. For example, exporting a data quality report as a CSV file to an external storage location. |
Execute Policy Operator | ADOC provides using Execute Policy Operator. This feature executes a specified data policy (Data Quality or Reconciliation) within a data pipeline, which can be triggered upon span completion, either fully or incrementally, and links the execution to a specific pipeline run. |
External Integrations | These are connections with third-party services and applications through the ADOC platform, with updated capabilities offering a flexible approach to load pipeline monitoring metadata independently of the platform's ongoing activity. They often utilize OAuth for secure, limited access without exposing login details |
F
Term | Definition |
---|---|
Filters | Filters in ADOC help you narrow down and focus on the specific data you need, whether you're searching for assets, policies, or other information. They let you sift through the noise by setting criteria based on various attributes like data source, tags, or time ranges, so you can quickly find what's most relevant to you. |
Filter UDT (User Defined Template) | A Filter UDT allows users to define their own data quality rules in languages such as Java, Scala, Python, JavaScript, or Spark SQL to filter records in a data asset (table). |
G
Term | Definition |
---|---|
Gen AI Assisted Metadata Generation | It employs advanced AI algorithms to analyze data assets and generate descriptive metadata automatically to improve the discoverability and understandability of data assets, facilitating data governance and usage. |
Grouping Policy | Grouping policies allow users to aggregate multiple assets or rules under a common group to simplify management and execution. These policies can be applied to a set of data quality rules or assets that share similar characteristics. For example, grouping multiple customer-related data sources under a single policy for data quality monitoring. |
H
Term | Definition |
---|---|
Home Page | The ADOC landing page provides an overview of Acceldata's capabilities and allows navigation to specific dashboards, recommendations, or actions. |
Historical Metrics | Historical metrics track historical data points over time to analyze trends, detect anomalies, and set baselines for future data observations. For example, using historical metrics of data quality to determine acceptable thresholds for missing values or data type inconsistencies. |
I
Term | Definition |
---|---|
Integration Points | Integration Points are the specific interfaces and methods used within ADOC to connect to Hadoop Distributed File System (HDFS) and other related components, such as MapR, to facilitate data loading and management. These points ensure seamless compatibility and robust support for managing and monitoring data sources within the Hadoop ecosystem, thus improving data observability and reliability |
Ingestion Pipeline | An ingestion pipeline is a series of data operations that pull raw data into the ADOC platform for further processing and analysis. For example, setting up an ingestion pipeline to pull transactional data from AWS S3 into ADOC. |
J
Term | Definition |
---|---|
Jobs | Jobs are operations triggered when an action is performed in ADOC. Various jobs can be viewed and monitored in the jobs window, such as profile jobs, auto profile queues, data quality jobs, reconciliation jobs, and upcoming jobs. |
Job State | The final state of the job, providing a more detailed description of the process, such as metrics requests sent, partial analysis received, or the task being fully completed |
K
Term | Definition |
---|---|
KPI | KPIs are measurable values used to assess the performance and success of data operations or policies within ADOC. For example, tracking the percentage of successful data quality checks as a KPI for data reliability. |
L
Terms | Definition |
---|---|
Label | Labels allow data assets to be categorized by purpose, owner, or business function. Labels use key-value pairs defined over an asset to facilitate advanced search methods. For example, assigning a label like "Confidentiality: High" to sensitive data assets. |
Lineage | Lineage depicts how data was obtained from various sources, showing a graphical representation of data flow. For example, lineage diagrams illustrate how data moves through different ETL processes from source to destination. |
Lookup Type | A toggle switch in Validation UDF. When enabled, ADOC recognizes that the Validation UDF would be used in a Lookup rule in Data Quality policy. |
Lookup Data Quality Policy | A Lookup Data Quality Policy enables the validation of values in a table against a reference dataset or predefined set of valid values. For example, ensuring that all state codes in a dataset match a predefined list of US state abbreviations. |
M
Term | Definition |
---|---|
Metadata | Metadata describes information about data, making it easier to locate, use, and reuse specific data instances. For example, metadata for a column "Name" in a table "Customer_Information" indicates that the column's data type is string. |
Metadata Synchronization | ADOC supports metadata synchronization. Metadata Synchronization is a process within ADOC that ensures consistency and alignment of metadata across various systems and clusters. This involves aligning table structures, formats, and other relevant metadata parameters between, for example, MapR HDFS/Hive and Apache HDFS/Hive. The goal is to accurately reflect the current state of data and facilitate efficient data discovery, understanding, and governance. |
Minimal Privilege | In the context of ADOC Data Plane installation, refers to the practice of granting the least amount of permissions necessary for the Data Plane to function correctly. This approach enhances security by restricting the Data Plane's access only to the specific resources and actions it requires, reducing the potential impact of security breaches or unauthorized activities. |
Monitoring Dashboard | The Monitoring Dashboard provides a consolidated view of the performance and status of all assets, jobs, and policies within ADOC. For example, monitoring the status of ongoing profiling jobs and any data quality issues detected. |
Monitors | Monitors are entities that continuously observe data sources and assets to track specific metrics or patterns, providing real-time observability. For example, setting up a monitor to observe data drift in a dataset and alerting if the change exceeds the tolerance threshold. |
Monthly Asset Profiling Schedule | ADOC allows users to schedule asset profiling on a monthly basis, enabling automated execution and ensuring timely evaluation of data. |
N
Term | Definition |
---|---|
Notification Channel | A Notification Channel is used to configure notifications via email, Slack, Hangouts, Jira, or a webhook URL. Multiple notification channels can be set up depending on user segregation. |
O
Term | Definition |
---|---|
OAuth Integration | It lets ADOC connect securely to other platforms and applications. Instead of sharing your username and password, it's like giving ADOC a special key to access specific things, like your data, without exposing your personal login details, making it more secure. This means you can seamlessly connect ADOC to services like Snowflake or Azure Databricks, streamlining your workflow while keeping your account safe |
Observability | Observability refers to the capability of the platform to measure the internal state of a data system by analyzing the output and logs, enabling users to diagnose and fix issues effectively. For example, observing a data pipeline to understand how changes in the source system impact data quality downstream. |
Operating System | Dataplane Uses Operating System Level Metrics. This refers to the Data Plane's capability to gather performance metrics directly from the operating system on which it runs. By collecting these metrics, ADOC can gain insights into resource utilization, system health, and overall performance of the Data Plan. These metrics help identify potential bottlenecks, optimize resource allocation, and ensure the Data Plane operates efficiently. |
P
Term | Definition |
---|---|
Permissions | Define user roles, access levels, and permissions within the ADOC platform based on your organization's requirements |
Persistence Path | The Persistence Path specifies the result location at the asset level. Data quality results will be stored in the specified persistence path of any storage, such as Amazon S3, HDFS, Google Cloud Storage, or Azure Data Lake. The persistence path can be set globally in the admin console but can be overridden if configured at the asset level. |
Pipeline | A Pipeline represents the complete ETL (Extract-Transform-Load) workflow and contains asset nodes and associated jobs. It facilitates observability of data movement from source repositories to target repositories. |
Policy | A policy is a rule mapped to an asset to perform specific actions. There are three types of policies that can be defined for an asset:
|
Policy Execution | When you execute a policy, all of the rules stated in your policy are run, and you can view the results for each rule. |
Policy Template | A Policy Template contains predefined rules and configurations that can be applied to create new data quality or reconciliation policies, saving time and ensuring consistency. For example, using a policy template to standardize rules for missing values and unique constraints. |
Profile | Data profiling is the process of reviewing, analyzing, and summarizing data into meaningful information. It produces a high-level overview that assists in identifying data quality concerns. For example, profiling an asset provides statistical data such as minimum, maximum, and average values. |
Pushdown Data Engine | The Pushdown Data Engine is a data processing engine within ADOC that performs data operations directly on the source system, reducing the need to move data across the network. For example, performing a join operation between two tables directly in the data warehouse instead of pulling the data into ADOC. |
Q
Term | Definition |
---|---|
Queries | A query is a request for data results from the database or an action on the data. The Queries tab displays the top 50 successful or failed queries, along with their estimated cost, user, database, warehouse, and execution status. For example, viewing the estimated cost by query type (CRUD) and cost per warehouse in graphical form. |
R
Term | Definition |
---|---|
RBAC | The Role Based Access Control (RBAC) feature in ADOC, which provided authorization control across the entire application, has been deprecated. Starting with ADOC V4.0, RBAC has been superseded by Resource-Based Access Management (RBAM), offering more granular control by introducing domains and resource groups |
RBAM | Resource-Based Access Management is a method of regulating access to resources. RBAM is an enhanced access control system in ADOC that lets administrators define who can access specific resources (like assets, reports, and policies) by using domains and resource groups. This goes beyond traditional Role-Based Access Control (RBAC) by providing granular control over both actions and resource visibility, ensuring users can only see and interact with the resources relevant to their roles. RBAM improves security, aids in regulatory compliance, and supports delegated administration, making resource management more scalable and efficient for enterprises |
Reconciliation | Data Reconciliation Policy refers to comparing the target data to the original source data to ensure that data migration transfers the data correctly. It can be created in ADOC between two assets of similar type or between assets that can be profiled. |
Regex Match | Regex Match is a type of data quality check in ADOC that validates data patterns within columns of a single asset, utilizing regular expressions for structured data like email addresses. This ensures that the column values adhere to a specified pattern. When creating a data quality policy in ADOC, a Pattern Match rule can be selected to check if column values conform to a given regular expression. |
Rule Set | A Rule Set is a group of data quality rules that exist outside of an asset-level policy. It can be used to automatically create policies by applying them over assets. |
Rules | Rules are defined functions used when configuring a policy to validate data during policy execution. A policy can contain multiple rules that check for null values, uniqueness, and other attributes on an asset. |
Reference Asset | A reference asset is a dataset or schema used as a baseline to compare with other datasets during reconciliation or validation operations. For example, using a "Customer Master" dataset as a reference asset to validate data consistency in transactional datasets |
S
Term | Definition |
---|---|
Sample Data | Sample Data represents the content of attributes in a table, providing example values for understanding the data. |
Schema Drift Policy | A Schema Drift Policy detects changes to a schema or table between previously crawled and currently crawled data sources. In ADOC, schema drift policies are executed every time a data source is crawled. For example, detecting if a new column has been added or an existing column's data type has changed. |
Schema Registry | A central repository for Apache Avro schemas and includes a REST API for schema storage and retrieval. |
Security Measures | ADOC complies with privacy and security requirements, ensuring data never leaves the environment and supports configurations where PII is not removed |
Segments | The Data Reliability function in Acceldata Data Observability Cloud(ADOC) allows you to apply polices on assets, to maintain data quality in your assets. |
Source Connection | A source connection is a configuration that establishes a link between ADOC and a data source, enabling data extraction, profiling, and monitoring. For example, configuring a source connection to an Azure Data Lake storage account. |
SQL Rule | SQL Rule is a feature within ADOC's Data Quality Policy that enables users to create custom data validation and transformation logic using SQL expressions. |
Storage | The Storage tab provides a summary of storage costs. Table, database, and high churn table storage costs can be viewed to take appropriate actions. |
Stock Monitor | Stock Monitors are pre-built monitors available in the ADOC platform to track common metrics and observability parameters. For example, using a stock monitor to track schema changes in a dataset. |
System Time | Dataplane uses operating system level metrics, such as CPU utilization, load, and system time. |
T
Term | Definition |
---|---|
Tags | Tags are metadata that help describe an asset and allow it to be found through browsing or searching. Tags aid in data discoverability and can be linked to assets and policies. Tags can be generated by the system or by users. It is also used to apply policies to assets. ADOC provides a Tags page where you can manage tags |
Template | A Template contains multiple rule definitions. These rule definitions are applied when a Data Quality Policy is created. Instead of defining rules for each policy, you can use a policy template that contains predefined rules. For example, when a policy template is added, all rule definitions in the template are automatically evaluated. |
Transform UDT | A Transform User Defined Template allows you to extract or manipulate values from a record in a data asset (table), using custom logic defined in languages such as Java, Scala, Python, JavaScript, or Spark SQL. |
V
Term | Definition |
---|---|
Validation UDF | You can use Validation UDF in a Lookup rule in Data Quality policy. You can create a lookup rule that compares multiple target columns with multiple reference columns by using User Defined Templates (UDT). |
Virtual Data Source | A Virtual Data Source refers to a logical representation of a data source that enables the integration of multiple physical data sources into a single entity. For example, combining multiple Google Cloud Storage buckets into a single virtual data source for unified monitoring. |
W
Term | Definition |
---|---|
Webserver | A Web Server is software that handles requests to access the ADOC platform's interface. Think of it as the middleman that takes your instructions and displays the ADOC platform to you through a web-based interface, often using software like Apache or Nginx. |