Google Cloud Pub/Sub

Google Cloud Pub/Sub integration in Acceldata Data Observability Cloud (ADOC) enables comprehensive data reliability, observability, and profiling for your event-driven architecture. Introduced in ADOC v4.7.0, this connector allows you to crawl, profile, and reconcile Pub/Sub data streams using ADOC’s batch reading engine—without complex setup.

Google Cloud Pub/Sub is a fully managed real-time messaging service that allows applications to exchange event data at scale. By integrating it with ADOC, you can ensure continuous data reliability and visibility for your streaming workloads. The ADOC connector reads from Pub/Sub topics using ephemeral (temporary) subscriptions, ensuring isolated execution and automatic cleanup after each batch job.

Supported Authentication Methods

Authentication TypeDescription
Google Workload Identity (Default)Uses GCP Workload Identity Federation for secure, identity-based authentication between ADOC and GCP.
Service Account Key File (JSON)Authenticate using a service account JSON key uploaded directly in ADOC.

Prerequisites and Permissions

Before adding Google Cloud Pub/Sub as a data source, ensure the following:

  1. You have an existing Data Plane configured in ADOC. Refer to the Data Plane Installation Guide plane for setup instructions.
  2. The following GCP IAM permissions must be granted to the service account used for connection:
PermissionResource ScopePurpose
pubsub.subscriptions.createSubscription Project IDCreate ephemeral subscriptions during job execution.
pubsub.topics.attachSubscriptionEach Topic of InterestAttach ephemeral subscriptions to Pub/Sub topics.
pubsub.subscriptions.deleteSubscription Project IDDelete ephemeral subscriptions post-job to avoid residual resources.
pubsub.subscriptions.consume, pubsub.messages.pull, pubsub.messages.acknowledgeEphemeral SubscriptionsEnable Spark Batch Reader to read and acknowledge messages.

The Test Connection validates lifecycle permissions (create, attach, delete) critical for ADOC’s Pub/Sub batch processing model.

  1. Ensure the following connection details are available:
    • Source and Subscription Project IDs
    • Google Cloud region (e.g., us-east1)
    • Authentication credentials (Workload Identity or JSON key)
    • Topics to be read by ADOC

Configuration Parameters

ParameterDescriptionMandatoryExample
Data Source NameUnique identifier for the Pub/Sub source.GCP-PubSub-Prod
DescriptionOptional notes for the data source.Production Pub/Sub data pipeline
Data PlaneADOC Data Plane to use.dp-gcp-us
Source Project IDProject ID where Pub/Sub topics reside.source-project-123
Subscription Project IDProject ID for temporary subscriptions.subscription-project-xyz
RegionGCP region of Pub/Sub topics.us-east1
Authentication MethodChoose Workload Identity or Upload Service Account File.Workload Identity
Service Account FileJSON key for Service Account authentication.⚙️ Required if JSON file method chosen/path/to/service-account.json
Topics of InterestComma-separated list of Pub/Sub topics to monitor.orders-topic, audit-topic

Adding Google Cloud Pub/Sub as a Data Source

  1. Navigate to Register > Data Sources tab in ADOC.

  2. Click Add Data Source.

  3. Select Google Cloud Pub/Sub from the list of data sources.

  4. Enter a Data Source Name and optional Description.

  5. Ensure the Data Reliability toggle is enabled.

  6. Choose an existing Data Plane or create a new one.

  7. Click Next to configure Connection Details.

  8. Provide the following:

    • Authentication method (choose between Workload Identity or JSON file)
    • Credentials File
    • Source Project ID
    • Subscription Project ID
    • List of Topic Names
  9. Click Test Connection to validate access and permissions.

  10. Once successful, click Next to configure Topic details in the Observability Setup step.

Configuring Topic Details

After successfully connecting to your Google Cloud Pub/Sub data source, the Set Up Observability page allows you to configure topic-level settings for monitoring and data reliability.

FieldDescriptionExample
Asset NameLogical name assigned to the Pub/Sub topic within ADOC. Appears as the asset identifier in the Data Reliability dashboard.orders_topic_asset
Topic NameExact name of the Pub/Sub topic in Google Cloud.orders-topic
Message FormatSupported formats: JSON, Avro, Confluent Avro.JSON
Subscriber ParallelismNumber of parallel subscribers used for data reading during job execution. Controls throughput.1
Schema IDIdentifier of the schema (for Avro/Confluent Avro). Used for mapping to a Schema Registry entry.orders-schema-v2
Schema Naming StrategyNaming convention used to resolve schema identity. Options: TOPIC_NAME, RECORD_NAME, TOPIC_RECORD_NAME.TOPIC_RECORD_Name
Key or ValueSpecifies whether the schema applies to the message key or value.Value
Record NameRecord name for Avro or Confluent Avro messages.OrderRecord
Record NamespaceAvro namespace used to organize record schemas.com.retail
Topic SchemaFull schema definition for the topic, if manually providedInline JSON or avro schema
Schema File PathPath to an external schema file (e.g., .avsc)./schemas/order.avsc

These parameters allow ADOC’s Spark Batch Reader to interpret payloads correctly during data profiling and quality evaluation.

Optional Settings

Enable Schema Drift Monitoring

Turn on this setting to track structural changes (schema drift) in your Pub/Sub topic data over time.

Note: Schema drift detection requires Enable Crawler Execution Schedule to be turned on.

Enable Crawler Execution Schedule

Set up scheduled crawlers to automatically scan and profile your Pub/Sub topics at regular intervals.

Options include:

  • Frequency: Choose how often the crawler runs (e.g., Daily, Weekly, Hourly).
  • Execution Time: Specify the start time for crawler execution.
  • Time Zone: Select the appropriate time zone (e.g., UTC, Asia/Calcutta).
  • Multiple Execution Windows: Add multiple time slots as needed.

Example:

Every Day at 12:00 AM UTC (Next Execution: 2025-10-28 05:30:00 Asia/Calcutta)

Set Notifications

  • Notify on Crawler Failure: Select one or more configured notification channels (e.g., Slack, Email) to receive alerts if a crawler run fails.
  • Notify on Success: Toggle on to receive notifications when a crawler run completes successfully.

Finally, click Submit to save your topic configuration and begin monitoring Pub/Sub data through ADOC.

Data Reading Options (Batch Mode)

ADOC supports two data reading modes for Google Cloud Pub/Sub:

1. Full Read

  • Creates a temporary subscription.
  • Reads all messages from the earliest retained message to the job start time.
  • Deletes the subscription after processing.

Use Case: Initial ingestion or complete refresh of topic data.

2. Incremental Read

  • Creates a new ephemeral subscription for each job run.
  • Supports two strategies:
    • Timestamp-based: Reads messages newer than the previous job’s watermark.
    • Lookback-based: Reads messages within a user-defined time window (e.g., last 24 hours).

Use Case: Continuous, non-overlapping data processing or recovery with overlap for fault tolerance.

Next Steps

  • View the newly added data source under Data Reliability → Data Sources.
  • Schedule crawler runs for continuous profiling and data quality checks.
  • Monitor data health, schema drift, and freshness metrics through ADOC dashboards.
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard