Google | Cloud Storage

Google Cloud Storage (GCS) is a Representational State transfer (REST) based file storage web service. You can use GCS to store and access data on the Google Cloud Platform (GCP). In ADOC, you can add GCS as a Data Source to monitor vital parameters of your data.

Google Cloud Storage in ADOC

ADOC provides data reliability capability for data stored in your Google Cloud Storage data source. You must create a data plane or use an existing data plane to add Google Cloud Storage as a data source in ADOC. Once you add Google Cloud Storage as a data source, you can view the details of your assets in Google Cloud Storage database in the Data Reliability tab in ADOC.

Prerequisites

Ensure to add the following permissions to your service account before adding GCS as a data source in the ADOC platform:

Steps to Add Google Cloud Storage as a Data Source

To add Google Cloud Storage as a Data source:

  1. Click Register from the left pane.
  2. Click Add Data Source.
  3. Select the Google Cloud Storage Data Source. Google Cloud Storage Data Source basic Details page is displayed.
  1. Enter a name for the data source in the Data Source name field.
  2. (Optional) Enter a description for the data source in the Description field.
  3. Select a Data Plane from the Select Data Plane drop-down menu.

To create a new Data Plane, click Setup Dataplane.

You must either create a Data Plane or use an existing Data Plane to enable the Data Reliability capability.

  1. Click Next. The GCS Connection Details page is displayed.
  1. Upload the GCS credentials file of your GCS account.
    • Credential File: This section may involve submitting a JSON file containing your GCP service account credentials. The credentials file can be downloaded from the GCP project linked with the GCS bucket to which you wish to connect.
    • Project Name: Specify the name of your GCP project that contains the GCS bucket.
    • Bucket Name: Enter the name of the specific GCS bucket you want to connect to within your GCP project.
    • Monitoring Channel Type (Optional): This field might be optional depending on functionalities. It might allow you to specify a channel for monitoring events related to the GCS connection.

A monitoring channel type in ADOC refers to the process or pathway that is utilized to monitor and identify data changes or updates inside a certain dataset. This concept is especially crucial when integrating external storage or data systems with ADOC because you must keep track of data changes, such as file updates or new data inputs, in real or near real time.

For example, when selecting a data source like Google Cloud Storage (GCS), you may be asked to choose a monitoring channel type. This may be:

  • None: There is no current monitoring. This means that ADOC will not automatically monitor changes to the data source.
  • Google Pub/Sub: This is a type of monitoring channel that uses Google Cloud Pub/Sub to send and receive notifications about changes or events in a data source. When a new file is uploaded or an old file is modified, a Pub/Sub message can trigger an event in ADOC.

Choosing the appropriate monitoring channel type is critical for ensuring that data observability fulfills your organization's objectives, particularly in situations when real-time data accuracy and responsiveness are essential.

  1. Click Test Connection.

If your credentials are valid, you receive a Connected message. If you get an error message, validate the GCS credentials file and Google Client Email ID that you entered.

  1. Click Next. The Observability Setup page is displayed.
  2. Enter the name of the project in which your data exists, in the Project Name field.
  3. Provide the Asset Name, Path Expression, and File Type, of your GCS assets.
  • Based on the file type selected, you must enter values for the required parameters:
File TypeParameterDescription
CSVDelimiterThe character that separates fields in a CSV file. Common delimiters include commas (,), tabs (\t), or semicolons (;).
ORCNo additional parameters are required for ORC files.
PARQUETFile Processing StrategyOptions include: Evolving Schema (no additional parameters required), Random Files, or Date Partitioned.
Base Path (Random Files)The root directory or location in the storage system where the Parquet files are stored. This is used to locate the data for random file processing.
Base Path (Date Partitioned)The root directory or location where the date-partitioned Parquet files are stored.
Pattern (Date Partitioned)A file pattern that includes a date (e.g., "file-<yyyy-MM-dd>.parquet") to identify the specific files for processing.
LookBack Days (Date Partitioned)The number of days to look back when crawling and processing date-partitioned Parquet files.
TimeZone (Date Partitioned)The time zone in which the partitioned data is recorded.
JSONFlattening LevelDefines how deeply nested JSON structures will be flattened. Nested JSON fields will be expanded based on the level specified.
MultiLine JSONWhen enabled, this toggle allows for the processing of JSON data that spans multiple lines.
AVROSchema Store TypeSpecifies where the AVRO schema is stored. Options could include local files, a schema registry, or other storage systems.
DeltaNo additional parameters are required for Delta files.

Important

  • The asset path expression must be a full path address. The syntax is gs://bucket-name/file name_. For example, gs://acceldatabucket/observability.csv.
  • To include all the files in the bucket, you must use the syntax gs://bucket-name. This includes all the files stored in the Google Cloud Storage bucket, irrespective of the file type.
  • To include all the files that belong to a specific file type, you must use the syntax as gs://bucket-name/*file extension. For example, to include all the CSV files in the bucket, you must use the syntax as gs://bucket-name/*.csv. Similarly, to include all the JSON files in the bucket, you must use the syntax, gs://bucket-name/*.json.
  1. (Optional) To add more assets, click +. The selected assets are monitored by the Data Reliability capability of ADOC.
  2. Enable Schema Drift Monitoring: For newly crawled assets, Acceldata will automatically enable Schema drift monitoring. These monitoring policies may be further configured per asset in the application. Warning Schema changes are not automatically detected unless the crawler is executed. Acceldata recommends scheduling the crawler.
  3. Enable Crawler Execution Schedule : Turn on this toggle switch to select a time tag and time zone to schedule the execution of crawlers for Data Reliability.

  1. Click Submit.

GCS is now added as a data source. You can choose to crawl your GCS Data Source now or later.

You can see that a new card is created for GCS on the data sources page. This card displays the crawler status and other details of your GCS data source.

For more information on Cross account implementation see Cross Account Implementation.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard