Knowledge Catalog permits you to ingest and edit enterprise metadata via an interactive interface. It contains programmatic interfaces that can be utilized to automate your frequent duties. Many enterprises need to outline and gather a set of metadata utilizing Knowledge Catalog, so we’ll provide some greatest practices right here on how one can declare, create, and preserve this metadata in the long term.
In our previous post, we checked out how tag templates can facilitate information discovery, governance, and high quality management by describing a vocabulary for categorizing information property. On this publish, we’ll discover how one can tag information utilizing tag templates. Tagging refers to creating an occasion of a tag template and assigning values to the fields of the template with a view to classify a selected information asset. As of this writing, Knowledge Catalog helps three storage again ends: BigQuery, Cloud Storage and Pub/Sub. We’ll focus right here on tagging property which might be saved on these again ends, corresponding to tables, columns, recordsdata, and message subjects.
We’ll describe three utilization fashions which might be appropriate for tagging information inside an information lake and information warehouse surroundings: provisioning of a brand new information supply, processing derived information, and updating tags and templates. For every situation, you’ll see our instructed strategy for tagging information at scale.
1. Provisioning information sources
Provisioning an information supply sometimes entails a number of actions: creating tables or recordsdata relying on the storage again finish, populating them with some preliminary information, and setting entry permissions on these sources. We add yet one more exercise to this checklist: tagging the newly created sources in Knowledge Catalog. Right here’s what that step entails.
Tagging an information supply requires a site professional who understands each the which means of the tag templates for use and the semantics of the info within the information supply. Based mostly on their data, the area professional chooses which templates to connect in addition to what kind of tag to create from these templates. It can be crucial for a human to be within the loop, provided that many selections depend on the accuracy of the tags.
We’ve noticed two kinds of tags primarily based on our work with purchasers. One kind is known as
static as a result of the sector values are identified forward of time and are anticipated to alter solely occasionally. The opposite kind is known as
dynamic as a result of the sector values change regularly primarily based on the contents of the underlying information. An instance of a static tag is the gathering of information governance fields that embody
data_domain, information confidentiality, and
data_retention. The worth of these fields are decided by a corporation’s information utilization insurance policies. They’re sometimes identified by the point the info supply is created and they don’t change often. An instance of a dynamic tag is the gathering of information high quality fields, corresponding to
number_values, unique_values, min_value, and
max_value. These subject values are anticipated to alter often every time a brand new load runs or modifications are made to the info supply.
Along with these variations, static tags even have a cascade property that signifies how their fields ought to be propagated from supply to by-product information. (We’ll broaden on this idea in a later part.) In contrast, dynamic tags have a question expression and a refresh property to point the question that ought to be used to calculate the sector values and the frequency by which they need to be recalculated. An instance of a config for a static tag is proven within the first code snippet, and one for a dynamic tag is proven within the second.