The first article on this collection supplied an summary of a knowledge lake resolution structure utilizing Information Fusion for information integration and Cloud Composer for orchestration.
On this article, I’ll present an summary of the detailed resolution design primarily based on that structure. This text assumes you will have some fundamental understanding of GCP Information Fusion and Composer. In case you are new to GCP, you can begin by studying the previous article on this collection to get an understanding of the totally different companies used within the structure earlier than continuing right here.
The answer design described right here supplies a framework to ingest numerous supply objects via the usage of easy configurations. As soon as the framework is developed, including new sources / objects to the info lake ingestion solely requires including new configurations for the brand new supply.
I’ll publish the code for this framework within the close to future. Look out for an replace to this weblog.
The answer design includes four broad parts.
- Data Fusion pipelines for information motion
- Customized pre-ingestion and post-ingestion duties
- Configurations to offer inputs to reusable parts and duties
- Composer DAGs to execute the customized duties and to name Information Fusion pipelines primarily based on configurations
Let me begin with a excessive degree view of the Composer DAG that orchestrates all of the components of the answer, after which present perception into the totally different items of the answer within the following sections.
Composer DAG construction
The Composer DAG is the workflow orchestrator. On this framework, It would broadly comprise parts proven within the picture beneath.