Data Mesh: Reference Architecture

Sash Barige
Aug 13, 2022
3 min read

In my previous post I covered when to consider Data Mesh. Data federation will occur to support your enterprise-wide corporate systems like CRM (customer resource management), HCM (human capital management) etc. Data federation will also occur to support your operational systems such as supply chain, inventory management etc. You'll notice that the reference architecture below is similar to what you'd consider with or without data mesh. However, when data gets productized within each domain, then you must consider tools and approaches like data lineage, data mesh governance, data mesh version control.

A reference architecture for Data Mesh typically includes various components and principles to guide the design and implementation of a Data Mesh within an organization.

Here is a simplified reference architecture that can serve as a starting point for implementing a Data Mesh:

Data Domains:

Data Products: Break down your organization's data into discrete, self-contained "data products" based on business domains. Each data product should be owned by the respective domain teams and treated as a standalone entity.

Domain Teams:

Data Product Teams: Each domain should have its dedicated Data Product Team responsible for the end-to-end ownership of their data product. This team includes data engineers, data scientists, and domain experts.

Data Ingestion:

Ingestion Layer: Data from various sources is ingested into the Data Mesh through an ingestion layer. This layer should provide connectors to various data sources and support batch and streaming data.

Data Processing and Transformation:

Compute Layer: A compute layer processes and transforms raw data into refined data products. It can utilize technologies like Apache Spark, Apache Flink, or custom data processing tools.

Data Storage:

Data Lake: Raw and refined data is stored in a centralized data lake or a collection of data lakes, making it accessible to domain teams for further processing and analysis.

Data Catalog and Discovery:

Data Catalog: Implement a centralized data catalog or metadata repository (e.g., DataHub) to index and describe available data products, making them discoverable by other teams.

Data Quality and Governance:

Data Quality Framework: Utilize tools like Great Expectations for data quality validation, ensuring that data products meet defined expectations and standards.

Data Access and Consumption:

Data APIs: Each data product should expose APIs for consumption, ensuring that other domain teams can access the data in a self-serve manner.

Security and Access Control:

Access Control: Implement fine-grained access control and security measures to protect sensitive data, ensuring data privacy and compliance with regulatory requirements.

Monitoring and Observability:

Monitoring Tools: Utilize monitoring and observability tools to track the health, performance, and data quality of the Data Mesh.

Data Collaboration and Culture:

Cross-Functional Collaboration: Foster a culture of collaboration and knowledge sharing among domain teams, data product teams, and the central data infrastructure team.

DevOps and CI/CD:

Continuous Integration and Deployment (CI/CD): Implement CI/CD pipelines to manage changes to the data products and ensure data pipeline reliability and efficiency.

Data Mesh Governance:

Governance Framework: Establish clear governance policies and procedures that ensure data quality, data lineage, data ownership, and data stewardship.

Metadata Management:

Metadata Management Tools: Use metadata management tools to capture and maintain metadata about data products, schemas, lineage, and transformations.

Data Mesh Tools:

Data Lineage Tools: Implement tools that help in tracking data lineage to understand how data flows through the Data Mesh.
Collaboration Platforms: Use collaboration platforms to facilitate communication and knowledge sharing among domain teams.
Version Control: Implement version control tools for data pipelines to track changes and manage data quality.

It's important to note that the specific tools and technologies you choose for each component of the Data Mesh may vary depending on your organization's needs and existing infrastructure. The reference architecture should serve as a guide to design and implement a Data Mesh that suits your organization's unique requirements and challenges.

Sash Barige

Aug/13/2022

Photos: unsplash.com

References:

Dehghani, Z. (2019). Data Mesh: A Paradigm Shift in Data Platform Architecture. ThoughtWorks.

LinkedIn Engineering. (2022). DataHub: LinkedIn's Metadata Search and Discovery Platform. LinkedIn Engineering Blog.

Intuit. (2022). Embracing Data Mesh at Intuit. Intuit Engineering Blog.

Spotify Engineering. (2021). The Spotify Data Mesh. Spotify Engineering Blog.

LakeFS. (2022). Home. LakeFS.

Databand. (2022). Home. Databand.

Great Expectations. (2022). Home. Great Expectations.

https://martinfowler.com/articles/data-monolith-to-mesh.html