Ensuring data is ingested reliably and securely is a big challenge with governance. Often, we see clients struggle with data governance. The old ways of doing data governance no longer work, and it often lacks a chance to succeed. Some big names, such as Gartner, have even declared existing data governance dead.
Data governance is frequently done through manual, static, and uniquely designed processes. Data governance is often tied to ITIL-based change management systems, with database administrators driving most of the hands-on work.
The size of data sets continues to increase, and the world of technology has been moving to cloud-based technology for several years. It’s nearly impossible to maintain accurate control without automation nowadays.
With big data, data needs to be ingested a lot faster than it used to. Data privacy and security needs are adding to the complexity. A manual approach is no longer adequate as complexity increases to ways manageable by manual or ad-hoc techniques through GUI-based tools.
At Infostrux, we believe the path to improving data governance is through automation and tools that support dynamic data governance.
What is Data Governance?
Data governance is a set of principles and practices that keep data and data usage reliable and trustworthy.
It is a way to ensure the data is also not misused. Data governance usually involves a steering committee in charge of overall executive sponsorship, holding everyone accountable, and being a champion for the data at the enterprise, a governance team that sets high-level priorities and general standards of operation, and a data team that champions the best practices and frameworks to empower across the board quick adoptions while maintaining governance. This team typically creates frameworks and empowers other teams to use the data.
Reasons why we need to govern data:
- It helps avoid potential disasters for your company and the possibility of misusing the data (e.g., data leaking, principally when it contains sensitive information).
- Data is siloed without clear owners, catalog making it unusable to most.
- High data quality leads to increased confidence in the data. Quality issues lead to low confidence in your data.
Data governance controls the following aspects of your data:
- The data catalog – this tool allows for standard metadata inventory for the organization. This will enable organizations to discover and improve access to the data while improving governance.
- The glossary – focused on the business and maintaining definitions in a standard dictionary throughout the organization. It helps avoid misunderstandings about what things mean.
- Data Lineage – this is when you understand where the data came from and what system touched and moved the data.
- Data Quality – Make sure your data has high quality through the testing pyramid, unit testing, integration, and some basic UI testing.
- Data Security – auditing and security, including roles and responsibilities
Challenges of Data-Driven Organizations
Organizations face many challenges in gaining reliable data sets. As we determined above, data programs have a high rate of failure.
Here are many of the challenges:
- A non-centralized platform leads to inconsistency, often due to political issues
- Talent is missing with the high-demand skill set
- Poor governance generates poor quality of the data
- Processes for data ingestion frequently fail
- Data models are inconsistent
- Data is incomplete
- Change is hard, and teams often resist change
- Current technologies cannot keep up with business demands. This ties into IT teams wanting to keep existing supporting systems with non-standard formats.
- The architecture is too complex to manage big data sets with good quality
- Security and compliance are not well understood
What does good data governance look like?
Cloud-based data architecture allows organizations to focus on data intelligence, not system management.
- Manage most aspects of operating your data through automation and systems thinking, including infrastructure, data quality, catalog management, security, and compliance.
- Find your balance to provide data access so data gives your organization a competitive edge.
- An agile approach to delivering quick results vs. high impact
Let us look at existing automation and how it can help using DataOps principles. Namely, using an ‘everything as code’ approach allows testing of the data pipelines and models.
In general, governance becomes stronger with everything as code as you design and enforce policy-based automation at the code base and CI/CD base. Much like DevSecOps, DataOps will evolve into a policy-based mechanism.
A data catalog helps users find and manage data sets across many systems. With everything as code, you would update the data catalog automatically with every new model, dimension, and fact so the data catalog would never miss any information.
Automated documentation as part of the data pipelines automatically profiles and verifies the schemas, generate documentation, and saves the results into whatever system you want. The automated documentation can be tracked as part of the Pull Requests in your Git system or into a documentation system that you support so there is a change management tracking or through many SaaS tools available in the market.
Data Lineage and Glossary
Like the catalog, data lineage, and a glossary helps users better find and understand what is in their data. Automation removes human mistakes and traces data across the entire data pipeline through logging and a tagging system that flows to your data catalog.
For example, suppose we are using Fivetran to feed data from Salesforce to Snowflake. Every run, we have a data glossary and a data lineage table that updates as the data pipelines run identifies where the specific dimensions come from and what system put that there. Then you can keep track of that lineage by understanding that an insert for Salesforce was made on a date and time, what dimensions were updated, and what tables.
Data quality can be automated in different ways. Automated quality means fixing issues in the data through automation.
For example, suppose you have multiple data sources that use a dimension for the country; some are CA, Canada, CAN, or CANADA. Depending on your data storage system, through the ETL or ELT, you could have a function to fix all those, making the country code from multiple sources consistent.
Here is an example of checking data quality.
Security is probably easier to explain regarding automation as many of us are already used to doing automated security in the DevOps world. The benefit of automating your security is the fact that there are change controls that can be verified through practices, pull requests with peer reviews, and approvals. So fewer mistakes are made.
Other benefits of automating include de-identification of the data with policy-based de-identification linked to RBAC roles. Systems like Snowflake allows row and column-based security tied to RBAC roles.
Data engineering (aka DataOps), like the DevOps movement, is an excellent answer to data governance in the era of large data sets as it helps manage complexity.
Many of the pains of data governance revolve around things fixed through automation and fostering collaborative environments, shared responsibility, and continuous improvement.
Through those, governance becomes a shared responsibility model with everyone involved and working towards it.