Ensuring data is ingested reliably and securely is a big challenge with governance at the moment. Often, we see clients struggle with data governance. The old ways of doing data governance no longer work, and it often lacks a chance to succeed from the get-go. Some big names such as Gartner have even declared existing data governance dead.
Most Data governance is frequently done through manual, static, and uniquely designed processes. Often data governance is also tied to ITIL-based change management systems with database administrators driving most of the hands-on work.
The size of data sets continues to increase, and the world of technology has been moving to cloud-based technology for several years. It’s nearly impossible to maintain accurate control without having automation nowadays.
With big data, data needs to be ingested a lot faster than it used to. Data privacy and security needs are adding to the complexity. A manual approach is no longer adequate as complexity increases to ways manageable by manual or ad-hoc techniques through GUI-based tools.
At Infostrux, we believe the path to improving data governance is through automation and tools that support dynamic data governance.
What is Data Governance?
Data governance is a set of principles and practices that keep data and data usage reliable and trustworthy.
It is a way to ensure the data is also not misused. Data governance usually involves a steering committee, in charge of overall executive sponsorship, holding everyone accountable, and being a champion for the data at the enterprise, a governance team that set high-level priorities and general standards of operation, and a data team that champions best practices and frameworks to empower across the board quick adoptions while maintaining governance. This team typically is creating frameworks and empowering other teams to use the data.
Reasons why we need to govern data:
- It helps avoid potential disasters for your company and the possibility of misusing the data (e.g. data leaking, principally when the data contains sensitive information).
- Data siloed without clear owners, catalogue making it unusable to most.
- High data quality leads to high confidence in the data. Quality issues lead to low confidence in your data.
Data governance controls the following aspects of your data:
- The data catalogue – this tool allows for common metadata inventory for the organization. This will enable organizations to discover and improve access to the data while improving governance.
- The glossary – focused on the business and maintaining definitions in a common dictionary throughout the organization. It helps avoid misunderstandings about what things mean.
- Data Lineage – this is when you understand where the data came from, what system touched, and moved the data.
- Data Quality – making sure your data has high quality through the testing pyramid, unit testing, integration, and some basic UI testing.
- Data Security – auditing and security, including roles and responsibilities
Challenges of Data-Driven Organizations
Organizations face many challenges to gain reliable data sets. As we determined above, there is a high rate of failure for data programs. Here many of the challenges:
- A non-centralized platform leads to inconsistency, often due to political issues
- Talent is missing with the high-demand skill set
- Poor governance generates poor quality of the data
- Processes for data ingestion frequently fail
- Data models are inconsistent
- Data is incomplete
- Change is hard and teams often resist change
- Current technologies cannot keep up with business demands. This ties into IT teams wanting to keep existing supporting systems with non-standard formats
- The architecture is too complex to manage big data sets with good quality
- Security and compliance are not well understood
What does good data governance look like?
- Cloud-based data architecture that allows organizations to focus on the data intelligence, not the management of the system
- Manage most aspects of operating your data through automation and systems thinking, including infrastructure, data quality, catalogue management, security and, compliance
- Find your balance to provide data access, so data gives your organization a competitive edge
- An agile approach to delivering quick results vs. high impact
Let us now look at actual automation and how it can help by using DataOps principles. Namely, by using an ‘everything as code’ approach it allows testing of the data pipelines and models.
In general, governance becomes stronger with everything as code as you design and enforce policy-based automation at the code base and CI/CD base. Much like DevSecOps, DataOps will evolve to a policy-based mechanism.
A data catalogue helps users find and manage data sets across many different systems. With everything as code, you would update the data catalogue automatically with every new model, dimension, and the fact so the data catalogue would never miss any information.
Automated documentation as part of the data pipelines automatically profiles and verifies the schemas, generates documentation, and finally saves the results into whatever system you want. The automated documentation can be tracked as part of the Pull Requests in your Git system, or into a documentation system that you support so there is a change management tracking, or through many SaaS tools available in the market.
Data Lineage and Glossary
Similar to the catalogue, data lineage, and glossary helps users better find and understand what is in their data. Automation removes human mistakes and traces data across the entire data pipeline through logging and a tagging system that flows to your data catalogue.
For example, suppose we are using Fivetran to feed data from Salesforce to Snowflake. Every run, we have a data glossary and a data lineage table that updates as the data pipelines run identifies where the specific dimensions come from and what system put that there. Then you can keep track of that lineage by understanding that an insert for Salesforce was made on a date and time, what dimensions were updated, and what tables.
Data quality can be automated in different ways. Automated quality means fixing issues in the data through automation.
For example, suppose you have multiple data sources that use a dimension for the country; some are CA, Canada, CAN, or CANADA. Depending on your data storage system, through the ETL or ELT, you could have a function to fix all those, making the country code from multiple sources consistent.
Here is an example of checking data quality.
Security is probably easier to explain in terms of automation as many of us are already used to doing automated security in the DevOps world. The benefit of automating your security is the fact that there are change controls that can be verified through practices, pull requests with peer reviews, and approvals. So fewer mistakes are made.
Other benefits of automating include de-identification of the data with policy-based de-identification linked to RBAC roles. Systems like Snowflake allow row and column-based security tied to RBAC roles.
Data engineering (aka DataOps), like the DevOps movement, is an excellent answer to data governance in the era of large data sets as it helps manage complexity.
Many of the pains of data governance revolve around things fixed through automation and fostering collaborative environments, shared responsibility, and continuous improvement.
Through those governance becomes a shared responsibility model with everyone involved and working towards it.