Skip to content

Data Sharing and the Future of Data Warehousing in the Cloud

In this article, our CEO Goran (Kima) Kimovski, writes about the future of the data cloud when it comes to Data Warehousing and what sharing data in the data cloud looks like.

Recently, I had a conversation with a product leader of a SaaS organization who shared how many of their customers don't use the built-in analytics they offer as part of their product and instead ask them for direct access to their data. 

Their customers prefer to load the data into their own data warehouse or data analytics platform for direct processing and analysis. I've heard this same scenario over a dozen times in recent months, and I am used to it by now. 

Such a request would not be unusual if only their largest enterprise customers were asking for it — a common request I have observed for over a decade.  

What is unusual is how commonplace it is for many SaaS vendors to export their data or give access to it to many customers, regardless of size. Further, many of the SaaS vendors I’ve talked to in the past six months tell me that many of their customers ask if they can share their data directly using Snowflake. Many of their customers started using Snowflake and would prefer to use Snowflake's data-sharing capabilities rather than importing the data using ETL tools. 

Let me preface this post by saying I love Snowflake's private data exchange capabilities and the growing public data marketplace. It was one of the reasons why I got excited to focus Infostrux exclusively on Snowflake as the main data platform for our services.  

Recently, I had a fireside chat conversation with one of Snowflake’s early investors and board members, S. Somasegar (transparently, Soma is an investor and advisor to Infostrux, too), who said in our interview that he thinks data sharing in the cloud will be bigger than data warehousing. We’re both biased and love the data cloud Kool-aid, so I am always trying to seek evidence that prevents me from becoming blind to my biases. So far, I haven’t been able to find one.  

It feels validating that the organizations I am talking to are finding the data-sharing requirements rising in priority and looking for ways to do that safely and efficiently.

What does data sharing in the cloud look like?

The old approaches to data sharing typically involve one of two methods:

  • The most common by far involves some kind of export or backup mechanism on the provider side. This puts the data to be shared into one or multiple files and stores the data on a secure FTP or cloud storage. The data needs to be imported or restored on the consumer side to bring the data into a data lake or a data warehouse technology for processing and integration with the rest of the organization’s data.
  • Larger enterprises may leverage their existing enterprise service bus or develop web services to facilitate direct data exchange with their customers, suppliers, or third-party partners without needing export and import processes.

Advanced architectures involving streaming services, big data technologies, microservices, etc., have also been applied to the problem, but they’re typically not suitable for organizations that don't have sizable R&D capacity and know-how on both sides of the exchange. Snowflake showed that it is possible to simplify data sharing by giving the consumer direct SQL access to the provider’s data without copying the data while maintaining a fully governed, secure, and highly controlled exchange between the provider and the consumer.  

This novel approach makes it possible for anyone with a Snowflake account to connect to the data shared by their SaaS vendors or third-party partners. As long as both parties already use Snowflake, they can access the data and start gaining insights from it by connecting a BI or analytics tool. It does not require creating a data pipeline and importing the data into a data warehouse. Of course, they quickly discover that they can enrich that data with additional data from the public marketplace and may even start bringing other internal data from their own systems and, over time, develop a data warehouse or an analytics solution themselves where the original share that started the whole initiative is just one of multiple shares and sources feeding that solution.

 

peter-ivey-hansen-Ca7DfxH5zd8-unsplash

 

Architect your data solution with data sharing as a core approach

I intend not to write about Snowflake’s data-sharing capabilities as those are well documented on Snowflake's website. Instead, I would like to focus on the value of designing your architecture for data sharing internally within your organization, even if you don’t have plans to give access to your data to anyone outside of your organization yet. 

If you're not new to the public cloud, you may already be familiar with the approach of using a multi-account cloud architecture for implementing an effective solution for separations of concerns by mapping each cloud account to your organizational structure, security, and compliance concerns or some other design pattern that best meets your organization's needs. The same approach applies to your data architecture in Snowflake's data cloud by using multiple Snowflake accounts within your data architecture and using data sharing as the interface between those accounts. 

Note: even if you're not using Snowflake today and you’re building on another cloud platform, you should consider making your architecture support data sharing, as many of the other platforms are adding data-sharing capabilities to their functionality. 

The simplest example of how data sharing can help you practice this approach is creating a second sandbox account for data science. Using data sharing, you can give direct read-only access to your raw (“data lake”) schema as well as your processed (“data warehouse”) and modeled (“data mart") schemas to your data science team. They can import whatever other data they want and create any datasets they need for their models or ML training directly inside their sandbox. 

This approach simplifies the security controls required for the two groups to work together on the same platform. It empowers each to continually work on enhancing their data and solutions without spending much time writing ETL to move the data across different technology stacks. 

Next, you can enable additional accounts to support your development and testing efforts. This approach takes advantage of the ability to define how to handle sensitive data at the point of exchange by effectively classifying, masking, anonymizing, tokenizing, or filtering protected data. This makes it so the users in the development and testing accounts can use functionality like zero-copy cloning to work safely and directly with production data without putting any effort into cleaning it before using it during their tasks. 

Once you gain enough experience empowering different functional groups within your data teams to work with your data, you can further extend the same approach by creating specific analytics accounts for different OUs like marketing, finance, sales, etc.  

Too often, data teams respond to requests for new reports and dashboards in a centralized BI or analytics tool, or they have to add additional data sources to a central data warehouse only some OUs may need or use. Instead, you can maintain a core centralized data repository for the entire organization and enable each separate group to further work off of that repository. Once this is in place, they can run their own reporting or enhance the data with additional datasets imported into their own account to bring that additional data into their reports.

From Data Sharing to Data Networks

Once you have been down the path of using data sharing to architect a flexible, scalable, and extensible data platform for your organization, it is a very small step to extend that to your external partners, customers, etc.  

This has the side effect of inviting those organizations to invest in their own data platforms and further enable data sharing, which can enable a two-way sharing opportunity for you too.  

With this approach, data warehousing becomes a component in a much wider data network solution where each “node” in that network can use a different data architecture for integrating, modeling, and preparing the data for the specific uses it is meant to serve. It is not a big stretch to suggest that organizations can soon run their entire supply chain on a data cloud solution like Snowflakes. It is certainly a compelling opportunity for anyone who had to enable data sharing and exchanges in the past and understands how difficult, complex, and often time brittle the traditional solutions are.  

With cloud-based data sharing, Snowflake is enabling new data architectures that are faster to evolve and easier to maintain. I am certainly excited about this opportunity and can't agree more with Soma that data sharing will be much bigger than data warehousing!