Is the so-called Big Data revolution more accurately a Big Data delusion? In many ways, it feels like the digital transformation and the rise of Big Data failed to deliver on its promise of understanding our customers better, optimizing our marketing spend, improving our operational inefficiencies, and so on. The reality for many companies is that they are struggling to become data-driven. Companies soon discovered that data is complex, cumbersome, and costly. The host of complexities range from integrating many data sources with varying formats, building ‘as code’ automated data pipelines, data sharing and eliminating data silos, data security and governance, and having the proper team and tools to make use of data to drive business decisions. With so much complexity around Big Data, organizations can easily make mistakes when working with their data. In this post, we look at the top 5 mistakes companies commonly make with data, and offer potential solutions to avoid these mistakes.
1. Overinvesting in data analytics, while underinvesting in data engineering
Most businesses collect and have access to large volumes of data about their customers and operations, but using that data in meaningful ways is difficult and costly. Even with the right tools, extracting insight from data can be overwhelming if you don’t hire the right staff. Companies tend to invest in hiring data analysts and data scientists to build reports and develop analytic dashboards, but don’t invest in proper data platforms and data engineers, which are required to make the work successful. Sure, they may be able to figure out how to create efficient ‘as code’ ETL pipelines, but it doesn’t mean they should. Since they often lack the specialized expertise, they will spend an exorbitant amount of time getting lost in The Dataland Zoo of technologies and approaches with no proper support, guidance, or knowledge of best practices. What usually ends up happening is only a small fraction of the overall data that a business generates is properly structured and organized in relational data models, stored in database tables, and becomes available to be analyzed with SQL queries or a BI tool.
At Infostrux, we strongly believe in data engineering as a valuable practice that deserves to be invested in. Data engineering is loosely used to refer to people who can handle anything from low-level infrastructure work to deploying data technologies. By using automated pipelines, data engineers connect data sources to data warehouses and data lakes
. From there, they create the coding work inside those pipelines and platforms for doing the cleaning, transformation, integration, and basic modelling of the data. These steps make data reliable and easy to use. Data engineering enables business analysts and data scientists to start developing their reports and dashboards to glean insights and make data-driven decisions. We also empower teams to use the data in other meaningful ways such as uncovering business insights or start training machine learning models as well as building innovative data products that differentiate them in their markets and create new revenue opportunities.
2. Not migrating to the cloud
By now, it’s difficult to ignore the benefits of the cloud, but some companies have cultural or legacy reasons for sticking with an on-premises data platform. A lot can be said about this topic, but we’ll focus on just three areas (since we mention the challenges with data sharing in the next section): High upfront costs – with on-premises hosting, the hardware must be purchased upfront. The company may also be responsible for any and all ongoing maintenance and replacement of this hardware, including reserving dedicated office space, hiring trained staff, and expending high energy costs to keep the equipment cool and running smoothly. Overpaying for storage – since it is difficult to predict how much or how little storage you may need, companies end up paying for storage that they never use. Overtime, this can add up to significant capital expenditures. Limited concurrency – on-premises data platforms have fixed compute and storage resources, which limits concurrency. Concurrency is the ability to perform many tasks simultaneously and/or to allow many users to access the same data and resources. The Solution
Among many benefits of moving to the cloud, one significant factor is the reduced cost, especially in the long-term. No upfront costs and zero maintenance – with a cloud data platform, there are no upfront costs. Companies also don’t need to guess how much storage they will need. Snowflake’s fully scalable on-demand, pay-for-what-you-need data platform can save companies a lot of money. There are also no maintenance costs for servicing and upgrading the hardware (since there is no hardware), you don’t have to reserve office space and hire staff to monitor it, and you don’t have to pay high energy bills; the cloud provider handles all these things. Increased performance – we looked at one drawback with on-premise, which is the concurrency limitations with fixed compute and storage. With a multi-cluster, shared data architecture in the cloud, you can achieve instant and near-infinite performance and scalability. Concurrency is no longer an issue since compute and storage resources are separate, enabling them to be scaled independently. Now multiple users can query the same data without degrading performance. More than data storage – perhaps the most important factor is that a cloud data platform is much more than data storage, it’s a single platform that supports many different workloads:
- Data warehouse for easy storage and analytics
- Data lakes for storing all your data for data exploration
- Data engineering for ingesting and transforming your data
- Data science for creating machine learning models
- Data application development and operation
- Data exchanges for sharing data with high security and governance
At Infostrux, we help customers of all sizes and industries get up and running with Snowflake. We also offer data engineering as a service
, which involves building and managing automated data pipelines so that your data is centralized, cleaned, and modelled in alignment with your specific business requirements. In essence, we help organizations obtain reliable data they can trust for all their data use cases.
3. Ignoring data silos and using outdated data sharing methods
Data silos exist when groups within an organization collect and maintain their own data inside separate data warehouses or data marts
(smaller, stand-alone versions of a data warehouse that includes a segmented or subset of data e.g. marketing, sales, finance, etc.). Data that is siloed becomes invisible and inaccessible to others, resulting in an incomplete picture of the company, which hinders business intelligence, data-driven decision making, and collaboration. Even if a company is aware of this problem, they may not be equipped to efficiently do anything about it. Traditional approaches to data sharing are inefficient and ineffective. Sharing vast amounts of data is impractical, if not impossible. It requires making a copy of the data and manually moving it somewhere (usually via email), which has a host of challenges:
- The process is cumbersome, time-consuming, and complex, and must be repeated any time the data changes
- There are extensive delays as the process does not happen immediately
- It creates multiple versions of static (and stale) data so there is no single source of truth. This leads to the ‘high-confidence, bad-answers trap’. Running analytics on incomplete data and drawing faulty conclusions can result in costly business decisions
- There are limitations to how much data an organization can share via email and other methods
- Data security and governance risks are high as multiple copies of data would need to exist and could end up in the wrong hands
An ideal solution would be to share limitless amounts of live data easily, effortlessly, and instantly with proper security and governance protocols. Cloud data platforms such as Snowflake enable organizations to do just this. Data sharing with Snowflake’s Data Exchange breaks down these silos by providing access to live data between internal business units and external partners. Instead of a complex and cumbersome process, users can share and receive live read-only copies of data within minutes. This is done without having to move the data so the data sharer is always in control, which greatly reduces the security and governance risks. For further reading, check out: Data Sharing and the Future of Data Warehousing in the Cloud
4. Not addressing data quality issues
For some business units, marketing especially, getting access to good quality data is a challenge. To illustrate our point, we’ll use marketing attribution as an example since it is a notorious problem to solve. Attribution is like giving credit to a source. For example, suppose a restaurant owner gives a $10 referral fee to the source of every guest that comes in. So a patron walks into the restaurant, and the owner asks, “How did you hear about us?” The patron replies “My friend, Carl told me.” In this scenario, Carl is the source and would receive credit for the referral, and with it, $10.” But suppose Carl told Sally, and Sally told her other friend. When that friend enters the restaurant, she makes no mention of Carl. But shouldn’t Carl receive partial credit as well? It seems fair that he would, so we would need to refine our referral structure. Currently, we have it set up as a “first-touch” attribution (Carl), or “last-touch” attribution (Sally). To take the example further, suppose the friend also heard a radio ad for the restaurant and also received a flyer in the mail. Surely those had some influence on her (even if only slightly), but without her mentioning it, the restaurant owner has no idea if the money spent on those market channels produced a worthwhile ROI. It gets even more complex when we look into attribution windows. What if Sally heard a radio ad for the restaurant 2 months before Carl told her about it? Does the radio ad still count as a source in our attribution model, or is it too far back? So perhaps we need to make a further refinement to include only those sources that occurred within a given time period, say 7 days. This scenario actually plays out quite routinely in marketing. It takes an average of 9-14 touchpoints before a consumer makes a purchase with your brand. With so many touchpoints factoring into a consumer’s decision, it would seem right to implement a weighted attribution model, which is to distribute some portion of credit amongst the different touchpoints. Perhaps giving more credit to the sources closer to the conversion since those were more likely to be key drivers in the consumers decision matrix. Ad platforms such as Google, Facebook, and LinkedIn, and others often like to take full credit for the conversion even though there were most likely multiple touchpoints or factors in the consumer’s decision to finally make the purchase. So this credit often gets double counted. The other challenge with these ad platforms is that they don’t make it easy to integrate data. For many of these platforms, obtaining and retaining a lot of data around their users and suppliers is integral to their business models. Unfortunately, sharing that data with their users is not part of their business model.
The best way to overcome some of these challenges are:
- Apply tracking codes to all marketing touchpoints (where applicable). This way, you at least have a way to distinguish the source and medium that brought the customer through the door so-to-speak. Digital marketing makes this a little easier with UTM tags and other tracking codes. With print, radio, and TV commercials, you can have QR codes, coupons, and referral messages, but it admittedly isn’t a foolproof system.
- Enable conversion events on the ad platforms. That way you can have some indication of which ads resulted in a conversion and therefore allocate more of the budget toward those specific ads, while reducing or eliminating the budget from underperforming ads.
- Set up goals in Google Analytics. This will allow you to better track on-site consumer behaviour associated with predefined goals such as filling out a form, downloading an asset, or making a purchase.
- Invest in software tools (CRM, ERP, etc.) that give you full visibility into the buyer’s journey.
- Integrate all your data sources into one unified platform to create a single source of truth. This creates a broader picture of your attribution models. Through data engineering, we can use code to automate these data pipelines into a cloud platform such as Snowflake. Once the data is integrated, wrangled, and modelled properly, it is ready for analysis.
5. Not using data properly
There are a lot of ways companies can misuse data. Either they don’t know what they’re looking for or they’re not asking the right questions, they have no defined metrics or don’t use it to make decisions, or they become overwhelmed and succumb to analysis paralysis. Companies obsess over KPIs and OKRs, but often don’t have reliable ways to measure the metrics they identify as key. Without a reliable way to measure these goals and effectiveness of certain initiatives, companies are left to rely on intuition, anecdotes, or past experiences.
Companies who implement data-driven decision cultures and processes are generally more successful. The first step to using data properly is to start with a question or set of questions related to your business. Next, you want to determine what you are going to do with that information.
It may look something like this.
|What was our gross margin last quarter?||Lower than expected||We need to switch manufacturers to cut our COGs and improve our gross margins|
|Are sales trending up or down?||Sales are trending up||Our marketing push worked and the ROI was worthwhile. Let’s do it again.|
|Which channels resulted in the most sales?||Events||We need to invest more in our events.|
|What product(s) is our top seller?||XYZ product||We need to increase production and focus in order to fulfil customer demand.|
The next step would be to look at your data to see if any patterns emerge. Some insight may stand out that you didn’t think to inquire about. For example, perhaps a previously under-represented user group is showing strong signs of growth. That may indicate that you need to focus more efforts on marketing to that group and offering more products that cater to them. Data is a valuable business asset, but that value can only be realized if you have the proper tools and expertise to manage and explore the data. As your data sources grow in complexity, so too must your tools as well as the training so that staff can make meaningful insights from it.
Companies generating large volumes of data face the inevitable challenge of gaining control and access to all their data for reporting, analysis, decision making, and other data use cases. To overcome these challenges, organizations must adopt tools (i.e. technology, software, and trained staff), which adds complexity to their business operations. Building solutions that don’t work well due to missing or unreliable data results in organizations falling into the ‘high confidence, bad answers’ trap. Bringing data together and creating useful insights has developed a whole new industry, set of technologies, and approaches that we know as Big Data, data analytics, data science, etc. To stay ahead of the curve, companies need new tools for integrating data, new models for analyzing data, and data engineering efforts from businesses like ours. At Infostrux, we help organizations of all sizes deal with the undifferentiated heavy lifting required to improve data reliability and break data silos.