Is the so-called Big Data revolution more accurately a Big Data delusion? In many ways, it feels like the digital transformation and the rise of Big Data failed to deliver on its promise of understanding our customers better, optimizing our marketing spend, improving our operational inefficiencies, and so on.
The reality for many companies is that they are struggling to become data-driven. Companies soon discovered that data is complex, cumbersome, and costly. The complexities range from integrating many data sources with varying formats, building ‘as code’ automated data pipelines, data sharing and eliminating data silos, data security and governance, and having the right team and tools to use data to drive business decisions.
With so much complexity around Big Data, organizations can easily make mistakes when working with their data. In this post, we look at the top 5 mistakes companies commonly make with data and offer potential solutions to avoid these mistakes.
1. Overinvesting in data analytics while underinvesting in data engineeringMost businesses collect and have access to large volumes of data about their customers and operations, but using that data in meaningful ways is difficult and costly. Even with the right tools, extracting insight from data can be overwhelming if you don’t hire the right staff. Companies tend to invest in hiring data analysts and data scientists to build reports and develop analytic dashboards but don’t invest in proper data platforms and data engineers, which are required to make the work successful. Sure, they may be able to figure out how to create efficient ‘as code’ ETL pipelines, but it doesn’t mean they should.
Since they often lack specialized expertise, they will spend an exorbitant amount of time getting lost in The Dataland Zoo of technologies and approaches without proper support, guidance, or knowledge of best practices. What usually happens is that only a tiny fraction of the overall data a business generates is adequately structured and organized in relational data models, stored in database tables, and can be analyzed with SQL queries or a BI tool.
The SolutionAt Infostrux, we strongly believe in data engineering as a valuable practice that should be invested in. Data engineering loosely refers to people who can handle anything from low-level infrastructure work to deploying data technologies. Data engineers connect data sources to data warehouses and lakes using automated pipelines. From there, they create the coding work inside those pipelines and platforms for cleaning, transforming, integrating, and modeling the data. These steps make data reliable and easy to use.
Data engineering enables business analysts and data scientists to develop reports and dashboards to glean insights and make data-driven decisions. We also empower teams to use the data in other meaningful ways, such as uncovering business insights or starting training machine learning models, as well as building innovative data products that differentiate them in their markets and create new revenue opportunities.
2. Not migrating to the cloudBy now, it's difficult to ignore the benefits of the cloud, but some companies have cultural or legacy reasons for sticking with an on-premises data platform. A lot can be said about this topic, but we'll focus on just three areas (since we mention the challenges with data sharing in the next section): High upfront costs – with on-premises hosting, the hardware must be purchased upfront.
The company may also be responsible for any ongoing maintenance and replacement of this hardware, including reserving dedicated office space, hiring trained staff, and expending high energy costs to keep the equipment cool and running smoothly. Overpaying for storage – since it is difficult to predict how much or how little storage you may need, companies end up paying for storage they never use. Over time, this can add up to significant capital expenditures. Limited concurrency – on-premises data platforms have fixed computing and storage resources, which limits concurrency.
Concurrency is the ability to perform many tasks simultaneously and to allow many users to access the same data and resources. The Solution Among the many benefits of moving to the cloud, one significant factor is the reduced cost, especially in the long term. No upfront fees and zero maintenance – with a cloud data platform, there are no upfront costs. Companies also don't need to guess how much storage they will need. Snowflake's fully scalable, on-demand, pay-for-what-you-need data platform can save companies money.
There are also no maintenance costs for servicing and upgrading the hardware (since there is no hardware), you don't have to reserve office space and hire staff to monitor it, and you don't have to pay high energy bills; the cloud provider handles all these things. Increased performance – we looked at one drawback with on-premise: the concurrency limitations with fixed compute and storage. You can achieve instant and near-infinite performance and scalability with a multi-cluster, shared data architecture in the cloud. Concurrency is no longer an issue since computing and storage resources are separate, enabling them to be scaled independently. Now multiple users can query the same data without degrading performance.
More than data storage – perhaps the most critical factor is that a cloud data platform is much more than data storage; it's a single platform that supports many different workloads:
- Data warehouse for easy storage and analytics
- Data lakes for storing all your data for data exploration
- Data engineering for ingesting and transforming your data
- Data science for creating machine learning models
- Data application development and operation
- Data exchanges for sharing data with high security and governance
3. Ignoring data silos and using outdated data-sharing methodsData silos exist when groups within an organization collect and maintain their data inside separate data warehouses or data marts (smaller, stand-alone versions of a data warehouse that includes a segmented or subset of data, e.g., marketing, sales, finance, etc.). Siloed data becomes invisible and inaccessible to others, resulting in an incomplete picture of the company, which hinders business intelligence, data-driven decision-making, and collaboration.
Even if a company is aware of this problem, it may need to be equipped to do something about it efficiently. Traditional approaches to data sharing could be more efficient and effective. Sharing vast amounts of data is impractical, if possible. It requires making a copy of the data and manually moving it somewhere (usually via email), which has a host of challenges:
- The process is cumbersome, time-consuming, and complex and must be repeated whenever the data changes.
- There are extensive delays as the process does not happen immediately
- It creates multiple versions of static (and stale) data, so there is only one source of truth. This leads to the 'high-confidence, bad-answers trap.' Running analytics on incomplete data and drawing faulty conclusions can result in costly business decisions
- There are limitations to how much data an organization can share via email and other methods
- Data security and governance risks are high as multiple copies of data would need to exist and could end up in the wrong hands
The SolutionAn ideal solution would be to share limitless amounts of live data easily, effortlessly, and instantly with proper security and governance protocols. Cloud data platforms such as Snowflake enable organizations to do just this. Data sharing with Snowflake's Data Exchange breaks down these silos by providing access to live data between internal business units and external partners. Instead of a complex and cumbersome process, users can share and receive live read-only copies of data within minutes. This is done without moving the data, so the data sharer is always in control, significantly reducing security and governance risks. For further reading, you can check out: Data Sharing and the Future of Data Warehousing in the Cloud.
4. Not addressing data quality issuesFor some business units, marketing especially, accessing good quality data takes time and effort. To illustrate our point, we'll use marketing attribution as an example since it is a notorious problem to solve. Attribution is like giving credit to a source.
For example, suppose a restaurant owner gives a $10 referral fee to the source of every guest that comes in. So a patron walks into the restaurant, and the owner asks, "How did you hear about us?" The patron replies, "My friend Carl told me." In this scenario, Carl is the source and would receive credit for the referral and, with it, $10." But suppose Carl told Sally, and Sally told her other friend. When that friend enters the restaurant, she makes no mention of Carl. But shouldn't Carl receive partial credit as well? It seems fair that he would, so we must refine our referral structure.
Currently, we have it set up as a "first-touch" attribution (Carl) or "last-touch" attribution (Sally). To take the example further, suppose the friend also heard a radio ad for the restaurant and received a flyer in the mail. Indeed those had some influence on her (even if only slightly), but without her mentioning it, the restaurant owner has no idea if the money spent on those market channels produced a worthwhile ROI. It gets even more complex when we look into attribution windows. What if Sally heard a radio ad for the restaurant two months before Carl told her about it? Does the radio ad still count as a source in our attribution model, or is it too far back?
So perhaps we need to refine further to include only those sources that occurred within a given period, say seven days. This scenario plays out quite routinely in marketing. It takes an average of 9-14 touchpoints before consumer purchases your brand. With so many touchpoints factoring into a consumer's decision, it would seem fitting to implement a weighted attribution model to distribute some portion of credit amongst the different touchpoints.
They may give more credit to the sources closer to the conversion since those were more likely to be critical drivers in the consumers' decision matrix. Ad platforms such as Google, Facebook, LinkedIn, and others often like to take full credit for the conversion even though there were most likely multiple touchpoints or factors in the consumer's decision to make the purchase finally. So this credit often gets double counted.
The other challenge with these ad platforms is that they don't make it easy to integrate data. For many of these platforms, obtaining and retaining much data about their users and suppliers is integral to their business models. Unfortunately, sharing that data with their users is not part of their business model.
The SolutionThe best way to overcome some of these challenges is:
- Apply tracking codes to all marketing touchpoints (where applicable). This way, you at least can distinguish the source and medium that brought the customer through the door so-to-speak. Digital marketing makes this easier with UTM tags and other tracking codes. With print, radio, and TV commercials, you can have QR codes, coupons, and referral messages, but it isn't a foolproof system.
- Enable conversion events on the ad platforms. That way, you can have some indication of which ads resulted in conversion and therefore allocate more of the budget toward those specific ads while reducing or eliminating the funding from underperforming ads.
- Set up goals in Google Analytics. This will allow you to track better on-site consumer behavior associated with predefined objectives, such as filling out a form, downloading an asset, or purchasing.
- Invest in software tools (CRM, ERP, etc.) that give you complete visibility into the buyer's journey.
- Integrate all your data sources into one unified platform to create a single source of truth. This makes a broader picture of your attribution models. Through data engineering, we can use code to automate these data pipelines into a cloud platform such as Snowflake. Once the data is integrated, wrangled, and modeled correctly, it is ready for analysis.
5. Not using data properlyThere are many ways companies can misuse data. Either they don't know what they're looking for or are not asking the right questions, have no defined metrics, or don't use them to make decisions, or they become overwhelmed and succumb to analysis paralysis. Companies obsess over KPIs and OKRs but often don't have reliable ways to measure the metrics they identify as key. Without a reliable way to measure these goals and the effectiveness of specific initiatives, companies are left to rely on intuition, anecdotes, or past experiences.
The SolutionCompanies that implement data-driven decision cultures and processes are generally more successful. The first step to using data properly is to start with a question or set of inquiries related to your business. Next, you want to determine what you will do with that information. It may look something like this.
|What was our gross margin last quarter?
|Lower than expected
|We must switch manufacturers to cut our COGs and improve our gross margins.
|Are sales trending up or down?
|Sales are trending up.
|Our marketing push worked, and the ROI was worthwhile. Let's do it again.
|Which channels resulted in the most sales?
|We need to invest more in our events.
|What product(s) is our top seller?
|We need to increase productivity and focus on fulfilling customer demand.
Data is a valuable business asset, but that value can only be realized if you have the proper tools and expertise to manage and explore the data. As your data sources grow in complexity, so must your devices and the training so that staff can make meaningful insights from it.
ConclusionCompanies generating large volumes of data face the inevitable challenge of gaining control and access to all their data for reporting, analysis, decision-making, and other data use cases.
Organizations must adopt tools (i.e., technology, software, and trained staff) to overcome these challenges, which adds complexity to their business operations. Building solutions that don't work well due to missing or unreliable data results in organizations falling into the 'high confidence, bad answers' trap. Bringing data together and creating valuable insights has developed a whole new industry, set of technologies, and approaches that we know as Big Data, data analytics, data science, etc.
To stay ahead of the curve, companies need new tools for integrating data, new models for analyzing data, and data engineering efforts from businesses like ours. At Infostrux, we help organizations of all sizes deal with the undifferentiated heavy lifting required to improve data reliability and break data silos.