The Data Refinery Problem: Why Data-First Architectures Are Failing on Their Own Terms
Part 2 of a three-part response to Scott Brinker’s “The New Martech Stack for the AI Age”
Scott Brinker’s recent report for #Databricks on the future of the martech stack is one of the most coherent and useful articulations I’ve seen in years.
The shift he describes—from rigid, vertically layered stacks to a composable, data-centric canvas—is not only directionally correct, it is already underway. The industry has been straining toward this model for some time, and his framing brings much-needed clarity to that transition.
In Part 1 of my response to Scott’s report, I argued that his composable canvas — and the broader data-first paradigm it represents — is built on an inverted architecture. It begins with what is available rather than what is required. It starts with the roads rather than the destination.
Part 2 takes that argument further, into empirical territory. Part 3 will put the capstone on top.
There is a question that businesses have been carefully avoiding. Not out of bad faith, but because it is uncomfortable in proportion to how much has already been invested in not asking it.
The question is this: What exactly are we collecting data for?
Not in the abstract sense of “better decisions” — that answer is always available and never sufficient. In the specific, governing sense. What decisions, precisely? What causal questions are those decisions trying to answer? What data does answering those questions actually require?
A CFO once expressed his frustration to me about this issue this way: “Our functional teams, our data science team, our IT organization, have been building a vision of master data management that is nothing but roads to nowhere.”
Whether you consider it fair criticism or not, the reality is that focus we’ve placed on data has not succeeded in #Reality.
If we cannot answer that CFO’s question before we architect and build a new stack, we need to acknowledge what we are doing: we are not building a decision-making system. We are building a new system of record, a describer of events that have already passed. We are building a very expensive, very sophisticated pantry, and hoping that dinner will eventually suggest itself.
This is because of a inarguable, double-headed “law of gravity” for data:
Data is always and only about the Past. Past is not Prologue, especially these days.
The Refinery That Forgot Its Output
There is a useful way to think about what a data architecture actually is.
A refinery is not built to collect crude oil. It is built to produce specific outputs — jet fuel, diesel, gasoline — and every element of its design is organized around those outputs. The collection and processing of crude oil is upstream infrastructure in service of a downstream specification. You do not build the refinery and then discover what it makes. The output governs the architecture, not the other way around.
Enterprise data architecture has largely inverted this logic.
Over the past fifteen to twenty years, organizations have built increasingly sophisticated collection, storage, and processing infrastructure — data warehouses, data lakes, CDPs, and now composable canvases — premised on the idea that if enough data is collected and organized with sufficient sophistication, valuable insight will emerge from it.
The outputs were supposed to follow from the inputs. Insight was supposed to emerge from accumulation.
It has not happened at the scale, consistency, or reliability the investment promised. And the reason is not architectural immaturity, insufficient tooling, or inadequate data governance. The reason is that the refinery was built without an output specification. The architecture was designed to process whatever arrived, rather than to produce whatever was required.
Scott Brinker’s composable canvas is, in many ways, the most coherent and capable expression of this architecture to date. It solves real problems — particularly around integration, accessibility, and composability — that have constrained the system for years. That is precisely why it illustrates the limitation so clearly. It advances the efficiency of the architecture. It does not resolve the prior question of what the data is for.
Fifteen to Twenty Years of Evidence
This is not a theoretical concern. It is an empirical one.
Data-first architectures have had fifteen to twenty years of sustained, large-scale investment to demonstrate their core value proposition: that better data infrastructure produces better decisions and better commercial outcomes.
The business leaders who commissioned and funded that investment have delivered their verdict, consistently and across industries. The #1 critique of IT and data science investments, stated repeatedly in boardrooms and C-suites across the Fortune 500, is that the managed data layer has not improved company decision-making.
Not meaningfully. Not at the level that justified the investment.
This is a remarkable finding. It is not the complaint of organizations that failed to implement the technology. These are organizations that implemented it extensively, expensively, and in many cases with genuine organizational commitment. They built the data lakes. They hired the data scientists. They deployed the dashboards. They did what the paradigm required.
And they are still making major strategic decisions the same way they always did — through experience, intuition, political consensus, and retrospective rationalization dressed up as analysis.
Meanwhile, the data has accumulated. By almost any measure, enterprise data volumes have grown exponentially over the same period. More data, more tools, more sophistication, more cost — and no corresponding improvement in the quality of the decisions that the entire investment was supposed to enable.
That points to a structural limitation in the paradigm itself.
The composable canvas does not change this equation. It makes the data more fluid, more accessible, and more composable. It reduces integration friction significantly. These are genuine improvements. But they are improvements in the efficiency of an architecture that has not demonstrated it can deliver its core promise.
Moving faster in the wrong direction is not meaningful progress.
The Utilization Signal
The empirical record contains a specific and well-documented finding that deserves far more attention than it has received — not just for what it shows, but for what it has consistently shown across more than five years of continued infrastructure investment.
Multiple independent research programs have converged on the same conclusion: the substantial majority of collected enterprise data is never used for analytics or decision-making.
Seagate’s Rethink Data Report — based on a survey of 1,500 global enterprise leaders conducted in partnership with IDC — found that only 32% of data available to enterprises is ever put to work, leaving 68% unleveraged.¹ Splunk’s State of Dark Data research, surveying more than 1,300 business and IT decision makers across seven economies, found that 55% of enterprise data is dark — untapped, unanalyzed, and in many cases unknown — with 60% of respondents reporting that more than half their organization’s data is dark, and one-third reporting that figure at 75% or more.²
IBM’s published research on dark data, citing the Splunk global survey, confirms these findings and adds that roughly 90% of data generated by sensors and analog-to-digital conversions is never used.³ IDC’s StorageSphere Forecast documents that unstructured data — the fastest-growing category — already accounts for 78% of all data stored, and is projected to grow from 5.5 zettabytes in 2024 to 10.5 zettabytes by 2028 at a 16% compound annual rate, the overwhelming majority of which remains unanalyzed.⁴
These are not studies from critics of the data industry. Seagate manufactures storage. Splunk sells data infrastructure. IDC serves the technology market. IBM builds data platforms. These findings come from inside the paradigm, from organizations with every commercial incentive to tell a more optimistic story.
But the most important finding is not any individual statistic. It is the stability of the failure rate over time.
The Seagate and Splunk research dates from 2019 and 2020. Updated Splunk figures and Gartner analysis from 2023 and 2024 put the current range at 55-60% of enterprise data never analyzed or leveraged.⁵
Gartner analysts Richard Bartley, Dennis Xiu, and Anthony Carpino, writing in Gartner’s 2023 Planning Guide for Security, stated that between 55% and over 80% of the data that a business stores is dark.⁶
Across the entire period from 2019 to 2024 — five years during which the industry invested hundreds of billions of dollars in cloud migration, modern data platforms, CDPs, and data lake infrastructure — the utilization rate has barely moved. It was 55-68% unused in 2019. It remains 55-60% unused in 2024.
That stability is the indictment.
If the utilization failure were a maturation problem — if organizations simply needed better tools, more sophisticated governance, or more time — we would expect to see improvement as those tools arrived. The tools arrived. The composable canvas is in many respects their culmination. The utilization rate did not improve.
This is not a problem that better architecture can solve, because it is not an architectural problem. It is a question problem. Data that is never queried is not primarily data that is hard to find. It is data that was collected without a governing question. It exists because the architecture was designed to collect comprehensively and figure out what matters later. And what has happened, predictably, is that most of what was collected does not matter — or at least, no one has ever been able to specify clearly enough what it matters for to make querying it worthwhile.
The utilization gap is the physical manifestation, measured in zettabytes, of an architecture that began with collection rather than with inquiry.
Making that data more composable and semantically coherent will not change this. The semantic layer makes data more findable. It does not make ungoverned collection purposive. If the governing questions were never specified, improving access to the answers does not help.
You cannot navigate to a destination you have never defined.
The Financial Dimension
There is a cost reality that compounds the utilization argument and deserves direct attention from anyone with fiduciary responsibility for technology investment.
Veritas Technologies’ Global Databerg Report — a survey of 2,550 senior IT decision makers across 22 countries conducted by independent research firm Vanson Bourne, replicated in a follow-on Veritas study in 2019 — found consistently that 52% of all stored enterprise data is dark, with only 15% classified as business critical.⁷ Gartner analysts, writing in the 2023 Planning Guide for Security, put the dark data range at 55% to over 80% of business storage.⁶ These findings have held remarkably stable across nearly a decade of measurement, across multiple independent research programs, across different methodologies and geographies.
The financial implication is direct. If more than half of everything an enterprise stores is dark — collected, governed, secured, and maintained at full operational cost while generating no analytical value — then more than half of the data infrastructure budget is being spent on assets that are not performing the function that justified the investment. Industry analysis projects global dark data storage costs surpassing $500 billion annually by 2028 if current accumulation trends continue.⁸
Perhaps most consequentially for organizations now investing in AI-era architectures: a Gartner survey of 1,203 data management leaders published in February 2025 found that 63% of organizations either do not have or are unsure whether they have the right data management practices for AI. Gartner predicts that through 2026, organizations will abandon 60% of AI projects for lack of AI-ready data.⁹
The composable canvas is being proposed as the architecture for the AI era. Gartner is simultaneously documenting that the majority of organizations lack the data readiness that AI requires. These are not separate problems. They are the same problem — the absence of a governing question architecture — expressing itself at two levels simultaneously. The composable canvas makes data more fluid and accessible. It does not make it AI-ready in the sense that actually matters, which is purposively collected, causally governed, and organized around specified decision requirements rather than comprehensive accumulation.
The composable canvas, by reducing the friction and cost of data collection and integration, accelerates the accumulation dynamic rather than correcting it. When collection is nearly costless, accumulation becomes the default. The asset base grows. The utilization rate stays flat. The per-unit value of the asset falls while the absolute cost of maintaining it rises.
The Expiry of Data
There is a second empirical problem that compounds the utilization failure, and it strikes more directly at the asset value of data-first architecture itself.
Historical data has a predictive lifespan. That lifespan is shorter than the industry has assumed, it varies significantly by application, and in the current environment it is compressing.
The utilization problem is about data that was never valuable enough to query. The expiry problem is about data that was valuable — and then stopped being.
This is not a theoretical observation from outside the paradigm. It is an operational reality that the intent data industry — one of the most data-intensive segments of the martech ecosystem — has quietly built into its products.
Autobound’s 2026 analysis of intent data infrastructure states explicitly that intent data has a half-life measured in days, not weeks, recommending that practitioners apply decay functions to signals older than 30 days and warning that stale intent data creates false confidence that is worse than no data at all.¹⁰ ZoomInfo’s intent data research confirms that signals decay quickly, with stale data meaning outreach arrives after accounts have already moved past their research phase or chosen a competitor.¹¹ Forrester’s predictive marketing research, cited by Saber’s recency signal analysis, documents that engagement value decreases exponentially over time, with signals more than 30 days old generating 3-5x lower conversion likelihood than recent ones — which is why standard practice now includes 24-hour monitoring windows for high-priority accounts.¹² Vector’s intent signal research finds a typical actionable window of just 2-3 weeks before buyer interest moves elsewhere entirely.¹³
The industry has responded to this reality not by questioning the data-first premise, but by engineering faster pipelines to exploit signals before they expire. That response is revealing. It accepts signal decay as an immutable constraint and optimizes around it. What it does not — cannot — do within the data-first frame is ask why the signals decay so fast, what that decay reveals about the limits of correlative architecture, or whether a different kind of reasoning would be more durable.
This creates a difficult arithmetic problem for enterprise data investment at every level.
Organizations are spending hundreds of millions of dollars building infrastructure premised on the idea that accumulated historical data has durable predictive value. But the practitioners who work most intensively with behavioral data have already accepted, operationally if not rhetorically, that the predictive signal is measured in days. The asset is not a lake. It is a river. Value flows through it briefly and disappears. What remains is an expensive, meticulously governed record of patterns that no longer apply — and a set of products designed to help you act fast enough to catch the value before it expires.
The composable canvas accelerates this dynamic. By making data collection cheaper, easier, and more fluid, it reduces the friction that might otherwise force the discipline of asking whether data is worth collecting in the first place. When collection is nearly costless, accumulation becomes the default.
The river gets wider. The half-life of any individual signal stays the same — or shortens. The asset base grows while the per-unit value of the asset declines.
When the Stability Assumption Breaks
The data expiry problem becomes an epistemological crisis when the underlying stability assumption begins to fail.
Correlative models work when the relationship between past patterns and future outcomes is relatively stable. When enough of the causal structure persists from the period of data collection into the period of prediction, correlation can function as a practical proxy. The model does not need to understand why something happened. It just needs the pattern to repeat.
That condition held well enough, for long enough, in enough domains, to make correlative architecture appear broadly reliable.
It was not a general solution. It was a conditional one.
We are increasingly operating outside those conditions.
Consider two of the most data-rich, mathematically rigorous applications of correlative modeling in existence: macroeconomic forecasting and actuarial science. Both represent decades of refinement, enormous investment, and genuine mathematical sophistication. Both are now showing visible and serious strain.
I recently had breakfast with an economist working for the Fed. The Federal Reserve’s econometric models — built on decades of historical data and refined through multiple economic cycles — are, in their words, “failing to resolve.” The Fed models were not wrong because the data was inadequate. They were wrong because the underlying causal structure of the economy had shifted in ways the historical patterns did not capture and could not signal. The past was no longer predicting the future. No additional data or architectural sophistication would have changed that, because the problem was not in the data. It was in the stability assumption the models were built on.
The insurance industry is experiencing the same failure at even more visible scale. Actuarial models — the accumulated product of generations of loss data, probability distributions, and risk pricing refinement — are breaking down under the combined pressure of climate volatility, geopolitical instability, and correlated tail risks that historical loss data cannot adequately represent. The response from major insurers is revealing: they are moving from probabilistic pricing to binary coverage decisions. Not “we will charge more for this risk.” It’s “we will not cover this risk at all.”
That binary shift is not a business strategy. It is an admission by more and more insurance companies that the “Big Data” correlative approach to insights has lost resolution. When you can no longer produce a confident probability distribution, you cannot price on a continuum. You can only decide whether to participate.
These are not peripheral applications or immature implementations. These are the strongest possible proof cases for correlative modeling — the domains where it has been most rigorously developed, most extensively validated, and most consequentially applied.
If the stability assumption is breaking here, it is breaking everywhere that operates on the same premise. Enterprise martech stacks are built on the same epistemological foundation, with less data, less mathematical sophistication, and less institutional capacity to recognize or respond to the failure when it arrives.
The difference is that the Fed’s forecast errors make headlines. Enterprise decision quality failures are absorbed quietly into missed targets, attribution disputes, and strategic pivots that nobody quite explains.
The Depreciating Asset
This leads to a conclusion with direct implications for anyone with fiduciary responsibility for technology investment.
The data lakes of major enterprises represent hundreds of billions of dollars of accumulated capital investment. They were built on a thesis: that historical data has durable predictive value that compounds over time as more data is added. That thesis is the asset’s carrying value.
When the stability assumption breaks — when the past becomes a progressively less reliable guide to the future — the thesis breaks with it. The data does not become less accurate as a record of what happened. It becomes less reliable, and in some conditions actively misleading, as a guide to what will happen next.
That distinction matters enormously.
In stable, short-horizon applications — fraud detection, session-level optimization, intra-day pricing — the correlative model still functions because the action happens fast enough to stay inside the validity window of the signal. These are real successes and they are worth acknowledging.
But in the domain where enterprise data investment was expected to deliver its greatest strategic value — multi-quarter planning, capital allocation, market positioning, GTM strategy — the validity window is collapsing. The decisions operate on time horizons that the correlative signal cannot reliably span. The investment thesis was built on long-horizon strategic value. The demonstrated value is concentrated in short-horizon tactical applications.
That gap is an asset impairment. Not in the accounting sense yet — but in the analytical sense that the thesis supporting the carrying value is not being demonstrated at the level or in the domain that justified the investment.
The composable canvas does not address this. It cannot, from within its own framing. It makes historical data more accessible, more fluid, and more composable. It has no mechanism for assessing whether the historical patterns that data encodes still reflect the causal structure of the world the organization is operating in today.
It makes the asset more composable.
It does not address the impairment.
And the impairment — the confounding environment — is the problem.
What This Demands
The argument across Parts 1 and 2 builds to a specific conclusion.
Data infrastructure is not the system. It is upstream infrastructure in service of a downstream purpose. That purpose is better decisions. And better decisions require not just better data, but better reasoning about what the data means — reasoning that can distinguish correlation from causation, pattern from mechanism, coincidence from driver.
The data-first paradigm has never adequately answered the question of what the data is for. It has assumed that if the infrastructure is sophisticated enough, the answer will auto-magically emerge from the data itself.
Fifteen to twenty years of investment, a utilization rate that has not improved despite continuous architectural advancement, a financial waste profile measured in hundreds of billions of dollars annually, the operational reality of signal decay that the intent data industry has quietly encoded into its products, the consistent complaint of the executives responsible for outcomes, and the documented failure of the paradigm’s most rigorous applications in economics and insurance — together these constitute the empirical verdict.
The system has become very good at collecting and managing data.
It has not become proportionally better at deciding.
Companies should be collecting the data specified by the models they have created to answer inherently causal questions. Not the data that is available. Not the data that is easiest to collect. The data that the model requires.
That inversion — model first, data second — is not a minor adjustment. It is a reorientation of the entire system. It changes what gets built, what gets collected, what gets governed, and what gets asked.
Scott’s composable canvas is a genuine and meaningful improvement on what came before. But it is an improvement in service of an architecture that still has the question backwards.
Making the wrong question easier to answer faster does not bring us closer to the right answer.
Part 3 in this series will address what the right architecture looks like — and why the inversion from data-first to model-first is not a theoretical preference but a practical and fiduciary necessity.
Notes
¹ Seagate Technology and IDC. Rethink Data: Put More of Your Business Data to Work — From Edge to Cloud. Seagate Technology, 2020. Survey of 1,500 global enterprise leaders across North America, Europe, and Asia Pacific. Available at seagate.com/our-story/rethink-data.
² Splunk Inc. The State of Dark Data. Research conducted by TRUE Global Intelligence. Splunk, 2019. Survey of 1,300+ business and IT decision makers across seven countries. Available at splunk.com. Key finding reported in Splunk press release: “Dark Data Research Reveals Widespread Complacency in Driving Business Results,” April 30, 2019.
³ IBM. “What Is Dark Data?” IBM Think Topics, 2024. Citing Splunk State of Dark Data survey. IBM estimate on sensor data utilization sourced from IBM research on industrial data generation. Available at ibm.com/think/topics/dark-data.
⁴ Alation. “Dark Data: What It Is, Risks & How to Unlock Its Value.” December 2025. Citing IDC StorageSphere Forecast. Available at alation.com/blog/dark-data. IDC StorageSphere is IDC’s primary global data tracking model.
⁵ Lex Data Labs. “Unlocking the Power of Dark Data.” 2025. Citing Splunk 2024 State of Dark Data industry survey report and Gartner 2023 analysis. Available at lexdatalabs.com. Cites Splunk (2024) State of Dark Data (Industry Survey Report), January 2024, and Gartner (2023) analysis placing the range at 55-60% of enterprise data never analyzed or leveraged.
⁶ BigID. “What Is Dark Data? Uncovering Vulnerable Data.” 2023. Quoting Gartner analysts Richard Bartley, Dennis Xiu, and Anthony Carpino, Gartner 2023 Planning Guide for Security: “55% to over 80% of the data that a business stores [is] dark.” Available at bigid.com/blog/what-is-dark-data.
⁷ Veritas Technologies. “Veritas Global Databerg Report Finds 85% of Stored Data Is Either Dark, or Redundant, Obsolete, or Trivial.” Press release, March 15, 2016. Survey of 2,550 senior IT decision makers across 22 countries, conducted by Vanson Bourne. Available at veritas.com/news-releases. Finding replicated in: Veritas Technologies. “Dark Data Exceeds 50%, Creating Major Security Blind Spot for Most Companies.” Press release, June 4, 2019. Survey of 1,500 IT decision makers across 15 countries, conducted by Vanson Bourne.
⁸ DataStackHub. “Dark Data Statistics For 2025–2026.” October 2025. Compilation of 50+ verified dark data statistics from enterprise studies, market reports, and industry research (2024–2026). Available at datastackhub.com/insights/dark-data-statistics. Projects global dark data storage costs surpassing $500 billion annually by 2028.
⁹ Gartner, Inc. “Lack of AI-Ready Data Puts AI Projects at Risk.” Press release, February 26, 2025. Based on survey of 1,203 data management leaders, July 2024. Available at gartner.com/en/newsroom. Analyst quoted: Roxane Edjlali, Senior Director Analyst, Gartner.
¹⁰ Autobound. “Top 15 Intent Data Providers (2026) With Pricing.” February 2026. Available at autobound.ai/blog. Specific guidance: “Intent data has a half-life measured in days, not weeks. Set aggressive decay windows. Weight signals from the last 7-14 days most heavily. Apply decay functions to anything older than 30 days. Stale intent data creates false confidence that is worse than no data at all.”
¹¹ ZoomInfo. “What is Intent Data? How to Turn Signals into Action.” ZoomInfo Pipeline, 2025. Available at pipeline.zoominfo.com/sales/what-is-intent-data. Specific finding: “Intent signals decay quickly. Stale data means you’re reaching out to accounts that have already moved past their research phase or chosen a competitor.”
¹² Saber. “Recency Signals: Definition, Examples & Use Cases.” 2025. Available at saber.app/glossary/recency-signals. Citing Forrester research on predictive marketing: “Engagement value decreases exponentially over time, with recent activities indicating 3-5x higher conversion likelihood than 30+ day old signals.” Standard industry practice cited: 24-hour monitoring windows for immediate sales alerts.
¹³ Vector. “How to Use Intent Signals to Act on Buyer Interest.” 2025. Available at vector.co/blog. Specific finding: “Intent signals have a short shelf life. You typically have a 2–3-week window before buyer interest moves elsewhere.”



