Insights

4 min

Musings of a Data Enthusiast – Will We Really Need Data Lakes in the Future?

Over our 25+ years in data and AI, data lakes and data warehouses – whether on-premises or in the cloud – haven’t really become the “system of record” or even “system of SVT (single version of truth)” that enterprises envisioned. For too long we’ve been talking about data consolidation, integration, and platform modernization to make these systems live up to their promise.

We think we’re on the verge of an entirely different conversation, because when Agentic AI can talk directly to source systems, do we need data lakes anymore?

Wherefore Art Thou, Data Lakes?

We’ve always built data lakes to answer three questions: where did our business come from? How are we performing today? And where are we going?

Those questions require a) integrating complex and large data volumes from across the business, b) applying historical context, and c) enabling trustworthy, consistent access to that data at scale.

That’s why we all created ETL pipelines, governance frameworks, Medallion architectures, and virtualization layers. And remember that’s only for structured data – not even the majority of what we store in formats like pdfs, emails, and chats.

But with Agentic AI, business is going to start to ask if we can’t just go to the source directly and skip the entire data layer.

The Shrinking (But Still Valuable) Data Lake

There’s real potential for intelligently designed agent-based systems that leverage the right APIs and source systems to bypass traditional data stacks on the way to insights. If the data is clean, integrated, and trustworthy at the source. Every data pro knows that’s rarely if ever the case.
  • Clean source data remains elusive. That’s why lakes were initially created – to standardize, cleanse, and contextualize disparate data sources.
  • Even with advanced APIs, it’s still challenging to achieve real-time integration of massive (and most times aging) enterprise data sets across systems like SAP, Salesforce, and Oracle.
  • And Agentic AI lacks historical context – unless you give it one. A well-designed data lake can provide the long-term memory that agents need to improve over time.

However, we do believe the days of data lakes being the one-stop data shop are over. Data lakes must quickly evolve to be a foundation layer that provides context to enable insight discovery by a broader ecosystem of agents.

Avoiding the Agent “Data Silo” Trap

While Agents can go straight to the source, they will create even more data silos than the ones data lakes were designed to eliminate if we’re not careful.

Imagine agents deployed across various functions – finance, procurement, customer service, supply chain, wealth management, inventory management – all gathering incredibly rich data from user interactions, workflows, and decisions. Where does that data go?

Today that incredibly valuable information will likely stay right at home with the agent. This is where data lakes can take on a new role. Agents that push these signals back into the enterprise data lake or analytics platform avoid repeating the mistakes RPA and IPA made, where valuable operational data remained in isolated, disconnected systems. We like to call this approach “leave no data behind.”

With closed feedback loops, where agents send context-rich signals to data lakes, we exponentially increase data’s value for insights and real enterprise learning. These signals support strategic questions like “what in execution or operations is keeping us from meeting goals?” or “where can process changes make the biggest impact?” with more clarity and confidence.

Two Paths, One Brain
Agents can both act and think. Your architecture should make the two paths explicit:
  • Live-Action Path (do things now). The agent plans, calls operational APIs, and writes results back to source systems – lowest latency, minimal duplication.
  • Shared-Memory Path (know things reliably). The agent retrieves governed context – documents, features, historical facts – from the lake/lakehouse + vector index, providing history, policy, lineage, and auditability.

In practice, most enterprise tasks touch both paths: act via APIs to change state and ground the decision in governed memory for safety and reuse

Do We Still Land Data? Yes - Selectively

The lakehouse moves from “land everything” to selective landing. Access most operational data in place; materialize only what’s needed for governance, scale, and reproducibility.
  • Always land: Agent telemetry (events, prompts, tools, outcomes), training/validation datasets (immutable), and regulatory snapshots.
  • Usually land (lightweight): Features, embeddings + chunk metadata (with pointers to sources), CDC logs when you need time travel.
  • Reference, don’t land: Most operational tables and much unstructured content – use connectors/virtual tables; keep catalog/policy/lineage as metadata pointing to sources.
  • Full copies (only when the 5 Rs demand it): Regulation, Reproducibility, Reliable joins at scale, Runtime cost/stability, Richer history.
Decision rule: If it’s about action, read in place; if it’s about memory, land the artifacts.

Agents should bypass the lakehouse for many operational reads/writes – that’s the point of acting at the source. But the lakehouse doesn’t die; it thins and becomes the shared memory and governance fabric for history, policy, lineage, telemetry, and training. Land artifacts, not everything; run actions at the source; and close the loop by logging agent behavior into governed storage.

Data Lakes 2.0 - Reusability and Impact

As AI continues to evolve, so must our data strategies. Agentic AI should force us to be smarter, more agile, and more intentional about where and how we manage data and how we think about the role of data lakes.

The future isn’t data lakes vs. agents. It’s about building dynamic, adaptive architectures where the data lake becomes a trusted foundation for context and insights, helping to make agents both intelligent and reusable.

With this strategic mindset and approach, we can go beyond customer 360, partner 360 or employee 360 and truly get to Business 360.

Share
Facebook
Twitter
LinkedIn

Leave a Comment

Your email address will not be published. Required fields are marked *


Recent Insights

Related Topics