Insights

25, Sep 2025

4 mins

Musings of a Data Enthusiast – Will We Really Need Data Lakes in the Future?

Over our 25+ years in data and AI, data lakes and data warehouses – whether on-premises or in the cloud – haven’t really become the “system of record” or even “system of SVT (single version of truth)” that enterprises envisioned. For too long we’ve been talking about data consolidation, integration, and platform modernization to make these systems live up to their promise.

We think we’re on the verge of an entirely different conversation, because when Agentic AI can talk directly to source systems, do we need data lakes anymore?

Wherefore Art Thou, Data Lakes?

We’ve always built data lakes to answer three questions: where did our business come from? How are we performing today? And where are we going?

Those questions require a) integrating complex and large data volumes from across the business, b) applying historical context, and c) enabling trustworthy, consistent access to that data at scale.

That’s why we all created ETL pipelines, governance frameworks, Medallion architectures, and virtualization layers. And remember that’s only for structured data – not even the majority of what we store in formats like pdfs, emails, and chats.

But with Agentic AI, business is going to start to ask if we can’t just go to the source directly and skip the entire data layer.

The Shrinking (But Still Valuable) Data Lake

There’s real potential for intelligently designed agent-based systems that leverage the right APIs and source systems to bypass traditional data stacks on the way to insights. If the data is clean, integrated, and trustworthy at the source. Every data pro knows that’s rarely if ever the case.

Clean source data remains elusive. That’s why lakes were initially created – to standardize, cleanse, and contextualize disparate data sources.
Even with advanced APIs, it’s still challenging to achieve real-time integration of massive (and most times aging) enterprise data sets across systems like SAP, Salesforce, and Oracle.
And Agentic AI lacks historical context – unless you give it one. A well-designed data lake can provide the long-term memory that agents need to improve over time.

However, we do believe the days of data lakes being the one-stop data shop are over. Data lakes must quickly evolve to be a foundation layer that provides context to enable insight discovery by a broader ecosystem of agents.

Avoiding the Agent “Data Silo” Trap

While Agents can go straight to the source, they will create even more data silos than the ones data lakes were designed to eliminate if we’re not careful.

Imagine agents deployed across various functions – finance, procurement, customer service, supply chain, wealth management, inventory management – all gathering incredibly rich data from user interactions, workflows, and decisions. Where does that data go?

Today that incredibly valuable information will likely stay right at home with the agent. This is where data lakes can take on a new role. Agents that push these signals back into the enterprise data lake or analytics platform avoid repeating the mistakes RPA and IPA made, where valuable operational data remained in isolated, disconnected systems. We like to call this approach “leave no data behind.”

With closed feedback loops, where agents send context-rich signals to data lakes, we exponentially increase data’s value for insights and real enterprise learning. These signals support strategic questions like “what in execution or operations is keeping us from meeting goals?” or “where can process changes make the biggest impact?” with more clarity and confidence.

Two Paths, One Brain

Agents can both act and think. Your architecture should make the two paths explicit:

Live-Action Path (do things now). The agent plans, calls operational APIs, and writes results back to source systems – lowest latency, minimal duplication.
Shared-Memory Path (know things reliably). The agent retrieves governed context – documents, features, historical facts – from the lake/lakehouse + vector index, providing history, policy, lineage, and auditability.

Situation	Live-Action (direct APIs)	Shared-Memory (lake/KB)
Fresh operational state (inventory, case status)	Primary	Secondary (policy/history)
Long-horizon history or cross-domain joins	Secondary	Primary
Strict audit/traceability	Secondary (log results)	Primary (lineage/policy)
Low-complexity, narrow task	Primary	Optional / Not ideal
Privacy/consent checks	Secondary (enforcement)	Primary (policy store/redaction)
Model evaluation & improvement	Optional / Not ideal	Primary (telemetry/labels)

In practice, most enterprise tasks touch both paths: act via APIs to change state and ground the decision in governed memory for safety and reuse

Do We Still Land Data? Yes - Selectively

The lakehouse moves from “land everything” to selective landing. Access most operational data in place; materialize only what’s needed for governance, scale, and reproducibility.

Always land: Agent telemetry (events, prompts, tools, outcomes), training/validation datasets (immutable), and regulatory snapshots.
Usually land (lightweight): Features, embeddings + chunk metadata (with pointers to sources), CDC logs when you need time travel.
Reference, don’t land: Most operational tables and much unstructured content – use connectors/virtual tables; keep catalog/policy/lineage as metadata pointing to sources.
Full copies (only when the 5 Rs demand it): Regulation, Reproducibility, Reliable joins at scale, Runtime cost/stability, Richer history.

Decision rule: If it’s about action, read in place; if it’s about memory, land the artifacts.

Agents should bypass the lakehouse for many operational reads/writes – that’s the point of acting at the source. But the lakehouse doesn’t die; it thins and becomes the shared memory and governance fabric for history, policy, lineage, telemetry, and training. Land artifacts, not everything; run actions at the source; and close the loop by logging agent behavior into governed storage.

Data Lakes 2.0 - Reusability and Impact

As AI continues to evolve, so must our data strategies. Agentic AI should force us to be smarter, more agile, and more intentional about where and how we manage data and how we think about the role of data lakes.

The future isn’t data lakes vs. agents. It’s about building dynamic, adaptive architectures where the data lake becomes a trusted foundation for context and insights, helping to make agents both intelligent and reusable.

With this strategic mindset and approach, we can go beyond customer 360, partner 360 or employee 360 and truly get to Business 360.

Kishore Jasti

Head of Technology and Delivery - Kishore is a seasoned leader with 24+ years of global experience in digital transformation, enterprise data, and AI architecture. At PivotX, he leads technology architecture and delivery engagements, helping clients implement modern data strategies, including Data Mesh, data product operating models, and agentic AI solutions.

Recent Insights

The Real Barrier to Agentic AI Still Isn’t Tech – It’s Your Imagination

02, Feb 2026 0 Comments

In July 2025, I argued that the biggest barrier to

We’re Not Building the Reusable Architectures Agentic AI Needs

17, Dec 2025 0 Comments

If you walk into any board meeting or CIO offsite,

What’s Old is New Again – AI’s Appetite for Unstructured Data Is Highlighting Systemic Quality, Governance and Compliance Failures

02, Jun 2025 0 Comments

We can no longer let perfection be the enemy of good, when it comes to our analytics projects. As the pace of disruption across digital marketplaces increases, every organization needs to be identifying the data that is essential to their analytics program, rather than waiting for data perfection.

Insights

Musings of a Data Enthusiast – Will We Really Need Data Lakes in the Future?

Wherefore Art Thou, Data Lakes?

The Shrinking (But Still Valuable) Data Lake

Avoiding the Agent “Data Silo” Trap

Two Paths, One Brain

Do We Still Land Data? Yes - Selectively

Data Lakes 2.0 - Reusability and Impact

Kishore Jasti

Leave a Comment

Recent Insights

The Real Barrier to Agentic AI Still Isn’t Tech – It’s Your Imagination

We’re Not Building the Reusable Architectures Agentic AI Needs

What’s Old is New Again – AI’s Appetite for Unstructured Data Is Highlighting Systemic Quality, Governance and Compliance Failures

Agentic AI Isn’t Magic and Won’t Cure Our Data, Process and People Issues

Related Topics

The Real Barrier to Agentic AI Still Isn’t Tech – It’s Your Imagination

Setting the Conditions for AI and GenAI Success: The Data as a Product Approach

Going Beyond Data-Driven: Becoming Data-Centric with a Data as a Product Approach

Data Purity isn’t the answer to the Problem of Data Quality

Get a Free Discovery Session

Get in Touch

Social Media

Newsletter