When I was evaluating companies I wanted to join, I kept coming back to one simple question:
What are the biggest problems customers are facing right now?
In meeting after meeting with analysts, operators, and engineering leaders, one theme surfaced again and again:
Observability costs are out of control
I started asking a very consistent question:
Is your observability spend growing faster than your cloud infrastructure spend?
The overwhelming answer was yes.
That response stopped me in my tracks. Observability was supposed to help teams control complexity in the cloud era. Instead, for many organizations, it has become one of the fastest-growing line items in the budget.
So I followed up with a second question:
Why is observability spend growing so quickly?
Over time, a few hypotheses began to crystallize.
1. The Diminishing Returns of Redundant Visibility
In the early days of this industry, we heard phrases like:
- “You can’t manage what you can’t see.”
- “Moving to cloud and microservices without observability is like driving on the Autobahn blind.”
Those messages were right for their time. When organizations first embraced cloud and microservices, visibility gaps were real and painful, so it was natural to just collect as much data as possible so you could quickly debug any issues.
Today, it’s not uncommon to see:
- Agents everywhere
- Multiple agents on the same workload
- Logs, metrics, traces, profiles - duplicated across vendors
Several years ago at KubeCon, I struck up a conversation at a bar with the person sitting next to me. Unsurprisingly, he was an SRE for one of the leading gaming companies in the world. When I asked about their observability stack, he casually mentioned that a single application could have four or five APM agents deployed, often collecting nearly identical data.
Four or five.
At that point, are we increasing visibility - or just multiplying cost?
The industry optimized for completeness. But the marginal value of each additional telemetry stream is shrinking, while the marginal cost continues to compound.
We are now firmly in the era of diminishing returns.
2. AI Is Accelerating Observability Consumption
Yes, AI is accelerating innovation. It’s also accelerating spend.
I’ve been asking customers a related question:
Are you deploying more or less code now with AI-assisted development and “vibe coding”?
You can probably guess the answer.
It’s more.
More code means:
- More services
- More containers
- More ephemeral workloads
- More events
- More telemetry
Every new service emits logs. Every container exports metrics. Every request generates traces.
And what happens when infrastructure scales dynamically?
Telemetry scales with it.
In many environments, observability data growth is now outpacing infrastructure growth itself. Not because vendors are malicious. Not because teams are careless.
But because the system is compounding:
More code → more infrastructure → more telemetry → higher ingestion → higher retention → higher bill.
And we haven’t fundamentally changed the economic model.
The Rise of Cost-Focused Observability Innovation
This is why we’re starting to see companies whose sole focus is cost control within observability.
Companies like Grepr are focused on reducing noise - and therefore reducing bills - while allowing organizations to preserve their existing dashboards, alerts, and runbooks.
That’s important.
Because ripping and replacing tooling is rarely the answer. Teams don’t want to lose the workflows they rely on. They want:
- Less duplication
- Smarter filtering
- Better signal-to-noise
- Predictable costs
The market is maturing. It’s time we move from “collect everything” to “collect what matters.”
Where This Leaves Us
Observability is not optional.
But unlimited ingestion is not sustainable.
We are entering the next phase of this industry:
- From visibility → efficiency
- From volume → value
- From “more data” → “better signal”
I’d love to hear what others are seeing.
Is your observability spend growing faster than your cloud infrastructure spend?
Are you seeing agent sprawl or telemetry duplication inside your environments?
And how are you thinking about balancing visibility with economics in the age of AI?
Frequently Asked Questions
Q: Why is observability spend growing so fast even when infrastructure costs stay flat?
Observability data compounds across every layer of the stack. More code means more services, more containers, and more ephemeral workloads, each emitting logs, metrics, and traces. When infrastructure scales dynamically, telemetry scales with it, often faster. The ingestion and retention costs follow automatically, and most teams are still operating under an economic model built for a much smaller data volume.
Q: What is agent sprawl and why does it matter?
Agent sprawl happens when multiple monitoring or APM agents are deployed on the same workload, often collecting nearly identical data. It's more common than most teams realize. Each additional agent adds cost without proportionally adding signal, which means teams pay more for visibility they already have.
Q: Does reducing observability data mean losing visibility into production issues?
Not if you filter intelligently. The goal is preserving the signals that matter, such as the dashboards, alerts, and runbooks teams actually rely on, while eliminating the redundant telemetry that drives up the bill without improving incident response. Better signal-to-noise ratio typically improves response time rather than hurting it.
Q: How does AI-assisted development affect observability costs?
AI accelerates code output, which accelerates the services, containers, and events that generate telemetry. Teams shipping more code faster are also generating observability data faster. Without a corresponding change to how that data is filtered and retained, the cost curve steepens alongside the development velocity curve.
Q: What does a cost-focused observability strategy actually look like in practice?
It starts with auditing duplication across your existing stack, identifying redundant agents, overlapping vendors, and data being ingested but never queried. From there, smarter filtering and sampling reduce volume without creating blind spots. The goal is predictable costs tied to business value, not to raw data volume.
More blog posts
All blog posts
Why Automated Context Is the Real Future of Observability

The Ferrari Problem in AI Infrastructure (and Why It Applies to Your Observability Bill Too)



