A conversation with Evan Robinson, CTO of Jitsu
Jitsu runs last-mile delivery across 200+ services serving 122 million people. A single package generates roughly 400 log events, traces not included. At millions of daily shipments, that's a significant data volume problem.
The real cost isn't the bill
Jitsu CTO, Evan Robinson's core argument: the provider invoice is a distraction. The cost that matters is decision latency during incidents. When relevant signal is buried under millions of informational log events, the gap between "something is broken" and "here's why" stretches from minutes into hours. Incidents become outages, outages become customer impact.
Jitsu's early approach was to instrument everything and sort relevance later. As they expanded metros and moved toward autoscaling, log volume grew faster than the business. Evan calls the accumulated low-value events "comfort logs." His team kept them because dropping them felt risky. Comfort and clarity turned out to be different things.
A real example
A third-party integration in Jitsu's highest-traffic driver API had no timeout configured. Hanging connections accumulated in the pool, degrading the API over hours. The evidence was there, but buried in routine traffic. Once engineers identified the cause, diagnosis took 30 seconds. Getting there took far longer.
What actually fixed it
Jitsu addressed observability discipline through shared on-call accountability, logging treated as a first-class concern in code review, and hygiene checks before code ships. Evan is direct that none of this runs itself. It requires consistent reinforcement from engineering leadership.
Jitsu adopted Grepr in May of 2025. Log volume dropped significantly and stayed flat even as operations expanded. December, their highest-volume month, produced an uptick but nothing like previous years. Novel events moved through naturally. High-volume informational events following known patterns stopped cluttering the view. Grepr reinforced the logging discipline the team had already been building, without requiring extensive custom rules to get there.
"If you can move engineers from thinking at the log or metric level to thinking about business flow health, you have changed what they are capable of managing." - Evan Robinson, CTO, Jitsu
Evan's closing advice
Stop optimizing primarily for uptime percentages. The metric that maps to customer impact is reaction time and remediation time when something goes wrong. Shift from individual service health to business-critical flow health. Know which flows matter most, instrument accordingly, and make sure SREs and developers share that understanding. That clarity is what separates a 30-minute incident from a four-hour one.
Watch the recording here.
FAQs
What is observability and why does it matter for engineering teams?
Observability is the practice of monitoring your software systems so you can understand what's happening inside them, especially when something goes wrong. Engineering teams rely on it to diagnose incidents quickly. The faster you can find the source of a problem, the less customer impact you take.
What's the difference between logging everything and logging what matters?
Most teams start by capturing as much data as possible. The problem is that high log volume buries the signals that actually matter during an incident. Engineers end up sifting through millions of routine events to find the one that explains the failure. Selective, disciplined logging keeps the relevant signal visible.
Why is incident response time more important than uptime percentage?
Uptime is a baseline expectation, not a differentiator. What actually determines customer impact is how quickly your team can identify and fix a problem once it starts. A team with great uptime but slow diagnosis still produces long outages. Reaction time and remediation time are the numbers worth improving.
What does observability cost management actually mean?
It's not just about reducing your monitoring bill. The bigger cost is the engineering time lost during incidents when noise outweighs signal. Effective cost management means keeping log volume proportional to actual diagnostic value, so your team isn't paying in hours to find what should take seconds.
How does a tool like Grepr help without requiring teams to rewrite their instrumentation?
Grepr sits in the log pipeline and distinguishes novel events from high-volume routine ones. Familiar patterns that carry no diagnostic value get compressed or staged rather than streamed in full. New or anomalous events pass through unfiltered. Teams get cleaner signal without having to manually audit and rewrite their logging across every service.
More blog posts
All blog posts
The Observability Debt: Why More Data is Making Us Less Reliable

Regain Control of Your Datadog Spend



