Announcing Grepr: Observability for the modern complex world

Jad Naous
January 22, 2025
Grepr logo
“I love to configure Datadog to drop metrics and logs so we can stay within budget.” – no engineer ever

No engineer is ever excited about figuring out what metrics or logs are important and making decisions about what they might or might not use in the future. Engineers want all the data all the time, because, when there’s trouble, they dread not having the one log message or the one metric that would have told them what’s happening. I started Grepr to tackle this problem and eliminate the need for engineers to set up rules to ensure they have the right data in the right place at the right time. With AI, we can now do it with less effort and more effectively than with manual labor, without losing critical information that might be needed for troubleshooting later on.

Back in 2016, when I was at AppDynamics, microservices were new and our largest customers’ metrics fit into a single MySQL database (!). Today, the rise of microservices and the exponential growth of the scale of software deployments have caused a massive explosion in complexity. Companies dedicate entire teams to individual components, each with a full software and hardware stack, deployed on hundreds or thousands of containers. To manage this complexity and be able to understand what’s happening, teams turned to observability tools, deployed ubiquitously.

With this rise in complexity, observability costs skyrocketed, often becoming the second-largest expense after cloud infrastructure. I’ve had tens of conversations with engineering leaders in addition to my own experience, and these conversations confirm that observability costs now make up 10–15% of their infrastructure expenses.

Despite this massive spending, many organizations still struggle: either they lack the data needed to resolve incidents or they’re overwhelmed with the sheer volume of data and the complexity of the applications, making it hard to pinpoint root causes. The issue is that current observability tools require engineers to piece together fragmented metrics, logs, and traces—like solving a puzzle with no picture to guide them. Engineers must also correlate inconsistent tags and labels to connect observations to their mental model of the system. This complexity often leads to confusion during troubleshooting, as teams fail to link observations or miss critical issues.

And we’re at the cusp of the next wave of complexity growth: AI. As AI inference workloads demand finer-granularity data, tracking not only each individual inference performance but its impact on the behavior of users or systems in response, we’ll be seeing another exponential growth in complexity.

What was already a less-than-ideal situation is becoming untenable. The architecture and user experience of today’s observability tools will not be able to deliver on the improvement in reliability for which companies are paying all that money.

At Grepr, we want to elevate observability beyond individual metrics and logs. Instead, we focus on the behaviors of objects like hosts, containers, and processes, making these behaviors and relationships first-class concepts in troubleshooting. Our AI will leverage this rich representation to guide engineers quickly to root causes, thus reducing complexity and downtime. This ensures significant productivity gains, giving engineers clarity without requiring deep knowledge of every system component.

Most observability tools today use a static representation of applications and infrastructure. For example, Kubernetes pods run on Kubernetes nodes, Kubernetes nodes run on hosts, hosts belong to VPCs, etc. In today’s massively complex, dynamic world where users want to monitor all sorts of objects — like users, sessions, AI models, inferences, Spark jobs, Kafka consumers, and user conversions, in addition to traditional infrastructure elements like hosts and containers — these static representations are insufficient. Unlike other observability tools that layer AI on top of this static representation, we’re building Grepr’s foundation on the right representation first—a flexible, customizable representation, enabling users to reason about the health of these objects and the relationships between them.

That said, we know that switching tools is hard, so we designed Grepr to integrate seamlessly into existing environments—and to pay for itself. Our approach focuses on cutting observability costs while maintaining the data that engineers need. Here's how we do it:

  1. Real-Time Noise Reduction: Using machine learning to distinguish noise from signal, we cut noisy data by 95%, forwarding only relevant information to observability tools.

  1. Health-based aggregation: Grepr can aggregate the data for healthy objects (like hosts, containers, users, etc), and send unhealthy objects’ data unaggregated, so users can focus on what’s important at the lowest cost.
  2. Low-Cost Data Storage: A highly optimized data lakehouse stores all raw data, making it easy to troubleshoot, analyze, or backfill into tools when needed.

  3. Incident-Based Reactivity: During incidents, Grepr stops aggregating relevant data and backfills critical details into existing tools, ensuring engineers have the information they need, when they need it.

This is just the start. Systems are growing more complex, especially as they incorporate AI workflows, and traditional observability architectures are becoming unsustainable. Grepr aims to transform how observability data is used, evolving along with modern systems to help engineers keep systems efficient and reliable.

As part of our launch, we’re excited to announce our $9M Seed round led by Martin Casado from Andreessen Horowitz and Ed Sim from boldstart ventures to lead the next wave of reliability.

It takes 20 minutes from start to finish to get started with Grepr and reduce your observability spending by 90%. Sign up for a free trial to find out for yourself!

Share this post

More blog posts

All blog posts
Product

New Relic + Grepr: A Simple Setup to Slash Observability Costs

This blog post shows how to reduce log volume by up to 90% by integrating New Relic with Grepr. Using a simple Docker-based microservices demo, we walk through configuring Fluent Bit to ship logs to New Relic, then show how easily Grepr can be inserted into the pipeline to intelligently filter out noise. The result is cleaner, more actionable log data, reduced observability spend, and no disruption to existing workflows. All raw data is retained in low-cost storage and can be backfilled on demand—helping teams stay in control of both their visibility and their budget.
June 11, 2025
Product

Grepr Cost Savings Case Study

Goldsky, a Web3 realtime data platform, partnered with Grepr to significantly reduce their log management costs while maintaining observability performance. Initially facing misalignment between logging spend and value, Goldsky deployed Grepr, led by Lead Engineer Paymahn Moghadasian, who quickly integrated it using Terraform and Datadog's dual-shipping feature. Over four weeks, they successfully filtered noisy logs and transitioned their production environment with zero disruption. The result: a 96% reduction in indexed logs, 93% less data ingested, and over 85% savings in Datadog costs—without any negative impact on Mean Time to Resolution (MTTR). Additional benefits included improved log readability, faster searches without rehydration, and white-glove support from Grepr.
June 6, 2025
Product

Grepr vs Cribl

Grepr and Cribl both offer data pipelines for observability, but they differ in complexity and approach. Cribl is a powerful, flexible platform requiring significant setup, ongoing management, and learning its custom query language. Grepr is the newer, simpler option, using AI to automate data filtering and reduce manual configuration by 90%. While Cribl offers more integrations, Grepr supports common sources, uses familiar query languages, and enables faster, lower-maintenance deployment. Cribl suits large enterprises with dedicated teams, while Grepr is ideal for organizations seeking a faster, more automated solution.
May 30, 2025

Get started free and see Grepr in action in 20 minutes.