How Grepr reduces observability data volume by 90% or more, without changing a single engineer workflow.
When thinking about observability, the idea that the more data you collect the better is now causing serious problems. Way back in simpler times, organisations only had a few hosts which just ran a handful of services, and collecting every last bit of observability data did not present any problems. Today that situation has dramatically changed, the number of hosts has soared together with the explosion in microservices and Kubernetes services to support those microservices. The volume of observability data has shot up exponentially, yet MTTR remains constant. If all that extra data is not improving the MTTR, why continue to bear the cost of ingesting all those tracing spans and log events into your Datadog account?
How Much Data?
Datadog provides a generous allowance for APM span ingestion and indexing, with each licensed host given 150GB ingestion and 1 million indexed spans per month. The ingestion allowance covers sending spans to Datadog and them being available for live view for 15 minutes. Indexed spans are a filtered subset of e.g. slow, error, critical services which are available for 15 days. Span ingestion and indexing beyond these allowances incurs additional cost. Extra ingestion is $0.10 per GB and indexing is $2.55 (on demand) per extra 1 million.
While this initially appears to provide plenty of capacity without the fear of overage charges, an enterprise micro services application can generate a large number of spans for each request served. Typically such an application generates 100 - 500 spans for each request; a surprisingly good rule of thumb is:
Spans ≈ services x 4.5
Certain system architectures can increase this figure, object relational database abstraction frameworks can be very noisy and some message queues create a lot of spans for each message transaction.
Typically a span is 2kB and 100 spans per request is a healthy level of instrumentation. With a load of 50 requests per second, that produces 26 Peta bytes of trace data per month. This is way beyond the baseline data ingest volume provided by Datadog. The total number of spans is 130 million which is also above the baseline allowance. Overage charges are inevitable.
Datadog uses a similar pricing model for logs, charging $0.10 per GB for ingest and $1.70 per 1 million indexed log events each month. One million sounds like a big number, we would all like an extra $1 million in our bank account. However, when it comes to log events it’s pitifully small. A good rule of thumb for log events per request is:
Log events ≈ services x 8
A typical medium sized micro services application at 50 requests per second will produce 2 - 6 billion log events per month. For larger enterprise applications that increases to 40 - 150 billion log events per month. Using structured JSON logging, each log event will be approximately 512 bytes in size. This gives data volumes of 1 - 3 Peta bytes and 20 - 77 Peta bytes respectively.
The copious amount of telemetry data produced by modern micro services applications results in significant cost to ingest and index it on the Datadog observability platform.
Signal To Noise Ratio
Only a small percentage of observability data is useful all the time, most of it is only useful at certain points in time. What if it were possible to retain all observability data in low cost queryable storage and only send pertinent data through to Datadog? This would significantly increase the signal to noise ratio while simultaneously reducing costs. Sounds like a pipe dream? No, not at all, this is what Grepr does.
Grepr fits in like a shim between the Datadog agents and the backend. With just a small configuration change the Datadog agents now stream the observability data to Grepr for preprocessing before being passed through to Datadog. Nothing changes for the engineers, they continue to use the workflows they are familiar with, only now using refined data. Grepr interrogates Datadog to find metrics and alerts driven by log data then automatically creates exception rules so that those log entries are not processed by Grepr preserving the data integrity in Datadog.
Grepr uses machine learning to continuously identify frequently occurring log message patterns, these are summarised and passed on. Unique error messages get passed straight though ensuring existing health rules are triggered. Data volumes are reduced by 90% or more. All data sent to Grepr is retained in low cost storage where it can be queried and/or backfilled to Datadog to aid in incident investigation. Typically backfill of data is triggered automatically from a Datadog alert notification which calls Grepr providing context. A targeted backfill job is then initiated providing full data for a specified time period for the specified service, host, etc.
For trace span data, Grepr buffers the spans to build a complete end to end trace. Each trace is then fingerprinted by examining the pattern of flow across services. Each unique trace fingerprint has its baseline performance calculated. Grepr can then apply sophisticated sampling rules, for example: 100% of errors, 50% of very slow, 25% slow, 1% default. Because all the spans are buffered to build a complete end to end trace for analysis, when a sampling rule is matched the entire trace is sent to Datadog. This means when engineers inspect problem traces in Datadog, they will all be complete end to end traces with zero missing spans providing complete fidelity. Only sending through problematic trace spans reduces data volume by 90% or more. For customers on volume-only APM plans with Datadog, this corresponds to a significant cost reduction. APM Host-based plans have an alternative solution that we’re currently testing.
Win Win Win
The fully automatic nature of Grepr’s machine learning which continuously self tunes the live set of active log patterns and trace fingerprints (approx 200k patterns for large systems) frees engineers from the laborious tasks of manually maintaining log and span filtering rules.
The live set of active patterns and fingerprints reduces observability data volume by 90% or more with the corresponding savings in Datadog ingress and indexing charges. Grepr provides multiple returns on investment: big savings on the Datadog bill, more engineering time for site reliability and improved MTTR from higher fidelity data.
Ready To See The Difference?
For teams that want Datadog cost control without overhauling existing workflows, Grepr delivers results from day one. There's no complex installation, no manual pipeline maintenance, and no second set of dashboards competing for engineers' attention.
The existing agents, dashboards, and alerting rules stay exactly where they are. Grepr works alongside them, automatically optimizing data volume while retaining everything in low-cost storage for when it is needed.
Schedule a demo to see how Grepr's Intelligent Observability Data Engine can reduce your datadog costs by 90% or more, with zero disruption to your current workflows.
FAQ
1. Why is Datadog spending getting out of control for microservices teams?
Modern microservices applications generate massive telemetry volumes. A medium-sized app running at 50 requests per second can produce 2 to 6 billion log events per month, far exceeding Datadog's baseline allowances and triggering significant overage charges for ingestion and indexing.
2. What does Grepr actually do with our observability data?
Grepr sits between your Datadog agents and the Datadog backend. It preprocesses your telemetry data using machine learning, filters out repetitive noise, and passes only high-signal data through to Datadog, while retaining everything in low-cost queryable storage.
3. Will Grepr require changes to how our engineers work?
No. Your existing Datadog agents, dashboards, and alerting rules stay in place. Engineers continue using the same workflows they know. Grepr operates as a transparent layer that optimizes what gets sent to Datadog without introducing a second set of dashboards or manual pipeline rules.
4. How does Grepr handle trace data, and what happens to spans that aren't forwarded?
Grepr buffers spans to build complete end-to-end traces, then applies sampling rules based on performance and error status. When a trace qualifies, the full trace is sent to Datadog with zero missing spans. Spans that aren't forwarded are retained in low-cost storage and can be backfilled to Datadog automatically when an alert fires.
5. What kind of cost reduction can teams realistically expect?
Most teams see observability data volume reduced by 90% or more, which translates directly to lower Datadog ingestion and indexing charges. For teams on volume-only APM plans, the savings are especially significant, since every span forwarded to Datadog carries a direct cost.
More blog posts
All blog posts
The Observability Reckoning Is Here. It's Why I'm at Grepr.

Why Automated Context Is the Real Future of Observability



