How Grepr Handles Trace Logs

Jad Naous
May 14, 2025

Introduction

Grepr reduces companies’ observability costs without disrupting engineers’ existing workflows. One way Grepr achieves this cost reduction is by minimizing the number of logs sent to vendors using summarization and intelligent sampling (https://www.grepr.ai/blog/reducing-logging-costs-part-1). However, when working with Application Performance Monitoring (APM) tools that provide full execution traces for requests, engineers expect comprehensive logs associated with those traces. Additionally, users may need trace-like data outside the scope of APM. For example, they might want to see all logs related to a request hitting each endpoint or all logs associated with a Spark job or SQL session. So, how does Grepr balance the need for log reduction with the requirement of providing complete logs for traces?

We heard from users that accessing full logs for at least a subset of traces in the vendor tool is important, without having to backfill logs or search in the Grepr Data Lake. The problem is that random log sampling can lead to incomplete trace log sets. Therefore, we need to ensure that a portion of traces have their full set of associated logs passed through to the vendor tool without aggregation.

For the rest of this blog, “trace ID” refers not only to IDs inserted into log messages by APM tools but also to any identifier like a Spark job ID, user ID, or request ID that users consider a “trace” for which they require complete logs.

Feature Overview

To enable users to have a full set of logs for a sample of traces, Grepr needs to:

  1. Find the trace ID in the message
  2. Sample unique trace IDs using a sampling ratio specified by the user. 100% means all logs with trace IDs are kept, while 10% means that 1 out of 10 traces will have full logs.
  3. For the selected trace IDs, forward all logs with those trace IDs to the vendor.

To find trace IDs in the message, users can specify the path to a tag or attribute that contains the IDs, and Grepr checks that path for a value in every incoming message. If the trace ID is not already present as a field in a structured log message, users can write Grok rules to extract it. Users can also specify a query predicate to filter logs considered for sampling, enabling them, for instance, to only sample traces from specific services or applications.

Sampling Trace IDs

How do we select trace IDs and make sure all messages with that trace ID are passed through without aggregation? To do this, we’ll need to identify the first message for each trace ID. Consider the following:

  1. Grepr is a distributed system: Messages with different trace IDs may be processed by different nodes.
  2. We need to know if a trace ID is new at scale: Upon encountering a new trace ID, we can determine that the message is the first in that trace, enabling us to apply sampling. However, tracking previously seen trace IDs without excessive memory consumption in a streaming system is not easy.
Problem 1: Distributed processing of trace IDs

How do we ensure that all messages belonging to the same trace are processed by the same node in a distributed system? This is efficiently resolved using Apache Flink, our underlying stream processing engine. We filter messages matching our criteria and containing trace IDs and we partition them by the trace ID. Consequently, all messages belonging to the same trace are processed by the same node.

Problem 2: Tracking trace IDs

How do we prevent memory exhaustion while tracking previously seen and active trace IDs for sampling decisions? Extended trace durations (e.g., for asynchronous jobs) require maintaining a large number of trace IDs for longer periods. When there could be thousands of trace IDs per second per individual node, this could mean millions of traces in memory along with their associated state.

Using Hash Sets

One approach involves a hash set per trace ID to track “seen” trace IDs. Upon encountering a trace ID, we check its presence in the seen set. If absent, it’s new. We then probabilistically decide whether to sample the trace ID and subsequent messages with that ID. The trace ID is then added to an “active” set, ensuring future messages with that ID are passed through. This allows us to track each trace ID and its state (active or seen).

Each seen or active trace ID would have an associated timer for expiration. On expiration, the ID is removed from each set.

However, this approach is resource-intensive, involving the overhead of hash sets and timers per trace ID. Is there a more efficient method?

Using Bloom Filters

While a set lets us quickly determine whether a trace ID has been seen, they consume significant memory. Employing a Bloom filter drastically reduces memory usage at the cost of a small (configurable) probability of false positives. A Bloom filter definitively indicates if an element is not the existing data set, but may falsely report an element as present.

The memory savings are significant: a set of 100,000 UUIDs (16 bytes each), uses around 2.8MB, whereas a bloom filter for 100,000 elements at a 1% false positive rate uses only 120KB (~ 95% reduction). 

However, two challenges arise with Bloom filters: 1) they do not support removals, and 2) the timer for each trace ID can still be significant.

To address these challenges, we utilized multiple Bloom filters in a circular buffer. The filter at the head of the buffer is the “active” filter into which we add new trace IDs. Existence checks involve the entire set of Bloom filters. As time progresses, every time window, we shift the filters so that the oldest filter is cleared, becoming the newest “active” filter. This creates a “forgetful” Bloom filter, resolving both challenges simultaneously.

Conclusion

Possible future enhancements include dynamically adjusting filter size based on incoming message rates or lowering the false positive rate in later filters for greater efficiency. This capability is already live in Grepr’s production release. Interested in trying it out? Sign up for a free trial at grepr.ai!

Share this post

More blog posts

All blog posts
Product

Aggregate my log volume by 90%, yet still find anything I need? How is that possible?

Grepr uses unsupervised machine learning to reduce log volume by over 90% while preserving important data through smart, configurable aggregation. It passes low-frequency messages through unmodified, allows engineers to retain specific parameters like user IDs, and supports backfilling logs via API triggers when deeper detail is needed—such as during support tickets. For added flexibility, trace sampling can capture full logs for a subset of users, and all original logs are archived in a searchable data lake. This gives teams control, reduces noise, and enables cost-effective observability without sacrificing access to critical information.
June 30, 2025
Product

All Observability Data Is Equal But Some Is More Equal Than Others

With apologies to George Orwell. Not all Observability data is salient all the time, some data is required all the time but most data is only germane when investigating an issue.
June 24, 2025
Product

Grepr vs Vector

Vector and Grepr both function as observability data pipelines, but they differ sharply in complexity and automation. Vector, an open-source tool sponsored by Datadog, is powerful and flexible but requires extensive manual configuration, domain-specific scripting (VRL), and careful infrastructure planning. In contrast, Grepr is a fully automated, AI-driven observability platform that dynamically manages thousands of data transformations without requiring custom coding. It reduces observability costs by up to 90%, stores all data in queryable formats like Apache Iceberg on AWS S3, and integrates seamlessly with tools like Datadog and Splunk. With Grepr, organizations can deploy in minutes instead of days—without the operational overhead.
June 20, 2025

Get started free and see Grepr in action in 20 minutes.