Introduction
Grepr reduces companies’ observability costs without impacting engineers’ existing workflows. You can learn more about Grepr’s mission here. One aspect of Grepr’s strategy for cost reduction is minimizing unnecessary log emissions through techniques such as intelligent sampling and aggregation with automatic backfilling.
Application Performance Monitoring tools (APM) enable users, during troubleshooting, to see traces with complete data to tell them exactly what happened in execution. Sometimes, users want to see trace-like data that is not APM-related. For example, they might want to see all the logs associated with a sample of requests hitting each endpoint or they want to see all the logs associated with a sample of Spark jobs. In these cases, a request ID, thread ID, session ID, query ID, or a job ID could act like a trace ID. For the rest of this blog, what is referred to as trace ID’s could refer to any of the aforementioned.
If logs are sampled randomly, traces will likely not have a full set of associated logs. We heard that it is important to users that they be able to see full logs for at least a subset of traces. This blog post is about how we enable users to specify a fraction of end-to-end traces that will have a full set of logs in their observability tool, despite the aggregation and sampling that Grepr does.
Feature Overview
To enable users to have a full set of logs for a sample of traces, Grepr needs to:
- Find the trace ID in the message
- Sample unique trace IDs using a sampling ratio specified by the user. 100% means all logs with trace IDs are kept, while 10% means that 1 out of 10 traces will have full logs.
- For the selected trace IDs, backfill all the associated logs to the user’s observability vendor so they’re available for querying.
To find trace IDs in the message, users can specify the path to a tag or attribute that contains the IDs, and Grepr checks that path for a value in every incoming message. Users can also specify a query predicate to select which logs to consider for sampling, giving them, for example, the option to only sample traces for logs coming from a specific job or container.
To backfill the collected trace IDs, Grepr executes a query to search for unprocessed log messages containing any of the collected IDs at the specified path. It then sends all of those log messages to the user’s observability vendor.
To randomly sample trace IDs, Grepr makes notes of sub-groups of traces of size 10 as they come in. At the beginning of each sub-group, Grepr generates a set of random numbers between 0 and 10 of size equal to (user specified sampling percentage * 10), which are used to determine which of the next 10 traces to sample. As an example, if the sampling percentage is 0.2, Grepr generates [2, 7] and samples the 2nd and 7th next traces. After 10 traces have passed, Grepr will generate another 2 numbers and sample those, and so on.
These 3 aforementioned items give users the confidence that Grepr correctly identifies the logs containing their traces, samples them randomly at the rate specified, and that logs containing the sampled trace IDs are entirely backfilled going back the specified duration.
High Level Approach: The Grepr Rule Engine
This feature integrates seamlessly with Grepr’s existing rule engine: a custom Flink operator which applies rules to incoming messages. Matching rules then trigger actions, such as stopping aggregation for a container, or backfilling logs for a host. These action-sparking objects are called Triggers.
A new trigger collects samples of traces as messages are processed. The trigger then fires once a number of traces has been collected (limiting individual backfill size) or a time period has expired (guaranteeing timely backfills).
First Implementation: Single Operator + HashSet

A naïve implementation of this this feature would have the following logic:
- If the message lacks a trace ID, do nothing.
- Otherwise, sample the trace according to a sampling ratio set by the user, and add it into a Set for that window. This deduplicates sampled IDs.
- When this set reaches capacity or the time window ends, trigger a backfill for all the trace IDs in the set, clear the set and reset the window.

This naïve implementation has some shortcomings, specifically with regards to Flink operator parallelism. Splitting the input into partitions across parallel operators means that messages for a trace become divided across multiple message-processing nodes.
Since each parallel operator has its own set of collected traces, operators will collect duplicate trace samples, and each trigger their own backfills. This results in redundant backfills, and extra memory overhead from holding duplicate trace IDs.
Improvement #1: Global Trace Collection

To address the parallelism issue, we introduced a global trace collector. The collectors from the previous part that are running in parallel will now be referred to as “local” collectors.
The local collectors now send batches of traces to this global collector, which deduplicates them into a single unified set, before triggering a backfill. Like the local collectors, the global collector has both a maximum collection size and a time-based flush.
While this design solved the issue of redundant backfills, using a hash set to do exact deduplication of trace IDs is inefficient at scale.
Improvement #2: Bloom Filters
A set lets us quickly determine whether a trace ID has been seen, but takes up a lot of memory. Moving to a Bloom filter drastically reduces memory usage at the cost of a small (configurable) probability of false positives. A Bloom filter can tell with certainty if an element is not the existing data set, but there’s a small chance it reports that an element is in the set when it actually isn’t.
So a bloom filter can tell us with 100% certainty if a trace ID is new (i.e. the element is NOT in the existing set), with a small chance that it incorrectly says it has seen the element before, “dropping” the trace unnecessarily. This is a good trade-off because it won’t impact performance if we don’t sample a particular trace ID and instead pick a different one.
The most straightforward path to implementation was to just use a standard library such as Google’s Guava where you can specify the number of elements per filter, and the approximate false positive rate.
The memory savings are significant: a set of 100,000 UUIDs (16 bytes each), uses around 2.8MB, whereas a bloom filter for 100,000 elements at a 1% false positive rate uses only 120KB (~ 95% reduction).
However, one challenge remained: once each time window closes, the bloom filters reset, immediately forgetting which traces were sampled in previous windows. This can cause traces spanning multiple windows to be re-sampled unnecessarily.
One solution here would be to increase window duration and capacity, but this inflates memory usage, which would be unnecessary when there is low log volume and the bloom filters stay largely unfilled. Another solution would be to build a custom data structure tailored to retaining partial history without growing unbounded.

Improvement #3: Resizing Bloom Filter
To handle dynamic load while still remembering traces across windows, we built a custom resizing bloom filter. It starts with one Bloom filter and adds additional ones when full, up to a set memory limit. A periodic refresh timer drops the oldest Bloom filter, providing a regular “forgetting” mechanism, and ensuring only a limited amount of memory is used.
Insertion: New elements are added to the last Bloom filter in the chain, just like any standard Bloom filter. We measured ~5700 insert operations per millisecond in benchmarks, comfortably meeting our throughput needs.

Existence Checks: To see if an element exists, we check each Bloom filter in sequence, returning immediately on a match. The costliest scenario, searching all filters without finding a match (or matching only in the last filter), performed at ~2600 operations per millisecond (~156 million ops/minute) in benchmarks. Since existence checks only apply to messages with trace identifiers (which is a subset of the total log inflows), the actual throughput impact on the system is acceptable.
Conclusion
Possible future enhancements include dynamically adjusting filter size based on incoming message rates or lowering the false positive rate in later filters for greater efficiency. However, the current implementation is already live in Grepr’s production release.
Interested in trying it out? Sign up for a free trial at grepr.ai!
More blog posts
All blog posts
Time Travel With Dynamic Backfill

Using Grepr To Reduce Logging Costs
