Three Advanced Techniques to Reduce Logging Costs - Part II

Jad Naous
February 11, 2025
Image of a dam letting out some water

In the first blog post in the series, I went through four basic ways to reduce logging volumes: increasing the severity threshold, converting logs to metrics, uniform sampling, and drop rules. These techniques work well for smaller, simpler environments, but they lead to missing data that might be important when troubleshooting. Some of them require a significant effort to scale to the enterprise. In this blog post, I'll go through three advanced techniques to reduce log volumes: automatic sampling by pattern, logarithmic sampling by pattern, and sampling with automatic backfilling.

Technique 1: Automatic sampling by pattern

If we can automatically identify, in real-time, the patterns in log messages and track how many messages we're seeing for each pattern, we can then automatically make decisions on how much data to send for each pattern. Here’s how it would work:

  1. As messages come in, build a database of log patterns and pass the messages through.
  2. Keep track of the incoming rate for each pattern.
  3. When a particular pattern crosses some threshold (meaning we've already sent some number of messages for that pattern), start randomly sampling messages for some period of time (say 2 minutes). This sampling passes through a fraction (which could be zero) of additional messages for that pattern.
  4. At the end of that time period, send a summary message for each aggregated pattern with a count of the messages that were skipped for that pattern.
  5. Repeat.
Two samples from each of the two log patterns (GET and POST) are passed through unaggregated, and the rest are aggregated into summary messages.
Pros
  1. Minimal configuration: with a very low effort, this technique can significantly reduce log volumes. Unlike drop rules, no manual pattern configuration or maintenance is required, and it works for all patterns, not just the highest volume ones.
  2. Catches low-volume, important messages: error messages or executions that are infrequent are passed through unsampled, so troubleshooting can use them.
  3. Prevents spurious spikes: new high-volume patterns are automatically identified and sampled.
  4. Maintains relative volumes of patterns: by uniformly sampling data beyond the initial threshold, heavy-flow patterns would continue to be heavier than light-flow patterns. Tools that group by pattern or anomaly detection that relies on anomalous statistical profiles for messages would still work.
  5. Absolute counts for patterns are still available: Dashboards and alerts that rely on absolute values can be updated to include counts from summary messages.
  6. Avoid modifying existing dashboards and alerts: By configuring exceptions for what shouldn't be aggregated you can avoid making changes to your existing setup and minimize impacts to workflows.
  7. Immediate: since pattern detection is real-time, this starts working immediately.
Cons
  1. Loses some data: since high-volume patterns are sampled, those patterns will not have all their data available at the sink.
  2. Reconfiguration: Some dashboards and alerts may need reconfiguration to take summaries into account if exceptions are not configured.
  3. Complexity: capability doesn’t exist in any standard log aggregation tools, and an additional tool may be needed.

Technique 2: Logarithmic sampling by pattern

Logarithmic sampling increases volume reduction exponentially over uniform sampling

The previous technique will do two things: 1) guarantee a basic minimum of log messages for every pattern and 2) reduce the rest of the data by a set fraction. However, that's not optimal. This means that your heaviest patterns, which maybe 10000x more noisy than your lightest patterns will be sampled the same amount and you only get a linear decrease in log volumes. Ideally, you want to sample the heavier patterns more heavily than lighter patterns. For example, if you're seeing 100,000 messages per second for pattern A and 100 messages per second for pattern B, you probably want to only pass through 1% of pattern A and 10% of pattern B. This is what logarithmic sampling does.

This technique is an extension of Sampling by pattern, so it has the same pros and cons. The only difference is that it's exponentially (pun intended) more effective. It's also exponentially more effective than the basic uniform sampling technique.

Technique 3: Sampling with automatic backfilling

This last technique reduces the downsides of sampling by storing all the original logs into some low-cost, queryable storage and automatically reloading data to the log aggregator when there’s an anomaly. The anomaly detection could be built into the processing pipeline or it could be external (such as a callback from the observability tool itself). This way, when an engineer goes to troubleshoot the anomaly, the data would already be in the log aggregator.

When there's an anomaly, passthrough additional data and backfill history for anomalous objects.
Pros
  1. Reduces data loss: by automatically backfilling missing data, it’s less likely an engineer wouldn’t find what they need. Assuming manual backfilling is also possible and quick, data loss would be eliminated completely. At the same time, if the data store is queryable by the developer, all the raw data would still be available, albeit with a reduced query performance.
  2. Simplifies deployment: because of the lower risk of data loss, deployment can be faster and more aggressive.
  3. Iterative improvement: can iterate and improve the rules for automated backfilling over time.
Cons
  1. Requires an additional queryable storage: Requires something like a data lakehouse where the cost of storage is cheap but the data is still queryable at fast-enough performance.
  2. Complexity: the ability to handle backfills requires batch processing which is beyond data ingestion, and may require an additional component to be added.
  3. Not perfect: at some point an engineer might need to manually search through the raw data store or execute a manual backfill, which would require them to go to another tool.

Availability

I went through the available public documentation for various tools to check what capabilities are present in each. Where it was clear a technique was possible, I marked the cell with ✔️. Where it was missing, I marked it with ❌.

Grepr is the only solution that implements all advanced techniques. Further, Grepr automatically parses configured dashboards and alerts for patterns to exclude from sampling, and mitigate impacts to existing workflows. In customer deployments, we’ve seen log volume reductions of more than 90%! Sign up here for free, and see the impact Grepr can make in 20 minutes or reach out to us for a demo here.

Share this post

More blog posts

All blog posts
Product

Use Grepr to Avoid Observability Vendor Lock-In

Grepr is an intelligent observability pipeline that optimizes, analyzes, and routes data in real time, sitting between your agents and observability platform. Utilizing machine learning and a rules engine, it efficiently detects data patterns, filters out repetitive information, and forwards only essential summaries or unique messages. This seamless integration helps organizations significantly cut observability costs by up to 90%, enable long-term data retention, and make valuable insights available for business reporting and AI, all with minimal configuration changes.
July 8, 2025
Product

Aggregate my log volume by 90%, yet still find anything I need? How is that possible?

Grepr uses unsupervised machine learning to reduce log volume by over 90% while preserving important data through smart, configurable aggregation. It passes low-frequency messages through unmodified, allows engineers to retain specific parameters like user IDs, and supports backfilling logs via API triggers when deeper detail is needed—such as during support tickets. For added flexibility, trace sampling can capture full logs for a subset of users, and all original logs are archived in a searchable data lake. This gives teams control, reduces noise, and enables cost-effective observability without sacrificing access to critical information.
June 30, 2025
Product

All Observability Data Is Equal But Some Is More Equal Than Others

With apologies to George Orwell. Not all Observability data is salient all the time, some data is required all the time but most data is only germane when investigating an issue.
June 24, 2025

Get started free and see Grepr in action in 20 minutes.