4 Basic Techniques to Reduce Log Volume and Cut Observability Costs (Part One)

Jad Naous

February 5, 2025

Over the past five years, the rise of Kubernetes and the explosion of microservices have led to an unprecedented increase in the volume of logs generated by applications. Each application today could consist of tens or hundreds of microservices, each of which could have tens of instances running in containers emitting hundreds of log messages per second. Many engineering leaders are driving their teams to reduce observability costs, starting with logs, but often, this is an enormous undertaking that needs to be repeated every few years.

In this blog, I'll highlight four basic techniques for reducing log volumes, comparing their pros and cons. These basic techniques should be available in most log aggregation systems as well as observability pipelines, and while they're simple and unlikely to work sufficiently for an enterprise, they're a good starting point. In the next blog post, I’ll talk about three advanced techniques, and discuss their availability across common tools.

Basic Technique 1: Increase severity threshold

One of the simplest techniques to implement is increasing the severity threshold of either logs emitted at the source or accepted at the log aggregation system. For example, you can increase the threshold from INFO to WARN level, and no longer collect INFO logs.

Changing the log severity emitted at the source can require code changes and can be harder to revert quickly. On the other hand, since the logs are no longer emitted at the source, you would save any costs needed to transmit and collect the logs.

Changing the log severity at the log aggregator would still accrue the cost of delivering the logs. However, you can then change the level of logs indexed/stored at the log aggregator itself, without code changes. It would be easier to revert that change if needed, such as during an incident.

Pros:

Easy to implement: significant reduction with minimal changes
Customizable: Can be configured at a fine granularity at the source.

Cons:

Requires discipline: log messages need to be labeled with the correct level.
Data gaps: can miss relevant logs necessary for troubleshooting or alerting‍
Migration impacts: Some alerts and dashboards may stop working, and might not be fixable since data is lost.

Basic Technique 2: Logs to metrics

Often, counts of various log messages are used by developers to measure throughput, delay, errors, or other metrics. This is an antipattern since logs are more expensive to process than metrics. Some log aggregation systems and observability pipeline tools allow users to convert log messages to metrics.

‍

Illustration of logs converted to metrics

Pros:

Targeted: since specific messages are targeted for conversion to metrics, this technique reduces undesired side effects.
Certain details preserved: specifically, aggregated metrics.

Cons:

Other details lost: while metrics are preserved, other information (such as user id) are lost.
Implementation effort: Hard to find candidate log messages and then rewrite dashboards and alerts.
Lower impact: most log messages are not used as metrics, so this technique has limited implementation scope.

Basic Technique 3: Uniform Sampling

A powerful tool that has been recently getting wider adoption is sampling. With sampling, you collect a randomly sampled percentage of messages, say 1 out of every 100. You could apply different sampling ratios at each message severity so that higher severity messages get less sampled or not sampled at all. In Part 2 of this series, we'll talk about Logarithmic sampling, an advanced technique.

Coin being flipped — Sampling randomly selects messages to allow to pass through, a very simple and low-cost technique.

Pros:

Powerful reduction: can reduce log volumes very quickly.
Maintains statistical profiles: Since log messages are sampled uniformly, all message patterns are reduced proportionately, maintaining their statistical shapes.
Customizable: Depending on how sampling is implemented, you can avoid sampling certain services’ messages and sample others more heavily.

Cons:

Data gaps: can drop rare but important messages that can be necessary for troubleshooting.
Requires a sampler: Most log aggregators have sampling capabilities, but this is an extra feature to check for. You might need an observability pipeline to implement it.
Alerting and dashboarding impacts: Sampling changes the absolute counts of any messages the log aggregator is receiving. Alerts or dashboards that depend on those exact values may need to be updated.
Implementation effort: setting up all the exclusions so as not to miss important low-volume messages could be a significant effort.

Technique 4: Drop rules

Most modern log aggregators allow a user to count messages by pattern. This can be then used to configure patterns to drop either by the agent collecting the logs at the source or by the log aggregation system when the log messages arrive.

Bar chart eliminating first few bars — Configured drop rules eliminate the top log patterns by volume.

Pros:

Works well in smaller environments: where the number of patterns is relatively small and doesn’t change often.
Targeted: Since the patterns to exclude are picked manually, undesired side effects can be minimal.

Cons:

Implementation effort: In larger environments where there may be thousands of high volume patterns, understanding which are safe to remove and which aren’t and manually configuring a rule for each would be too much effort.
Requires continuous maintenance: in dynamic environments, the log patterns may change all the time, which would require continuous maintenance.
Loses data: dropped log messages may hold information that’s needed for troubleshooting even though they’re high volume.
Allows spurious spikes: A bad deployment or a simple bug may trigger a slew of new messages that wouldn’t be stopped.

Many tools support all these basic techniques, just like Grepr. In the next blog post, I'll cover some more advanced techniques that allow Grepr to push volume reduction to 90% or more, with minimal impact to existing workflows. If you'd like to see what Grepr can do for you, schedule a demo here or sign up for a free trial here!

Ready for Part Two? Head here!

Frequently Asked Questions

What is the easiest way to reduce log volume?

Increasing the severity threshold is the simplest starting point. By changing the threshold from INFO to WARN level, you can eliminate a large portion of log data with minimal configuration changes, either at the source or at the log aggregation system.

What is the difference between log sampling and drop rules?

Uniform sampling randomly selects a percentage of all log messages to keep, maintaining statistical patterns across your data. Drop rules target specific log patterns by volume and eliminate them entirely. Sampling preserves proportional representation while drop rules remove selected patterns completely.

Will reducing log volume break my existing dashboards and alerts?

It can. Most of these techniques change the absolute count of messages reaching your log aggregator, which means dashboards and alerts that rely on exact values may need updating. This is a key tradeoff to evaluate before implementing any reduction technique.

Can I combine multiple log volume reduction techniques?

Yes, and most organizations do. For example, you might increase severity thresholds for low-priority services, convert specific log messages to metrics, and apply sampling to everything else. The right combination depends on your environment's size, complexity, and troubleshooting requirements.

‍

Share this post

4 Basic Techniques to Reduce Log Volume and Cut Observability Costs (Part One)

Basic Technique 1: Increase severity threshold

Pros:

Cons:

Basic Technique 2: Logs to metrics

Pros:

Cons:

Basic Technique 3: Uniform Sampling

Pros:

Cons:

Technique 4: Drop rules

Pros:

Cons:

Frequently Asked Questions

What is the easiest way to reduce log volume?

What is the difference between log sampling and drop rules?

Will reducing log volume break my existing dashboards and alerts?

Can I combine multiple log volume reduction techniques?

More blog posts

Livestream Recap: How Jitsu Cut Observability Costs Without Flying Blind

The Observability Debt: Why More Data is Making Us Less Reliable

Regain Control of Your Datadog Spend

Get started free and see Grepr in action in 20 minutes.

Livestream Recap: How Jitsu Cut Observability Costs Without Flying Blind

The Observability Debt: Why More Data is Making Us Less Reliable

Regain Control of Your Datadog Spend

Get started free and see Grepr in action in 20 minutes.

Cookie Settings