Automating Log Management

Steve Waterworth

May 9, 2025

Grepr slips in like a shim between the log shippers and the aggregation backend. With a small configuration change log shippers forward the logs to Grepr instead of the usual aggregation backend. Grepr automatically analyses each log entry and identifies similarity across all messages in real time. Noisy messages are summarised while unique messages are passed straight through. Nothing is discarded, all messages received by Grepr are persisted in low cost storage.

Semantic Pipeline

As the log messages arrive in Grepr they are processed by a pipeline which parses the messages into the Grepr internal structure. Each log message has the following structure:

ID: Globally unique identifier
Received Timestamp: When Grepr received the message
Event Timestamp: The timestamp from the log message
Tags: A set of key-value pairs used to filter and route messages e.g. host, service, environment, etc.
Attributes: Structured data and fields extracted from the message.
Message: The text of the message.
Severity: The Open Telemetry standard for message severity. 1-4 TRACE, 5-8 DEBUG, 9-12 INFO, 13-16 WARN, 17-20 ERROR and 21-24 FATAL. Either derived from a message filed or parsed out of the message.

Masking

Once the message is in a standard form the real work can begin. Masking automatically identifies and masks out frequently changing values such as numbers, UUID, timestamps, IP addresses, etc. This significantly improves the efficiency od our machine learning by normalising variable data into consistent patterns.

Clustering

Using sophisticated similarity metrics to group messages into patterns. The similarity threshold determines how closely messages must match to be considered part of the same pattern.

Sampling

Once a pattern reaches a threshold, Grepr will either stop forwarding those messages matching the pattern or only forward a sampled subset of the matching messages. If the pattern has been configured to be sampled, then Grepr uses a logarithmic sampling algorithm. With the base set to 2 and the deduplication threshold set to 4, then Grepr will send an additional sample message once the number of messages reaches 32. Since 4 messages were already sent before the threshold was reached and 2^4 = 16 so the next threshold is 32 and 64, 128, 256, you get the idea.

Summarising

At the end of each time slot, Grepr will generate a concise summary for each clustered pattern including the following extra attributes:

grepr.patternId: Unique identifier for the pattern
grepr.rawLogsUrl: Direct link to view all raw messages for this pattern
grepr.repeatCount: Count of the number of messages aggregated

Exceptions

The machine learning in the semantic pipeline does a very good job of significantly reducing the volume of log data sent through to the aggregation backend without filtering out any essential data. However, there are always exceptions. Fortunately there is a rules engine that works beside the machine learning that allows for the configuration and fine tuning of which messages are filtered and which are allowed to pass straight through.

All The Data When You Need It

When using Grepr to automatically manage log data, it’s like having an engineer look at each message and decide which ones are useful and which ones may not be useful. Most log messages are only useful when investigating an issue, only a small subset of messages are useful to verify that everything is working as it should be. Why pay to have every message indexed and stored by your log aggregation backend? With Grepr you can keep all messages in low cost storage where they can be queried for reporting or to feed AI analysis. When an incident occurs, the relevant log messages can be quickly backfilled into the log aggregation backend to aid in the restoration of service. With this strategy, you get the benefits of automated log reduction without changing any of the configurations, analytics or dashboarding your team has built on the logging tools.

‍

Share this post

More blog posts

All blog posts

Product

Aggregate my log volume by 90%, yet still find anything I need? How is that possible?

Grepr uses unsupervised machine learning to reduce log volume by over 90% while preserving important data through smart, configurable aggregation. It passes low-frequency messages through unmodified, allows engineers to retain specific parameters like user IDs, and supports backfilling logs via API triggers when deeper detail is needed—such as during support tickets. For added flexibility, trace sampling can capture full logs for a subset of users, and all original logs are archived in a searchable data lake. This gives teams control, reduces noise, and enables cost-effective observability without sacrificing access to critical information.

June 30, 2025

Product

All Observability Data Is Equal But Some Is More Equal Than Others

With apologies to George Orwell. Not all Observability data is salient all the time, some data is required all the time but most data is only germane when investigating an issue.

June 24, 2025

Product

Grepr vs Vector

Vector and Grepr both function as observability data pipelines, but they differ sharply in complexity and automation. Vector, an open-source tool sponsored by Datadog, is powerful and flexible but requires extensive manual configuration, domain-specific scripting (VRL), and careful infrastructure planning. In contrast, Grepr is a fully automated, AI-driven observability platform that dynamically manages thousands of data transformations without requiring custom coding. It reduces observability costs by up to 90%, stores all data in queryable formats like Apache Iceberg on AWS S3, and integrates seamlessly with tools like Datadog and Splunk. With Grepr, organizations can deploy in minutes instead of days—without the operational overhead.

June 20, 2025

Aggregate my log volume by 90%, yet still find anything I need? How is that possible?

All Observability Data Is Equal But Some Is More Equal Than Others

Grepr vs Vector

Get started free and see Grepr in action in 20 minutes.