Why We Call Grepr A “Data Engine”

Steve Waterworth
August 7, 2025

The complexity of the Grepr Intelligent Observability Data Engine is hidden behind an easy to use web dashboard and simple to implement integrations with common log shippers. Here we take a peek inside the inner workings of Grepr.

The name Grepr may imply to some that the inner workings of Grepr is just a collection of regular expressions, in fact it’s a lot more complicated than that and it does not involve grep at all.

Pipelines

Pipelines are the beating heart of Grepr, they read data from one or more sources, store it in one or more datasets, process it and finally write it out to one or more sinks.

Sources

These are endpoints for the log shippers to send data to instead of the usual destinations such as: Splunk, Datadog, New Relic. The Grepr endpoint will be indistinguishable from the regular endpoint that the log shipper is usually configured to send data to.

Datasets

All data sent to Grepr is retained in low cost storage either Grepr hosted S3 buckets or your own. Each S3 bucket aka Data Lake may contain one or more datasets, for example a bucket could contain datasets for production and staging. One or more pipelines can utilize a dataset.

Sinks

After Grepr has processed the data the summary entries for the verbose data and the unique entries are written out to the regular log aggregation and storage backend: Splunk, Datadog, New Relic.

Processing

Grepr supports both structured and unstructured log data. Each log message is parsed into an internal record:

  • Id - A globally unique identifier for this event
  • receivedTimestamp - The point in time that Grepr received the event
  • eventTimestamp - The point in time that the event occurred
  • Tags - A set of key - value pairs extracted from the event e.g. host, service
  • Attributes - A set of key - value pairs extracted from the message
  • Message - The message itself
  • Severity - Extracted from the event: 1-4 TRACE, 5-8 DEBUG, 9-12 INFO, 13-16 WARN, 17-20 ERROR, 21-24 FATAL.
{
  "type": "log",
  "id": "0mdj5j8wz06yx",
  "receivedTimestamp": 1753277608167,
  "eventTimestamp": 1753277604321,
  "tags": {
    "image_name": "steveww/rs-cart",
    "container_name": "robot-shop-cart-1",
    "service": "rs-cart",
    "short_image": "rs-cart",
    "host": "957bdcf3dcf4",
    "source": "rs-cart",
    "image_tag": "1.0.13",
    "image_id": "sha256:6695ca1ed437db0548356b179026341a632e52f569ebb413b33697b27fcd208b",
    "container_id": "957bdcf3dcf4c5b98e980b17a9d4a7a13c98ee7bdc4ce0b45c9ab76c4b3eac90",
    "docker_image": "steveww/rs-cart:1.0.13"
  },
  "attributes": {
    "hostname": "957bdcf3dcf4",
    "req": {
      "headers": {
        "host": "localhost:8080",
        "user-agent": "curl/7.64.0",
        "accept": "*/*"
      },
      "method": "GET",
      "query": {},
      "remotePort": 56128,
      "id": 3,
      "params": {},
      "url": "/health",
      "remoteAddress": "::1"
    },
    "time": 1753277546571,
    "responseTime": 2,
    "meta.grepr.messageAttributePath": "msg",
    "pid": 1,
    "level": "info",
    "res": {
      "headers": {
        "access-control-allow-origin": "*",
        "content-length": "25",
        "x-powered-by": "Express",
        "content-type": "application/json; charset=utf-8",
        "etag": "W/\"19-PNgogu5NtxY46N+WhTPw15zgIXM\"",
        "timing-allow-origin": "*"
      },
      "statusCode": 200
    },
    "status": "info"
  },
  "message": "request completed",
  "severity": 9
}

Now that all the messages are internally represented in a common format, they can be analysed and reduced by the following key steps:

  1. Masking - Automatically identifies and masks frequently changing values such as numbers, UUID, timestamps and IP addresses. Normalises data into consistent patterns.
  2. Tokenizing - Breaks log messages down into semantic tokens based on punctuation characters; lexical analysis.
  3. Clustering - Using machine learning to group messages based on similarity patterns. A high volume data stream of messages can have hundreds of thousands of similarity patterns under analysis.
  4. Sampling - Once a pattern under analysis reaches the threshold, no more matching messages are passed through.
  5. Summarising - At the end of the time window a summary for each pattern is forwarded with extra tags added: grepr.patternId, prepr.rawLogsUrl, grepr.repeatCount.

Just like a real engine, you can tune the performance of the Grepr Data Engine.

  • Aggregation Time Window - The time interval for the production of summarisation.
  • Exception Rules - Specify which messages should not be reduced and just passed straight through.
  • Similarity Threshold - How close the pattern has to match before a message is considered similar.
  • Deduplication & Sampling Strategy
  • Attribute Aggregation

Deduplication And Sampling

Every aggregation window, Grepr will start by passing through a configurable number of sample messages unaggregated for each pattern. Once that threshold is crossed for a specific pattern, Grepr by default stops sending messages for that pattern until the end of the aggregation window. Then the life cycle repeats. This ensures that a base minimum number of raw messages will always pass through unaggregated. Low frequency messages that usually contain important troubleshooting information will pass through unaggregated.

While this behavior maximizes the reduction, log spikes for any messages beyond the deduplication threshold disappear. Features in the external log aggregator that depend on counts of messages by pattern (such as Datadog's "group by pattern" capability) would no longer work well.

Instead, Grepr allows users to sample messages beyond the deduplication threshold. Grepr implements "Logarithmic Sampling" that allows noisier patterns to be sampled more heavily than less noisy patterns within the aggregation window. To enable this capability, you configure the logarithm base for the sampler. If the base is set to 2 and the deduplication threshold is set to 4, then Grepr will send one additional sample message once the number of messages hits 32 within the aggregation window (since we already sent 4 before the deduplication threshold is hit, and 2^4 = 16) another at 64, at 128, etc.

Attribute Aggregation

Provides fine level control over how the attributes section of a message are combined when messages are aggregated into a single pattern. The message text may be similar enough for aggregation but the attributes can vary. There are two ways to configure how attributes are handled.

Specific Attribute Path Strategies - Using a selector for the attribute path a merge strategy can be specified: Exact match, preserve all, sample.

Default Attribute Merge Strategy - For any attribute keys not specifically specified, use the selected strategy: Exact match, preserve all, sample.

For the complete reference see the documentation.

Dynamic Rule Engine

Even with a low-volume data stream of 300 messages per second Grepr reduces log volume by 90%. Internally it will be managing a live pattern rule set in the region of 200,000 patterns which are continually adapted to the incoming stream. It is simply not possible to achieve this level of sophistication and data reduction with a manually managed set of pattern matches. The Grepr Dynamic Data Engine does all the work for you resulting in significant savings on log aggregation and storage costs.

Share this post

More blog posts

All blog posts
Announcements

Announcing live edit

In the fast-paced world of data pipelines, making a mistake can have serious consequences. This blog introduces Grepr's new Live Edit feature, which allows you to safely test changes to your production pipelines. By creating a temporary, risk-free clone of your pipeline, you can add new parsers, exceptions, or other modifications and see the results in real time. This ensures you can validate changes and their impact on your data stream before committing, preventing errors and giving you the confidence to maintain your pipelines with ease.
August 14, 2025
Product

Automatic Backfill

Data backfilling is a powerful tool for troubleshooting, but doing it manually can slow you down when you're racing to resolve an issue. This blog explores how to automate the backfill process using the Grepr Intelligent Observability Data Engine. By configuring webhooks with popular monitoring tools like Splunk, Datadog, and New Relic, or by using Grepr’s built-in rule engine, you can automatically trigger a backfill job when an alert is fired. This provides a complete, unabridged dataset for the time period of an incident, giving you the full context you need to debug without manually running queries—saving you time and making your workflows more efficient.
August 12, 2025
Product

How FOSSA Reduced Their Logs by 94% Without Burdening Their Engineers

Is your observability bill growing faster than your engineering team can say "log volume"? You're not alone. FOSSA, a leader in software supply chain management, faced a similar challenge. Their reliance on Datadog, while providing essential visibility, was becoming a significant financial burden as their platform scaled. Instead of a painful, time-consuming overhaul of their entire logging strategy, FOSSA found a smarter way. They discovered a solution that allowed them to dramatically reduce their Datadog costs without sacrificing the crucial insights they needed to monitor and troubleshoot their systems. Want to know how FOSSA achieved a whopping 95% reduction in log volume and kept their observability costs in check? Click to read the full story and discover their secret!
July 30, 2025

Get started free and see Grepr in action in 20 minutes.