Monitored Objects: How Grepr reduces Datadog metrics and host costs

Jad Naous

February 12, 2025

If you’re managing observability with Datadog, you’ve likely noticed two big cost drivers: custom metrics and host counts. Datadog charges based on unique metric time series (i.e. metric cardinality) and active hosts, both of which can get large. At Grepr, we’ve developed a smarter way to manage observability—cutting those costs significantly without compromising developer workflows.

Why is metrics observability so expensive?

Metrics storage systems (time series databases or TSDBs), are expensive because they offer:

Fast low-latency queries for immediate answers
Fresh data within seconds of arriving
Historical access for comparisons

But do all metrics need this level of service? We studied how metrics are used and found four core use cases:

Monitoring and alerting: Detect issues in real-time
Troubleshooting: Pinpoint the source of an issue
Dashboarding: Display aggregated performance views
Optimization & capacity planning: Compare changes, fine-tune performance, understand utilization.

Not every use case requires all TSDB features or all the data all the time, and by aligning metrics service to actual needs, we can dramatically cut costs.

The Grepr Approach: Smart Tiering with Integrated Anomaly Detection

The TSDB (Datadog) is your expensive hot data storage, so Grepr only stores the most useful data there and only when needed. Grepr combines the TSDB with an observability-optimized data lake as a low-cost cold data storage. Grepr automatically manages what data goes to which tier based on the current situation.

How It Works:

Monitored Objects: Metrics are grouped by objects like hosts, containers, jobs, AI model, or HTTP requests. Grepr applies built-in anomaly detection to identify abnormal behavior in monitored objects.
Aggregating Normal Data: “Normal” objects are aggregated into virtual entities to reduce noise and save costs. The original raw data is stored in the data lake, while the aggregated data is sent to the TSDB.
Fine-Grained Anomaly Data: When an object becomes anomalous, Grepr sends its detailed metrics to the TSDB along with historical data from the data lake for full visibility.

Result: If you’re billed by objects, such as hosts, or by metrics cardinality or number of datapoints sent, you’ll see a major drop in your billable items.

Example: Host Reduction

Grepr has customizable built-in anomaly detection for host metrics to identify hosts that behave differently from others. Grepr reduces the number of hosts and all their metrics that are visible to Datadog by aggregating all normal hosts together. Anomalous hosts are sent individually along with historical data. You only get billed on anomalous hosts, and the virtual “aggregate” hosts.

Benefits for Metrics Use Cases

Monitoring and alerting: instead of issuing TSDB queries against metrics periodically, Grepr uses stream processing to detect anomalies, eliminating the need for a full TSDB. At the same time, Grepr enables more powerful anomaly detection using unsupervised machine learning.
Troubleshooting: Grepr automatically loads fine-granularity relevant data for anomalies, significantly reducing the amount of data that needs to be in hot storage all the time. Engineers don’t have to compromise on the granularity of metrics for troubleshooting.
Dashboarding: Aggregated data powering dashboard is sent in real-time, and Grepr can be configured to pass fine-granularity metrics for others.
Optimization & capacity planning: Full granularity metrics are available for as long as needed since they’re stored in low cost storage.

How Grepr Reduces its Impact

Grepr aims to reduce the impact on any existing workflows with a slew of capabilities:

Automatic configuration: Grepr can automatically allow metrics used in alerts or dashboards to pass through unmodified.
Default settings: Grepr comes with pre-configured sane settings to get going immediately.
Compatible Query Languages: Grepr is multilingual! We support a Datadog-like syntax so users don’t have to learn a new language.

There are many other exciting details to complete this picture, such as clustering of monitored objects by cohort, built-in anomaly detection, and others. Try Grepr free and see what Grepr can do in 10 minutes here.

‍

FAQs

‍

1. What are Monitored Objects in Grepr?

Monitored Objects are logical groupings of metrics by entity type, such as hosts, containers, jobs, AI models, or HTTP requests. Grepr applies built-in anomaly detection to these objects to identify abnormal behavior, then intelligently routes data based on whether the object is behaving normally or anomalously.

2. How does Grepr reduce Datadog host costs?

Grepr aggregates all normally-behaving hosts into virtual "aggregate" entities, sending only summarized data to Datadog. When a host becomes anomalous, Grepr sends its individual detailed metrics along with historical context. You only pay for anomalous hosts plus the aggregate virtual hosts, rather than every host in your infrastructure.

3. Will I lose access to historical metrics data with Grepr?

No. Grepr stores full-granularity raw data in a low-cost observability-optimized data lake. When an anomaly occurs, Grepr automatically pulls relevant historical data from cold storage and sends it to Datadog for complete visibility. For capacity planning and optimization work, full metrics remain accessible in the data lake for as long as you need them.

4. Does Grepr work with my existing Datadog dashboards and alerts?

Yes. Grepr can automatically detect metrics used in your existing alerts and dashboards and allow them to pass through unmodified. Grepr also supports Datadog-compatible query syntax, so your team does not need to learn a new query language.

5. How does Grepr handle anomaly detection differently than Datadog?

Instead of running periodic queries against the TSDB, Grepr uses stream processing to detect anomalies in real time before data reaches Datadog. This approach enables more powerful unsupervised machine learning detection while eliminating the need to store all metrics in expensive hot storage just for alerting purposes.

‍

Share this post