Privacy and Data Ownership in Observability Pipelines

Utkarsh Vashishtha

January 28, 2026

When evaluating an observability tool, the conversation often revolves around features and query speed but Platform and Security engineers are also concerned about questions few vendors like to answer: Where does the data actually live? Who controls the retention policy? What happens if we choose to switch vendors?

At Grepr, we built our architecture on the principle that log data belongs to the customer, not the vendor. We designed the platform to decouple the processing engine from the storage layer, giving you total control over privacy, retention, and access.

Here is how Grepr handles data ownership and compliance.

Where Your Data Lives and How

Most observability vendors operate as a "black hole." You stream logs to their cloud, they store them in a proprietary format, and they charge you rent to make them available.

Grepr takes a different approach. We store raw log data in blob stores (like Amazon S3) using open standards (Apache Iceberg format as parquet files). As a customer, you have two choices for where this data lives:

Grepr-Hosted: We manage the secure storage for you.
Customer-Hosted (BYOB): You bring your own bucket.

This "Bring Your Own Bucket" model allows you to maintain physical custody of your raw assets while Grepr handles the intelligence and querying layer.

True Customer Control: Retention and Deletion

Because the data can reside in your S3 bucket, you are not bound by a vendor's arbitrary retention tiers. You control the lifecycle policies directly in your cloud provider.

In AWS, for example, you simply set the AWS Lifecycle Rule to manage the data. If compliance needs 7 years worth of data, you can choose to transition to Glacier Deep Archive after. If you need to purge data to meet GDPR "Right to Be Forgotten" requests, you can do so using standard S3 tools without opening a support ticket. Grepr processes the data, but we don't have to be the custodian.

Open Formats: No Vendor Lock-In

Proprietary data formats are the biggest barrier to data ownership. Grepr stores all log data in the Apache Iceberg format (as Apache Parquet files) allowing you to run any workloads on top of that data. You can do this inside or outside of Grepr. We strive to provide easy to use yet powerful operators that would fit anything you can imagine doing with logs.

Because these are open standards, your data is not locked inside Grepr. You can point other tools, like Amazon Athena, Spark, or your own AI models, directly at your storage buckets to process your logs for business intelligence or security auditing.

You own the data, and you own the format.

Automated Privacy-driven workflows

Grepr allows you to write powerful workflows on top of your log data which enables you to take long strides in the domain of privacy driven workflows, for example, automated PII detection and masking.

Privacy often fails because it relies on manual configuration. If an engineer forgets to write a redaction rule for a new service, PII (Personally Identifiable Information) leaks into logs.

We use semantic machine learning to spot patterns in your log data as it streams through. From there, large language models identify new types of PII and mask them automatically before anything hits storage. Sensitive data gets handled correctly without anyone having to maintain sprawling regex lists.

Example, for the following log message:"2023-10-27 14:22:01 DEBUG [PaymentController] Transaction failed, context dump {"orderId": "99812", "user_raw": {"attributes": {"misc_data": "{\"nickname\":\"John\", \"secret_q\":\"Mothers maiden name: Smith\"}\", "notes": "calls from +1-555-010-9988 often"}}}"The Grepr pipeline will automatically figure out PII from the context and mask it into:"2023-10-27 14:22:01 DEBUG [PaymentController] Transaction failed, context dump {"orderId": "99812", "user_raw": {"attributes": {"misc_data": "{\"nickname\":\"<REDACTED_NAME>\", \"secret_q\":\"Mothers maiden name: <REDACTED_NAME>\"}\", "notes": "calls from <REDACTED_PHONE> often"}}}"

LLMs have seen enough natural language to pick up on PII buried deep in nested JSON, even when the field names are vague or inconsistent. That contextual awareness makes Grepr's semantic ML layer far more scalable than pattern-matching alone.

Compliance: SOC2, HIPAA, and DORA

Proving you're compliant matters just as much as actually being compliant. Grepr is SOC2 Type 2 and HIPAA certified, and you can dig into our security controls over at our Trust Center.

If you’re in financial services, (DORA) Digital Operational Resilience Act probably keeps you up at night. The regulation demands tight oversight of ICT risk and data recovery, and auditors want to see that you're not putting all your eggs in one vendor's basket. Storing logs in your own infrastructure checks that box. We wrote a longer piece on this called How DORA Redefines Logging and Observability if you want the full rundown.

SSO and Access Control

Owning your data means nothing if you can't control who touches it. Grepr plugs into SSO providers like Okta, Google Workspace, and Azure AD, so you can require MFA across the board.

On the platform side, we built Role-Based Access Control (RBAC) to keep people in their lanes. Engineers see the logs for their services and nothing else. If something sensitive leaks, the blast radius stays small.

The Security Review Checklist

We built Grepr to survive the toughest security reviews. Here's how we answer the questions that usually trip vendors up:

"Does the vendor have our data?" Only if you want us to. Otherwise, it stays in your S3 bucket.
"Is the data encrypted?" Yes. At rest in S3, in transit via TLS.
"What about compliance?" SOC2 Type 2, HIPAA, and DORA. All covered.
"Can we host it ourselves?" Yep. Grepr runs as SaaS or on-prem, so you can meet strict requirements like FedRAMP without compromising on functionality.

The Bottom Line on Data Ownership

The architecture boils down to one idea: separate the smart stuff from the storage. You get AI-driven observability and all the query power you need, but your logs stay exactly where you put them.

Ready to take ownership of your log data? Get started with Grepr today.

‍

Frequently Asked Questions

1. What is BYOB (Bring Your Own Bucket) in observability?

BYOB lets you store your raw log data in your own cloud storage (like AWS S3) instead of the vendor's infrastructure. The observability platform processes and queries your data, but you maintain physical custody of the files. This gives you control over retention policies, deletion requests, and long-term storage costs.

2. How do I delete log data for GDPR compliance?

When your logs live in your own S3 bucket, you can delete data directly using standard AWS tools without opening a support ticket with your observability vendor. This makes responding to GDPR "Right to Be Forgotten" requests faster and keeps you in control of the deletion process.

3. What is Apache Iceberg and why does it matter for log storage?

Apache Iceberg is an open table format that stores data as Parquet files. Using open formats instead of proprietary ones means you can query your logs with any compatible tool (Athena, Spark, Trino) and you're not locked into a single vendor. If you switch observability platforms, your data stays readable.

4. How does automated PII detection work in log data?

Machine learning models analyze your log stream in real time, using context to identify personally identifiable information like names, phone numbers, and addresses, even when they're deeply nested in JSON. The PII gets masked before the data hits storage, so you don't have to maintain manual redaction rules for every new service.

5. What compliance certifications should an observability platform have?

Look for SOC 2 Type 2 (security controls), HIPAA (if you handle health data), and DORA compliance (for financial services in the EU). Platforms that offer BYOB storage and on-prem deployment options make it easier to meet strict requirements like FedRAMP.

‍

Share this post