How to Retain Raw Telemetry Data for HIPAA Compliance Without Breaking Your Budget

Jonathan Seidman
April 23, 2026
Animated GIF of a crowd of green three-eyed Toy Story aliens with navigation labels (Stories, Collections, Favs, Settings, Videos, Analytics, Uploads, Bio) floating across them.

HIPAA requires covered entities and business associates to retain compliance-related documentation, including audit logs, for a minimum of six years. For engineering teams running production healthcare systems, that means six years of access logs, system activity records, and any telemetry that touches electronic protected health information (ePHI).

The regulation is clear on the timeline. The challenge is the cost. Retaining six years of raw telemetry in a commercial observability platform is prohibitively expensive for most organizations. This post covers the technical implementation for retaining everything you need at a price that makes sense.

The Retention Requirement

The HIPAA Security Rule (45 CFR 164.530(j)) mandates that all policies, procedures, and records of actions, activities, and assessments be retained for six years from their creation date or last effective date. The HHS Office for Civil Rights treats missing audit logs as a presumed violation of monitoring requirements during investigations.

State laws can extend the retention period further. Several states require seven to ten years for medical records. Your compliance team should confirm the applicable standard. Your systems need to meet the stricter one.

For a detailed breakdown of what observability data falls under HIPAA retention, see HIPAA Requirements for Observability Data Retention.

Why Commercial Platforms Fail at Long-Term Retention

The math does not work. Consider a healthcare platform generating 500 GB of logs per day. That is roughly 180 TB per year. Six years of retention means over a petabyte of data.

At commercial observability platform pricing, indexed log storage runs anywhere from $0.10 to $3.00+ per GB per month. Even at the low end of $0.10/GB/month, storing a petabyte costs $100,000 per month. At higher tiers, the number is far worse.

Most teams respond by reducing retention windows to 15 or 30 days. This satisfies operational needs but creates a compliance gap. When an OCR investigation requests audit logs from 18 months ago, a 30-day retention window leaves you with nothing to produce.

Aggressive sampling is another common workaround. It reduces cost but also reduces the completeness of your audit trail. An auditor asking for access records on a specific patient at a specific time expects complete data. A sampled dataset has gaps.

The Two-Tier Architecture

The solution separates hot operational data from cold compliance data.

Hot tier: Your observability platform receives a reduced, deduplicated log stream for real-time alerting, dashboards, and active troubleshooting. Retain this for 15 to 30 days. This is the data your on-call engineers query during incidents.

Cold tier: Every raw event, unmodified and unsampled, writes to S3-compatible object storage in Apache Parquet format with Apache Iceberg table management. Retain this for six years (or longer if state law requires). This is the data you produce during audits and investigations.

The hot tier is expensive per GB but holds a small volume. The cold tier is cheap per GB and holds everything.

Implementation

Storage Layer

Write raw telemetry to your S3-compatible bucket (AWS S3, GCS, MinIO). Use Parquet as the file format for compression and columnar query efficiency. Use Iceberg for table management, partition pruning, and schema evolution.

S3 Standard costs roughly $0.023/GB/month. S3 Infrequent Access drops to around $0.0125/GB/month. For data older than a year, S3 Glacier Instant Retrieval costs under $0.004/GB/month.

At these rates, storing a petabyte of compressed log data for six years costs a small fraction of what a single year would cost in a SaaS observability platform.

Lifecycle Policies

Configure S3 lifecycle policies to automatically transition data between storage tiers based on age:

  • 0 to 90 days: S3 Standard (frequent queries during active investigation windows)
  • 90 days to 1 year: S3 Infrequent Access (occasional queries for recent historical context)
  • 1 to 6 years: S3 Glacier Instant Retrieval (rare queries for compliance and audit)

Iceberg's partition pruning ensures queries against any tier resolve efficiently. The query engine reads metadata to identify the exact files needed and skips everything else.

Immutability and Integrity

HIPAA auditors expect evidence that logs have not been tampered with after they are created. Configure your S3 bucket with:

Object Lock in compliance mode to prevent deletion or modification of log files for the required retention period.

Versioning is enabled, so any access to the data creates a version history.

Access logging on the bucket itself to create an audit trail of who accessed the compliance archive.

These controls satisfy the integrity requirements that auditors look for when evaluating your retention infrastructure.

Query Access

Retained data is only useful if you can search it during an investigation. Multiple query engines read Iceberg tables natively:

Amazon Athena for serverless, pay-per-query SQL. DuckDB for local ad-hoc investigation. Trino for distributed queries across large datasets. Grepr's query layer for sub-10-second responses across the full archive using massively parallel processing.

When an auditor requests access logs for a specific patient over a specific date range, the query runs against the Iceberg table, prunes to the relevant partitions, reads only the necessary columns from the Parquet files, and returns results. The data size and age do not materially affect response time because the metadata layer eliminates unnecessary I/O.

Using Grepr for Automated Retention

Grepr handles this architecture as a managed pipeline. Point your log agents at Grepr and it simultaneously:

Reduces log volume and forwards the signal to your observability platform for real-time operations.

Writes every raw event, including the noise and duplicates, to your S3 bucket in Parquet/Iceberg format.

Manages compaction, partitioning, schema evolution, and table maintenance automatically.

Provides a query interface across the full archive using your existing query language (SPL, Lucene, Datadog syntax, NRQL).

Your compliance team gets the complete, unmodified audit trail retained for as long as regulations require. Your engineering team gets a manageable observability bill and sub-10-second queries against years of historical data. No data leaves storage you control, and the open formats mean no vendor lock-in.

Checklist for HIPAA Telemetry Retention

Confirm your retention period meets both federal (six year minimum) and applicable state requirements.

Identify all log sources that touch ePHI or serve as compliance documentation.

Implement the two-tier architecture: reduced data to your observability platform, full raw data to S3.

Configure immutability controls (Object Lock, versioning, access logging) on your compliance storage.

Set lifecycle policies to transition data to cheaper storage tiers over time.

Test your query capability against historical data. Run a mock audit query against data from one year ago and verify it returns complete results.

Document your retention policies and technical controls. HIPAA requires the documentation, not just the data.

For the technical details of the S3, Parquet, and Iceberg storage layer, see How to Store Logs in S3 Using Parquet and Apache Iceberg for Cost Savings.

Frequently Asked Questions

How long do you need to retain telemetry data for HIPAA compliance?

HIPAA requires covered entities and business associates to retain compliance-related documentation, including audit logs and system access records, for a minimum of six years. Some states require longer retention periods for medical records, up to seven or ten years. Your systems need to meet whichever standard is stricter between federal and state requirements.

Can you use your observability platform for HIPAA-required long-term log retention?

You can, but the cost is typically prohibitive. Commercial observability platforms charge $0.10 to $3.00+ per GB per month for stored data. A healthcare platform generating 500 GB of logs per day would face six-figure monthly storage bills to retain six years of data. Most teams use a two-tier approach: short-term retention (15 to 30 days) in the observability platform for real-time operations, and long-term retention in object storage like S3 for compliance.

How do you make sure retained telemetry data has not been tampered with?

Configure your S3 bucket with Object Lock in compliance mode to prevent deletion or modification during the retention period. Enable versioning to maintain a full history of any access. Turn on access logging for the bucket itself to create an audit trail of who accessed the compliance archive. These controls satisfy the integrity and immutability expectations that HIPAA auditors look for during investigations.

What is a two-tier retention architecture for HIPAA telemetry?

A two-tier architecture separates hot operational data from cold compliance data. The hot tier sends reduced, deduplicated logs to your observability platform for real-time alerting and dashboards, retained for 15 to 30 days. The cold tier writes every raw event, unmodified and unsampled, to S3-compatible storage in compressed Parquet format for six or more years. This satisfies both operational and compliance needs at a fraction of the cost of retaining everything in a SaaS platform.

How do you query retained telemetry data years after it was stored?

Use query engines that read Apache Iceberg tables natively. Amazon Athena runs serverless SQL queries on a pay-per-query basis. DuckDB works locally for ad-hoc investigation. Trino handles distributed queries across larger datasets. Iceberg's partition pruning ensures the query engine reads only the relevant files, so response times remain fast regardless of how old the data is or how much total data exists in the archive.

Share this post

More blog posts

All blog posts
Green waveforms on an oscilloscope: a jagged noisy trace above two steady sine waves, illustrating signal versus noise.
Engineering Guides

How to Reduce Telemetry Data Costs Without Losing Coverage

Filter rules force you to guess which data matters before you need it. Pattern-based sampling with full retention doesn't.
April 24, 2026
Kubernetes logo on a background of rippling blue pool water.
Engineering Guides

Grepr for Kubernetes Environments: Architecture and Implementation

Kubernetes environments drown observability platforms in redundant log data, and Grepr uses semantic machine learning to reduce that volume by up to 90 percent while preserving every raw event in low-cost S3 storage.
April 16, 2026
An animated digitally illustrated graphic of an iceberg in water
Engineering Guides

How to Store Logs in S3 Using Parquet and Apache Iceberg for Cost Savings

The ingestion bill is visible. The storage bill is the one that compounds.
April 14, 2026

Get started free and see Grepr in action in 20 minutes.