Store Logs in S3 With Parquet and Apache Iceberg - Grepr Guide

IN THIS ARTICLE

This is some text inside of a div block.

Keep up with Grepr

Subscribe now for best practices, research reports, and more..

Storage costs in observability platforms are priced for query speed, not for keeping data. A team generating 1 TB of logs per day sits on 30 TB at the end of the month. After a year that's 365 TB, and retention requirements in regulated industries often run to six years. The ingestion bill is visible. The storage bill is the one that compounds. Storing logs in S3 using Apache Parquet and Apache Iceberg cuts that second cost without sacrificing queryability.

Why S3 Changes the Economics of Log Retention

Object storage is roughly an order of magnitude cheaper than what observability vendors charge for the same bytes. Commercial platforms run $0.10 to $3.00+ per GB per month for indexed log storage. S3 Standard is about $0.023. S3 Infrequent Access is around $0.0125. Glacier Instant Retrieval is under $0.004.

Put that against real volume: 10 TB per month at $1.00/GB is $10,000. At $0.023/GB, it's $230. That gap exists every month, and if you're in a regulated industry that requires six years of audit log retention, it stops being a line-item conversation fairly quickly. S3-compatible stores like GCS, MinIO, and Cloudflare R2 follow the same pricing logic, and unlike a SaaS platform, it's storage you own, with no extraction fees to retrieve your data.

Why Parquet Is the Right Format for Log Storage

Parquet is a columnar file format that came out of the Hadoop ecosystem and became the default for data lakes because of how it handles reads compared to row-based formats. For log data specifically, three things matter.

Compression

Log data has low-cardinality columns: level, service, environment, event_type. The same values repeat thousands of times. Parquet’s columnar format and multiple compression techniques typically produce files 50 to 80 percent smaller than equivalent raw JSON.

Selective column reads

When you query for errors from one service, you need three columns, not twenty. Parquet lets query engines read only what they need and skip the rest at the file level. After a few weeks of logs, that difference barely registers. In a year, it separates a query that takes seconds from one that scans terabytes of irrelevant data.

Schema enforcement

Every Parquet file carries its own schema declaration. Log schemas change constantly: someone adds a field for a new feature, another team renames a column mid-sprint, someone else drops one without flagging it. In raw JSON, that drift accumulates invisibly until a query returns garbage and you spend an afternoon figuring out when it broke. Parquet surfaces the mismatch at write time.

Why Apache Iceberg Is the Table Management Layer You Need

Without a metadata layer, querying Parquet files in S3 means scanning all of them. There's no index, no range information, nothing to tell the query engine which files are relevant before it opens them. Apache Iceberg solves this by maintaining a table format on top of your Parquet files that tracks what data each file actually contains.

Partition pruning is the part that changes query performance the most. Iceberg records the min and max values stored in each data file. When a query arrives for a specific time range or service, the engine first checks Iceberg's metadata and skips any files outside those bounds. A full year of logs doesn't mean a full year of scanning. The rest of what Iceberg provides tends to matter more over time: schema changes don't require rewriting old data, so new services can add fields without breaking queries against existing files. The snapshot history means you can query the table as it existed at any prior point, which comes up in compliance audits when you need to demonstrate data wasn't altered after ingestion. And because multiple writers can append concurrently without corrupting the table, your ingestion pipeline doesn't need to pause while queries are running against the same dataset.

The Architecture: Dual-Path Log Routing

The pipeline routes logs to two destinations simultaneously. Collection agents (Fluent Bit, OTel Collector, Vector) pull from all sources. A processing layer reduces volume for the hot path and writes the complete, unmodified dataset to S3 in Parquet format under Iceberg table management.

Log Sources → Collection Agent → Processing Pipeline

↓ ↓

Observability Platform S3 Bucket

(reduced, hot data) (full, Parquet/Iceberg)

The hot path carries signals: errors, anomalies, unique events, and sampled representatives of repetitive patterns. Your dashboards and alerts run against this reduced volume. The cold path carries everything, every log line, every trace, every event, compressed and indexed for efficient querying when you actually need it.

Implementation: Build vs. Managed

Building It Yourself

Spark, Flink, or a custom writer can convert incoming logs to Parquet and commit them to Iceberg tables in S3. The operational work is real, though. Small Parquet files from frequent writes degrade query performance and require a compaction process that periodically merges them. Partitioning strategy matters: timestamp-based partitioning (hourly or daily) with optional service name works well, but too many small files kill pruning effectiveness, and too-large files that span broad ranges don't prune well either. Schema management at write time requires a conversion layer that handles new fields as services evolve, and retention requires configuring Iceberg's snapshot expiration and data file cleanup.

This approach is reasonable for teams that already run Spark or Flink with data engineering capacity in place. For teams without that infrastructure, the ongoing operational overhead is not trivial.

Using Grepr

Grepr manages the entire pipeline as a managed service. Point your log agents at Grepr, and it reduces log volume for your observability platform using semantic ML while writing every raw event to an Amazon S3 data lake using the Parquet format with Iceberg table management. You can search your raw data in your data lake with a query interface that supports Datadog and New Relic Lucene-like syntaxes. Compaction, partitioning, schema evolution, and retention run automatically, and the data lives in your bucket in open formats, queryable by Grepr or any Iceberg-compatible engine.

Querying Your S3 Log Archive

Several engines read Iceberg tables on S3 natively. Athena is the lowest-friction option if you're already on AWS: serverless SQL, no cluster to manage, pay per query. DuckDB handles ad-hoc investigation directly from a laptop without any infrastructure overhead. Trino covers larger distributed queries. Spark SQL integrates natively with Iceberg for both batch and interactive workloads. Grepr's query layer uses massively parallel processing against Iceberg's file-level filtering to return results across the full archive, typically in under 10 seconds, regardless of time range.

When This Architecture Makes Sense

Long-term retention is where the cost difference becomes most concrete. In regulated industries requiring six years of audit logs, the gap between SaaS storage rates and S3 object storage prices tends to escalate from a line-item concern to a budget problem somewhere around year two. High log volume accelerates that timeline. And if vendor lock-in on your own telemetry data is a concern, open formats on infrastructure you control is the direct answer.

The architecture itself isn't complicated. The formats are open, and the cost math is straightforward. What takes time is the operational layer: compaction, partitioning, schema handling, and retention policies. Teams that build it themselves spend meaningful engineering cycles on infrastructure that isn't their product. Teams that don't have that capacity tend to stay on platforms that charge them for the privilege of keeping their own data. Either way, the logs exist, and the retention requirement doesn't go away. The question is just what you pay to hold them.

‍

FAQ: Storing Logs in S3 With Parquet and Apache Iceberg

What is the difference between storing logs in Parquet versus raw JSON in S3?

The cost shows up at query time. Raw JSON in S3 is cheap to write, but querying it means scanning every file to find what you need, reading every field even when you only care about two or three. Parquet stores data by column rather than by row, so a query that only needs level and service_name skips the rest of the file entirely. On top of that, log data stored as Parquet typically comes in 50 to 80 percent smaller than equivalent JSON because of how well columnar compression handles low-cardinality fields.

How does Apache Iceberg make S3 log queries faster?

Iceberg maintains metadata about the min and max values stored in each Parquet file. That means a query for logs from Tuesday afternoon can skip every file that only contains data from last month, without opening any of them. Without Iceberg, there's no way to know what's in a file without reading it, so query engines read all of them. Once you're past a few weeks of retention, that becomes the bottleneck.

What does log compaction mean and why does it matter for Parquet on S3?

Continuous log pipelines write small files, and a few days in you might have hundreds of thousands of them. Query engines handle a smaller number of larger files much better than millions of tiny ones because there's overhead per file even before reading a single byte. Compaction merges those small files into larger ones. Iceberg supports it natively, but something has to actually run the process, whether that's your team or a managed layer handling it automatically.

Is Apache Iceberg the only open table format that works with Parquet on S3?

Delta Lake and Apache Hudi both solve the same core problem. Delta Lake is tightly associated with the Databricks ecosystem. Hudi came out of Uber and tends to show up in streaming ingestion architectures. Iceberg has the broadest query engine support across vendors and is the format most referenced in cloud-native observability work. For a log archive where the goal is queryability with whatever tool makes sense years from now, Iceberg is the safer long-term choice.

What query languages can I use to search logs stored in Iceberg tables?

Standard SQL covers Athena, Trino, DuckDB, and Spark SQL. Grepr also supports SPL, Lucene, Datadog query syntax, and NRQL against the same archive. The Iceberg format itself is query-language agnostic; the language depends on whichever engine you point at it.

‍

Ready to reduce your observability TCO by 75%?

Get started for free today