From 900,000 to 3: Drastically Reducing S3 Costs with Incremental Caching

2026-05-28 660 words 4 minutes

/posts/2026-05-28-analytics-cost-optimization/featured-image.jpg

Contents

$4.64 in two days. For a blog with modest traffic, that’s an unexpected surprise in the AWS Cost Explorer. The culprit was easy to find — but the fix took three iterations worth writing down.

The Problem

My analytics system is built on Kinesis Firehose → S3 → Athena. A scheduled Lambda runs every hour, queries Athena, and writes the result as JSON to S3 where the dashboard reads it.

The original Lambda looked conceptually like this:

# 4 queries, each scanning 30 days
SELECT page, COUNT(*) FROM page_views
WHERE timestamp >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY page

Four queries. Every hour. Sounds harmless.

Why Athena Generates So Many S3 Requests

Athena is not a database server with an index — it’s a query engine that works directly on S3 files. For each query, Athena internally does:

LIST — enumerate all files in the relevant partitions
GET — read and decompress each file individually
PUT — write results to athena-results/

With Hive partitions by year/month/day and 30 days of history, that means ~30 partition prefixes per query, each with several GZIP files. Four queries, 24 times a day:

4 queries × ~8,000 S3 requests/query × 24 runs/day ≈ 768,000 Tier-1 requests/day

Tier-1 (PUT/LIST) costs $0.005 per 1,000 requests in eu-central-1 — that’s nearly $4/day.

Step 1: Let Athena Scan Only the Current Hour

The core insight: the dashboard needs 30 days of data, but Athena doesn’t have to scan 30 days on every run. If hourly snapshots are accumulated, it’s enough to ask Athena about the current hour only.

# One day as partition filter, one hour as timestamp filter
where = (
    f"year = '{year}' AND month = '{month}' AND day = '{day}'"
    f" AND from_iso8601_timestamp(\"timestamp\") >= TIMESTAMP '{ts_start}'"
    f" AND from_iso8601_timestamp(\"timestamp\") <  TIMESTAMP '{ts_end}'"
)

The partition filter limits Athena to one day (instead of 30), the timestamp filter to one hour. The result is saved as cache/hourly/YYYY-MM-DD-HH.json. The Lambda then merges all existing hourly files in Python into a dashboard.

Savings: ~97% of Athena scan costs.

Step 2: Optimizing the Merge Step

Look closely and you’ll spot the next problem: reading up to 720 hourly files every hour (30 days × 24h) means 720 S3 GET requests per run.

The fix: instead of re-reading all files, maintain an accumulated merged_state.json — a dict with all hours as keys:

{
  "2026-05-27-14": {
    "top_articles": [["page", "views"], ...],
    "daily_traffic": [...],
    ...
  }
}

Each Lambda run then only needs to:

GET merged_state.json
Query Athena for the last hour
Insert new hour, remove hours > 30 days old
PUT merged_state.json
PUT dashboard.json

3 S3 requests per run instead of 720.

Step 3: Querying the Right Hour

One more subtle bug: the original code queried the current hour:

hour_start = now.replace(minute=0, second=0, microsecond=0)  # e.g. 14:00
hour_end   = hour_start + timedelta(hours=1)                  # 15:00

If the Lambda fires at 14:05, the snapshot for hour 14 only contains 5 minutes of data. The remaining 55 minutes are never captured.

Fix: always query the previous completed hour:

hour_end   = now.replace(minute=0, second=0, microsecond=0)  # 14:00
hour_start = hour_end - timedelta(hours=1)                    # 13:00

Run at 14:05 → complete hour 13:00–14:00. No data gaps, one hour of delay in the dashboard.

The Result

	Before	After
Athena scans	30 days / run	1 hour / run
S3 requests / run	~8,000	3
S3 requests / day	~768,000	~72
Estimated cost / day	up to ~$3	<$0.01

There’s also a one-time backfill Lambda that prepares the last 30 days as historical hourly snapshots — so the dashboard isn’t empty on first run.

Takeaway

The pattern generalizes: whenever a scheduled query recomputes historical data from scratch, it’s worth asking — does it really need to start over every time? Often a small accumulated state object is enough to limit the scan to what’s new. Like an incremental backup: only the delta, not everything.

The full code is available on GitHub.