How I Replaced Google Analytics with 3 AWS Services

2026-05-18 952 words 5 minutes

/posts/2026-05-18-privacy-analytics/featured-image.jpg

Contents

Google Analytics is the obvious choice for blog analytics — free, ready to go, works out of the box. But it also means: third-party cookies, data sharing with Google, and a consent banner you have to make GDPR-compliant somehow. For a blog about AWS, it seemed natural to solve this differently.

The result: a custom analytics system built from three AWS services that sets no cookies, stores no IP addresses, and respects DNT — with a live dashboard at /stats.

The Architecture

Browser
  │  POST /track  (keepalive fetch)
  ▼
CloudFront Function  ──  strip cookies
  │
  ▼
API Gateway → Lambda  ──  bot filter, session hash, Firehose
  │
  ▼
Kinesis Firehose  ──  GZIP, Hive prefix
  │
  ▼
S3  ──  events/year=.../month=.../day=.../
  │
  ▼
Athena + Glue  ──  SQL over raw data
  │
  ▼
Lambda (hourly)  ──  4 parallel queries → S3 cache
  │
  ▼
GET /dashboard  →  ECharts frontend

Each layer has exactly one responsibility. No monolithic service, no database to maintain.

Privacy by Design

This was the first and most important decision. The rules:

No IP address is stored
No cookies, no persistent IDs
DNT is respected — users with Do Not Track enabled are not tracked
Session ID is a daily-rotating SHA256 hash: SHA256(date + User-Agent + screen width) — resets daily, not reversible, but still enables a unique visitor approximation

The tracking script embedded in the blog is intentionally minimal:

if (navigator.doNotTrack === '1') return;

fetch('/track', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    page: location.pathname,
    lang: document.documentElement.lang,
    referrer: document.referrer ? new URL(document.referrer).hostname : '',
    screen_width: screen.width
  }),
  keepalive: true
}).catch(() => {});

keepalive: true ensures the request fires even if the user navigates away immediately after a page load. The .catch(() => {}) silently swallows errors if analytics is temporarily unavailable.

CloudFront Function: Stripping Cookies

CloudFront forwards /track requests to API Gateway. To ensure no browser cookies reach the backend, a CloudFront Function sits in front:

function handler(event) {
  var request = event.request;
  request.cookies = {};
  return request;
}

Three lines. The function runs at the edge before the request ever reaches the origin.

Lambda: Bot Filter and Session Hash

The Lambda function does three things: filter bots, compute the session ID, write the event to Firehose.

Bot filtering by User-Agent:

BOT_PATTERNS = re.compile(
    r'bot|crawl|spider|slurp|facebookexternalhit|'
    r'python-requests|curl|wget|go-http-client',
    re.IGNORECASE
)

if BOT_PATTERNS.search(ua):
    return {"statusCode": 200, "body": "ok"}

Bots get a silent 200 — no error, no retry storm.

Session ID without persistent data:

date_str = datetime.now(timezone.utc).strftime("%Y-%m-%d")
raw = f"{date_str}|{user_agent}|{screen_width}"
session_id = hashlib.sha256(raw.encode()).hexdigest()[:16]

The hash rotates automatically every day. Two visits from the same user on the same day count as one session; the next day it’s a new one.

The event is then written as compact JSON to Kinesis Firehose:

record = {
    "timestamp": datetime.now(timezone.utc).isoformat(),
    "event": "pageview",
    "page": body.get("page", ""),
    "lang": body.get("lang", ""),
    "referrer_domain": body.get("referrer", ""),
    "browser": browser,
    "screen_width": screen_width,
    "session_id": session_id,
}
firehose.put_record(
    DeliveryStreamName=STREAM_NAME,
    Record={"Data": (json.dumps(record) + "\n").encode()}
)

Kinesis Firehose: Data into S3

Firehose buffers events (60 seconds or 5 MB, whichever comes first), compresses with GZIP, and writes to S3 with a Hive-compatible prefix:

events/year=2026/month=05/day=18/analytics-1-2026-05-18-...gz

This format lets Athena read only the relevant partitions rather than scanning all data.

Glue Table with Partition Projection

Instead of a Glue Crawler that needs to run regularly, I use Partition Projection. The partition structure is declared directly in the Glue schema:

Parameters:
  projection.enabled: "true"
  projection.year.type: "integer"
  projection.year.range: "2026,2030"
  projection.month.type: "integer"
  projection.month.range: "1,12"
  projection.month.digits: "2"
  projection.day.type: "integer"
  projection.day.range: "1,31"
  projection.day.digits: "2"
  storage.location.template: "s3://BUCKET/events/year=${year}/month=${month}/day=${day}/"

No crawler, no daily costs for metadata updates. Athena computes the partition paths itself.

The Dashboard

An EventBridge schedule triggers a Lambda every hour that runs four Athena queries in parallel:

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(_run_query, name, sql): name
        for name, sql in _QUERIES.items()
    }

The queries:

Query	What
`daily_traffic`	Views + unique visitors per day, last 30 days
`top_articles`	Top 10 posts by views, last 30 days
`referrers`	Top 10 traffic sources, last 30 days
`devices`	Mobile / Tablet / Desktop by screen width

The result is stored as cache/dashboard.json in S3. GET /dashboard reads this file — Athena is not queried on every dashboard request.

The frontend at /stats renders the data with ECharts, is dark mode aware, and loads everything client-side:

fetch('/dashboard')
  .then(r => r.json())
  .then(d => render(d));

Custom Domain: analytics.aws-sensei.cloud

One important architectural decision: the analytics API has its own subdomain rather than an SSM parameter dependency on the infrastructure stack.

This solves the chicken-and-egg problem: CloudFront can have analytics.aws-sensei.cloud configured as an origin even before the analytics API exists. Requests to /track will silently fail (.catch(() => {})), but the rest of the site works normally. Once the analytics stack is deployed, everything works — without needing to redeploy the infra stack.

Costs

For the traffic of a personal blog: under one dollar per month.

Firehose: $0.029 per GB — with a few KB of events per day, essentially zero
S3: $0.023 per GB storage + minimal PUT/GET costs
Athena: $5 per TB scanned — with GZIP + Partition Projection, a few cents per month
Lambda: free under the free tier
EventBridge: free for standard schedules (first million events/month)

Google Analytics is free — but the price is data control and privacy. This stack costs cents, and all the data is mine.

Result

The dashboard is live at /stats. Daily traffic, top articles, referrers, and device breakdown — all from my own data, no third parties, no cookies.

Update: The first day in production revealed that hourly Athena queries scanning 30 days of data generate far more S3 requests than expected. How I reduced that from ~768,000 to ~72 requests per day using incremental caching is covered in this follow-up post.