Security Data Lakes: Cut SIEM Spend, Improve Detection
Security operations want two things that often feel at odds: comprehensive visibility and manageable cost. Traditional SIEM platforms are great for real-time alerting and correlation, but their ingest-based pricing and proprietary data stores create painful trade-offs. Teams either dial back telemetry to stay under budget or accept shallow retention windows that cripple investigations. Security data lakes break that compromise. By decoupling storage from compute on low-cost object storage and open table formats, you can keep far more data for far less money while expanding analytics capabilities that improve detection quality. This post explains what a security data lake is, why it reduces spend, and how to design one that strengthens detection engineering and incident response.
Why SIEM Costs Balloon
SIEM cost models are usually driven by data volume or events per second. As your footprint grows—more cloud accounts, SaaS apps, endpoints, and identities—telemetry explodes. Two compounding factors make budgets fragile:
- Ingest-based licensing: Every line of log data increases monthly cost. Highly verbose services like CloudTrail, EDR telemetry, and DNS datasets become prohibitively expensive to retain in the SIEM.
- Proprietary storage: You are tied to the SIEM’s internal store, so you pay the vendor for both the software and the storage/compute used for search.
These economics lead to operational compromises:
- Data minimization and sampling that weaken detection efficacy.
- Short retention that undermines threat hunting and compliance.
- Slow, expensive backfills when incidents require historical reconstruction.
- Engineering time spent writing brittle parsing rules to trim data instead of improving detections.
Security data lakes shift the heavy, high-volume data to inexpensive storage, allowing the SIEM to focus on what it does best: real-time correlation, alert triage, and analyst workflow. The result is lower total cost and better outcomes.
What Is a Security Data Lake?
A security data lake is a centralized, analytics-ready repository for security telemetry built on cloud object storage and open table formats. Instead of ingesting every log into a SIEM, you land data into the lake, standardize schemas, and run detection and hunting queries with scalable engines. You keep rich history on affordable storage while decoupling compute from storage so you only pay for processing when needed. The lake becomes your “source of truth” for investigations, enrichment, and model training, while the SIEM handles time-sensitive alerts and case management with a curated subset of priority signals.
Core Architectural Components
Object Storage as the Bedrock
Store raw and curated data in cloud object storage (e.g., Amazon S3, Google Cloud Storage, Azure Data Lake Storage). It is cheap, durable, and massively scalable. Organize data in partitioned directories by tenant, dataset, and time to support pruning and lifecycle policies.
Open Table Formats and Columnar Files
Use columnar formats such as Parquet for storage efficiency and fast column scans. Layer a table format like Apache Iceberg, Delta Lake, or Apache Hudi for ACID guarantees, schema evolution, time travel, and data compaction. This enables reliable updates, deletes (for privacy requests), and consistent reads even during batch writes.
Ingestion and Stream Processing
Reliable ingestion moves logs from sources to the lake. Common patterns include:
- Streaming pipelines using Kafka/Kinesis for high-throughput telemetry from EDR, network, and authentication sources.
- Agent-based collection (Fluent Bit, Vector, Beats) to capture syslog, Windows Event logs, container logs, and application traces.
- SaaS and cloud connectors for APIs like Okta, Google Workspace, Microsoft 365, and CloudTrail.
You can normalize and enrich data in-flight with stream processors or write raw to “bronze” storage and transform later into “silver/gold” layers.
Query Engines and Notebooks
Interactive SQL engines like Trino/Presto, Athena, BigQuery, Snowflake, or Spark SQL enable ad hoc hunting and scheduled detection jobs. Notebooks (Jupyter) and data science platforms add exploratory analysis and machine learning workflows. The decoupled model lets you scale compute for heavy hunts or model training without permanently paying for a large SIEM cluster.
Metadata, Catalog, and Governance
A centralized catalog (e.g., Glue, Hive Metastore, or vendor catalog) tracks table schemas, partitions, and lineage. Governance layers provide row/column-level security, dynamic masking, and audit trails. These controls are essential to protect sensitive attributes and support least privilege access for analysts and automated jobs.
Data Modeling and Normalization
One of the biggest challenges in security analytics is the heterogeneity of log formats. A security data lake thrives on consistent schemas. Adopting an open schema like the Open Cybersecurity Schema Framework (OCSF) reduces parsing overhead and simplifies cross-source correlation. Benefits include:
- Standardized field names and event categories for identities, endpoints, networks, and cloud.
- Easier detection logic: a single query can work across multiple EDR or identity providers.
- Smoother onboarding of new data sources with predictable transformations.
Model the lake in layers:
- Bronze: raw, landed exactly as received, with minimal transformation and immutable storage for audit.
- Silver: cleaned and normalized to OCSF or a similar canonical schema. Add basic enrichments (e.g., geolocation, threat intel tags, asset ownership).
- Gold: curated datasets optimized for specific use cases, such as identity access analytics, network flow summaries, and endpoint process trees.
Schema evolution matters. Choose table formats that can handle new columns, nullability changes, and soft deprecations without breaking queries.
Ingestion Patterns That Keep Costs Predictable
Cost control begins with intelligent routing. Not every event should reach the SIEM. A common pattern is:
- Stream or batch ingest to object storage first (bronze). Retain full fidelity cheaply.
- Transform to silver and selectively forward high-signal events to the SIEM for real-time alerting (e.g., authentication failures crossing thresholds, malware detections, policy violations).
- Keep verbose telemetry like DNS queries, CloudTrail data events, and EDR process lineage primarily in the lake for hunting and retrospective correlation.
Enrichment can happen in-flight or post-ingest. If your license model penalizes every additional byte, do enrichment in the lake and only forward compacted, enriched alerts to the SIEM.
Storage and Tiering Strategies
Design for access patterns:
- Hot tier: last 7–30 days with more frequent queries; compacted files and clustering for rapid hunts.
- Warm tier: 1–12 months for investigations and threat hunting; lifecycle policies that keep data in standard storage but with larger file sizes for efficient scans.
- Cold/Archive tier: 12–84 months for compliance and rare investigations; object storage archive tiers with retrieval SLAs that match your use cases.
Partition by event_time at hourly or daily granularity plus optional secondary partition keys (e.g., tenant_id, event_type). Use clustering or sorting on high-cardinality fields like user_id or ip_address when supported by your table format to speed up predicate pushdown. Schedule compaction to merge small files into larger Parquet files (128–512 MB) to reduce scanning overhead.
Query and Detection Engines
Choose engines based on use case and skill set:
- Trino/Presto or managed services (Athena) for SQL-centric analysts and large-scale interactive hunts across many tables.
- Spark for heavy transformations, graph construction (process trees), and machine learning pipelines.
- Serverless warehouses (BigQuery, Snowflake) for simplified operations and elastic performance, especially if you already have enterprise agreements.
Scheduling options range from workflow orchestrators (Airflow, Dagster), to serverless schedulers (Cloud-native cron), to native task schedulers in your data warehouse. Use materialized views or incremental merge jobs for commonly queried aggregations (e.g., per-user login baselines) to reduce query cost and latency.
Example detection query correlating identity and cloud activity to spot potential session hijacking:
-- Pseudo-SQL for an engine that supports ANSI SQL
WITH okta_logins AS (
SELECT user, source_ip, device_id, event_time
FROM identity.okta
WHERE event_type = 'user_login_success'
AND event_time >= current_date - INTERVAL '1' day
),
cloud_admin_actions AS (
SELECT user, source_ip, action, event_time
FROM cloud.cloudtrail
WHERE action IN ('CreateUser', 'AttachUserPolicy', 'PutRolePolicy')
AND event_time >= current_date - INTERVAL '1' day
)
SELECT c.user, c.action, c.event_time, c.source_ip, o.device_id
FROM cloud_admin_actions c
LEFT JOIN okta_logins o
ON c.user = o.user
AND c.event_time BETWEEN o.event_time AND o.event_time + INTERVAL '2' hour
AND c.source_ip = o.source_ip
WHERE o.user IS NULL; -- Admin actions with no preceding matching login
Detection Engineering as Code
Security teams gain leverage by treating detection logic like software. A data lake makes this easier because detections are just SQL or code against open tables. Key practices:
- Version control: Store detections, parsers, and enrichments in Git. Track changes, approvals, and rollbacks.
- Unit tests: Build synthetic datasets in CI to validate that rules catch known malicious patterns and avoid common benign scenarios.
- Staging environments: Run new rules against historical data to estimate alert volume and false positive rates before promotion.
- Reusable libraries: Encapsulate common joins (e.g., identity + asset owner + GeoIP) and time-window logic.
- Observability: Emit metrics per detection (execution time, scanned bytes, matches per day, precision estimates) to keep costs and quality in check.
Use infrastructure-as-code to provision tables, permissions, and scheduled jobs. When a detection meets your real-time criteria, forward only the alert to the SIEM for case management, keeping raw data in the lake for context.
Analytics and ML That Actually Helps
Machine learning in security has a reputation for noise. The lake provides the data density to do it right without exploding cost:
- Baselining: Build per-entity baselines for logins, process launches, network egress, and cloud API usage. Anomaly scores complement rule-based detection rather than replacing it.
- Feature stores: Persist derived features (e.g., failed_login_rate_1h, geo_hops_24h, new_process_ratio) for consistent training and inference.
- Weak supervision: Label datasets using heuristics, blocklists, and analyst feedback to produce training sets without expensive manual labeling.
- Explainability: Prefer models (or wrap them) to provide interpretable reasons for anomalies—analysts need actionable context.
Run ML jobs on an elastic compute layer and persist only compact model outputs and alert candidates. Keep models simple where possible—robust baselines and well-chosen thresholds often outperform black boxes in production.
Cost Optimization Techniques
Security data lakes cut spend primarily by shifting storage to low-cost object storage, but day-to-day efficiency still matters. Practical tactics include:
- Right-size partitions and compact files to reduce scanned data.
- Predicate pushdown and column pruning by selecting only the fields you need.
- Materialize high-value aggregations for common hunts (e.g., top talkers, new services listening, newly seen binaries).
- Use lifecycle policies and archive tiers for older data, with retrieval workflows pre-tested.
- Cache threat intelligence locally and join via fingerprints (domain hash, ASN) to minimize data bloat.
- Selective SIEM forwarding: send high-signal alerts and key audit trails; leave verbose telemetry in the lake.
Track cost per detection and cost per investigation hour saved. These metrics inform where to invest optimization effort and which datasets to prioritize.
Governance, Privacy, and Security Controls
A security data lake inevitably houses sensitive information: user identifiers, device fingerprints, IP addresses, sometimes payloads. Apply strict guardrails:
- Encryption: Server-side and client-side encryption with key management and rotation. Consider customer-managed keys for sensitive datasets.
- Access controls: Implement role-based or attribute-based access at the table, column, and row level. Analysts should not see secrets or unnecessary PII.
- Data minimization: Drop or hash unnecessary fields early. Tokenize high-risk fields where possible, keeping reversible mapping in a secure enclave.
- Lineage and audit: Maintain logs for who queried what, when, and how much data was accessed.
- Data residency: Partition datasets by region and keep processing local if required by regulation.
Build privacy by design into schemas. For example, separate tables for sensitive enrichments with controlled joins rather than scattering sensitive columns across all datasets.
Performance Tuning for Fast Hunts
Speed matters to analysts. Common optimizations include:
- Sort or cluster tables by event_time and high-selectivity keys (e.g., user_id) to maximize data skipping.
- Bloom filters and zone maps (supported by some table formats) to avoid reading irrelevant files.
- Warm caches by precomputing daily aggregations and sessionization tables for identity and cloud activities.
- Use vectorized readers and ensure compression codecs (Snappy, ZSTD) match your engine’s strengths.
Set SLAs for typical hunts (e.g., “Find all admin-role creations across 90 days in under 30 seconds”) and tune until you meet them. Analysts will naturally prefer the lake if it feels faster than the SIEM for exploratory work.
A Migration Blueprint That Works
Moving from an all-in SIEM approach to a hybrid SIEM + data lake can be done incrementally. A pragmatic plan:
- Inventory telemetry: classify sources by volume, value, and compliance requirements. Identify high-volume/low-signal candidates (e.g., DNS, flow logs, CloudTrail data events) for lake-first routing.
- Stand up the foundation: object storage, table format, catalog, baseline governance, and a SQL engine.
- Land raw data (bronze): start with append-only, immutable ingestion to validate throughput and reliability.
- Normalize to silver: adopt OCSF or a compatible schema for the first few datasets. Add enrichments for identity and asset context.
- Shadow detections: port a handful of SIEM rules to the lake and run them in parallel to measure precision, recall, and performance.
- Selective offload: reduce SIEM ingest by forwarding only curated alerts and prioritized audit logs. Keep the rest in the lake.
- Expand coverage: iterate dataset by dataset, growing hunting playbooks, materialized views, and detection-as-code pipelines.
- Decommission or downgrade licenses: after confidence grows, renegotiate SIEM licensing based on lower ingest volume.
Throughout the migration, create analyst-friendly views and document common joins so daily workflows are smooth. Change management is as important as the technical work.
Real-World Examples
Fintech: 60% SIEM Cost Reduction and Better MFA Attack Detection
A global fintech faced monthly SIEM bills dominated by CloudTrail and Okta logs. They redirected raw logs to an object store, normalized them to OCSF, and ran scheduled detections on a serverless SQL engine. Only alert-worthy events (e.g., impossible travel, privilege escalations, policy changes) flowed into the SIEM. With richer retention, they built a baseline of per-user login velocity and device consistency. Within weeks, they caught an MFA fatigue attack where an attacker bombarded a user and then performed admin actions from a new ASN. The correlation required 90 days of identity and cloud API data that would have been unaffordable in the SIEM. Their SIEM spend dropped by 60%, and detection coverage increased.
Manufacturer: Network Beaconing Exposed via Long-Term DNS Retention
A manufacturing enterprise stored 12 months of DNS and NetFlow-like telemetry in the lake, keeping only IDS alerts in the SIEM. Using Trino, analysts built daily aggregations of domain-to-IP stability and periodicity features. They identified low-and-slow beaconing to dynamic DNS domains from a single plant segment—previously invisible due to 30-day SIEM retention. The lake’s cost allowed them to maintain long windows for statistical confidence without trimming data.
Healthcare Provider: Faster Investigations with Process Trees
Endpoint telemetry was landing raw into the lake, where Spark jobs constructed process trees and persisted compressed graph representations. When suspicious PowerShell activity surfaced, analysts pivoted across six months of process lineage to identify the initial access vector—a macro in a specific document template. The SIEM had a summary alert, but the investigation moved quickly because the lake held rich historical context.
Key Performance Indicators and ROI Signals
To demonstrate value and guide optimization, track:
- SIEM ingest reduction: bytes per day and cost delta relative to baseline.
- Retention window growth: months of coverage for key datasets vs. before.
- Mean time to investigate: time from alert to root cause with lake-assisted context.
- Detection quality: precision and recall for migrated rules; volume of actionable alerts.
- Cost per successful detection: lake compute + storage + SIEM alerting divided by confirmed incidents.
- Query performance: median and p95 latency for core hunting workloads and scheduled detections.
Showcase specific hunts that would not have been possible without the lake’s retention or cross-source joins. These narratives resonate with leadership as much as raw numbers.
Pitfalls and Anti-Patterns to Avoid
- Data swamp syndrome: Landing everything without a normalization plan leads to unreadable, untrustworthy data. Invest early in schema and catalog hygiene.
- Unbounded costs from sloppy queries: Educate analysts on partition pruning and column selection; set guardrails and budgets on serverless engines.
- Recreating a SIEM monolith: The goal is decoupling. Do not force all detections through one giant job. Use modular pipelines and small, composable rules.
- Neglecting governance: Without granular access controls and audit, you risk exposure of sensitive data and stall adoption by compliance stakeholders.
- Overfitting ML: Complex models that require handholding are brittle. Start with robust baselines, features, and thresholds backed by domain insight.
- One-way migrations: Keep feedback loops to push high-signal, real-time detections into the SIEM. Analysts need streamlined triage even as data gravity shifts to the lake.
Finally, plan for on-call realities. When an incident happens, responders need fast, predictable queries and well-documented playbooks. Build “golden paths” in the lake for common pivots: by user, device, process, IP, and cloud principal. Precompute or index what you can so responders do not fight slow scans when minutes matter.
The Path Forward
A security data lake lets you decouple storage and analytics to curb SIEM costs while expanding retention, context, and detection coverage. The payoff is tangible in KPIs: reduced ingest, longer lookbacks, faster investigations, and more actionable alerts. Steer clear of swamps and surprise bills with strong schemas, governance, modular pipelines, and cost guardrails—and keep a feedback loop to the SIEM for crisp triage. Start with a focused pilot: land your noisiest sources, formalize schemas, migrate a few high-value rules, and measure before-and-after. Take that step now, and you’ll be positioned to catch more threats at lower cost in the quarters ahead.
