Custom Source Integration: A Step-by-Step Guide to Onboarding Third-Party Data into Security Lake
Amazon Security Lake has transformed how organisations centralise and analyse security data from AWS services. With native integrations for CloudTrail, VPC Flow Logs, Route 53, and other AWS sources, getting started with Security Lake is straightforward. However, the real value emerges when you extend beyond AWS’s native ecosystem to include third-party security tools, on-premises infrastructure, and custom applications.
This guide provides a practical walkthrough for security teams looking to onboard custom sources into Security Lake, addressing the technical challenges, architectural decisions, and operational considerations that matter in production environments.
Understanding the Custom Source Integration Challenge
Most organisations operate heterogeneous security stacks that generate valuable telemetry from sources Amazon Security Lake doesn’t natively support. These might include:
Third-party security platforms such as endpoint detection and response (EDR) solutions, network detection and response (NDR) tools, or cloud access security brokers (CASBs). On-premises infrastructure including legacy SIEM platforms, firewalls, intrusion detection systems, or proxy servers. Custom applications and services that generate security-relevant events but lack standard integration paths. SaaS applications outside the AWS ecosystem such as Microsoft 365, Google Workspace, or Salesforce.
Without these sources, your Security Lake provides incomplete visibility, undermining the core value proposition of centralised security data analysis. The challenge is bringing these disparate sources into Security Lake whilst maintaining data quality, ensuring proper normalisation to OCSF, and avoiding the operational overhead that plagued traditional SIEM implementations.
The OCSF Normalisation Imperative
Before diving into integration mechanics, it’s crucial to understand why normalisation matters. The Open Cybersecurity Schema Framework (OCSF) provides a vendor-agnostic taxonomy for security events. When you normalise third-party data to OCSF at the point of ingestion, you enable:
Cross-source correlation: Query across AWS CloudTrail and your EDR platform using unified field names and data structures. Tool interoperability: Any OCSF-aware security tool can consume your Security Lake data without custom parsers. Future-proof architecture: Adding new sources or switching vendors becomes dramatically simpler when everything speaks the same language. Query efficiency: Normalised schemas enable query optimisations that reduce costs and improve performance.
The investment in proper normalisation at ingestion pays dividends throughout the data lifecycle. Conversely, storing raw logs and normalising at query time creates ongoing operational friction and inflated costs.
Integration Architecture Patterns
There are three primary architectural patterns for custom source integration, each suited to different scenarios:
Pattern 1: Direct API Push
In this pattern, the source system pushes data directly to Security Lake using AWS APIs. This works well when:
The source system supports webhook or API-based event forwarding. Event volumes are moderate and predictable. You require near real-time data ingestion. The source provides structured data that can be transformed to OCSF with reasonable effort.
Implementation involves configuring the source to send events to an AWS Lambda function or API Gateway endpoint, which transforms the data to OCSF format and writes to the Security Lake S3 bucket with appropriate partitioning.
Pattern 2: Agent-Based Collection
For sources that don’t support push mechanisms, agent-based collection provides a reliable alternative. Deploy collection agents that:
Poll source systems on configurable intervals. Buffer events locally to handle network interruptions. Transform data to OCSF before transmission. Support backpressure and rate limiting to prevent overwhelming sources.
This pattern suits on-premises infrastructure, legacy systems, or sources with unreliable connectivity. The trade-off is additional infrastructure to manage and potential for higher latency.
Pattern 3: Streaming Pipeline
For high-volume sources or scenarios requiring complex enrichment, a streaming pipeline architecture provides the necessary throughput and flexibility. This typically involves:
Initial ingestion to a Kinesis Data Stream or Kafka cluster. Stream processing using Kinesis Data Analytics, AWS Lambda, or Apache Flink for transformation and enrichment. Writing normalised events to Security Lake with partitioning optimised for query patterns.
This pattern handles the highest event volumes and supports sophisticated processing logic, but requires more architectural complexity and operational expertise.
Step-by-Step Implementation Guide
Let’s walk through integrating a third-party EDR platform into Security Lake using the direct API push pattern, as it represents the most common integration scenario.
Step 1: Define Your Data Requirements
Before writing any code, clearly define what data you need and how it maps to OCSF. For an EDR platform, you might prioritise:
Process execution events (OCSF class 1007). Network connection events (OCSF class 4001). File activity events (OCSF class 4010). Authentication events (OCSF class 3002).
Document the source data schema and create explicit mappings to OCSF fields. This mapping document becomes your integration contract and ensures consistency across your team.
Step 2: Set Up the Ingestion Infrastructure
Create an API Gateway endpoint to receive events from your EDR platform. Configure a Lambda function as the backend processor. Ensure the Lambda execution role has permissions to write to your Security Lake S3 bucket with the prefix format: aws/REGION/ACCOUNTID/ext/SOURCE_NAME/.
Implement proper error handling, logging to CloudWatch, and dead letter queues for events that fail processing. Production implementations should include retry logic and circuit breakers to handle transient failures gracefully.
Step 3: Implement OCSF Transformation
The transformation logic is where integration quality is determined. For each event type, implement a transformation function that:
Extracts relevant fields from the source event. Maps them to the appropriate OCSF class and attributes. Enriches with contextual information such as asset criticality or threat intelligence. Validates the resulting OCSF event against the schema. Handles missing or malformed data gracefully with appropriate defaults.
Avoid the temptation to include every field from the source. Focus on fields that provide investigative value or support specific detection use cases. Unnecessary fields inflate storage costs and query complexity without adding value.
Step 4: Implement Partitioning Strategy
Security Lake uses Hive-style partitioning with the structure: region=REGION/accountId=ACCOUNTID/eventDay=YYYYMMDD/. When writing events, ensure your Lambda function:
Calculates the correct partition based on the event timestamp (not ingestion time). Creates partition directories if they don’t exist. Writes events in Parquet format with appropriate compression. Batches events when possible to reduce S3 API costs and improve write efficiency.
Proper partitioning is critical for query performance and cost optimisation. Queries that filter by date can skip entire partitions, dramatically reducing the data scanned and associated costs.
Step 5: Configure the Source System
With the ingestion infrastructure in place, configure your EDR platform to forward events to your API Gateway endpoint. Key considerations include:
Event filtering: Only forward events you’ve decided to ingest to avoid processing overhead. Batch size: Larger batches improve efficiency but increase retry complexity. Authentication: Implement API key validation or mutual TLS to prevent unauthorised submissions. Rate limiting: Protect your ingestion endpoint from overwhelming your processing capacity.
Start with a subset of endpoints or event types to validate the integration before expanding to full production scale.
Step 6: Validate Data Quality
Before declaring the integration complete, thoroughly validate data quality:
Schema compliance: Verify that all required OCSF fields are populated correctly. Field mapping accuracy: Spot-check that source fields map to the intended OCSF attributes. Timestamp handling: Ensure event times reflect when events occurred, not when they were ingested. Completeness: Confirm that event volumes match expectations and no data is being silently dropped.
Query the integrated data in Security Lake using Athena to verify that the partitioning is working correctly and that queries return expected results. This validation step prevents discovering data quality issues weeks later when they’ve undermined detection logic.
Step 7: Implement Monitoring and Alerting
Production integrations require observability to detect and resolve issues quickly:
Monitor Lambda execution metrics including invocation count, error rate, duration, and throttles. Track S3 write operations to detect ingestion interruptions. Alert on data freshness to identify when events stop flowing. Monitor transformation error rates and review failed events regularly.
Include integration health in your operational dashboards alongside native AWS source metrics. Custom integrations shouldn’t be second-class citizens in your observability strategy.
Advanced Considerations
Data Enrichment at Ingestion
Consider enriching events during transformation with additional context:
Asset criticality: Tag events from critical systems to prioritise investigations. Threat intelligence: Enrich with indicators of compromise to accelerate detection. User context: Add department, role, or other attributes to support access analytics. Geolocation: Enrich IP addresses with geographic information for anomaly detection.
Enrichment at ingestion is more efficient than enriching at query time, particularly when the same enrichment would be repeated across many queries.
Handling Schema Evolution
Source systems evolve, adding new fields or changing data formats. Plan for schema evolution by:
Versioning your transformation logic to support multiple source schema versions. Implementing schema validation to detect unexpected changes. Designing transformations to degrade gracefully when optional fields are missing. Maintaining backwards compatibility when updating OCSF mappings.
A robust integration anticipates change rather than assuming static schemas.
Multi-Region Considerations
For global deployments, decide whether to:
Replicate all data to a primary region for centralised analysis. Maintain regional Security Lake instances with federated querying. Hybrid approach with regional storage and selected replication based on data sensitivity or compliance requirements.
Data sovereignty requirements often dictate architecture, particularly in regulated industries or when operating in jurisdictions with strict data residency rules.
Cost Optimisation
Custom source integration introduces costs that require active management:
Lambda invocation costs: Batch events when possible to reduce function executions. S3 storage costs: Implement lifecycle policies to transition older data to cheaper storage classes. S3 API costs: Minimise PUT operations through batching and efficient partitioning. Data transfer costs: For on-premises sources, compression reduces network transfer costs.
Monitor integration costs alongside security value to ensure positive ROI. Not every event type justifies the cost of ingestion, storage, and analysis.
Common Integration Pitfalls to Avoid
Having implemented numerous custom integrations, we’ve observed recurring mistakes:
Incomplete OCSF mapping: Failing to populate required fields creates downstream query problems. Ignoring timezone handling: Inconsistent timezone treatment causes temporal correlation failures. Over-collecting data: Ingesting high-volume, low-value events inflates costs without improving security posture. Inadequate error handling: Silent failures allow data gaps to persist undetected. Neglecting partition pruning: Poor partitioning design makes queries expensive and slow.
Learning from others’ mistakes is cheaper than making them yourself.
Real-World Example: EDR Integration
Consider a practical example integrating CrowdStrike Falcon with Security Lake. Falcon provides rich endpoint telemetry but uses its own event schema. The integration approach:
Leverage Falcon’s Event Streams API to receive events via HTTP push. Deploy a Lambda function that transforms Falcon’s DetectionSummaryEvent to OCSF Detection Finding (class 2004). Enrich with host criticality from your CMDB during transformation. Partition by event date and write to Security Lake in OCSF-compliant Parquet format.
This integration enables querying Falcon detections alongside AWS GuardDuty findings and CloudTrail events using unified OCSF fields, dramatically simplifying cross-platform investigations.
Moving Beyond Native Sources
Custom source integration transforms Security Lake from an AWS-centric data repository into a comprehensive security data platform. The investment in proper integration architecture, OCSF normalisation, and operational excellence pays dividends through:
Comprehensive visibility across your entire security stack, not just AWS services. Simplified vendor evaluation and switching when data normalisation eliminates lock-in. Improved detection accuracy through cross-platform correlation. Reduced operational overhead compared to maintaining multiple data silos.
The organisations gaining maximum value from Security Lake are those that view it as a unifying data layer for all security telemetry, regardless of source. With proper integration practices, Security Lake becomes the foundation for truly data-centric security operations.
Ready to Extend Your Security Lake?
At HOOP Cyber, we specialise in custom source integrations that maintain data quality whilst optimising costs. Our team of ex-Splunk engineers and AWS security specialists has integrated hundreds of third-party sources into Security Lake for clients across industries.
Whether you’re integrating a single critical source or building a comprehensive multi-source architecture, we can help design, implement, and operate integrations that deliver long-term value.
Contact HOOP Cyber at to discuss your custom integration requirements and learn how we can accelerate your Security Lake adoption