Architecting Real-Time Data Platforms at Multi-Billion Row Scale

Building data platforms that process billions of rows in real-time isn't just about choosing the right technologies—it's about understanding distributed systems fundamentals, making informed architectural tradeoffs, and designing for inevitable failure scenarios.

SK

Sai Kiran Malikireddy

Distinguished Engineer @ Walmart Global Tech | 10+ years in distributed systems

The Real-Time Data Challenge

In today's digital economy, the ability to process and act on data in real-time has become a critical competitive advantage. Whether it's powering recommendation engines, detecting fraud, optimizing supply chains, or personalizing customer experiences, organizations need to handle massive data volumes with millisecond latencies.

At Walmart, we've built real-time data platforms that process billions of events daily, supporting everything from inventory management to customer analytics. Here's what we've learned from architecting these systems at scale.

Core Architectural Principles

1. Design for Horizontal Scalability from Day One

The most critical decision in building scalable data platforms is ensuring every component can scale horizontally. This means:

We learned this lesson the hard way. Our initial architecture had stateful components that became bottlenecks at scale. Redesigning for horizontal scalability allowed us to grow from processing millions to billions of records daily.

💡 Pro Tip: The Scalability Test

Ask yourself: "If traffic doubles overnight, can I handle it by adding more machines?" If the answer involves code changes or architectural redesign, you have a scalability problem.

2. Embrace Event-Driven Architecture

Event-driven architectures are the backbone of real-time systems. They provide:

Apache Kafka has been our workhorse for event streaming, but the principles apply to any message broker. The key is treating events as first-class citizens in your architecture.

// Event Schema Example
{
  "eventId": "uuid-v4",
  "eventType": "user.action.checkout",
  "timestamp": "2024-12-20T10:30:00Z",
  "userId": "user-12345",
  "payload": {
    "cartValue": 149.99,
    "items": [...],
    "location": "store-5678"
  },
  "metadata": {
    "source": "mobile-app",
    "version": "2.5.1"
  }
}

3. Optimize for the 99th Percentile, Not the Average

Average latency metrics are misleading. What matters is tail latency—the experience of your slowest users. In distributed systems, tail latencies compound:

We obsessively monitor p95, p99, and p99.9 latencies, and optimize specifically for these percentiles through techniques like:

Technology Stack Deep Dive

Data Ingestion Layer

The ingestion layer is where data enters your platform. It must be:

Our stack leverages Apache Kafka for ingestion, configured with:

Stream Processing Layer

Real-time transformations, aggregations, and enrichment happen here. We use Apache Flink for stateful stream processing because it provides:

🎯 Key Architectural Decisions

  • Choose horizontal scalability over vertical from day one
  • Implement event-driven patterns for decoupling and resilience
  • Optimize for tail latencies (p99), not averages
  • Design for failure—it's not if systems fail, but when
  • Monitor everything, but alert on what matters

Storage Layer Considerations

Storage is where many real-time systems hit walls. Key considerations:

The key is using the right tool for each access pattern. Polyglot persistence isn't complexity for its own sake—it's optimization for real-world requirements.

Operational Excellence

Observability: The Three Pillars

You can't operate what you can't observe. Our observability strategy focuses on:

The power comes from correlating these three. When latency spikes, you need to quickly trace the issue to specific components, view relevant logs, and understand the broader system metrics.

Chaos Engineering in Production

We regularly inject failures into production systems to validate resilience:

This isn't recklessness—it's controlled testing that reveals weaknesses before they cause real outages. Every chaos experiment teaches us something about our system's resilience.

Performance Optimization Strategies

1. Batch Where Possible, Stream Where Necessary

Not everything needs real-time processing. We use the Lambda Architecture pattern:

This hybrid approach balances cost, latency, and accuracy requirements.

2. Smart Data Partitioning

Partitioning strategy directly impacts performance:

We partition by customer ID for user-specific queries and by timestamp for analytics workloads, creating separate data flows optimized for each use case.

3. Caching Strategy: Multi-Level Defense

Our caching strategy operates at multiple levels:

Each layer has different TTLs and invalidation strategies based on data freshness requirements and update patterns.

Lessons from Production

What Worked

What We'd Do Differently

The Road Ahead: Emerging Trends

The real-time data landscape continues to evolve:

Conclusion

Building real-time data platforms at scale is a journey, not a destination. Technologies will change, but the fundamental principles remain constant: design for horizontal scalability, embrace failure as inevitable, optimize for tail latencies, and make systems observable.

Success comes from balancing technical excellence with pragmatic tradeoffs. Not every decision needs to be perfect—it needs to be good enough today while enabling evolution tomorrow.

The platforms we build today will power the AI-driven, real-time experiences of tomorrow. By grounding our architecture in distributed systems fundamentals while remaining flexible to new technologies, we position ourselves to meet whatever challenges come next.

Found this helpful? Share it!