Lambda Architecture with Kafka, ElasticSearch and Spark (Streaming)
by Jerome Kehrli
Posted on Friday May 04, 2018 at 12:32PM in Big Data
The Lambda Architecture, first proposed by Nathan Marz, attempts to provide a combination of technologies that together provide the characteristics of a web-scale system that satisfies requirements for availability, maintainability, fault-tolerance and low-latency.
Quoting Wikipedia: "Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation.
The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce."
In my current company - NetGuardians - we detect banking fraud using several techniques, among which real-time scoring of transactions to compute a risk score.
The deployment of Lambda Architecture has been a key evolution to help us evolve towards real-time scoring on the large scale.
In this article, I intend to present how we do Lambda Architecture in my company using Apache Kafka, ElasticSearch and Apache Spark with its extension Spark-Streaming, and what it brings to us.
- 1. Introduction
- 2. Lambda Architecture
- 3. Real-time computation with Lambda Architecture
- 4. Conclusion
1.1 NetGuardians' key big data software components
NG|Screener, NetGuardians' flasgship product, is a Big Data Analytics Platform aimed at preventing fraud on the large scale within Financial Institutions.
Our platform manages and operates Big Data Analytics Use Cases detecting fraud attempts by analyzing user behaviours and financial transactions. Working in real-time, it can block suspicious business events, e.g financial transactions to prevent fraud effectively.
Our platform is built internally on four key Big Data Open Source Software components:
Apache Kafka: Kafka is an open-source stream processing software aimed at providing a unified, high-throughput, low-latency platform for handling real-time data feeds.
ElasticSearch: ElasticSearch is a distributed, real-time, RESTful search and analytics document-oriented storage engine. It lets one perform and combine many types of searches - structured, unstructured, geo, metric - in real time.
Apache Mesos: Mesos is a distributed systems kernel that runs on every machine and provides applications with API's for resource management and scheduling across entire datacenter and cloud environments.
Apache Spark: Spark is a fast and general engine for large-scale data processing. It provides programmers with an API functioning as a working set for distributed programs that offers a versatile form of distributed shared memory.
1.2 One ring to rule them all
The choice of these specific components under the hood is not anecdotal. Running Apache Spark on Apache Mesos is really still cutting edge nowadays and the choice of Apache Kafka and ElasticSearch, in addition to the good fit with our use case, answers a very important need we have.
We deploy our platform as much in tier 1 banks and big financial services providers than small private banks in Switzerland or even small Credit Institutions in Africa. Some of our customers have a few thousands of transactions daily while some others have dozens of millions of transactions per day.
Considering that some of our Analytics use cases require depth of analysis of several years, when we have billions of events to consider, we deploy our analytics platform on multiple-nodes clusters, sometimes up to a few dozen computation and storage nodes within the cluster. On the other hand, when we we work for small institutions with very low data volumes, we deploy it on a single small machine.
This need is at the very root of our technology choice, we needed technologies able to run efficiently on single small machines while still being able to scale our on hundreds of nodes should we require that.
ElasticSearch, Apache Spark, Apache Mesos and Apache Kaflka have been designed from the grounds up with this horizontal scalability in mind. But they have been implemented in such a way that they run also very well on a single little machine.
This is pretty uncommon in the Big Data Technology / NoSQL family of products. For instance, Apache Hadoop performs most of the time very poorly on single machines.
These products under the hood are key to sustain our "one ring to rule them all" approach. We develop one single platform that we can deploy everywhere, regardless of the volume of data of our customers.
1.3 Real-time readiness
In addition to their unique genes regarding vertical scalability described above, ElasticSearch, Apache Kafka and Apache Spark are providing our platform with another key feature.
With ElasticSearch, real-time updating (fast indexing) is achievable through various functionalities and search / read response time can be astonishingly deterministic.
Apache Kafka comes with the Kafka Stream extension. The Streams API, available as a Java library that is part of the official Kafka project, is the easiest way to write mission-critical real-time applications and microservices with all the benefits of Kafka's server-side cluster technology.
Despite being a humble library, Kafka Streams directly addresses both hardest problems in stream processing:
- event-at-a-time processing with millisecond latency and
- stateful processing including distributed joins and aggregations.
Kafka enables to implement fast processing on business events, e.g most often financial transactions in real-time and in event-at-a-time mode while dispatching micro-batches further to Spark Streaming. The more complicated processing required by our Analytics use cases occurs then within Spark through the Spark Streaming extension.
Spark Streaming is able to process hundreds of thousands of records per node per second. When using Kafka as a source, it is able to consume nearly half million records per node per second which is striking. It also offers near linear scaling ability, another great perk.
In contrary to Kafka, Spark Streaming works using a micro-batches approach. It works as follows; received input streams and decided into small batches, which are processed by Spark engine and a processed stream of batches is return.
The micro-batches can be as small as a few milliseconds batches, thus enabling sub-second latency while still ensuring a very high throughput and access to the whole Spark power and versatility to implement high level analytics use cases.
This real-time readiness aspect of these components of our technology stack is key to deploy Lambda Architecture within the our platform.
2. Lambda Architecture
When it comes to processing transactions in real-time, our platform provides a state-of-the-art implementation of a Lambda Architecture.
Lambda architecture is a Big Data Architecture that enables us to reunite our real-time and batch analytics layers.
2.1 Lambda Architecture principles
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.
At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.
The rise of lambda architecture is correlated with the growth of big data and real-time analytics.
As new data is introduced to the system, it is processed simultaneously by both the batch layer, and the speed layer. The batch layer is an append-only repository containing unprocessed raw data. The batch layer periodically or continuously runs jobs that create views of the batch data-aggregations or representations of the most up-to-date versions. These batch views are sent to the serving layer, where they are available for analytic queries.
At the same time that data is being appended to the batch layer, it is simultaneously streaming into the speed layer. The speed layer is designed to allow queries to reflect the most up-to-date information-necessary because the serving layer's views can only be created by relatively long-running batch jobs. The speed layer computes only the data needed to bring the serving layer's views to real time-for instance, calculating totals for the past few minutes that are missing in the serving layer's view.
By merging data from the speed and serving layers, low latency queries can include data that is based on computationally expensive batch processing, and yet include real-time data. In the Lambda Architecture, the raw source data is always available, so redefinition and re-computation of the batch and speed views can be performed on demand. The batch layer provides a big data repository for machine learning and advanced analytics, while the speed and serving layers provide a platform for real-time analytics.
The Lambda Architecture provides a useful pattern for combining multiple big data technologies to achieve multiple enterprise objectives.
2.2 Lambda Architecture with Kafka, ElasticSearch and Spark (Streaming)
Lambda defines a big data architecture that allows pre-defined and arbitrary queries and computations on both fast-moving data and historical data.
Using Kafka, ElasticSearch, Spark and SparkStreaming, it is achieved using the following layout:
Lambda Architecture enables us to score transactions or other business events in real-time and still consider the most recent events as well as the whole transaction history in its scoring model.
By using Kafka at the beginning of the pipeline to accept inputs, it can be guaranteed that messages will be delivered as long as they enter the system, regardless of hardware or network failure.
The batch layer is largely build on the Apache Spark / Mesos coupled with ElasticSearch as large scale storing component underneath. The reasons why we are running on Spark, Mesos and ElasticSearch have been covered before in this document but interestingly, these components appear to behave extremely well together when it comes to addressing batch processing concerns, thanks to spark's ability to work largely in memory and proper optimization of data co-locality on ElasticSearch and Spark nodes.
In the streaming layer, Kafka messages are consumed in real time using Spark Streaming. In terms of core component to support the speed layer, the usual choice is between Apache Storm or Apache Spark Streaming.
The main selection criteria between the two depends on whether one is interested in ultra low latency (Apache Storm) or high throughput (Apache Spark Streaming). There are other factors, but these are some of the main drivers.
In my company, for our use cases, we can afford a little higher latency as long as we work under a second to score a business event (e.g. financial transaction). On the other hand, we face situations where burst of thousands of transactions to be scored per second are common. As such, high throughput is not optional for us, it's a key requirement and as such, the rationality behind the usage of Apache Spark Streaming.
Here, in the speed layer, ElasticSearch is key to reduce latency of integration concerns of the speed layers since it is a real-time querying database in addition to a very powerful database engine.
The Serving Layer, consolidating the batch layer and speed layer partial results, is largely home made in our case and relies on ElasticSearch's ability to fetch both partial sets in real-time.
2.3 Drawbacks and difficulties of Lambda Architecture
There is a natural tendency to duplicate logic between batch layer and speed layer which needs to be addressed through strict design and re-usable logic. Using Spark in batch mode on the batch layer and Spark Streaming on the speed layer in our case really helps us reuse business logic as much as possible between both worlds.
In addition, there is an operational complexity of the systems that are involved in implementing the lambda architecture. Thus the implementation of Lambda architecture is inherently difficult.
3. Real-time computation with Lambda Architecture
The demand for real-time analytics has led to demand for workflows that can effectively balance latency, throughput, scaling and fault tolerance.
In order to accommodate the demand for real-time analytics, we need to design a system that can provide balance between the concept of "single version of truth" and "real-time analytics". Lambda Architecture is one such method.
In my company, some of our analytics use cases require to consider very extended contextual information about trade and transaction activities, for instance, to build user and customer profiles or analyze their past behaviours.
Building such contextual information typical require analyzing over again and again billions of business events and peta-bytes of data.
Rebuilding these profiles or re-creating the aggregated statistical metrics would require several dozens of minutes even on large cluster in a typical batch processing approach.
Happily, all this information supports an incremental building way and as such we can benefit from Lambda architecture to rebuild the historical part while the latest data is taken into consideration by the speed layer to provide an up-to-date (as far as real-time) view of the reality. The serving layer consolidates both results to provide always up-to-date and accurate views of these profiles or other aggregated statistical metrics.
These real-time metrics are thus made available to our real-time scoring and classification systems.
The same technologies and approaches deployed in the speed layer to provide up-to-date views of the reality are used to score and classify business events, e.g. financial transactions in real-time.
Here as well, we have no requirements for strong real-time with millisecond-order latency. As long as we can provide a risk score or a classification for an event under a second, this is sufficient for our use cases.
On the other hand, it happens often that we have to compute burst of events of several hundreds of entries per second. As such, a system benefiting from an acceptable latency but a very high throughput such as Apache Spark Streaming is a key component of our processing platform.
In addition, within NG|Screener UI we provide our customers with a full-blend data discovery application (forensic application). Lambda Architecture is key in enabling us to provide our users with real-time updates and a second close up-to-date view of the reality.
Deploying Lambda architecture on our use cases has proven to be the simplest way to reach our objectives:
- Up to date and second-close view of the reality in contextual information, user / customer profiles and other key periodic statistical metrics
- Classification and scoring of business events with an under-a-second latency and a very high throughput
- Resilience and fault tolerance of our business processes on large clusters, both on technical failures and human failures
- Simplicity and maintenance, especially in our approach since we can share significant portions of codes between the batch layer and the speed layer since both are built on Apache Spark
- Resolution of operational complexity of big computation on historical data by dividing the work to do in an incremental fashion.
Now of course, Lambda Architecture being the simplest way for us to reach our mission-critical objectives doesn't make it simple per se, on the contrary. Lambda Architecture is inherently difficult to deploy and maintain and requires sound design and implementation.
At NetGuardians, we could benefit from our mastery of cutting-edge technologies as well as our in-depth experience of batch computing systems and real-time computing systems to make it an advantage of our approach.