niceideas.ch

Modern Information System Architectures

Jerome Kehrli — Mon, 13 Dec 2021 06:04:45 -0500

For forty years we have been building Information Systems in corporations in the same way, with the same architecture, with very little innovations and changes in paradigms:

On one side the Operational Information System which sustains day-to-day operations and business activities. On the Operational Information Systems, the 3-tiers architecture and the relational database model (RDBMS - Relational Database Management System / SQL) have ruled for nearly 40 years.
On the other side the Decision Support Information System - or Business Intelligence or Analytical Information System - where the Data Warehouse architecture pattern has ruled for 30 years.

legacy / Information Systems Architecture for 40 years

Of course the technologies involved in building these systems have evolved in all these decades, in the 80s COBOL on IBM hosts used to rule the Information Systems world whereas Java emerged quickly as a standard in the 2000s, etc.
But while the technologies used in building these information systems evolved fast, their architecture in the other hand, the way we design and build them, didn't change at all. The relational model ruled for 40 years along the 3-tiers model in the Operational world and in the analytical world, the Data Warehouse pattern was the only way to go for decades.

The relational model is interesting and has been helpful for many decades. its fundamental objective is to optimize storage space by ensuring an entity is stored only once (3rd normal form / normalization). It comes from a time when storage was very expensive.
But then, by imposing normalization and ACID transactions, it prevents horizontal scalability by design. An Oracle database for instance is designed to run on a single machine, it simply can't implement relational references and ACID transactions on a cluster of nodes.
Today storage is everything but expensive but Information Systems still have to deal with RDBMS limitations mostly because ... that's the only way we used to know.

On the Decision Support Information System (BI / Analytical System), the situation is even worst. in Data warehouses, data is pushed along the way and transformed, one step at a time, first in a staging database, then in the Data Warehouse Database and finally in Data Marts, highly specialized towards specific use cases.
For a long time we didn't have much of a choice since implementing such analytics in a pull way (data lake pattern) was impossible, we simply didn't have the proper technology. The only way to support high volumes of data was to push daily increments through these complex transformation steps every night, when the workload on the system is lower.
The problem with this push approach is that it's utmost inflexible. One can't change his mind along the way and quickly come up with a new type of data. Working with daily increments would require waiting 6 months to have a 6 months history. Not to mention that the whole process is amazingly costly to develop, maintain and operate.

So for a long time, RDBMSes and Data Warehouses were all we had.

It took the Internet revolution and the web giants facing limits of these traditional architectures for finally something different to be considered. The Big Data revolution has been the cornerstone of all the evolutions in Information System architecture we have been witnessing over the last 15 years.

The latest evolution in this software architecture evolution (or revolution) would be micro-services, where finally all the benefits that were originally really fit to the analytical information system evolution finally end up overflowing to the operational information system.
Where Big Data was originally a lot about scaling the computing along with the data topology - bringing the code to where the data is (data tier revolution) - we're today scaling everything, from individual components requiring heavy processing to message queues, etc.

Example of modern IS architecture: Microservices

In this article, I would want to present and discuss how Information System architectures evolved from the universal 3 tiers (operational) / Data Warehouse (analytical) approach to the Micro-services architecture, covering Hadoop, NoSQL, Data Lakes, Lambda architecture, etc. and introducing all the fundamental concepts along the way.

Summary

1. Introduction
2. The Web giants and Big Data
3. The CAP Theorem
4. NoSQL / NewSQL
- 4.1 NoSQL
- 4.2 NewSQL
5. Hadoop and Data Lakes
6. Streaming Architectures
7. Big Data 2.0
- 7.1 Alternatives to Hadoop
- 7.2 Kubernetes
8. Micro-services
- 8.1. Micro-services discussion
9. Conclusion

1. Introduction

As stated in the summary above, the way we build information systems really didn't evolve in so many decades. The technologies used underneath have evolved of course, a long way from COBOL to Java and Angular, but the architectures in use - the 3-tiers model on the operational information system and the data warehouse pattern on the decision support system (a.k.a analytics system) - haven't evolve in more than 30 years.
The Software Architecture is defined as the set of principal design decision about the system. Software architecture is kind of the blueprint for the system's construction and evolution. Design decisions encompass the following aspects of the system under development: Structure, Behaviour, Interactions, Non-functional properties. (Taylor 2010)
And then again, the technologies under the hood, from the operating systems to the User Interfaces through the programming languages, have evolved drastically. We all remember 3270 green-on-black terminal screens and can only consider the terrific evolution to the fancy HTML5/bootstrap screens we see today.
But the design of the Information system components, their interactions and the technical processes in between didn't evolve at all!

Information Systems Architecture for 40 years

I find it amazing to consider that if you put COBOL, 3270 and a few terms like that on this schema instead of the web elements you literally get what would have been the high-level architecture schema 40 years ago.
As stated above, RDBMS - Relational Database Management Systems - have a lot of limits and some benefits, namely the standardized querying language - SQL - and the optimization of storage space. But in today's digital world, the benefits don't stand up to the drawbacks, mostly the impossibility to scale.
The Data Warehouse pattern in use for 30 years on the analytical Information System is also a poor fit for today's pace of development of digital services. It is much too inflexible not to mention the cost in developing and maintaining it.

It took the web giants to face the limits of these current architecture paradigms and invent new ways of building information systems to finally see some evolutions in the way we are building them in corporations. The first evolutions came to the analytics system side with Big Data technologies and overflowed later to the operationnal IS side with NoSQL, streaming architectures and eventually micro-services.

In this article I want to present these evolutions from an historical perspective. We'll start with the Web giants and the Big Data revolution, cover NoSQL and Hadoop, run through Lambda and Kappa architectures, and end up discussing Kubernetes and Micro-services.

2. The Web giants and Big Data

The web giants have been the first to face the limits of traditional architectures in an unacceptable way. Can you imagine google running their Search Engine on an IBM mainframe ? Can you imagine what that would be for a machine and how much money (licensing fees) they would need to leave to IBM every year for running such a host ?
Can you imagine Amazon running their online retail business on an Oracle database with hundreds of millions of users connected and querying the DB at any time ? Can you imagine the price of a computer that would be able to support such volumes of data and concurrent requests ?

The Web giants had to invent both new data storage technologies and programming paradigms to run their business and support their volume of activities.
But let's start with the beginning.

2.1 The Era of Power

As an prequel to introducing Big Data, let's have a look at these both situations:

The Era of Power
Source: https://pages.experts-exchange.com/processing-power-compared

These two computers are separated by only 30 years of technological evolutions.
The computer on the left is a Cray II. When it came out in 1985, it was a revolution since it was the fastest machine in the world, the first multi-processor computer from Seymour Cray and included many unique technological evolutions.
The computer on the right is a Samsung S6 Smartphone. It's 30 years younger that the Cray 2.

It's 30 years younger only and around 15 times more powerful that the Cray 2. While the later was by far bigger than a human being, the Samsung S6 fits in the palm. The Cray 2 has 4 processors while the S6 packages 8 processors.
Considering how the hardware technology progressed over one generation is mind-blowing.

Another comparison is even more impressive, 50 years before the Samsung S6, a computer has been used to send people to the moon. The S6 is a million times more powerful in terms of raw computing power than that very computer.
We have today a device so small that it fits in our palm, incredibly powerful, which enables us to be interconnected everywhere, all the time and for every possible need. This is the definition of the digitization
The smartphones are really an amazing piece of technology, but much more impressive are the apps behind and the services they enable us to use. This leads us to the Web Giants.

2.2 The Web Giants

The Web giants have been the first to face the limits of traditional architectures and the usual way information systems were built.

The Web Giants

And the revolution came from them. They had to find new technical solutions to business-critical challenges such as :

Google: Index the whole web, and keep a response time to any request below one second - or how to keep the search free for the user ?
Facebook: Interconnect billions of users, display their feeds in near-real-time and understand how they use their product to optimize ads ?
Amazon: How to build a product recommendation engine for dozens of millions of customers, on millions of products ?
EBAY : How to do a search in ebay auctions, even with misspelling ?

These are just oversimple examples of course and the challenges faced by the web giants go much beyond such simple cases.
These business challenges are backed by technical challenges such as:

How to invert a square matrix that doesn't fit in memory in a reasonable time ?
How to query a database containing trillions of documents in real-time ?
How to read billions of files of multiple megabytes in a reasonable time ?
etc.

At the end of the day, it all boiled down to finding ways to manage volumes of data bigger by several orders of magnitude than the volumes of data that IT systems were used to manipulate so far.

2.3 Data Deluge

So the most striking problem they had to solve is getting prepared and ready for the Data Deluge!

Data Deluge!

Not only do we generate more and more data, but we have today the means and the technology to analyze, exploit and mine it and extract meaningful business insights.
The data generated by the company’s own systems can be a very interesting source of information regarding customer behaviours, profiles, trends, desires, etc. But also external data, Facebook, twitter logs, etc.

2.4 The Moore Law

The Moore Law:The number of transistors and resistors on a chip doubles every 24 months" (Gordon Moore, 1965)

The Moore Law Click to enlarge-

For a long time, the increasing volume of data to be handled by any given corporation in its Information System was not an issue at all.
The volume of data increases, the number of user increases, etc. but the processing abilities increases as well, sometimes even more.
The Moore law was there to cover our ass. The corporation CTO just had to buy a new machine to host the Information System every few years.

For the 40 years, the IT component capabilties grew exponentially
Source: http://radar.oreilly.com/2011/08/building-data-startups.html

This model has hold for a very long time. The cost are going down, the computing capacities are rising, one simply needs to buy a new machine to absorb the load increase.
This is especially true in the mainframe world. There wasn’t even any need to make the architecture of the systems (COBOL, etc.) evolve for 30 years.
Even outside the mainframe world. The architecture patterns and styles we are using in the operational IS world haven’t really evolve for the last 30 years. Despite new technologies such as Web, Web 2.0, Java, etc. of course. I’m just speaking about architecture plans and styles.

2.5 The Death of the Moore Law

But everything has an end.
Let's consider a fifth dimension, too often left aside when considering the evolutions of computer technologies and hardware architectures: the throughput of the connection between the data on the disk (long term storage) and the memory (i.e. hard drive controllers mostly, but also buses, etc.)

The death of the Moore Law

Issue: the throughput evolution is always lower than the capacity evolution.

How read/write more and more data through an always thicker pipe?

The throughput has become the biggest concern in scaling computer / platform hardware up. It did not progress in terms of efficiency in a way comparable to the four other dimensions.
We are able to store more and more data, of course, but we are less and less able to manipulate this data efficiently.
In practice, fetching all the disk data on a computation machine to fit it in RAM to process it is becoming more and more difficult.

2.6 Fundamentals of Big Data - the Web giants new paradigms

In order to workaround the limits of traditional architectures, the web giants invented new architecture paradigms and new ways of building information systems by leveraging on three fundamental ideas:

Fundamentals of Big Data - the Web giants new paradigms

In details:

Key idea 1 : distribution - Since its impossible to fit the data in the RAM of one single machine, split it and distribute it on as many different machines as are required.
Distribution means partitioning the dataset - also called sharding it sometimes - but also always replicate the partitions or shards. We'll see exactly why and how later.
Key idea 2 : Horizontal scalability - Just as we split the data, let's split the computation and distribute it on as many nodes as are required to support the workload, even if it means multiple datacenters.
Key idea 3 : Data tier revolution - So we distribute both the data on a cluster of computers - or nodes - and the processing as well. We end up using the data nodes as processing nodes. This is the data tier revolution, which is in complete opposition to what was usually done so far in traditional architectures: fetching the required data to the place where the computation occurs.
But it goes further than that.
Most of the time we end up distributing different types or categories of data. Every time a specific business process needs to compute something out of a specific piece of all this data, it's crucial to ensure the processing will happen on the very nodes where this specific piece of data is located. This is called co-local processing or data locality optimization.

As a summary, the web giants have designed new architectures and programming paradigms where distributing the data and the processing (ideally in a co-local way) on a cluster of nodes was the most fundamental principle.

3. The CAP Theorem

But moving from a mainframe world - where everything is on the same computer and the data to compute always fits in the memory of that computer - to a distributed system world most definitely has benefits, but it also has some consequences. And that's the topic of this chapter.

3.1 The origins of NoSQL

Let's start with a bit of history.

3.1.1 Flat files as data store

In the early days of digital data, before 1960, the data within a Computer Information System was mostly stored in rather flat files (sometimes indexed) manipulated by top-level software systems.
The primitives provided by the operating system were really very low level, basically just the possibility to read or write file or file increments.

Indexed flat file

Directly using flat files was cumbersome and painful. Different uncovered needs emerged at the time:

Data isolation
Access efficiency
Data integrity
Reducing the time required to develop brand new applications

Addressing such needs by relying on indexed flat files required solutions to be implemented artificially by the applications using such files.
It was highly difficult, inefficient, time consuming, etc. And the wheel had to be re-invented all the time all over again.
So something else was required.

3.1.2 RDBMS and the relational model

So in 1969, Edgar F. Cood, a British engineer, invented the relational model. In the relational model, business entities are modeled as Tables and Associations (relations).
The relational model is at the root of RDBMS - Relational DataBase Management Systems - that ruled the Data Storage world for 30 years.

The relational model is conceived to reduce redundancy in order to optimize disk space usage. At the time of its creation Disk storage was very expensive and limited. And the volume of data in Information Systems was rather small.
The relational model avoids redundancy to optimize disk space usage by guaranteeing:

Structure: using normal design forms and modeling techniques
Coherence: using transaction principles and mechanisms

An example relational model would be as follows, illustrating an Exam Grade Management application

Relational Model Example

In this example, if we want to know the subject assigned to a student on his profile screen, we would need to

Extract the personal data from the "student" table
Fetch its subject if from the "relation" table
Read the subject title from the "subject" table.

Why, oh why, separate all this information in different tables when in practice 99% of the time we want to fetch all of this together ?

3.1.3 criticism of the relational model

The relational model comes from a time where storage was expensive. The fundamental idea behind its design is rationalizing storage space by ensuring every piece of information is stored only once.
But nowadays long-term storage space is not expensive at all anymore. A Terabyte of SSD storage is not more that a few dozens of dollars. Optimizing the storage space at all cost makes little sense today.
In addition, the relational model is not the best way to represent some information. Let's see some examples

Other models

Tabular information fits naturally well in the relational model but not only. Every time we can naturally divide a business problem into well-defined and predefined entities and relations among them, the relational model is usually a good fit.
But then think of other type of information, such as registration forms, product descriptions, etc. Such types of semi-structured data fit very poorly in the relational model.
Also molecular data or graph data would be way better stored in very different types of databases.

The web giants had to get away from the mainframe pattern, and if you challenge that, the very fundamental architecture pattern on which all information systems were built, why wouldn't you challenge all the rest, including the relational model ?
We'll get back to this.

3.2 Horizontal scalability

The mid and late 2000’s were times of major changes in the IT landscape. Hardware capabilities significantly increased and eCommerce and internet trade, in general, exploded.
Some internet companies- the "Web giants" (Yahoo!, Facebook, Google, Amazon, Ebay, Twitter, ...), pushed traditional databases to their limits. Those databases are by design hard to scale.
Traditional RDBMS and traditional architecture can only scale up. And scaling up is tricky.

3.2.1 Scaling up

With relational RDBMSes, the only way to improve performance is by scaling up, i.e. getting bigger servers (more CPU, more RAM, more disk, etc.). There's simply nothing else that can be done.
But one eventually hits a hard limit imposed by the current technology.

With traditional architectures and RDMBS, all the workload happens on one single machine. And while running a few thousands operations or transactions on one single machine is perhaps possible, going much beyond it just doesn't work. The programming paradigm we use - mostly around thread synchronizations and context switches - make it impossible to run effectively a million threads on one single machine for instance.

But there's worst than that.
Imagine that a machine A with 4 CPUs, 64 GB RAM and 1 TB hard drive costs 10'000 USD.
Do you think that a machine B with twice the power so 8 CPUs, 128 GB RAM and a 2 TB hard drive would cost the double, hence 20'000 USD ?
No! It would be much more than that, perhaps four or five times the price, so more than 40k USD.

The price of individual machines doesn't scale linearly with the processing power of the machine, it's exponential !

3.2.2 Scaling out

By rethinking the architecture of databases, the web giants have been able to make them scale at will, by adding more servers to clusters instead of upgrading the servers.
When scaling out, instead of buying bigger machines, one buys more machines and add them in a processing cluster, working together on distributed data and processing.
The servers are not made of expensive, high-end hardware; they are qualified as commodity hardware (or commodity servers).

When scaling out, the limits vanish, one can add as many nodes as one wants in a processing cluster.
And there's a cherry on the cake, recall the example of machine A above, buying 10 machine A is not even 10 times the price of a single machine A, since one can get discounts from the number being bought.

The only drawback is that the application leveraging on scaling out, or the information system as a whole, needs to be designed from the grounds up for distribution. And there are constraints for this, we'll see that further in this article.

Scaling out is also called Horizontal scalability while scaling up is called vertical scalability.

3.3 Data Distribution

With most NoSQL databases, the data is not stored in one place (i.e. on one server). It is distributed among the nodes of the cluster. When created, an object is split in a number of shards, for instance 4 shards, A, B, C, D and these shards are assigned to a node in the cluster.
This is called sharding - or partitioning - the portion of data assigned to a node is called a shard - or a partition.

Having more cluster nodes implies a higher risk of having some nodes crash, or a network outage splitting the cluster in two. For this reason, and to avoid data loss, objects are also replicated across the clusters. The number of copies, called replicas, can be tuned. 3 replicas is a common figure.
Imaging that the specifications of a given computer indicates that there is a 10% chance for the computer to experience any kind of hardware failure in its first year of exploitation. Then imagine you have 10 nodes like that one in a cluster, what is the probability that at least one of these nodes experiences an hardware failure ? Yes, you can be nearly sure at least one of them will fail.

For this reason, when we start to distribute data on a cluster of multiple machines, we have to design for failures.
In data management, this means creating multiple copies of every shard in such a way that we maximize the chances of one of them always being available.
This is called replication.

We can see here that the objects has been split in 4 shards A, B, C, D and that every shard has three replicas.

The objects may move, as nodes crash or new nodes join the cluster, ready to take charge of some of the objects. Such events are usually handled automatically by the cluster; the operation of shuffling objects around to keep a fair repartition of data is called rebalancing.

3.4 Properties of a distributed system

In RDBMSes, we expect DB transactions to respect some fundamental properties, identified by ACID: Atomicity, Consistency, Isolation and Durability.
In distributed systems, we consider things a little differently and consider the following properties:

Availability
Availability (or lack thereof) is a property of the database cluster. The cluster is available if a request made by a client is always acknowledged by the system, i.e. it is guaranteed to be taken into account.
That doesn’t mean that the request is processed immediately. It may be put on hold. But an available system should at a minimum always acknowledge it immediately.
Practically speaking, availability is usually measured in percents. For instance, 99.99% availability means that the system is unavailable at most 0.01% of the time, that is, at most 53 min per year.
Partition tolerance
Partition Tolerance is verified if a system made of several interconnected nodes can stand a partition of the cluster; if it continues to operate when one or several nodes disappear. This happens when nodes crash or when a network equipment is shut down, taking a whole portion of the cluster away.
Partition tolerance is related to availability and consistency, but it is still different. It states that the system continues to function internally (e.g. ensuring data distribution and replication), whatever its interactions with a client.
Consistency
When talking about distributed databases, like NoSQL, consistency has a meaning that is somewhat different than in the relational context.
It refers to the fact that all replicas of an entity, identified by a key in the database, have the same value whatever the node being queried.
With many NoSQL databases, updates take a little time to propagate across the cluster. When an entity’s value has just been created or modified, there is a short time span during which the entity is not consistent. However the cluster guarantees that it will eventually be, when replication has occurred. This is called eventual consistency

These 3 properties, Consistency, Availability and Partition tolerance, are not independent.
The CAP theorem - or Brewer’s theorem - states that a distributed system cannot guarantee all 3 properties at the same time.

This is a theorem. That means it is formally true, but in practice it is less severe than it seems.
The system or a client can often choose CA, AP or CP according to the context, and "walk" along the chosen edge by appropriate tuning.
Partition splits happen, but they are rare events (hopefully).

Traditional Relational DBMSes would be seen as CA - consistency is a must
Many NoSQL DBMSes are AP - availability is a must. Big clusters failures happen all the time so they better live with it. Consistency is only eventual.

3.4.1 Eventual consistency

Consistency refers to the fact that all replicas of an entity, identified by a key in the database, have the same value at any give time whatever the node being queried.

With many NoSQL databases, the preferred working mode is AP and all-the-time consistency is sacrificed.
Favoring performance, updates take a little time to propagate across the cluster. When an entity’s value has just been created or modified, there is a short time span during which the entity is not consistent. One node being queried at that moment could show the previous value while another node at the same time would already show the new value.
However the cluster guarantees that it will eventually be, when replication has occurred. This is called eventual consistency and this is an essential notion.

While all-the-time consistency is sacrificed, eventual consistency is a must and is guaranteed by most-if-not-all NoSQL Database.

4. NoSQL / NewSQL

NoSQL databases are a new type of databases emerging from the web giants technologies mostly, scaling out natively and renouncing to the usual behaviours and features of RDBMS - Relational Database Management Systems.

4.1 NoSQL

A NoSQL - originally referring to "not-SQL" for "non-relational" - database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st century, triggered by the needs of Web 2.0 companies.
NoSQL databases are increasingly used in Big Data and Real-Time Web applications.
NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.
(Wikipedia - https://en.wikipedia.org/wiki/NoSQL)

The fundamental idea behind NoSQL is as follows:

Because of the need to distribute data (Big Data), the Web giants have abandoned the whole idea of ACID transactions (only eventual consistency is possible).
So if we drop ACID Transactions - which we always deemed to be so fundamental - why wouldn't we challenge all the rest - the relational model and table structure?

There are 4 common types of NoSQL databases:

Document-oriented , e.g. MongoDB, ElasticSearch
Column-family (aka BigTable), e.g. Cassandra
Key/Value pairs, e.g. Redis
Graph, e.g. Neo4J

Document-oriented databases are really the most wide spread with market leaders such as MongoDB, ElasticSearch, CouchDB, etc.
Column-oriented databases are also wide spread with multiple good open source solutions.
Key/Value pairs databases are really distributed caching products in in the end. Multiple good solutions are available on the market but most of them are proprietary software with sometimes a limited open-source version (unfortunately).
In terms of graph oriented databases, the lead player is Neo4J.

The following schema provides an illustration of the way data is structured and stored in these Databases:

The NoSQL landscape is y very rich ecosystems with hundreds of different products and solution and growing continuously, with nearly every week a new product appearing.

4.2 NewSQL

What is NewSQL ?

NewSQL refers to relational databases that have adopted upon some of the NoSQL genes, thus exposing a relational data model and SQL interfaces to distributed, high volume databases.

NewSQL, contrary to NoSQL, enables an application to keep

The relational view on the data
The SQL query language
Response times suited to transactional processing

Some were built from scratch (e.g. VoltDB), others are built on top of a NoSQL data store (e.g. SQLFire, backed by GemFire, a key/value store)

The current trend is for some proven NoSQL databases, like Cassandra, to offer a thin SQL interface, achieving the same purpose
Generally speaking, the frontier between NoSQL and NewSQL is a bit blurry… SQL compliance is often sought for, as the key to integrating legacy SQL software (ETL, reporting) with modern No/NewSQL databases

5. Hadoop and Data Lakes

Around 2006, Google published two papers, "GFS - The Google FileSystem" where they explained how they designed an implemented a distributed filesystem and "Map Reduce" where they presented the distributed programming paradigm they used to process data stored on GFS.
A few years after, google published "Big Table", a paper presenting how they designed and implemented a Column-oriented database on top of HDFS and MapReduce.

Doug Cutting, the leader of the Apache Lucene Project at the time discovered these papers and decided to work on an Open-Source implementation of these concepts.
Hadoop was born.

5.1 What is Hadoop ?

Hadoop is an Open Source Platform providing:

A distributed, scalable and fault tolerant storage system as a grid
Initially, a single parallelism paradigm : MapReduce to reuse the storage nodes as processing nodes
Since Hadoop V2 and YARN, other parallelization paradigms can be implemented on Hadoop
Schemaless and optimized sequential write once and read many times
Querying and processing DSL (Hive, Pig)

Hadoop was initially primarily intended for Big Data Analytics. It is the middleware under the Data Lake Architecture pattern and intents to revolution the architecture of analytical information systems / decision support systems.
Nowadays Hadoop can be an infrastructure for much more, such as Micro-services architecture (Hadoop V3) or Real-time Architectures.

Hadoop is declined in different distributions: Fundation Apache, Cloudera, HortonWorks, MapR, IBM, etc.

5.2 Hadoop Overview

Hadoop is designed as a layered software where technologies in every layer can be interchanged at will:

Distributed storage: Hadoop packages HDFS as the default underlying distributed filesystem. But for instance the MapR Hadoop distribution uses the MAPR filesystem instead.
Parallel Computing Framework / MapReduce Processing Engine: In Hadoop V1, MapReduce was the only parallel computing paradigm available on top of Hadoop, taking care of the processing distribution as well as the resources negotiation on the Hadoop cluster.
With Hadoop 2.0, The MapReduce paradigm has been split from the Resource negotiation which became YARN - Yet Another Resource Negotiator. With this split, it has become possible to use Hadoop with different parallel processing constructs and paradigms, MapReduce becoming just one possibility among others.
Machine Learning / Processing: This is in the end the most essential layer on top of Hadoop core. Hadoop is designed first and foremost for Big Data Analytics. There are numerous solutions that were initially either implemented on top of MapReduce or ported to MapReduce.
Nowadays, with YARN, software doesn't need anymore to be ported to MapReduce to run on Hadoop, it just needs to integrate with YARN.
Orchestration: Numerous different solution can be used to operate Hadoop and orchestrate processes.
Querying: A lot of NoSQL / NewSQL databases have been implemented as an Hadoop querying framework, such as HBase or Hive. Some more advanced tools goes beyond querying with as Pig.
Reporting: User have multiple choices of software specialized in building reports on the data in Hadoop.
IS Integration: Integrating Hadoop in the Information System, specifically building data import / Export between Hadoop and the operational information system components is a key concern. Numerous different solutions have been developed for this and are packaged with most Hadoop distributions.
Supervision and Management: Most Hadoop distributions provide their own management tool. Some tools are available as Apache projects.

Hadoop is a very large ecosystems of hundreds of different software in all these different layers.
The most common ones would be as follows:

But then again there are really many more components than that in a typical Hadoop distribution.
Most Hadoop distributions are quite behemoth software stacks that would be very difficult to integrate and configure manually, which is the very reason behind the success of these distributions.
Hadoop core on its own is fairly complex to install, configure and fine tune so whenever one needs Hadoop core only for his specific software (e.g to run spark), it's sometimes a more appropriate approach to search for a lighter cluster management system such as Apache Mesos for instance; more on that later in this article.

5.3 Hadoop Architecture

A simplified view of Hadoop core components deployment architecture would be as follows:

Hadoop Architecture

Since Hadoop 2, having two master nodes for high-availability and avoiding single points of failure on the master components is highly advised.
The components from Hadoop core are deployed as follows:

The HDFS Name node (and secondary name node) is the center piece of the HDFS File System. It acts as the HDFS Master and keeps the directory tree and tracks where on the cluster the file data is kept. The HDFS Data Nodes acts as slave processes, run on individual cluster nodes and take care of data storage.
The YARN Resource Manager (and secondary resource manager) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers and the per-application ApplicationMaster.
The MapReduce JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data for co-local processing optimization, or at least are in the same rack. Client applications submit jobs to the Job tracker. MapReduce TaskTrackers run on individual cluster nodes, execute the tasks and report the status of tasks to the JobTracker.

5.4 The DataLake Architecture pattern

From Wikipedia:
A data lake is a system or repository of data stored in its natural/raw format.

It's is usually a single store of data including raw copies of source system data, sensor data, social data etc. and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
It can include structured data from relational databases, semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

With the continued growth in scope and scale of analytics applications using Hadoop and other data sources, then the vision of an enterprise data lake can become a reality.
In a practical sense, a data lake is characterized by three key attributes:

Collect everything. A data lake contains all data, both raw sources over extended periods of time as well as any processed data.
A data lake is characterized by a Big Volume of data.
Dive in anywhere. A data lake enables users across multiple business units to refine, explore and enrich data on their terms.
In a Data Lake, one doesn't know a priori the analytical structures.
Flexible access. A data lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.
As a result, a data lake delivers maximum scale and insight with the lowest possible friction and cost.

The fundamental approach of Data Lakes is to pull the required data from the raw data storage, transforming it and processing it dynamically, as required per the use case being executed. It's entirely dynamic, queries and processes are designed on the fly.
The storage principle is to store everything, all the raw data from the operational Information System as well as all the data produced by the IS, log files, usage metrics, etc. (Collect everything pattern).
Hadoop is kind of the Operating System underneath the Data Lake pattern and with Hadoop's power, there is nearly no analytics use case that can't be implemented in a dynamic fashion, requiring at worst a few hours of runtime processing before providing the expected results

This is in complete opposition with the Data Warehouse pattern where the data was pushed in statically predefined transformation pipelines. The most critical drawback of this approach is the impossibility to come up with a new use case in a quick time. Most of the time, when a corporation requires a new KPI to be computed by the analytical system, if the required data was not already collected for another use case, it was impossible to provide quickly, requiring for instance to wait 6 months before providing the KPI on a 6 months period.
Hadoop finally made it possible at a cheap cost to get away fro this push pattern.

DataLake Architecture

The Data Lake architecture pattern and its hadoop engine form a tremendous opportunity to finally get away from the Data Warehouse pattern.
But there are pitfalls of course and many corporations experienced it the hard way.
It has been stated to much everywhere that data can be incorporated "as is" in data lakes that way too many corporations took it too literaly, forgetting about one essential aspect, even in Data Lakes.
A minimum of data cleaning, cleansing and preparation is always required. The most crucial aspect than can nevfer be neglected is the need to alway have proper correlation IDs in every single piece of data that is being ingested in a data lake.
Without correlation IDs, data is unusable. And your Data Lake turns into a Data Swamp.

6. Streaming Architectures

Streaming data refers to data that is continuously generated, usually in high volumes and at high velocity. A streaming data source would typically consist of a stream of logs that record events as they happen - such as a user clicking on a link in a web page, or a sensor reporting the current temperature.

A streaming data architecture is a framework of software components built to ingest and process large volumes of streaming data from multiple sources. While traditional data solutions focused on writing and reading data in batches, a streaming data architecture consumes data immediately as it is generated, persists it to storage, and may include various additional components per use case - such as tools for real-time processing, data manipulation and analytics.
A real-time system is an event-driven system that is available, scalable and stable, able to take decisions (actions) with a latency defined as below the frequency of events.

Streaming Architectures are not strictly related to the web giants and the Big Data revolution and CEP - Complex Events Processing - Engines exists since the early 2000s.
However, streaming architectures evolved significantly with products emerging from the needs of the web giants in Lambda Architecture first and then Kappa Architecture.

6.1 Complex Event Processing

From Wikipedia
Complex event processing, or CEP, consists of a set of concepts and techniques developed in the early 1990s for processing real-time events and extracting information from event streams as they arrive. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) in real-time situations and respond to them as quickly as possible.

In a Complex Event Processing Architecture :

Historical data is regularly and consistently updated with live data.
Both types or data (historical and live) are not necessarily presented consistently to the end user.
- Both sets of data can have their own screens or even application
- A consistent view on both sets of data would be proposed by Lambda Architecture (next topic in this presentation)

Complex Event Processing

A few notes on typical CEP deployments, in a raw fashion:

The rules GUI is often a user friendly editor supporting hot updates of rules and made available to business users.
The capture middleware should support very high throughput of thousands of events per second, just as the whole processing line and negligible latency.
The CEP engine needs to support very high throughput as well and usually a maximum latency of a few dozen milliseconds to hundreds milliseconds. Fault tolerance and state coherence are common concerns.

Complex Event Processing engines and architecture are heavily used in the industry in the world of real-time computing systems, such as trading systems, payment monitoring systems, etc.
Such engines form however a quite legacy technology and have limits in terms of analytics. Most if not all CEP engines on the market even nowadays are really some sort of evolved rules-engines.
And that would be the most common limit of CEP engines, the fact that its really only about rules. Machine learning and AI use cases are limited on CEP engines by the difficulty of these systems to derive features requiring correlation with large historical datasets.

The rise of Big Data analytics technologies have opened the door for much more advanced analytics use cases in real-time. Lambda Architecture and Kappa Architectures are much more recent approaches to real-time analytics.

6.2 Lambda Architecture

The Lambda Architecture, first proposed by Nathan Marz, attempts to provide a combination of technologies that together provide the characteristics of a web-scale system that satisfies requirements for availability, maintainability, fault-tolerance and low-latency.

Quoting Wikipedia: "Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation.
The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce."

At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.
The efficiency of this architecture becomes evident in the form of increased throughput, reduced latency and negligible errors. While we mention data processing we basically use this term to represent high throughput, low latency and aiming for near-real-time applications.

As new data is introduced to the system, it is processed simultaneously by both the batch layer, and the speed layer. The batch layer is an append-only repository containing unprocessed raw data. The batch layer periodically or continuously runs jobs that create views of the batch data-aggregations or representations of the most up-to-date versions. These batch views are sent to the serving layer, where they are available for analytic queries.
At the same time that data is being appended to the batch layer, it is simultaneously streaming into the speed layer. The speed layer is designed to allow queries to reflect the most up-to-date information-necessary because the serving layer's views can only be created by relatively long-running batch jobs. The speed layer computes only the data needed to bring the serving layer's views to real time-for instance, calculating totals for the past few minutes that are missing in the serving layer's view.
By merging data from the speed and serving layers, low latency queries can include data that is based on computationally expensive batch processing, and yet include real-time data. In the Lambda Architecture, the raw source data is always available, so redefinition and re-computation of the batch and speed views can be performed on demand. The batch layer provides a big data repository for machine learning and advanced analytics, while the speed and serving layers provide a platform for real-time analytics.
The Lambda Architecture provides a useful pattern for combining multiple big data technologies to achieve multiple enterprise objectives.

There are numerous solutions nowadays to build every layer of a Lambda Architecture:

The takeaway here is that we have gone a long way since Complex Event Processing architectures and we have now numerous solutions to build new generations of streaming architectures able to extend real-time streaming to much more advanced analytics use cases, embracing Real-time Artificial Intelligence use cases.

Pros and Cons of Lambda Architecture.

Pros

Batch layer of Lambda architecture manages historical data with the fault tolerant distributed storage which ensures low possibility of errors even if the system crashes.
It is a good balance of speed and reliability.
Fault tolerant and scalable architecture for data processing.

Cons

It can result in coding overhead due to the need to implement the same analytics logic twice: one time in the speed layer and one time all over again in the batch layer.
Re-processes every batch cycle which is not beneficial in certain scenarios.
A data modeled with Lambda architecture is difficult to migrate or reorganize.

6.3 Kappa Architecture

In 2014 Jay Kreps started a discussion where he pointed out some discrepancies of Lambda architecture that further led the big data world to another alternate architecture that used less code resource and was capable of performing well in certain enterprise scenarios where using multi layered Lambda architecture seemed like extravagance.
Kappa Architecture cannot be taken as a substitute of Lambda architecture on the contrary it should be seen as an alternative to be used in those circumstances where active performance of batch layer is not necessary for meeting the standard quality of service.

Kappa architecture is a streaming-first architecture deployment pattern. With most recent Stream Processing technologies (Kafka Streams, Flink, etc.) the interest and relevance of the batch layer tend to diminish. The streaming layer matches computation abilities of the batch layer (ML, statistics, etc.) and stored data as it processes it.
A batch layer would only be needed to kick-start the system on historical data but then Apache Flink can very well do that

Kappa architecture can be deployed for those data processing enterprise models where:

Multiple data events or queries are logged in a queue to be catered against a distributed file system storage or history.
The order of the events and queries is not predetermined. Stream processing platforms can interact with database at any time.
It is resilient and highly available as handling Terabytes of storage is required for each node of the system to support replication.

Pros and Cons of Kappa architecture

Pros

Kappa architecture can be used to develop data systems that are online learners and therefore don’t need the batch layer.
Re-processing is required only when the code changes.
It can be deployed with fixed memory.
It can be used for horizontally scalable systems.
Fewer resources are required as the machine learning is being done on the real time basis.

Cons

Absence of batch layer might result in errors during data processing or while updating the database that requires having an exception manager to reprocess the data or reconciliation.

7. Big Data 2.0

When google published their papers in the early 2000s, it was quite a tsunami in the Computer Engineering World. Doug Cutting and the guys behind Hadoop started working on Hadoop but a lot of other initiatives kicked off as well.
With their approach - scaling information systems on commodity hardware, it turned out that massive computational systems suddenly became affordable and it gave a whole new level of interest in distributed systems and distributed computing.

There are now an entire range of engines that transcend the Hadoop framework and are dedicated to specific verticals (e.g. structured data, graph data, streaming data, etc.)
Nowadays, The NoSQL ecosystem provides incredibly efficient alternatives to HDFS storage in the storage layer. In the processing layer, there is a plethora of solutions available from Kafka Streams to Apache Flink through Spark, etc.
On the resource negotiation side as well, multiple initiatives provide lightweight and interesting alternatives to Hadoop's YARN.

7.1 Alternatives to Hadoop

A specific project kicked off by the University of California retained quite a bit of attention at the time, the Nexus project renamed later Mesos and given to the Apache Software Foundation.

Apache Mesos intended to be kind of the Operating System of a computer cluster, somehow in the same way the Linux Kernel for instance is operating a single machine. Mesos intended to provide the same kind of primitives for resources management at the scale of a whole cluster.
Pretty early in the Mesos development story, support of docker containers has been added to enable users to deploy and scale applications in the form of docker containers.

A few years Later, some folks inspired from Google Borg and created in their turn a cloud container orchestration system for automating computer application deployment, scaling, and management. They named it Kubernetes.
With Mesos and Kubernetes gaining a lot of traction since scaling applications in the form of docker containers is extremely convenient, the folks at Hadoop added support to deploying applications in the form of docker containers as well in YARN in Hadoop 3.0.

Nowadays in 2021, with Hadoop 3, these 3 technologies converge tend to converge to the same possibilities. Hadoop 3 supports deploying jobs as docker containers just as Mesos and Kubernetes.
Mesos and Kubernetes can use alternatives to HDFS such as Ceph, GlusterFS, Minio, (of course Amazon, Azure, ...) etc.

So while Kubernetes was really oriented to scale application in the Operational Information System space initially, it tends now to overflow to analytics use case as well.
And the other way around, while Hadoop is still first and foremost oriented to deploy applications in the Analytical Information System space, Hadoop 3 tends to be deployed increasingly in the operational space as well.
Apache Mesos can well be used on both sides and was forming an interesting alternatives to Hadoop YARN in both worlds for quite some time. Today, Apache Mesos, even though from my perspective an amazing software, is not heavily maintained anymore and support for Mesos tends to vanish in latest versions of software stacks.

Kubernetes (and/or technologies based on Kubernetes) is today a market standard for the Operational IS just as Hadoop remains a market standard for the Analytical IS.

7.2 Kubernetes

Kubernetes is an Open Source Platform providing:

Automated software applications deployment, scaling, failover and management across cluster of nodes.
Management of application runtime components as Docker containers and application units as Pods.
Multiple common services required for service location, distributed volume management, etc. (pretty much everything one requires to deploy application on a Big Data cluster).

Kubernetes is originally largely inspired and even based on Google Borg, (one of) Google’s initial cluster management system(s). It has been released as Open-Source component in Google in 2014 and the first official release was in 2015.

Kubernetes is emerging as a standard as a Cloud Operating System.
In comes in the flavour of many distributions. The mains ones are:

PKS (Pivotal Container Service)/li>
Red-Hat OpenShift/li>
Canonical Kubernetes/li>
Google / AWS / Azure

Kubernetes deployment architecture would be as follows:

Kubernetes Architecture

With the ever-growing popularity of containerized cloud-native applications, Kubernetes has become the leading orchestration platform to manage any containerized application.
Again, nowadays Kubernetes is emerging as a market standard to scale the Operational Information System, while Hadoop largely remains a market standard to scale the Analytical Information System.

8. Micro-services

From Wikipedia:
Microservice architecture - a variant of the Service-Oriented Architecture (SOA) structural style - arranges an application as a collection of loosely-coupled services. In a microservices architecture, services are fine-grained and the protocols are lightweight. Its characteristics are as follows:

Services in a microservices architecture (MSA) are small in size, messaging-enabled, bounded by contexts, autonomously developed, independently deployable, decentralized and built and released with automated processes.
Services are often processes that communicate over a network to fulfill a goal using technology-agnostic protocols such as HTTP.
Services are organized around business capabilities.
Services can be implemented using different programming languages, databases, hardware and software environment, depending on what fits best (Note JKE : this is not a strict requirement, e.g. Spring boot)

From Martin Fowler:
A Microservices-based architecture has the following properties:

Independent services lifecycles leads to a continuous delivery software development process. A change to a small part of the application only requires rebuilding and redeploying only one or a small number of services.
Adheres to principles such as fine-grained interfaces to independently deployable services, business-driven development (e.g. domain-driven design).

As early as 2005, Peter Rodgers introduced the term "Micro-Web-Services" during a presentation at the Web Services Edge conference. The architectural style name was really adopted in 2012.
Kubernetes democratized the architectural approach. The two big players in this field are Spring Cloud and Kubernetes

A typical micro-services infrastructure architecture would be as follows:

Micro-services Architecture

8.1. Micro-services discussion

Ask yourself : do you need microservices ?

Microservices are NOT Big Data !. In Big Data Analytics, one needs to scale the processing linearly with the storage. Hadoop and for instance Spark with Mesos on ElasticSearch are designed for that very key aspect to be respected: co-local processing optimization. Micro-services are not designed for this. The scaling approach in micro-services is at the component / service level. Heavy resources consuming services are scaled widely while light services run typically on a few nodes mostly for high-availability concerns.
You don’t need microservices or Kubernetes to benefit from Docker. Docker is a tremendous way to package and deploy applications as a whole or individual application components. Unless you need horizontal scalability and high-availability, you might not need Kubernetes or a micro-services infrastructure.
You’re not scaling anything with synchronous calls. This is essential. A fundamental element in the design of a micro-services architecture resides in the usage of asynchronous calls as the communication paradigm. Think of it. If services call each others using synchronous calls, then scaling them is useless since they will all synchronize with the slowest of them.

As a consequence, don’t do microservices unless:

You need independent service-level scalability (vs. storage / processing scalability - Big Data).
You need a strong SOA - Service-Oriented Architecture.
You need independent services lifecycle management.

There are various challenges to be accounted when implementing micro-services:

Distributed caching vs reloading the world all over again. If every service is a fully independent application, then all the reference and master data need to be reloaded all over again by all services. This needs to be accounted and distributed caching needs to be considered.
Not all applications are fit for asynchronous communications. Some applications require fundamentally synchronous calls.
Identifying the proper granularity for services.
- Enterprise architecture view is too big
- Application architecture view is too fine
Data consistency without distributed transactions. Applications need to be designed with this in mind.
Weighting the overall memory and performance waste.
- A Spring boot stack + JVM + Linux Docker base for every single service ?
- HTTP calls in between layers ?

9. Conclusion

We went a long way in this article, from the web giants and their needs to scale their information systems horizontally, the reasons behind it and the challenges this implies, down to Micro-services and the scaling of individual Information Systems components.
The Web giants needs were really related initially to their massive amount of data to be manipulated and the need to scale the processing linearly with the storage distribution. Nowadays, cloud computing and SaaS - Software As A Service on the cloud form somehow a different needs.
Initial Big Data technologies were really oriented towards Data Analytics use cases and the Analytical Information System space. Later technologies, namely NoSQL / NewSQL and now Kubernetes and micro-services are much more oriented towards scaling or deploying on the cloud Operational Information System components

The Strong frontier between Operational IS and Analytical IS will tend to vanish in the future.

Increasingly, in Hadoop 3 with YARN able to manage and deploy docker containers, Hadoop is not so strictly limited to the Analytical IS.
On the other side, Kubernetes make it increasingly feasible to scale heavy data analytics applications as well.
Even today NoSQL, Streaming, Lambda and Kappa architectures are increasingly overflowing to the Operational IS and as such provide a common ground for operational processes and analytical processes.

Powerful Big Data analytics platform fights financial crime in real time

Jerome Kehrli — Fri, 3 Sep 2021 05:17:04 -0400

(Article initially published on NetGuardians' blog)

NetGuardians overcomes the problems of analyzing billions of pieces of data in real time with a unique combination of technologies to offer unbeatable fraud detection and efficient transaction monitoring without undermining the customer experience or the operational efficiency and security in an enterprise-ready solution.

When it comes to data analytics, the more data the better, right? Not so fast. That’s only true if you can crunch that data in a timely and cost-effective way.

This is the problem facing banks looking to Big Data technology to help them spot and stop fraudulent and/or non-compliant transactions. With a window of no more than a hundredth of a millisecond to assess a transaction and assign a risk score, banks need accurate and robust real-time analytics delivered at an affordable price. Furthermore, they need a scalable system that can score not one but many thousands of transactions within a few seconds and grow with the bank as the industry moves to real-time processing.

AML transaction monitoring might be simple on paper but making it effective and ensuring it doesn’t become a drag on operations has been a big ask. Using artificial intelligence to post-process and analyze alerts as they are thrown up is a game-changing paradigm, delivering a significant reduction in the operational cost of analyzing those alerts. But accurate fraud risk scoring is a much harder game. Some fraud mitigation solutions based on rules engines focus on what the fraudsters do, which entails an endless game of cat and mouse, staying up to date with their latest scams. By definition, this leaves the bank at least one step behind.

At NetGuardians, rather than try to keep up with the fraudsters, we focus on what we know and what changes very little – customers’ behavior and that of bank staff. By learning “normal” behavior, such as typical time of transaction, size, beneficiary, location, device, trades, etc., for each customer and internal user, and comparing each new transaction or activity against those of the past, we can give every transaction a risk score.

Billions of pieces of data

To do this effectively means taking into account thousands of pieces of information every time a customer makes a transaction. Multiply that by the number of customers a bank has on its books, and it quickly gets to billions.

Such high volumes would overwhelm most platforms, slowing the analytics to an unacceptable speed for the demands of real-time banking. At NetGuardians, we have solved this by using a combination of technologies that allows us to regularly batch process all the data for super-accurate models and supplement these batch models in real time by checking and adding smaller data sets as they arrive. This allows our software to accurately assess huge volumes of transactions in real-time.

The technologies we use are:

Apache Kafka
Elasticsearch
Apache Mesos
Apache Spark

All are open-source and run on our proprietary Lambda architecture-driven platform. Together, they make up a powerful and affordable solution for analyzing every transaction accurately in real time. In fact, our platform catches up to 99% of fraud, with 85 percent fewer false alerts, cutting investigation time by 95 percent compared with alternative rule-based solutions.

While this is key, it’s not the best bit.

At NetGuardians, we help our customers reap the benefits of cutting-edge and state-of-the-art open-source technologies without them suffering any of the drawbacks. We integrate these technologies, fine-tune and secure them and, critically, we implement enterprise-grade requirements on top. This means banks can use our solution out of the box.

Enterprise-Ready Big Data Platform

NetGuardians combines all the appropriate technologies in a way to make them work together 100 percent of the time, perfectly fine-tuned and secure, providing a bank everything it requires for an enterprise environment. This includes high availability, data and communication encryption, disaster recovery processes, state-of-the-art authorization, identification and authentication frameworks, single sign on, backup and restore procedures and much more. In this way, banks using our software enjoy the benefits of open source – easy integration and further development/fine-tuning – with the security and resilience of proprietary software.

With NetGuardians, banking institutions get the best of both worlds. But the cherry on the cake is that our banks don’t have to do anything. The NetGuardians platform takes care of everything and operates itself automatically, benefitting from strong NoSQL and DevOps genes. And that is unique to us.

Should it want to, though, a bank can create its own analytics on top of the open-source components on which the NetGuardians’ platform is built for its own use cases. A bank may want to use our version of Kafka for its own data-streaming use cases, for example, or it can open our 360 vision of the customer and user activities in ElasticSearch and expose that data through a secured API to in-house, third party software. This allows it to use the data for whatever it wants or needs to do - perhaps AML use cases or enriching the CRM application with NetGuardians’ data about customers.

The future of finance is real-time payments

Typically, many banks access the anonymized data we collect and store on our platform as a financial crime data lake to enrich their own customer 360 views in front office applications with risk indicators and a consolidated view of customers’ activities on their account. This is important because real time payments are growing fast. In 2020, 54 percent of consumers had used real-time payment app PayPal https://www.paymentsjournal.com/real-time-payments-everything-you-need-to-know/. Similar apps such as Venmo and Zelle are also growing fast – with the latter claiming 13 percent of consumers using its app in 2020, up from 1 percent in 2017.

While retail payments are important, it’s in business that the big volumes lie and in one survey 80 percent of businesses said they wanted real time banking. Already this is translating into action - in the US, 2020 saw a fivefold increase year on year in financial institutions implementing real-time payments.

Such huge growth means banks, big and small, will need affordable fraud detection in real time that can cope with these volumes. For the big banks, the solution will need to scale fast; for the smaller ones, they need a platform that can deliver accurate real-time risk scoring with smaller data sets. NetGuardians, with its unique combination of proprietary and open-source technologies, satisfies both. That is why banks worldwide – from Tier 1 to credit unions and co-ops – are turning to NetGuardians fraud-mitigation software to keep their customers’ cash safe.

(Article initially published on NetGuardians' blog)

Lambda Architecture with Kafka, ElasticSearch and Spark (Streaming)

Jerome Kehrli — Fri, 4 May 2018 06:32:20 -0400

In my current company - NetGuardians - we detect banking fraud using several techniques, among which real-time scoring of transactions to compute a risk score.
The deployment of Lambda Architecture has been a key evolution to help us evolve towards real-time scoring on the large scale.

In this article, I intend to present how we do Lambda Architecture in my company using Apache Kafka, ElasticSearch and Apache Spark with its extension Spark-Streaming, and what it brings to us.

Summary

1. Introduction
2. Lambda Architecture
3. Real-time computation with Lambda Architecture
4. Conclusion

1. Introduction

1.1 NetGuardians' key big data software components

NG|Screener, NetGuardians' flasgship product, is a Big Data Analytics Platform aimed at preventing fraud on the large scale within Financial Institutions.
Our platform manages and operates Big Data Analytics Use Cases detecting fraud attempts by analyzing user behaviours and financial transactions. Working in real-time, it can block suspicious business events, e.g financial transactions to prevent fraud effectively.

Our platform is built internally on four key Big Data Open Source Software components:

Apache Kafka: Kafka is an open-source stream processing software aimed at providing a unified, high-throughput, low-latency platform for handling real-time data feeds.

ElasticSearch: ElasticSearch is a distributed, real-time, RESTful search and analytics document-oriented storage engine. It lets one perform and combine many types of searches - structured, unstructured, geo, metric - in real time.

Apache Mesos: Mesos is a distributed systems kernel that runs on every machine and provides applications with API's for resource management and scheduling across entire datacenter and cloud environments.

Apache Spark: Spark is a fast and general engine for large-scale data processing. It provides programmers with an API functioning as a working set for distributed programs that offers a versatile form of distributed shared memory.

1.2 One ring to rule them all

The choice of these specific components under the hood is not anecdotal. Running Apache Spark on Apache Mesos is really still cutting edge nowadays and the choice of Apache Kafka and ElasticSearch, in addition to the good fit with our use case, answers a very important need we have.

We deploy our platform as much in tier 1 banks and big financial services providers than small private banks in Switzerland or even small Credit Institutions in Africa. Some of our customers have a few thousands of transactions daily while some others have dozens of millions of transactions per day.
Considering that some of our Analytics use cases require depth of analysis of several years, when we have billions of events to consider, we deploy our analytics platform on multiple-nodes clusters, sometimes up to a few dozen computation and storage nodes within the cluster. On the other hand, when we we work for small institutions with very low data volumes, we deploy it on a single small machine.
This need is at the very root of our technology choice, we needed technologies able to run efficiently on single small machines while still being able to scale our on hundreds of nodes should we require that.

ElasticSearch, Apache Spark, Apache Mesos and Apache Kaflka have been designed from the grounds up with this horizontal scalability in mind. But they have been implemented in such a way that they run also very well on a single little machine.
This is pretty uncommon in the Big Data Technology / NoSQL family of products. For instance, Apache Hadoop performs most of the time very poorly on single machines.

These products under the hood are key to sustain our "one ring to rule them all" approach. We develop one single platform that we can deploy everywhere, regardless of the volume of data of our customers.

1.3 Real-time readiness

In addition to their unique genes regarding vertical scalability described above, ElasticSearch, Apache Kafka and Apache Spark are providing our platform with another key feature.

With ElasticSearch, real-time updating (fast indexing) is achievable through various functionalities and search / read response time can be astonishingly deterministic.

Apache Kafka comes with the Kafka Stream extension. The Streams API, available as a Java library that is part of the official Kafka project, is the easiest way to write mission-critical real-time applications and microservices with all the benefits of Kafka's server-side cluster technology.
Despite being a humble library, Kafka Streams directly addresses both hardest problems in stream processing:

event-at-a-time processing with millisecond latency and
stateful processing including distributed joins and aggregations.

Kafka enables to implement fast processing on business events, e.g most often financial transactions in real-time and in event-at-a-time mode while dispatching micro-batches further to Spark Streaming. The more complicated processing required by our Analytics use cases occurs then within Spark through the Spark Streaming extension.

Spark Streaming is able to process hundreds of thousands of records per node per second. When using Kafka as a source, it is able to consume nearly half million records per node per second which is striking. It also offers near linear scaling ability, another great perk.
In contrary to Kafka, Spark Streaming works using a micro-batches approach. It works as follows; received input streams and decided into small batches, which are processed by Spark engine and a processed stream of batches is return.
The micro-batches can be as small as a few milliseconds batches, thus enabling sub-second latency while still ensuring a very high throughput and access to the whole Spark power and versatility to implement high level analytics use cases.

This real-time readiness aspect of these components of our technology stack is key to deploy Lambda Architecture within the our platform.

2. Lambda Architecture

When it comes to processing transactions in real-time, our platform provides a state-of-the-art implementation of a Lambda Architecture.

Lambda architecture is a Big Data Architecture that enables us to reunite our real-time and batch analytics layers.

2.1 Lambda Architecture principles

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.
The rise of lambda architecture is correlated with the growth of big data and real-time analytics.

2.2 Lambda Architecture with Kafka, ElasticSearch and Spark (Streaming)

Lambda defines a big data architecture that allows pre-defined and arbitrary queries and computations on both fast-moving data and historical data.
Using Kafka, ElasticSearch, Spark and SparkStreaming, it is achieved using the following layout:

Lambda Architecture enables us to score transactions or other business events in real-time and still consider the most recent events as well as the whole transaction history in its scoring model.

By using Kafka at the beginning of the pipeline to accept inputs, it can be guaranteed that messages will be delivered as long as they enter the system, regardless of hardware or network failure.

The batch layer is largely build on the Apache Spark / Mesos coupled with ElasticSearch as large scale storing component underneath. The reasons why we are running on Spark, Mesos and ElasticSearch have been covered before in this document but interestingly, these components appear to behave extremely well together when it comes to addressing batch processing concerns, thanks to spark's ability to work largely in memory and proper optimization of data co-locality on ElasticSearch and Spark nodes.

In the streaming layer, Kafka messages are consumed in real time using Spark Streaming. In terms of core component to support the speed layer, the usual choice is between Apache Storm or Apache Spark Streaming. The main selection criteria between the two depends on whether one is interested in ultra low latency (Apache Storm) or high throughput (Apache Spark Streaming). There are other factors, but these are some of the main drivers.
In my company, for our use cases, we can afford a little higher latency as long as we work under a second to score a business event (e.g. financial transaction). On the other hand, we face situations where burst of thousands of transactions to be scored per second are common. As such, high throughput is not optional for us, it's a key requirement and as such, the rationality behind the usage of Apache Spark Streaming.
Here, in the speed layer, ElasticSearch is key to reduce latency of integration concerns of the speed layers since it is a real-time querying database in addition to a very powerful database engine.

The Serving Layer, consolidating the batch layer and speed layer partial results, is largely home made in our case and relies on ElasticSearch's ability to fetch both partial sets in real-time.

2.3 Drawbacks and difficulties of Lambda Architecture

There is a natural tendency to duplicate logic between batch layer and speed layer which needs to be addressed through strict design and re-usable logic. Using Spark in batch mode on the batch layer and Spark Streaming on the speed layer in our case really helps us reuse business logic as much as possible between both worlds.

In addition, there is an operational complexity of the systems that are involved in implementing the lambda architecture. Thus the implementation of Lambda architecture is inherently difficult.

3. Real-time computation with Lambda Architecture

The demand for real-time analytics has led to demand for workflows that can effectively balance latency, throughput, scaling and fault tolerance.
In order to accommodate the demand for real-time analytics, we need to design a system that can provide balance between the concept of "single version of truth" and "real-time analytics". Lambda Architecture is one such method.

In my company, some of our analytics use cases require to consider very extended contextual information about trade and transaction activities, for instance, to build user and customer profiles or analyze their past behaviours.
Building such contextual information typical require analyzing over again and again billions of business events and peta-bytes of data.
Rebuilding these profiles or re-creating the aggregated statistical metrics would require several dozens of minutes even on large cluster in a typical batch processing approach.
Happily, all this information supports an incremental building way and as such we can benefit from Lambda architecture to rebuild the historical part while the latest data is taken into consideration by the speed layer to provide an up-to-date (as far as real-time) view of the reality. The serving layer consolidates both results to provide always up-to-date and accurate views of these profiles or other aggregated statistical metrics.
These real-time metrics are thus made available to our real-time scoring and classification systems.

The same technologies and approaches deployed in the speed layer to provide up-to-date views of the reality are used to score and classify business events, e.g. financial transactions in real-time.
Here as well, we have no requirements for strong real-time with millisecond-order latency. As long as we can provide a risk score or a classification for an event under a second, this is sufficient for our use cases.
On the other hand, it happens often that we have to compute burst of events of several hundreds of entries per second. As such, a system benefiting from an acceptable latency but a very high throughput such as Apache Spark Streaming is a key component of our processing platform.

In addition, within NG|Screener UI we provide our customers with a full-blend data discovery application (forensic application). Lambda Architecture is key in enabling us to provide our users with real-time updates and a second close up-to-date view of the reality.

4. Conclusion

Deploying Lambda architecture on our use cases has proven to be the simplest way to reach our objectives:

Up to date and second-close view of the reality in contextual information, user / customer profiles and other key periodic statistical metrics
Classification and scoring of business events with an under-a-second latency and a very high throughput
Resilience and fault tolerance of our business processes on large clusters, both on technical failures and human failures
Simplicity and maintenance, especially in our approach since we can share significant portions of codes between the batch layer and the speed layer since both are built on Apache Spark
Resolution of operational complexity of big computation on historical data by dividing the work to do in an incremental fashion.

Now of course, Lambda Architecture being the simplest way for us to reach our mission-critical objectives doesn't make it simple per se, on the contrary. Lambda Architecture is inherently difficult to deploy and maintain and requires sound design and implementation.

At NetGuardians, we could benefit from our mastery of cutting-edge technologies as well as our in-depth experience of batch computing systems and real-time computing systems to make it an advantage of our approach.

Presenting NetGuardians' Big Data technology (video)

Jerome Kehrli — Fri, 5 Jan 2018 13:00:00 -0500

I am presenting in this video NetGuardians' Big Data approach, technologies and its advantages for the banking institutions willing to deploy big data technologies for Fraud Prevention.

The speech is reported in textual form hereafter.

It keeps puzzling me to see how deploying Big Data Technologies in banking institutions for fraud prevention and other use cases seems to be so difficult.
A large number of such projects have simply failed over the past.
By failure I mean projects that, led to poor results, or exceeded the budget significantly, or even that have been simply cancelled.

When looking at why these projects failed, it always boils down to the two same major issues.

The first major issue is that extracting the required data to build the analytics use cases is a challenge on its own. Let's say the bank managed to extract the required data, which is only a technical problem, but cleaning, enriching, normalizing and re-modeling it for banking fraud use cases is a whole new project.

The second major issue is that technological mastery alone is not sufficient for Big Data projects to succeed.
Implementing data analytics use cases requires a strong involvement from business experts.
It always amazes me to see so many projects had the illusion that putting a dozen of gifted Data Scientists in a room for a few years would be sufficient. Without a clear business understanding, Data Scientists are blind and can go nowhere.

And then even with a clear understanding of these both challenges, deploying big data technologies for fraud prevention is a 10 months to 2 years project. At NetGuardians we typically deploy our technology at our new customers within a few weeks.

So how do we do that ?

First, we are using technology on the bleeding edge of the state of the art, not today's state of the art but tomorrow's, benefiting from the right data extraction approach and the right use cases.

In terms of technology, our NG|Screener platform is using key big data components underneath: ElasticSearch, Mesos and Spark.

Regarding the Data Ingestion System, we have developed at NetGuardians our Data Collection Framework that is simple, efficient and configurable.
Typical data extraction tools are either simple, or efficient or configurable. Our framework is all of that together, without any compromises.

Then, working with numerous financial institutions worldwide over the years made us understand the indispensable role of not only technology but also business expertise when it comes to developing Big Data analytics use cases.

Business experts in banking institutions are only hardly available, right ?
Not a problem for us, we have hired our own.
Today, we have our own business and risk experts with an impressive trackrecord in risk or other banking business departments.

At NetGuardians, we have this multi-competencies team that so many project struggle to build and together, we have designed and implemented the right use cases to make Big Data deployment projects happen smoothly at our customers, and bring them actual added value.

As a result, our customers are able to make sense of their available Big Data, save enormous amount of time, and implement the Big Data technology to proactively prevent growing fraud challenges.

From a personal perspective, I am utmost proud of what we have built, both in terms of technology and approach, as well as the privilege I have to work in a team with such brilliant minds and wonderful persons.

ELK-MS - ElasticSearch/LogStash/Kibana - Mesos/Spark : a lightweight and efficient alternative to the Hadoop Stack - part III : so why is it cool ?

Jerome Kehrli — Wed, 30 Aug 2017 16:43:44 -0400

Edited 2017-10-30: I was using ES 5.0.0 with Spark 2.2.0 at the time of writing the initial version of this article.
With ElasticSearch 6.x, and ES-Hadoop 6.x, the game changes a little. The Spark 2.2.0 Dynamic allocation system is now perfectly compatible with the way ES-Hadoop 6.x enforces data locality optimization and everything works just as expected.

So, finally the conclusion of this serie of three articles, the big conclusion, where I intend to present why this ELK-MS, ElasticSearch/LogStash/Kibana - Mesos/Spark, stack is really really cool.
Without any more waiting, let's give the big conclusion right away, using ElasticSearch, Mesos and Spark can really distribute and scale the processing the way we want and out of the box /(using Dynamic Allocation) scale the processing linearly with the amount of data to process.
And this, exactly this and nothing else, is very precisely what we want from a Big Data Processing cluster.

At the end of the day, we want a system that books a lot of the resources of the cluster for a job that should process a lot of data and only a small subset of these resources for a job that works on a small subset of data, with a strong enforcement of data locality optimization.
And this is precisely what one can achieve pretty easily with the ELK-MS stack, in an almost natural and straightforward way.
I will present why and how in this article.

The first article - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

The second article - ELK-MS - part II : assessing behaviour presents a few concerns, assesses the expected behaviour using the niceideas ELK-MS TEST package and discusses challenges and constraints in this ELK-MS environment.

This third and last article - ELK-MS - part III : so why is it cool? presents, as indicated, why this ELK-MS stack is really really cool and works great.

This article assumes a basic understanding of Big Data / NoSQL technologies in general by the reader.

Summary

1. Introduction
2. Data locality and workload distribution
3. Examples
4. Conclusion

1. Introduction

The reader might want to refer to the Introduction of the first article in the serie as well as the introduction of the second article.

Summarizing them, this series of article is about presenting and assessing the ELK-MS stack, the tests done using the test cluster and present the conclusion, in terms of constraints as well as key lessons.
The second article was presenting the technical constraints coming from integrating Spark with ElasticSearch through the ES-Hadoop connector when running Spark on Mesos.
In this second article I focused a lot on what was not working and what were the constraints. A reader might have had the impression that these constraints could prevent a wide range of use cases on the ELK-MS stack. I want to address this fear in this third article since this is all but true, Spark on Mesos using data from ElasticSearch is really a pretty versatile environment and can address most if not all data analysis requirements.

In this last article, I will present how one can use a sound approach regarding data distribution in ElasticSearch to drive the distribution of the workload on the Spark cluster.
And it turns out that it's pretty straightforward to come up with a simple, efficient and natural approach to control the workload distribution using ElasticSearch, Spark and Mesos.

ES index layout strategies

The parameters that architects and developers need to tune to control the data distribution on ElasticSearch, which, in turn, controls the workload distribution on spark, are as follows:

The index splitting strategy
The index sharding strategy
The replication strategy (factor)
The sharding key

Spark aspects

Then on the spark side the only important aspect is to use a proper version of ES-Hadoop supporting the Dynamic Allocation System without compromising data locality optimization (i.e ES-Hadoop >= 6.x for Spark 2.2)

But before digging into this, and if that is not already done before, I can only strongly recommend reading the the first article in this serie, related presenting the ELK-MS stack and the second article which presents the conclusions required to understand what will follow.

2. Data locality and workload distribution

What has been presented in the conclusion section of the ELK-MS part II article is summarized hereunder:

Fine Grained scheduling mode of spark jobs by Mesos screws performances up to an unacceptable level. ELK-MS need to stick to Coarse-Grained scheduling mode.
ES-Hadoop is able to enforce data-locality optimization under nominal situations. Under a heavily loaded cluster, data-locality optimization can be compromised for two reasons:
- If the local Mesos / Spark node to a specific ES node is not available after the configured waiting time, the processing will be moved to another free Mesos / Spark node.
- ElasticSearch can well decide to serve the request from another node should the local ES node be busy at the time it is being requested by the local spark node.
With ES-Hadoop 5.x, Dynamic allocation was messing up data locality optimization between ES and Spark. As such only Static allocation was usable and it was required to limit artificially the amount of nodes for a given job in good correspondance to the amount of shards in ES (usage of property spark.cores.max to limit the amount of spark executors and the search_shards API in ES to find out about the amount of shards to be processed)
But now with ES-Hadoop 6.x, Dynamic allocation doesn't interfere with data locality optimization and everything works well out of the box.
Re-distributing the data on the cluster after the initial partitioning decision is only done by spark under specific circumstances.

ES-Hadoop drives spark partitioning strategy

So what happens with ES-Hadoop 6.x and dynamic allocation is that ElasticSearch sharding strategy drives the partitionning strategy of corresponding data frames in Spark. With Data Locality Optimization kicking in, even with Dynamic Allocation enabled, The Spark / Mesos cluster will do its best to create the Spark partitions on the nodes where the ES shards are located.
And this really works just out of the box.

Eventually, there will be just as many executors booked by Mesos / Spark on the cluster as is requiredto handle every ES shars in a dedicated, co-located partition within Spark.

3. Examples

In order to illustrate why I believe that in fact the way ELK-MS behaves when it comes to distributing the workload following the distribution of the data is efficient and natural, we'll use the examples below.

Imagine the following situation: the ELK-MS test cluster contains 6 nodes with similar configurations. The dataset to be stored is called dataset and contains 2 months of data.
In ElasticSearch the indexing settings are as follows:

The Index splitting strategy is by month. This is not strictly an ElasticSearch setting, this is configured in Logstash or any other data ingestion tool.
As a matter of fact, whenever one wants to store temporal data in ElasticSearch (timeseries), one naturally considers splitting the index by year, month or even day depending on the size of the dataset.
The sharding strategy consists in creating 3 shards.
The replication strategy consists in creating 2 replicas (meaning 1 primary shard and 2 replicas).
We do not care about configuring the sharding key any differently than the default for now (a few words on the sharding key configuration are given in the conclusion).

Initial situation

We can imagine that the above situation ends up in the following data layout on the cluster. (One should note though that this is not very realistic since ES would likely not split both month this way when it comes to storing replicas):

Working on a small subset of data for one month

Now let's imagine that we write a processing script in spark that fetches a small subset of the data of one month, June 2017, so [A] here.

In addition, imagine that the filter ends up identifying precisely the data from a single shard of the index. Spark / Mesos would create in this case a single spark partition on the node co-located to the ES shard.

The processing happens this way in this case:

Since only one shard needs to be read from ElasticSearch, ES-Hadoop will drive the creation of a single partition in the resulting DataFrame (or RDD), which in turn will cause Spark to request a single task in one executor, the one local to the ES shard.

So what actually happens is that working on a single shard located on single ES node will actually drive spark in a way to make it work on one single node as well.
Using replicas has the benefits to give the Mesos / Spark cluster some choice in regards to which this node should be. This is especially important if the cluster is somewhat loaded.

Working on a single month of data

In this second example, the processing script works on a single month of data, the full month of June 2017, so all shards of [A] here.

This will drive Spark to create 3 corresponding partitions on the Mesos / Spark cluster.
The processing works as follows in this case:

Three shards from ES need to be fetched to Spark. ES-Hadoop will create 3 partitions which leads to 3 tasks to be dispatched on the Spark processing stage. These 3 tasks will be executed on the 3 local ES nodes owning the shards.

Again, distributing the input data on one third of the ES cluster on one side, and limiting's Spark resources to the actual number of nodes required on the other side, leads to one third of the Spark cluster to be used for the spark processing.
In this case, the ElasticSearch data distribution strategy drives the workload distribution on spark.
Again replication is useful to ensure a successful distribution even under a loaded cluster.

Working on the whole period

This will drive spark to create partitions on all nodes of the cluster.
The processing happens this way:

When working on the whole period, it happens fortunately in this case that we end up fetching shards from the whole ES cluster, in this case the whole spark cluster will be used to distribute the processing workload, since each and every local spark node will need to work on the local ES shard.

Again, one last time, the ElasticSearch data distribution strategy drives the workload distribution in good understanding to the data distribution, enforcing data-locality optimization.

4. Conclusion

In conclusion, having the ElasticSearch data distribution strategy driving the processing distribution on the Mesos / Spark cluster, thanks to the ES-Hadoop connector requirements given to spark, makes a lot of sense if you think of it.

First it's simple and consistent. One can understand how the first stages of processing will occur within spark by simply looking at the data distribution using for instance Cerebro. Everything is well predictable and straightforward to assess.

But more importantly, it's efficient since, well, whenever we store data in ElasticSearch, we think of the distribution strategy, in terms of index splitting, sharding and replication precisely for the single purpose of performance.
Creating too many indexes and shards, more that the amount of nodes, would be pretty stupid since having more than X shards to read per node, where X is the amount of CPUs available to ES on a node, leads to poor performances. As such, the highest limit is the amount of CPUs in the cluster. Isn't it fortunate that this is also the limits we want in such case for our spark processing cluster?
On the other hand, when one wants to store a tiny dataset, a single index and a single shard is sufficient. In this case, a processing on this dataset would also use a single node in the spark cluster. Again that is precisely what we want.

In the end, one "simply" needs to optimize his ElasticSearch cluster and the spark processing will be optimized accordingly.
Eventually, the processing distribution will scale linearly with the data distribution. As such, it's a very natural approach in addition to being simple and efficient.

Summing things up, the spark processing workload distribution being driven by the ElasticSearch data distribution, both are impacted by the following parameters of an ES index:

The index splitting strategy
The index sharding strategy
The replication strategy (factor)
The sharding key

The sharding key is not very important unless one has to implement a lot of joins in his processing scripts. In this case, one should carefully look at the various situations of these joins and find out which property is used most often as join key.
The sharding key should be this very same join key, thus enabling spark to implement the joins with best data locality, most of the time on the local node, since all shards with same sharding key end up on same node.
This may be the topic of another article on the subject, but likely not soon ... since, after so much writing, I need to focus on something else than Spark and ElasticSearch for a little while ...

As a last word on this topic for now, I would like to emphasize that not only this ELK-MS is working cool, in a simple, natural, efficient and performing way, but in addition all the UI consoles (Cerebro, Kibana, Mesos Console, Spark History Server) are state of the art, the Spark APIs is brilliantly designed and implemented, ElasticSearch itself in addition answers a whole range of use cases on its own, etc.
This stack is simply so amazingly cool.

ELK-MS - ElasticSearch/LogStash/Kibana - Mesos/Spark : a lightweight and efficient alternative to the Hadoop Stack - part II : assessing behaviour

Jerome Kehrli — Wed, 23 Aug 2017 17:30:37 -0400

This article is the second article in my serie of two articles presenting the ELK-MS Stack and test cluster.

ELK-MS stands for ElasticSearch/LogStash/Kibana - Mesos/Spark. The ELK-MS stack is a simple, lightweight, efficient, low-latency and performing alternative to the Hadoop stack providing state of the art Data Analytics features.

ELK-MS is especially interesting for people that don't want to settle down for anything but the best regarding Big Data Analytics functionalities but yet don't want to deploy a full-blend Hadoop distribution, for instance from Cloudera or HortonWorks.
Again, I am not saying that Cloudera and HortonWorks' Hadoops distributions are not good. Au contraire, they are awesome and really simplifies the overwhelming burden of configuring and maintaining the set of software components they provide.
But there is definitely room for something lighter and simpler in terms of deployment and complexity.

The first article - entitled - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

This second article - ELK-MS - part II : assessing behaviour presents a few concerns, assesses the expected behaviour using the niceideas ELK-MS TEST package and discusses the challenges and constraints on this ELK-MS environment.

The conclusions of this serie of articles are presented in the third and last article - ELK-MS - part III : so why is it cool? which presents, as the name suggests, why this ELK-MS stack is really really cool and works great.

This article assumes a basic understanding of Big Data / NoSQL technologies in general by the reader.

Summary

1. Introduction
2. Testing framework
3. Conclusions from assessment tests
4. Further work
5. Details of Tests
6. References

1. Introduction

The reader might want to refer to the Introduction of the first article in the serie.

Summarizing it, this article is about assessing the behaviour of the ELK-MS Stack using the test cluster introduced in the first article.
Especially two questions need to be answered:

First, how does data-locality optimization work using ES-Hadoop to read data from ElasticSearch to Spark? On a large cluster, achieving Data Locality is the sinews of war. Before considering the ELK-MS stack as an actual alternative to a more standard Hadoop stack, assessing the sound behaviour and a good respect to data locality of the software stack is not optional.
Second, how does Mesos schedule spark executors and how does it impact data-locality? Mesos needs to be an effective alternative to YARN when it comes to dispatching spark executors while still taking into account data-locality.

These are the two main objectives of the tests for which I am reporting the conclusions hereafter, as well as a few other points.
This article is not about testing Spark or Mesos themselves, it's really about testing how ElasticSearch / Mesos / Spark behave all together to support the application architecture from the schema above.

In addition, in contrary to the state of the art on Spark, in my current company, we are not going to be using Java or Scala to implement the spark processing logic, we are going to use python.
The reason for this is simple: our Data Scientists know python, period. They do not know Java, they are not willing to learn Scala. Our Data Scientist know R and python and as such as Head of R&D I have made python our standard language for our Data Analytics algorithms (not that I don't like R, au contraire, but I believe python is at the right intersection between Data Science and Engineering).
Choosing python as processing language has an impact when it comes to programming Spark, the support of python is, as a matter of fact, a little under the support of Scala and Java.

Now all of the above give this article is rationality: programming an ElasticSearch / Mesos / Spark Task with python is something that suffers from really little documentation available.
In the previous article I wanted to present how to set things up as well as share my setup tools and in this article I want to present how to use it, it's behaviour and share some short sample programs in the form of my tests package.

2. Testing Framework

I would summarize the specificities of the usage of Spark in my current company as follows:

Data analytics use cases are implemented in pyspark and python scripts and not native Scala or Java APIs
The input data and results are stored in ElasticSearch, not in HDFS
Spark runs on Mesos and not the more standard YARN on Hadoop.

So I needed a way to test and assess that all of this is working as expected and that the behaviour of the Mesos/Spark stack, both from the perspective of concurrency and respect of data-locality in between ES nodes and Spark nodes, is sound.
This is the objective of the niceideas_ELK-MS-TEST framework.

I am presenting this framework, the approach and the tests it contains herunder.
The test framework is available for download here.

2.1 niceideas ELK-MS TEST

The niceideas ELK-MS TEST package structure, after being properly extracted in a local folder, is as follows:

./vm_execute.sh: this is the script one calls on the host machine to launch a test. The test to be executed should be given as argument.
./tests/*: the test scenario scripts.

Executing a test on the ELK-MS Test cluster, is simply done, for instance, by the following command:

badtrash@badbook:/data/niceideas_ELK-MS-TEST$ ./vm_execute.sh scenarii/5_concurrency_1_swissdata_df.sh

This would execute the test 5_1 and show the spark driver logs on the console.

2.2 Test Scenario Script

Each and every test scenario script has the very same structure :

Create an ad'hoc shell script taking care of downloading a data set and loading in into ElasticSearch
Execute that Data Loading Shell script
Create an ad'hoc python script taking care of implementing the Spark processing
Execute the Data Processing Python script

For instance, a test scenario X_test_Y_variant.sh would have following structure:

#!/bin/bash

# 1. Create Data Ingestion script
# -----------------------------------------------------------------------------
cat > X_test_Y_variant_do.sh <<- "EOF"
#!/bin/bash

# echo commands
set -x

# ...
# Various shell commands to proceed with loading the data in ES
# ...

# turn off command echoing
set +x


EOF

# 2. Exexcute Data Ingestion Script
# -----------------------------------------------------------------------------

bash X_test_Y_variant_do.sh
if [[ $? != 0 ]]; then
    echo "Script execution failed. See previous logs"
    exit -1
fi


# 3. Create pyspark script
# -----------------------------------------------------------------------------
cat > X_test_Y_variant.py <<- "EOF"

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_X_Y")

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# ...
# Various spark processing commands
# ...

EOF

# 4. Exexcute pyspark script
# -----------------------------------------------------------------------------

spark-submit X_test_Y_variant.py
if [[ $? != 0 ]]; then
    echo "Script execution failed. See previous logs"
    exit -1
fi

This key point of these scripts is that they are self contained and idempotent. They make no assumption about the state of the ELK-MS cluster before and they always start by cleaning all the data before reloading the data required for the tests.

2.3 Used Dataset

All the tests scenarii from the niceideas ELK-MS TEST package used either one of the following datasets:

Bank Dataset: from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip - a dataset of financial accounts with owner and balance information.
Shakespeare Dataset: from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json - the complete work of Shakespeare, every line of every speech.
Apache logs Dataset: from https://download.elastic.co/demos/kibana/gettingstarted/logs.jsonl.gz - a set or an apache web server log files.
Swiss AirBnB: two datasets in fact:
- from http://niceideas.ch/mes/swissairbnb/tomslee_airbnb_switzerland_1451_2017-07-11.csv : the list of AirBnB offers in Switzerland as of July 2017.
- from http://niceideas.ch/mes/swissairbnb/swisscitiespop.txt : the list of swiss cities with population and geoloc information.

The last dataset, related to swiss AirBnB offers and cities information is required to test the behaviour of joins on Spark.
The other datasets represent different volumes and enable us to test various aspects.

2.4 Test purposes

The tests from the niceideas ELK-MS TEST package are presented in details with all results in section 5. Details of Tests.
Before presenting the conclusions inferred from these tests in the next section, this is a short summary of the purpose of every family of tests:

Nominal tests - assess how the various kinds of APIs of Spark are used to read data from ES: the RDD API, the legacy DataFrame API (SQLContext) and the new DataFrame API (SQLSession).
Data-locality tests - assess how data-locality optimization between ES and Spark works and to what extent.
Aggregation tests - how aggregation on ES data works.
Join tests - how joining two data frames coming from ES works.
Concurrency tests - how does Mesos / Spark behave when running several jobs at a time.

Again, the section 5. Details of Tests presents each and every test in details, along with the screenshots of Cerebro, Spark History Server, the logs of the spark driver, etc.

3. Conclusions from assessment tests

I am reporting in this section already the conclusions that can be taken from the tests executed in the scope of this work.
The conclusions and important information are presented in this early section already, preventing the reader from the requirement to read all the individual tests presented in details in the next section.

3.1 ES-Hadoop and Data-locality enforcement

Data locality

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same server, or b) immediately start a new task in a farther away place that requires moving data there.

What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU.
The setting indicating how long it should wait before moving the processing elsewhere is spark.locality.wait=10s.

ES-Hadoop

Data-locality enforcement works amazingly under nominal conditions. ES-Hadoop makes the Spark scheduler understand the topology of the shards on the ES cluster and Spark dispatches the processing accordingly. Mesos doesn't interfere in this regards.

But again, it works only under nominal conditions.
As indicated above, there can be several factors compromising Data-locality:

First, imagine that at resource allocation time the Mesos cluster is heavily loaded. Spark will wait for spark.locality.wait=10s trying to get the processing executed on the node where ES stored the target data shard.
But if in this period the node doesn't become free, spark will move the processing elsewhere.
The second case it not anymore related to spark, but to ElasticSearch. Imagine that at the very moment the spark executor submits the request to ES (through the ES-Hadoop connector), the co-located ES node is busy doing something else (answering another request, indexing some data, etc.).
In this case, ES will delegate the answering of the request to another node and local data-locality is broken.

3.2 Spark coarse grained scheduling by Mesos vs. Fine Grained

In Coarse Grained scheduling mode, the default, Mesos considers spark only at the scale of the required spark executor processes. All Mesos knows about spark is the executor processes on the nodes they are running. Mesos knows nothing of Spark's jobs internals such as stages and tasks.
In addition, static allocation makes Mesos Job pretty easy: try to allocate as many resources from the cluster to spark executors for pending jobs as are available. This has the following consequences:

First, if a job is submitted to the cluster at a moment when the cluster is completely free, the job will be allocated the whole cluster. If another job comes even only just a few seconds after, it will still need to wait for the cluster to be freed by the first job, and that will happen only when the first job completes.
Second, if several jobs are waiting to be executed, when the cluster is freed, Mesos will allocate the cluster resources evenly to each and every job. Now imagine that all these jobs are short-lived jobs and only one of them is a long-lived job. At allocation time (static allocation), that long-lived job got only a small portion of the cluster. Even if very soon the cluster becomes free, that job will still need to complete its execution on his small portion, making most of the cluster unused.

Historically, Mesos on Spark can benefit from a Fine Grained scheduling mode instead, where Mesos will schedule not only spark executors on nodes in a rough fashion but really each and every individual spark task instead.
In regards to data-locality optimization, this doesn't seem to have any impact.
In regards to performance on the other hand, Fine Grained scheduling mode really messes performances completely.

The thing is that Mesos requires quite some time to negotiate with the resources providers. If that negotiation happens for every individual spark tasks, a huge amount of time is lost and eventually the impact on performance is not acceptable.

For this reason (and others), the Fine Grained scheduling mode is deprecated: https://issues.apache.org/jira/browse/SPARK-11857

3.3 Spark Static Resource Allocation vs. Dynamic Allocation

By default, Spark's scheduler uses a Static Resource Allocation system. This means that, at the job (or driver) initialization time, Spark, with the help of Mesos in this case, will decide what resource from the Mesos cluster can be allocated to the job. This decision is static, meaning that once decided the set of resources allocated to the job will never change in its whole life regardless of what happens on the cluster (other / additional nodes becoming free, etc.)
This has the consequences listed above in the previous section, the whole cluster is allocated to a single job, further jobs need to wait, etc. and as such it's not very optimal.

Now of course Spark provides a solution to this, the Dynamic Allocation System.

And this is where Spark gets really cool. With Dynamic Allocation, the Spark / Mesos cluster is evenly shared between multiples jobs requesting execution on the cluster regardless of the time of appearance of the jobs.
And with ES-Hadoop 6.x, the Dynamic allocation system is perfectly able to respect the locality requirements communicated by the elastic-search spark connector and respects them as much as possible

3.3.1 ES-Hadoop 5.x and Spark 2.2

With ES-Hadoop version 5.x, the way the elasticsearch-spark connector was enforcing data locality was incompatible with Spark 2.2.0 and as such But unfortunately, when using Dynamic Allocation, Spark simply doesn't take into consideration ES-Hadoop's requirements regarding data locality optimization anymore.

Without going into details, the problem comes from the fact that ES-Hadoop makes spark request as many executors as shards and indicates as preferred location the nodes owning the ES shards.
But Dynamic allocation screws all of this by allocating executors only one after the other (more or less) and only after monitoring evolutions of the job processing needs and the amount of tasks created. In no way does the dynamic allocation system give any consideration for ES-Hadoop requirements.

3.3.2 ES-Hadoop 6.x

As indicated in the release notes of the ElasticSearch-Hadoop connector 6.0.0, the ElasticTeam has added support for Spark 2.2.0. This support has fixed the messing up with Dynamic Allocation problem that was suffering ES-Hadoop 5.x.

Now even with Dynamic Allocation properly enables, which is a requirement for us in order to optimize the Mesos Cluster resources consumption, Data Locality os optimized and properly enforced everywhen possible.

3.4 Latency regarding Python instantiation

Executing some tasks in Python takes time in comparison to executing tasks natively in Java or Scala. The problem is that spark tasks in python require to launch the individual task processing in seperate process than the Spark JVM. Only Java and Scala Spark processings run natively in the Spark JVM.

This problem is not necessarily a big deal since the DataFrame or RDD APIs exposed to python pyspark scripts are actually implemented by Scala code underneath, they resolve to native Scala code.
There is one noticeable exception in this regards: UDF (User Defined functions) implemented in python. While this is very possible, it should be avoided at all cost.
One can very well still use pyspark but write UDF in Scala.

An explanation of this problem : https://fr.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark
UDF in Scala : http://aseigneurin.github.io/2016/09/01/spark-calling-scala-code-from-pyspark.html

3.5 Other ES-Hadoop concerns

Repartitioning

I couldn't find a way to make repartitioning work the way I want, meaning re-distributing the data on the cluster in order to scale out the further workload.
I am not saying there is no way, just that I haven't found one so far.

As such, a sound approach regarding initial sharding in ES should be adopted. One should take into consideration that a priori, initial sharding may well drive the way Spark will be able to scale the processing out on the cluster.
While creating by default one shard per node in the cluster would definitely be overkill, the general idea should tend in this direction.

ES level aggregations

It's simply impossible to forge a query from Spark to ElasticSearch through ES-Hadoop that would make ElasticSearch compute aggregation and returning them instead of the raw data.
Such advanced querying features are not available from spark.

The need is well identified but it remains Work in Progress at the moment: https://github.com/elastic/elasticsearch-hadoop/issues/276.

3.6 Other concerns

Spark History Server

Running Spark in Mesos, there is no long-lived Spark process. Spark executors are created when required by Mesos and the Mesos master and slave processes are the only long lived process on the cluster in this regards.

As such, the Spark Application UI (on ports 4040, 4041, etc.) only live for the time of the Spark processing. When the job is finished, the Spark UI application vanishes.

For this reason, Spark provides an History server. The installation and operation of the History Server is presented in the first article of this serie : ELK-MS - part I : setup the cluster.

Interestingly, that history server supports the same JSON / REST API that the usual Spark Console, with only a very few limitations.
For instance, one can use the REST API to discover about the Application-ID of running jobs in order to kill them (whenever required). For this, simply list the jobs and find out about those that have "endTimeEpoch" : -1, meaning the application is still alive:

curl -XGET http://192.168.10.10:18080/api/v1/applications

Limitations of the ELK-MS stack

As stated in the previous article, ElasticSearch is not a distributed filesystem, it's a document-oriented NoSQL database.

There are situations where a distributed filesystem provides interesting possibilities. Those are not provided by the ELK-MS stack as is. It would be interesting to test Ceph on Mesos for this. See http://tracker.ceph.com/projects/ceph/wiki/Ceph-mesos.

4. Further work

I am still considering some next steps on the topic of the ELK-MS stack testing since there are still a few things I would like to test ot assess:

In a raw fashion:

Find out about how to set maximum nodes booked by Mesos for a single spark job in order to avoid fully booking the cluster.
ElasticSearch on mesos
- This seems quite obvious. I expect the overall cluster performance to be way better if Mesos and ES don't compete with each other for hardware resources on nodes.
- There are workaround of course, such as configuring Mesos to avoid using all the CPUs of a node. But that will never be as efficient as letting Mesos distribute the global workload.
Find a way for repartitioning to work the way I intend it: data should get redistributed across the cluster!
Give Spark Streaming a try to reduce latency.
- https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-streaming
Try FAIR Spark scheduler and play with it.
- I got satisfying results using spark FIFO scheduler in terms of concurrency and haven't seen the need to change to FAIR.
- It really seems Mesos takes care of everything and I do really not see what the FAIR scheduler can change but I want to be sure.
- There are some chances that this makes me rewrite this whole article ... in another article.
Ceph integration on Mesos for binary files processing.
- How to integrate Ceph and spark ? Here as well very little documentation seems to be available.
- I found pretty much only this : https://indico.cern.ch/event/524549/contributions/2185930/attachments/1290231/1921189/2016.06.13_-_Spark_on_Ceph.pdf
What about HDFS on Mesos ?
- I would want to give it a try even though I am really rather considering Ceph for the use cases ElasticSearch forbids me to address.
- The thing is that Ceph integrates much better in the UNIX unified filesystem than HDFS
- Even though there is an approach to reach same level of integration with HDFS based on Fuse https://wiki.apache.org/hadoop/MountableHDFS. But that is still limited (doesn't support ownership informations for now)

5. Details of Tests

This very big section now presents each and every tests in details, along with the results in the form the the logs of the script (data feeding and spark driver logs), the screenshots of the UI applications (Cerebro, Mesos Console, Spark History Server).

The conclusions from the individual tests have been reported in the global 3. Conclusions from assessment tests section above.

5.1 Nominal Tests

Nominal tests - assess how the various kinds of APIs of Spark are used to read data from ES: the RDD API, the legacy DataFrame API (SQLContext) and the new DataFrame API (SQLSession).

5.1.1 Legacy RDD API on bank dataset

Test details

Test Script: 1_nominal_1_test_bank_rdd_legacy.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: see how Spark's RDD API can be used to fetch data from ElasticSearch and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_1")

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# Simplest possible query
q = "?q=*"

es_read_conf = {
    "es.resource" : "bank",
    "es.query" : q
}

es_rdd = sc.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_read_conf)

es_df = sqlContext.createDataFrame(es_rdd)

# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s accounts (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_1_test_bank_rdd_legacy.log
Screenshots from the various admin console after the test execution:

Test 1-1 / Dataset in ES

Test 1-1 / Job Completion in mesos

Test 1-1 / Process Overview on Spark

Test 1-1 / Job 0 / Stage 0

Conclusions

Spark can read data from ElasticSearch using the ES-Hadoop connector and the RDD API really out of the box.
One just needs to configure a few settings to the newAPIHadoopRDD API:
- inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat"
- keyClass="org.apache.hadoop.io.NullWritable"
- valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable"
Mesos spreads the workload on the cluster efficiently.
- This test was run alone on the cluster
- 2 nodes are sufficient to run the job since, thanks to replicas two nodes have actually all the shards
- Mesos creates a dedicated spark executor on each of the 2 nodes
- Sparks then successfully distribute the RDD on the 2 executors
Data-locality optimization works out of the box.
- There are 5 shards in ElasticSearch, which, with replicas, are well spread on the cluster
- Mesos / Spark dispatches the workload efficiently since it creates 5 RDD partitions for the 5 shards, each and every of them respecting data locality (NODE_LOCAL) and as such respecting the requirements given by the ES-Hadoop connector.

5.1.2 Legacy DataFrame API on bank dataset

Test details

Test Script: 1_nominal_2_test_bank_df_legacy.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: see how Spark's legacy DataFrame API can be used to fetch data from ElasticSearch and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_2")

## !!! Caution : this is pre 2.0 API !!! 

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

es_df = sqlContext.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s women accounts (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_2_test_bank_df_legacy.log
Screenshots from the various admin console after the test execution:

Test 1-2 / Dataset in ES

Test 1-2 / Job Completion in mesos

Test 1-2 / Process Overview on Spark

Test 1-2 / Job 0 / Stage 0

Conclusions

Spark can read data from ElasticSearch using the ES-Hadoop connector and the Legacy DataFrame API (SQLContext) really out of the box.
The single configuration required is format("org.elasticsearch.spark.sql") on the SQLContext API
Here as well, the Dynamic allocation system allocates nodes to the job one after the other.
After two nodes alolocated to the job, all shards (thx to replicas) become available locally and data localiy optimization can be satisfied without any other node required. The job executes on these 2 nodes.
In this case, as seen on result_1_2_job_0_stage_0.png, spark successfully respects data locality as well.

5.1.3 DataFrame API on bank dataset

Test details

Test Script: 1_nominal_3_test_bank_df.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: see how Spark's New (>= 2.0) DataFrame API can be used to fetch data from ElasticSearch and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_3")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s women accounts (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_3_test_bank_df.log
Screenshots from the various admin console after the test execution:

Test 1-3 / Dataset in ES

Test 1-3 / Job Completion in mesos

Test 1-3 / Process Overview on Spark

Test 1-3 / Job 0 / Stage 0

Conclusions

Spark can read data from ElasticSearch using the ES-Hadoop connector and the New DataFrame API (SQLSession) really out of the box.
The single configuration required here as well is format("org.elasticsearch.spark.sql") on the SQLSession API
Nothing specific to report regarding the other aspects: Mesos and Spark's dynamic allocation system distribute the workload as expected, still creates a dedicated Spark Executor for 2 nodes of the cluster which is sufficient (thx replicas), Spark respects data locality strictly, etc.

5.1.4 DataFrame API on Apache-logs dataset

Test details

Test Script: 1_nominal_4_test_apache-logs_df.sh
Input Dataset: Apache Logs Dataset from https://download.elastic.co/demos/kibana/gettingstarted/logs.jsonl.gz
Purpose: see how Spark's New (>= 2.0) DataFrame API can be used to fetch data from ElasticSearch from another dataset and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_4")

# es.read.field.exclude (default empty) : 
#  Fields/properties that are discarded when reading the documents 
#  from Elasticsearch
conf.set ("es.read.field.exclude", "relatedContent")

# es.read.field.as.array.include (default empty) : 
#  Fields/properties that should be considered as arrays/lists
conf.set ("es.read.field.as.array.include", "@tags,headings,links")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("apache-logs-*")


# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s logs (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s logs (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_4_test_apache-logs_df.log
Screenshots from the various admin console after the test execution:

Test 1-4 / Dataset in ES

Test 1-4 / Job Completion in mesos

Test 1-4 / Process Overview on Spark

Test 1-4 / Job 0 / Stage 0

Conclusions

N.D (nothing to declare). Everything works as expected (see previous tests results from the 1 Nominal Tests family).
Interestingly here, the workload justifies the booking of the three nodes of te cluster, which is successfully achieved since the job runs alone on the cluster.

5.1.5 DataFrame API on Shakespeare dataset

Test details

Test Script: 1_nominal_5_test_shakespeare.sh
Input Dataset: Shakespeare's Works Dataset from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json
Purpose: see how Spark's New (>= 2.0) DataFrame API can be used to fetch data from ElasticSearch from another dataset and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_5")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("shakespeare*")

# Collect result to the driver
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s logs (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s logs (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_5_test_shakespeare.log
Screenshots from the various admin console after the test execution:

Test 1-5 / Dataset in ES

Test 1-5 / Job Completion in mesos

Test 1-5 / Process Overview on Spark

Test 1-5 / Job 0 / Stage 0

Conclusions

N.D (nothing to declare). Everything works as expected (see previous tests results from the 1 Nominal Tests family).
This time, however, due to the lack of replicas, the three nodes are actually required to satisfy data localiy optimization. The allocation of the 2 nodes to the job happens successfully again since the job runs alone.

5.2 Data-locality tests

Data-locality tests - assess how data-locality optimization between ES and Spark works and to what extent.

5.2.1 Bank dataset with 1 shard

Test details

Test Script: 2_collocation_1_bank_one_shard.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality works when using a dataset with a single shard on a single node of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_1")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_1_bank_one_shard.log
Screenshots from the various admin console after the test execution:

Test 2-1 / Dataset in ES

Test 2-1 / Job Completion in mesos

Test 2-1 / Process Overview on Spark

Test 2-1 / Job 0 / Stage 0

Conclusions

In terms of workload distribution, this is where Dynamic Allocation is really cool. Since the single shard is on a single node, the Spark Dynamic Allocation System, with the help of Mesos, takes care of booking that single node as well for the Spark processing job
As a sidenote, using static allocation here, mesos Spark would have booked the whole cluster for the job, which would have been far from optimal in terms of workload distribution. Since the cluster is fully available, Mesos would have booked it all for the Job to come. But eventually 2 of the 3 spark executors won't be used at all.
That wouldn't have been a big deal since this test is running alone. But if some more jobs are added to the cluster and requests an executor, they will have to wait for that first job to be finished before they can share the cluster among them.
Data-locality works as expected. The single shard is located on 192.168.10.12 and both the driver logs and the Spark Console for Job 0 / Stage 0 confirms that the co-located Spark executor has been the only one processing the data.

5.2.2 Bank dataset with 2 shards

Test details

Test Script: 2_collocation_2_bank_two_shards.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality optimization works when using a dataset with a two shards on two nodes of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_2")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_2_bank_two_shards.log
Screenshots from the various admin console after the test execution:

Test 2-2 / Dataset in ES

Test 2-2 / Job Completion in mesos

Test 2-2 / Process Overview on Spark

Test 2-2 / Job 0 / Stage 0

Conclusions

Same remark as above regarding workload distribution.
Dynamic Allocation is really cool. Since the two shards are on two nodes, the Spark Dynamic Allocation System, with the help of Mesos, takes care of booking the two corresponding nodes as well for the Spark processing job.
Data locality works as expected. The two shards are on 192.168.10.10 and 192.168.10.12 and both the driver logs and the Spark Console for Job 0 / Stage 0 confirms that both co-located Spark executor have been used to process the 2 shards.
The tasks have been executed with NODE_LOCAL locality level.

5.2.3 Bank dataset with 3 shards

Test details

Test Script: 2_collocation_3_bank_three_shards.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality optimization works when using a dataset with a three shards on three nodes of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_3")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_3_bank_three_shards.log
Screenshots from the various admin console after the test execution:

Test 2-3 / Dataset in ES

Test 2-3 / Job Completion in mesos

Test 2-3 / Process Overview on Spark

Test 2-3 / Job 0 / Stage 0

Conclusions

Three shards on three nodes so three nodes booked for processing, everything works as expected.
Data locality works as expected. The 3 spark executors consumes data from their co-located shards. This is confirmed by everything behaving as expected as can be seen in the driver logs or in the Spark Application UI.

5.2.4 Bank dataset with 1 shard and replicas

Test details

Test Script: 2_collocation_4_bank_one_shard_with_replicas.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality works when using a dataset with one shard and two replicas on three nodes of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_4")

# FIXME GET RID OF THESE TESTS
# Trying some ES settings
# conf.set ("spark.es.input.max.docs.per.partition", 100)
# That doesn't really help => it split the dataframe on several nodes indeed 
# but it doesn't impact the fetching

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_4_bank_one_shard_with_replicas.log
Screenshots from the various admin console after the test execution:

Test 2-4 / Dataset in ES

Test 2-4 / Job Completion in mesos

Test 2-4 / Process Overview on Spark

Test 2-4 / Job 0 / Stage 0

Conclusions

In this case, it's really as if each and every ES node of the cluster has a copy of the data. ElasticSearch makes no distinction when it comes to serve requests between primary shards and secondary shards (replicas).
Which node will finally execute the processing is really random. Out of several executions I always ended up having a different Spark node executing the whole processing. Data co-locality still kicks in and one single node still does the whole processing every time
Important note: All of the above works under normal behaviour. Under a heavily loaded cluster, the results can be significantly different.
By running different scenarii under different conditions, I have been able to determine 2 different situations in addition to the nominal one (the one on a free cluster):
- First, It can happen that Mesos tries to distribute a specific spark processing part to the Spark executor co-located to the ES shard.
  But then, when the Spark processing finally queries that local node to get the shard, it can well happen that this ES node is busy answering a different request from a different client application.
  In this case, that local ES node will report itself as busy and will ask another node from the ES cluster to server the request.
  So even though initially Mesos / Spark distributed the workload to the local node to the shard in ES, eventually the request will be served by another distant node from te cluster.
- Second, it can also happen that all nodes co-located to the ES node owning all shards (primary and replicas) are busy.
  In this case, Mesos / Spark will only wait a few seconds expecting of this node to become free, and if that fails to happen, eventually a different Mesos node will run the processing, indifferently for data locality.
The difference here is that the existence of replicas suddenly gives ElasticSearch the choice
ElasticSearch has the choice to answer and serve the data from another node than the local node if suddenly the local node is busy!
In addition, Mesos / Spark will only wait spark.locality.wait=10s to try to make the specific processing part local to the ES node owning a shard (or a replica BTW). If none of these nodes (owning one of the primary shard or replicas) becomes free and available for that amount of time, then Mesos will distribute the workload to another available node from the Mesos cluster.

5.2.5 Testing repartitioning

Test details

Test Script: 2_collocation_5_bank_one_shard_repartition_NOT_WORKING.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how one can redistribute the data on the cluster after loading data from a sub-set of the cluster nodes such as, for instance, only one node, see how Spark can redistribute the data evenly on the cluster after having loaded an unbalanced data.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_5")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'") 

# Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("A - % rows stored on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# Doesn't help
#es_df2 = es_df.coalesce(1) 
## Print size of every partition on nodes
#es_df2.foreachPartition(f)

# (2)
es_df3 = es_df.repartition(4 * 3) 
# Print size of every partition on nodes
es_df3.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df3.collect()
print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

# Print
print (ss._jsc.sc().getExecutorMemoryStatus().size())

Results

Logs of the Spark Driver: 2_collocation_5_bank_one_shard_repartition.log
Screenshots from the various admin console after the test execution:

Test 2-5 / Dataset in ES

Test 2-5 / Job Completion in mesos

Test 2-5 / Process Overview on Spark

Test 2-5 / Job 0 / Stage 0

Conclusions

I haven't been able to make repartitioning work the way I intended it to work.
- Eventually all of my tests led to the underlying RDD being repartitioned, but all the partitions remain local to the initially owning node
- I never managed to find a way to make the Spark cluster redistribute the different partitions to the various Spark executors from the cluster
I don't know if that comes from Spark somehow knowing that it doesn't need to do that for the post-processing to be done efficiently.
Long story short, I have no real conclusions in this regards, reason why the above schema is crossed by an X.

5.3 Aggregation tests

Aggregation tests - assess how aggregation on ES data works.

5.3.1 ES-side Aggregations

Test details

Test Script: 3_aggregation_1_es_shakespeare_rdd_legacy_NOT_WORKING.sh
Input Dataset: Shakespeare's Works Dataset from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json
Purpose: see how Spark can exploit native ElasticSearch features such as ES-side aggregations instead of performing aggregations on its own.

Expected Behaviour

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_3_1")

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# -> query dsl
es_aggregations_query = '''
{ 
    "query" : { "match_all": {} },
    "size" : 0,
    "aggregations" : { 
        "play_name": {
            "terms": {
                "field" : "play_name"
            }
        }
    }
}
'''

es_read_conf = {
    "es.resource" : "shakespeare",
    "es.endpoint" : "_search",
    "es.query" : es_aggregations_query
}

# (1)
es_rdd = sc.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_read_conf)

es_df = sqlContext.createDataFrame(es_rdd)

# I need to collect the result 
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s rows (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 3_aggregation_1_es_shakespeare_rdd_legacy.log
Screenshots from the various admin console after the test execution:

Test 3-1 / Dataset in ES

Test 3-1 / Job Completion in mesos

Test 3-1 / Process Overview on Spark

Test 3-1 / Job 0 / Stage 0

Conclusions

There is simply no way at the moment to submit specific requests, such as aggregation requests, from spark to ElasticSearch using the ES-Hadoop connector.
The need is well identified but the solution is still work in progress: https://github.com/elastic/elasticsearch-hadoop/issues/276
Since it's impossible to make this work as expected, the schematic above is crossed with an X.

5.3.2 Spark-side Aggregations

Test details

Test Script: 3_aggregation_2_spark_shakespeare.sh
Input Dataset: Shakespeare's Works Dataset from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json
Purpose: see how Spark performs aggregations on its own.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_3_2")

# Every time there is a shuffle, Spark needs to decide how many partitions will 
# the shuffle RDD have. 
# 2 times the amount of CPUS on the cluster is a good value (default is 200) 
conf.set("spark.sql.shuffle.partitions", "12")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("shakespeare*")

# (2) Compute aggregates : I want the count of lines per book
agg_df = es_df.groupBy(es_df.play_name).count()

# (3) Collect result to the driver
data_list = agg_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s rows (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 3_aggregation_2_spark_shakespeare.log
Screenshots from the various admin console after the test execution:

Test 3-2 / Dataset in ES

Test 3-2 / Job Completion in mesos

Test 3-2 / Process Overview on Spark

Test 3-2 / Job 0 / Stage 0

Conclusions

There aren't a lof of things to conclude here. We can just mention that everything works as expected and return the user to the Data Flow schematic above.
Data locality kicks-in, etc.

5.4 Join test

Test details

Test Script: 4_join_1_swissdata_df.sh
Input Dataset: two datasets in fact:
- from http://niceideas.ch/mes/swissairbnb/tomslee_airbnb_switzerland_1451_2017-07-11.csv : the list of AirBnB offers in Switzerland as of July 2017.
- from http://niceideas.ch/mes/swissairbnb/swisscitiespop.txt : the list of swiss cities with population and geoloc information.
Purpose: see how the ELK-MS stack behaves when its has several datasets to load from ES into Spark and then join.

Expected Behaviour

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
import pyspark.sql.functions as F

# Spark configuration 
# all these options can be given to the command line to spark-submit
# (they would need to be prefixed by "spark.")
conf = SparkConf().setAppName("ESTest_4_1")

# Every time there is a shuffle, Spark needs to decide how many partitions will 
# the shuffle RDD have. 
# 2 times the amount of CPUS oi nthe cluster is a good value (default is 200) 
conf.set("spark.sql.shuffle.partitions", "12")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()


# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

# (1).1 Read city and population
citypop_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-citypop") \
            .alias("citypop_df")

# (1).2. Read airbnb offers
airbnb_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-airbnb") \
            .alias("airbnb_df")

# (2) Join on city
joint_df = airbnb_df \
            .join( \
                  citypop_df, \
                  (F.lower(airbnb_df.city) == F.lower(citypop_df.accent_city)), \
                  "left_outer" \
                 ) \
            .select( \
                    'room_id', 'airbnb_df.country', 'airbnb_df.city', \
                    'room_type', 'bedrooms', 'bathrooms', 'price', 'reviews', \
                    'overall_satisfaction', \
                    'airbnb_df.latitude', 'airbnb_df.longitude', \
                    'citypop_df.latitude', 'citypop_df.longitude', 'population', \
                    'region' \
                   )

# (3) Collect result to the driver
data_list = joint_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Computed %s positions (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 4_join_1_swissdata_df.log
Screenshots from the various admin console after the test execution:

Test 4-1 / Dataset in ES

Test 4-1 / Job Completion in mesos

Test 4-1 / Process Overview on Spark

Test 4-1 / Job 0 / Stage 0

Test 4-1 / Job 0 / Stage 1

Conclusions

Here as well there aren't a log of things to conclude. Everything works just as expected.
Data locality kicks-in both at the ES data fetching side and on the private Spark Side for the join.

5.5 Concurrency test

Test details

Test Script: 5_concurrency_1_swissdata_df.sh
Input Dataset: two datasets in fact:
- from http://niceideas.ch/mes/swissairbnb/tomslee_airbnb_switzerland_1451_2017-07-11.csv : the list of AirBnB offers in Switzerland as of July 2017.
- from http://niceideas.ch/mes/swissairbnb/swisscitiespop.txt : the list of swiss cities with population and geoloc information.
Purpose: see how the ELK-MS stack behaves when submitting several jobs at the same time to the cluster and what happens in terms of concurrency.

The Spark Script

The concurrency tests simply executes four times in parallel the scenario inspired from 5.3.2 Spark-side Aggregations as follows:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
import pyspark.sql.functions as F

# Spark configuration 
conf = SparkConf()

# Every time there is a shuffle, Spark needs to decide how many partitions will 
# the shuffle RDD have. 
# 2 times the amount of CPUS on the cluster is a good value (default is 200) 
conf.set("spark.sql.shuffle.partitions", "12")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

# 1. Read city and population
citypop_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-citypop") \
            .alias("citypop_df")

# 2. Read airbnb offers
airbnb_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-airbnb") \
            .alias("airbnb_df")

# 3. Join on city
joint_df = airbnb_df \
            .join( \
                  citypop_df, \
                  (F.lower(airbnb_df.city) == F.lower(citypop_df.accent_city)), \
                  "left_outer" \
                 ) \
            .select( \
                    'room_id', 'airbnb_df.country', 'airbnb_df.city', \
                    'room_type', 'bedrooms', 'bathrooms', 'price', 'reviews', \
                    'overall_satisfaction', \
                    'airbnb_df.latitude', 'airbnb_df.longitude', \
                    'citypop_df.latitude', 'citypop_df.longitude', 'population', \
                    'region' \
                   )

# Collect result to the driver
data_list = joint_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Computed %s positions (from collected list)") % len (data_list)

Results

The various logs:
- Logs of the script : 5_concurrency_1_swissdata_df.log
- Process P1 logs : log_5_concurrency_1_swissdata_1.log
- Process P2 logs : log_5_concurrency_1_swissdata_2.log
- Process P3 logs : log_5_concurrency_1_swissdata_3.log
- Process P4 logs : log_5_concurrency_1_swissdata_4.log
Screenshots from the various admin console after the test execution:

First the mesos console showing the completion of the 4 jobs:

Test 5-1 / P1 / Job Completion in mesos

Test 5-1 / P2 / Job Completion in mesos

Test 5-1 / P3 / Job Completion in mesos

Test 5-1 / P4 / Job Completion in mesos

Overview of the 4 processes in Spark console, the specific view of each of the process:

Test 5-1 / P1 / Process Overview on Spark

Test 5-1 / P2 / Process Overview on Spark

Test 5-1 / P3 / Process Overview on Spark

Test 5-1 / P4 / Process Overview on Spark

Focusing on Job 1 (P1), each and every relevant views from the Spark Application UI:

Test 5-1 / P1 / Job 0

Test 5-1 / P1 / Job 0 / Stage 0

Test 5-1 / P1 / Job 0 / Stage 1

Test 5-1 / P1 / Job 0 / Stage 2

Focusing on Job 2 (P2), each and every relevant views from the Spark Application UI:

Test 5-1 / P2 / Job 0

Test 5-1 / P2 / Job 0 / Stage 0

Test 5-1 / P2 / Job 0 / Stage 1

Test 5-1 / P2 / Job 0 / Stage 2

Focusing on Job 3 (P3), each and every relevant views from the Spark Application UI:

Test 5-1 / P3 / Job 0

Test 5-1 / P3 / Job 0 / Stage 0

Test 5-1 / P3 / Job 0 / Stage 1

Test 5-1 / P3 / Job 0 / Stage 2

Focusing on Job 4 (P4), each and every relevant views from the Spark Application UI:

Test 5-1 / P4 / Job 0

Test 5-1 / P4 / Job 0 / Stage 0

Test 5-1 / P4 / Job 0 / Stage 1

Test 5-1 / P4 / Job 0 / Stage 2

Conclusions

Before everything else let's mention that this test has been executed, first, using the FIFO scheduler (spark.scheduler.mode=FIFO) and second, using the Dynamic allocation system
Dynamic allocation seems to work a little slower that static allocation in this case.
With static allocation (only actual way on ES-Hadoop 5.x), what happens is that the first job that is prepared by the drivre a tiny little bit before the 3 others will get the shole cluster, and only whenever that first job is done, the three next sones will get an even share of the cluster, i.e one node each and complete almost at the same time.
With dynamic allocation, the cluster is well shares among jobs. Once in a while a job may get an additional executor and another job will need to wait but all in all the 4 jobs really run together on the three nodes.
In terms of concurrency, we can see on the following image that the cluster is used quite effectively, looking at the CPU consumption on the host machine:

(Note : each and every of the 3 VMs can use up to 2 CPUS of the host which has 4 CPUs in total)
Also, all my tests, including this one has been executed using Coarse Grained Scheduling Mode (spark.mesos.coarse=true).
- One might think that using Fine Grained Mode, things would be more efficient since each and ever task would be distributed on the cluster at will and we wouldn't end up in the static topology described above.
- But unfortunately, Mesos latency when it comes to negotiating resources really messes up performances. The dynamic dispatching of tasks works well, but the overall processes performances are screwed by the time Mesos requires for negotiation.
  In the ends, Fine Grained Scheduling mode kills performance of the whole cluster down.
- I have executed this very same test using spark.mesos.coarse=false and the dropdown in terms of cluster usage efficiency is seen by looking at the CPU consumption on the host machine for test 5 - 1 using Fine Grained Mode
In regards to data locality, since the 3 last processes get one single node of the cluster each, only one third of the tasks will execute with locality level NODE_LOCAL. Two thirds of them will require to fetch data from the network.

6. References

Spark and mesos

https://spark.apache.org/docs/latest/running-on-mesos.html (specific spark mesos configuration)

ES Hadoop doc

Pyspark.sql doc

Spark Doc

Configuration : https://spark.apache.org/docs/latest/configuration.html
Spark history server : https://spark.apache.org/docs/latest/monitoring.html
Dynamic resource allocation : https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

Other Pyspark specificities

Pyspark RDD API : https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD
Pyspark performance : https://fr.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark

ELK-MS - ElasticSearch/LogStash/Kibana - Mesos/Spark : a lightweight and efficient alternative to the Hadoop Stack - part I : setup the cluster

Jerome Kehrli — Wed, 23 Aug 2017 17:29:12 -0400

In my current company, we implement heavy Data Analytics algorithms and use cases for our customers. Historically, these heavy computations were taking a whole lot of different forms, mostly custom computation scripts in python or else using RDBMS databases to store data and results.
A few years ago, we started to hit the limits of what we were able to achieve using traditional architectures and had to move both our storage and processing layers to NoSQL / Big Data technologies.

We considered a whole lot of different approaches, but eventually, and contrary to what I expected first, we didn't settle for a standard Hadoop stack. We are using ElasticSearch as key storage backend and Apache Spark as processing backend.
Now of course we were initially still considering a Hadoop stack for the single purpose of using YARN as resource management layer for Spark ... until we discovered Apache Mesos.

Today this state of the art ELK-MS - for ElasticSearch/Logstash/Kibana - Mesos/Spark stack performs amazingly and I believe it to be a really lightweight, efficient, low latency and performing alternative to a plain old Hadoop Stack.
I am writing a serie of two articles to present this stack and why it's cool.

This first article - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

This article assumes a basic understanding of Hadoop and Big Data / NoSQL technologies in general by the reader.

Summary

1. Introduction
- 1.1 Rationality
- 1.2 Purpose of this serie of articles
2. Target Architecture
3. niceideas ELK-MS
4. Noteworthy configuration elements
5. Conclusion

1. Introduction

Actually deploying a whole Hadoop stack is, let's say, at least heavy. Having HDFS, YARN, the Map Reduce framework and maybe Tez up and running is one thing, and it's maybe not that complicated, sure.

But with such a vanilla stack you're not going very far. You'll at least add the following minimal set of software components: Apache Sqoop for importing data in your HDFS cluster, Apache Pig for processing this data, Apache Hive for querying it. But yeah, then, Hive is so slow for small queries returning small datasets, you'll likely add Stinger ... and then a whole lot of other components.
Now setting all of these software components up and running and tuning them well is a real hassle so one might consider a HortonWorks or Cloudera distribution instead, and this is where it gets really heavy.
Don't get me wrong, both HortonWorks and Cloudera are doing an amazing job and their distributions are awesome.
But I am working in a context where we want something lighter, something more efficient, something easier to set up, master and monitor.

In addition, HDFS is great. But it's really only about distributed storage of data. Vanilla Hadoop doesn't really provide anything on top of this data aside from MapReduce. On the other hand, the NoSQL landscape is filled with plenty of solutions achieving the same resilience and performance than HDFS but providing advanced data querying features on top of this data.
Among all these solutions, ElasticSearch is the one stop shop for our use cases. It fulfills 100% of our requirements and provides us out of the box with all the querying features we require (and some striking advanced features).
Using ElasticSearch for our data storage needs, we have no usage whatsoever for HDFS.
In addition, ElasticSearch comes out of the box with a pretty awesome replacement of Sqoop: Logstash and a brilliant Data Visualization tool that has no free alternative in the Hadoop world: Kibana.

Now regarding Data Processing, here as well we found our one stop shop in the form of Apache Spark. Spark is a (vary) fast and general engine for large-scale data processing. At the ground of our processing needs, there is not one single use case we cannot map easily and naturally to Spark'S API, either using low level RDDs or using the DataFrame API (SparkSQL).

Now Spark requires some external scheduler and resources manager. It can run without it of course but fails in achieving concurrency when doing so.
We were seriously considering deploying Hadoop and YARN for this until we discovered Apache Mesos. Mesos is a distributed systems kernel built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides Spark with API’s for resource management and scheduling across entire Data-center and cloud environments.

1.1 Rationality

I call the software stack formed by the above components the ELK-MS stack, for ElasticSearch/LogStash/Kibana - Mesos/Spark.

The ELK-MS stack is a simple, lightweight, efficient, low-latency and performing alternative to the Hadoop stack providing state of the art Data Analytics features:

lightweight : ELK-MS is lightweight both in terms of setup and runtime.
In terms of setup, the distributed storage engine, ElasticSearch, the resource manager, Mesos, and the distributed processing engine, spark, are amazingly easy to setup and configure. They really work almost out of the box and only very few configuration properties have to be set when it comes to configuring resources in Mesos, honestly trying to optimize anything other than the default value really tends to worsen things.
In terms of runtime, ElasticSearch, Mesos and some components of Spark, the only long-running daemons have a very low memory footprint under low workload. Now of course, both ElasticSearch and Spark have pretty heavy memory needs when working.
efficient : ElasticSearch, in contrary to HDFS, is not just a wide and simple distributed storage engine. ElasticSearch is in addition a real-time querying engine. It provides pretty advanced features such as aggregations and, up to a certain level, even distributed processing (scripted fields or else). With ELK-MS, the storage layer itself provides basic data analytics features.
In addition, Spark supports through the RDD API most if not all of what we can achieve using low-level Map Reduce. It obviously also supports plain old MapReduce. But the really striking feature of Spark is the DataFrame API and SparqSQL.
low-latency : Spark is by design much faster than Hadoop. In addition, jobs on spark can be implemented in such as way that the processing time and job initialization time is much shorter than on Hadoop MapReduce (Tez makes things more even on Hadoop though).
But there again Spark has a joker: the Spark Streaming extension.
performing : in addition to the above, both ElasticSearch and Spark share a common gene, not necessarily widely spread among the NoSQL landscape: the capacity to benefit as much from a big cluster with thousands of nodes than from a big machine with a hundreds of processor.
Spark and ElasticSearch are very good on a large cluster of small machines (and to be honest, the scaling out is really the preferred way to achieve optimal performance with both).
But in contrary to Hadoop, both Spark and ElasticSearch also works pretty good on a single fat machine with hundreds of processors, able to benefit from the multi-processor architecture of one single machine.

The conclusions of the behaviour assessment tests, at the end of the second article, as well as The conclusion of this serie of articles give some more leads on why the ELK-MS stack is cool.

For these reasons, we are extensively using the ELK-MS stack for our Data Analytics needs in my current company.

1.2 Purpose of this serie of articles

Setting up the ELK-MS stack in a nominal working mode is easy, but still requires a few steps. In addition, when assessing the stack and for testing purpose, I needed a way to setup a cluster and test key features such as optimization of data-locality between ElasticSearch and Spark.

I have written a set of scripts taking care of the nominal setup and a test framework based on Vagrant and VirtualBox.

This first article - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

2. Target Architecture

Before presenting the components and some noteworthy configuration aspects, let's dig into the architecture of the ELK-MS stack.

2.1 Technical Architecture

The technical architecture of the ELK-MS stack is as follows

The components in grey are provided out of the box at OS level by Debian Stretch distribution.
The components in yellow are provided by Elastic in the ELK Stack.
Mesos is in light red.
The components in blue are from the Spark Framework.

Let's present all these components.

2.2 Components

This section presents the most essential components of the ELK-MS stack.

2.2.1 ElasticSearch

From ElasticSearch's web site : "ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected."

ElasticSearch is a NoSQL Document-oriented database benefitting from the NoSQL Genes: data distribution by sharding (partitioning) and replication. It can run on all kind of hardware, from a big fat hundred CPUs machine to a multi-data centers cluster of commodity hardware.
The native document storage format is JSON.

ElasticSearch support real-time querying of data and advanced analytics features such as aggregation, scripted fields, advanced memory management models and even some support for MapReduce directly in ElasticSearch's engine.

2.2.2 Logstash

From Logstash's web site : "Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite "stash." (Ours is Elasticsearch, naturally.)."

Logstash is really the equivalent of Sqoop in the Elastic world. It's a largely configurable data processing engine whose primary intent is to feed ElasticSearch with data that can come from pretty much all imaginable data sources and formats. Of course Logstash can also output data to a very extended set of sinks in addition to ElasticSearch.
It's easily extendable through plugins which are straightforward to build, should the 200 provided plugins not be sufficient.

Logstash can also be distributed just as ElasticSearch, enabling not only to scale out the data ingestion processing but also enabling smart co-location strategies with ElasticSearch.

2.2.3 Kibana

From Kibana's web site : "Kibana lets you visualize your ElasticSearch data and navigate the Elastic Stack, so you can do anything from learning why you're getting paged at 2:00 a.m. to understanding the impact rain might have on your quarterly numbers."

Kibana core ships with the classics: histograms, line graphs, pie charts, sunbursts, and more. They leverage the full aggregation capabilities of ElasticSearch.
Kibana as well is easily extendable and integrating any kind of native D3.js visualization is usually done in a few hours of coding.

In the context of ELK-MS, Kibana is an amazing addition to ElasticSearch, since we can write Spark programs that work with data from ES but also stores their results in ES. As such, Kibana can be used out of the box to visualize not only the input data but also the results of the Spark scripts.

2.2.4 Cerebro

From Cerebro's web site : "Cerebro is an open source(MIT License) ElasticSearch web admin tool built using Scala, Play Framework, AngularJS and Bootstrap.."

Cerebro is the one-stop-shop, little and simple but efficient, monitoring and administration tool for ElasticSearch.

Cerebro is a must have with ElasticSearch since working only with the REST API to understand ElasticSearch's topology and perform most trivial administration tasks (such as defining mapping templates, etc.) is a real hassle.
Cerebro is far from perfect but really does the job.

2.2.5 Spark

From Spark's web site : "Apache Spark is a fast and general engine for large-scale data processing."

From Wikipedia's Spark article: "Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.

The availability of RDDs facilitates the implementation of both iterative algorithms, that visit their dataset multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications (compared to a MapReduce implementation, as was common in Apache Hadoop stacks) may be reduced by several orders of magnitude."

2.2.6 Mesos

From Mesos' web site : "Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
Mesos is a distributed systems kernel, built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire Datacenter and cloud environments."

In the context of ELK-MS, and in a general way when considering running Spark in a production environment, Mesos is the way to go if one doesn't want to deploy a full Hadoop stack to support Spark. In the end, it appears that Mesos performs amazingly, both by suffering only form a very small memory footprint on the cluster and by being incredibly easy to setup and administer.

2.2.7 Spark on Mesos specificities

Happily Spark and Mesos, both products from the Apache fundation, know about each other and are designed to work with each other.
There are some specificities though when it comes to run Spark on Mesos as opposed to running Spark on the more usual YARN, as explained below,

Spark Mesos Dispatcher

Interestingly, mesos handles spark workers in a pretty amazing way. Not only does Mesos consider node locality requirements between spark and ElasticSearch, but Mesos also provides required retry policies and else.

When launching a spark job, there is nevertheless one Single Point of Failure that remains: the spark driver that lives outside of the Mesos/Spark cluster, on the machine it is launched by the user or the driving process.

For this reason, spark provides the Spark Mesos Dispatcher that can be used to dispatch the Spark Driver itself on the Mesos/Spark cluster.
Using the Spark Mesos Dispatcher, the driver itself, just as the spark processing is balanced on the cluster to an available node and can be supervised (retried, monitored in terms of memory consumption, etc.).

The Spark Mesos Dispatcher addresses the single weakness of a spark process: the driver that can crash or exhaust resources and handles it just as any other bit of spark processing.

Spark History Server

In contrary to spark running in standalone mode, when spark runs on Mesos, it has no long life running backend that the user can use to interact with when Spark is not actually executing a job.
Mesos takes care of creating and dispatching Spark workers when required. When no Spark job is being executed, there is no spark process somewhere one can interact with to query, for instance, the results of a previous job.

Happily spark provides a solution out of the box for this : the spark History Server.
The Spark History Server is a lightweight process that presents the results stored in the Spark Event Log folder, that folder on the filesystem where Spark stores consolidated results from the various workers.
The documentation of Spark is very unclear about this, but since only the spark driver stores consolidated results in the event log folder, if all drivers are launched on the same machine (for instance the Mesos master machine), there is no specific needs for HDFS.

One should note that running the Spark History Server without HDFS to store the event log can be a problem if one uses the Spark Mesos Dispatcher to distribute the driver program itself on the Mesos cluster. In this case using a common NFS share for instance would solve the problem.

2.3. Making it work together : ES-Hadoop

From ES-Hadoop's web site : "Connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. The Elasticsearch-Hadoop (ES-Hadoop) connector lets you get quick insight from your big data and makes working in the Hadoop ecosystem even better."

Initially ES-Hadoop contains the set of classes implementing the connectors for pretty much all de facto "standards" components of a full hadoop stack, such as Hive, Pig, Spark, etc.
Interestingly, as far as Spark is concerned, Spark can perfectly use ES-Hadoop to load from or store data to ElasticSearch outside of an Hadoop stack. In fact, the spark connector from the ES-Hadoop library has no dependency on a Hadoop stack whatsoever.

In the context of ELK-MS, the ES-Hadoop connector is one of the most important components. When one considers building a large collocated ES / Mesos / Spark cluster and execute tasks requiring to fetch large datasets from ES to Spark, the data-locality knowledge supported by ES-Hadoop is utmost important. The second article of this serie is largely devoted to assessing how and how far the optimization of data-locality works.

When launching a job using Spark, the connector determines the locations of the shards in ElasticSearch that it will be targeting and creates a partition per shard (or even further to allow for greater parallelism). Each of these partition definitions carries with it the index name, the shard id, the slice id and the addresses of the machines where it can find this data on locally. It then relies on Spark's task scheduling to achieve data locality.
Spark will stand up a task for each of the input partitions, and each reading task is pinned to a node that is hosting the shards. This just means that the task will always try to read from that node first, but will target other nodes if that node fails the processing or fails from becoming available before the timeout.

2.4 Application Architecture

Typical Data Flows on the ELK-MS platform is illustrated by the following Application Architecture schema:

The tests presented "in the second article in this serie: ELK-MS - part II : assessing behaviour" are intended to assess the well behaviour of this application architecture.

3. niceideas ELK-MS

So. Again as stated in introduction before, when playing with ES / Mesos / Spark, it happened quite fast that I got two urgent needs:

First, I needed a reference for configuring the various software so that they work well together. Instead of writing pages of documentation indicating the settings to tune, I ended up putting all of that in setup scripts aimed at helping me re-apply the configuration at will.
Second, I needed a test cluster allowing me to assess how various key features were working, among which ensuring optimization of data-locality was one of the most important.

In the end I wrote a set of scripts using Vagrant and VirtualBox aimed at making it possible for me to rebuild the test cluster and reapply the configuration at will. I packaged all these scripts together and call this package the niceideas_ELK-MS package.

This package is available for download here.

3.1 System and principles

The System Architecture of the ELK-MS platform as build by the niceideas_ELK-MS package is as follows:

The master node, MES Master (for Mesos/Elasticsearch/Spark) is called mes_master. It contains the full stack of software including the management UIs. The Master node is also a data node.
The two data nodes, MES Slave X are called mes_node1 and mes_node2. They only provide an ElasticSearch instance and a Mesos worker instance to drive Spark Executors.

Having two possible Mesos Masters is not considered for now but the technical stack is deployed with this possibility wide open by using zookeeper to manage mesos masters.

3.2 The build system

The remainder of this section is a description of the niceideas_ELK-MS package build system and a presentation of the expected results.

3.2.1 Required Tools

First the build system is really intended to work on Linux, but would work as well on Windows except that vagrant commands need to be called directly.

But before digging into this, the following tools need to be installed and properly working on the host machine where the ELK-MS test cluster has to be built:

VirtualBox: is an x86 and AMD64/Intel64 virtualization solution.
The niceideas_ELK-MS package will build a cluster of nodes taking the form of VMs running on the host machine (the user computer).
Vagrant: is a tool for building and managing virtual machine environments in a single workflow.
The niceideas_ELK-MS package uses Vagrant to build and manage the VMs without any user interaction required and to drive the provisioning scripts execution.
vagrant-reload vagrant plugin is require to reload the machines after some changes applied by the provisionning scripts requiring a VM reboot.
See https://github.com/aidanns/vagrant-reload/blob/master/README.md.

3.2.2 Build System Project Layout

The niceideas_ELK-MS package structure, after being properly extracted in a local folder, is as follows:

./setup.sh: basically takes care of everything by calling vagrant to build the 3 VMs
./vagrant/VagrantFile: vagrant definition file to define the 3 VMs and the provisioning scripts
./provisionning/*: the provisoning scripts. The entry point is setup.sh that calls each and every other script.

Rationality

In a DevOps world, there are better tools than shell scripts to proceed with VM or machine provisioning, such as Ansible, Chef, Puppet, etc.
But in my case, I want it to be possible for me to go on any VM, any machine and re-apply my configuration to Spark, Mesos, ElasticSearch or else by simply calling a shell script with a few arguments.
So even though there are more efficient alternatives, I kept shell scripts here for the sake of simplicity.

Building the ELK-MS test cluster on Windows

With VirtualBox and Vagrant properly installed on Windows, nothing should prevent someone from building the cluster on Windows.
But in this case, of course, the root scripts setup.sh, start_cluster.sh, stop_cluster.sh are not usable (or else cygwin ? MingW ?).

In this case, the user should call vagrant manually to build the 3 VMs mes_master, mes_node1 and mes_node2 as follows:

c:\niceideas_ELK-MS\vagrant> vagrant up mes_master
...
c:\niceideas_ELK-MS\vagrant> vagrant up mes_node1
...
c:\niceideas_ELK-MS\vagrant> vagrant up mes_node2
...

3.3 Calling the build system and results

Again, calling the build system to fully build the cluster, on Linux, is as simple as:

badtrash@badbook:/data/niceideas_ELK-MS/setup$ ./setup.sh

A full dump of the result of the setup.sh script is available here.

3.4 Testing the System

After calling the setup.sh script above, the 3 VMs are properly created, as one can check in VirtualBox:

In addition, the 4 UI applications should be available at following addresses (caution, the links below return to your cluster, not niceideas.ch):

Cerebro: (http://192.168.10.10:9000/)

(One can see the 3 nodes available)

Mesos: (http://192.168.10.10:5050/)

(One can see the 3 nodes available)

Spark History Server: (http://192.168.10.10:18080/)

Kibana: (http://192.168.10.10:5601/)

3.5 Tips & Tricks

This closes the presentation of the niceideas_ELK-MS package. The remainder of this article gives some hints regarding the configuration of the different software components.
Readers interested in understanding what the build system of niceideas_ELK-MS presented above does without the hassle of analyzing the setup scripts can continue reading hereunder.
Reader interested only in understanding the cluster layout and the concerns of the ES / Spark integration can move to the second article in this serie: ELK-MS - part II : assessing behaviour.

Killing a stuck job

Once in a while, for various reasons, a job gets stuck. In this case the easiest way to kill it is using the Spark Web console.
But wait, hold on, you just said above that such a console is not available when running through Mesos ?
Well actually, the Spark console is available as long as the spark job is alive ... which is the case, happily, when a spark job is stuck.

So one can follow the link provided by Mesos on the Spark console and use the usual kill link from there.

Spark fine grained scheduling by Mesos

When reading about Mesos fine grained scheduling of spark job, one might think it makes sense to give it a try ... don't!

Spark fine grained scheduling by Mesos is really really messed up.
One might believe that it helps concurrency and better resource allocation but it really doesn't, In practice what happens is that an amazing proportion of time is lost scheduling all the individual spark tasks, plus it often compromises co-location of data between ES and Spark.

It's even deprecated in the latest spark versions.
More information in this regards is available here: https://issues.apache.org/jira/browse/SPARK-11857.

4. Noteworthy configuration elements

The below presents the important configuration aspects taken care of by the provisioning scripts.

4.1 NTP

Related scripts from the niceideas_ELK-MS package are as follows:

Configuration script: setupNTP.sh

Just as with every big data or NoSQL cluster, having a shared common understanding of time in the cluster is key. So NTP needs to be properly set up.

On master mes_master

Sample portion from /etc/ntp.conf:

pool ntp1.hetzner.de iburst
pool ntp2.hetzner.com iburst
pool ntp3.hetzner.net iburst

On slaves mes_node1 and mes_node2

Sample portion from /etc/ntp.conf:

server 192.168.10.10

#enabling mes_master to set time
restrict 192.168.10.10 mask 255.255.255.255 nomodify notrap nopeer noquery

#disable maximum offset of 1000 seconds
tinker panic 0

4.2 Zookeeper

Related scripts from the niceideas_ELK-MS package are as follows:

Configuration script: setupZookeeper.sh

Zookeeper is really only required when considering several Mesos master since in this case we need the quorum feature of zookeeper to proceed with proper election of the master and to track their state.
At the moment ELK-MS has only one Mesos master, but we'll make it production and HA ready by already setting up and using zookeeper.

On master mes_master

Sample portion from /etc/zookeeper/conf/zoo.cfg:

server.1=192.168.10.10:2888:3888

In addition, we need to set zookeeper master id in /etc/zookeeper/conf/myid.
Let's just put a single character "1" in it for now.

4.3 Elasticsearch

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installElasticSearch.sh
Configuration script: setupElasticSearch.sh
Systemd service file: elasticsearch.service

There's not a whole lot of things to configure in ES. The installation and setup scripts are really just created a dedicated users, a whole bunch of folders and simlinks, etc.
The only important configuration elements are as follows:

On master mes_master

Sample portion from /usr/local/lib/elasticsearch-6.0.0/config/elasticsearch.yml :

# name of the cluster (has to be common)
cluster.name: mes-es-cluster

# name of the node (has to be unique)
node.name: mes_master

# Bind on all interfaces (internal and external)
network.host: 0.0.0.0

# We're good with one node
discovery.zen.minimum_master_nodes: 1

#If you set a network.host that results in multiple bind addresses 
#yet rely on a specific address for node-to-node communication, you 
#should explicitly set network.publish_host
network.publish_host: 192.168.10.10

On slaves mes_node1 and mes_node2

Sample portion from /usr/local/lib/elasticsearch-6.0.0/config/elasticsearch.yml :

# name of the cluster (has to be common)
cluster.name: mes-es-cluster

# name of the node (has to be unique, this is for node1)
node.name: mes_node1

# Bind on all interfaces (internal and external)
network.host: 0.0.0.0

# We're good with one node
discovery.zen.minimum_master_nodes: 1

# enabling discovery of master
discovery.zen.ping.unicast.hosts: ["192.168.10.10"]

#If you set a network.host that results in multiple bind addresses 
#yet rely on a specific address for node-to-node communication, you 
#should explicitly set network.publish_host
# (this is for node1)
network.publish_host: 192.168.10.11

4.4 Logstash, Kibana, Cerebro

Related scripts from the niceideas_ELK-MS package are as follows:

Logstash Installation script: installLogstash.sh
Logstash Configuration script: setupLogstash.sh

Cerebro Installation script: installCerebro.sh
Cerebro Configuration script: setupCerebro.sh
Cerebro Systemd service file: cerebro.service

Kibana Installation script: installKibana.sh
Kibana Configuration script: setupKibana.sh
Kibana Systemd service file: kibana.service

There is really nothing specific to report in terms of configuration for these 3 tools.

4.5 Mesos

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installMesos.sh
Configuration script: setupMesos.sh
Mesos startup script: mesos-init-wrapper.sh
Mesos Master Systemd startup file: mesos-master.service
Mesos Slave Systemd startup file: mesos-slave.service

The noteworthy configuration aspects are as follows.

On both master and slaves

The file /usr/local/etc/mesos/mesos-env.sh contains common configuration for both mesos-master and mesos-slave.
So we should create this file on every node of the cluster.

#Working configuration
export MESOS_log_dir=/var/log/mesos

#Specify a human readable name for the cluster
export MESOS_cluster=mes_cluster

#Avoid issues with systems that have multiple ethernet interfaces when the Master 
#or Slave registers with a loopback or otherwise undesirable interface.
# (This is for master, put IP of the node)
export MESOS_ip=192.168.10.10

#By default, the Master will use the system hostname which can result in issues 
#in the event the system name isn't resolvable via your DNS server.
# (This is for master, put IP of the node)
export MESOS_hostname=192.168.10.10

Then, the file /usr/local/etc/mesos/mesos-slave-env.sh configures mesos-slave.
Since we run a mesos-slave process on the mes_master machine as well, we define this file on every node of the cluster as well.

#Path of the slave work directory.
#This is where executor sandboxes will be placed, as well as the agent's checkpointed state.
export MESOS_work_dir=/var/lib/mesos/slave

#we need the Slave to discover the Master.
#This is accomplished by updating the master argument to the master Zookeeper URL
export MESOS_master=zk://$MASTER_IP:2181/mesos

On master mes_master only:

The mesos-master process is configured by /usr/local/etc/mesos/mesos-master-env.sh:

#Path of the master work directory.
#This is where the persistent information of the cluster will be stored
export MESOS_work_dir=/var/lib/mesos/master

#Specify the master Zookeeper URL which the Mesos Master will register with
export MESOS_zk=zk://$192.168.10.10:2181/mesos

# Change quorum for a greater value if one has more than one master 
# (only 1 in our case)
export MESOS_quorum=1

4.6 Spark

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installSpark.sh
Configuration script: setupSpark.sh
Dynamic Allocation Configuration script: setupSparkDynamicAllocation.sh
Spark History Server start wrapper: start-spark-history-server-wrapper.sh
Spark History Server Systemd startup file: spark-history-server.service
Spark Mesos Dispatcher start wrapper: start-spark-mesos-dispatcher-wrapper.sh
Spark Mesos Dispatcher Systemd startup file: spark-mesos-dispatcher.service

Aside from the specific startup wrappers and systemd service configuration files required for the Spark History Server and the Spark Mesos Dispatcher, the noteworthy configuration elements are as follows.

On both master and slaves

The file /usr/local/lib/spark-2.2.0/conf/spark-env.sh defines common environment variables required by spark workers and drivers.
So we should create this file on every node of the cluster (in addition the master also executes spark workers).

#point to your libmesos.so if you use Mesos
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/mesos-1.3.0/lib/libmesos.so

#Important configuration directories
export SPARK_CONF_DIR=/usr/local/lib/spark-2.2.0/conf
export SPARK_LOG_DIR=/usr/local/lib/spark-2.2.0/logs

The file /usr/local/lib/spark-2.2.0/conf/spark-defaults.conf defines common configuration properties required by spark workers and drivers.
So we should create this file on every node of the cluster (since the master as well executes spark workers).

#Finding the mesos master through zookeeper
spark.master=mesos://zk://$MASTER_IP:2181/mesos

#Activating EventLog stuff (required by history server)
spark.eventLog.enabled=true
spark.eventLog.dir=/var/lib/spark/eventlog

#Default serializer
spark.serializer=org.apache.spark.serializer.KryoSerializer

#Limiting the driver (client) memory
spark.driver.memory=800m

#Settings required for Spark driver distribution over mesos cluster 
#(Cluster Mode through Mesos Dispatcher)
spark.mesos.executor.home=/usr/local/lib/spark-2.2.0/

#If set to true, runs over Mesos clusters in coarse-grained sharing mode,
#where Spark acquires one long-lived Mesos task on each machine.
#If set to false, runs over Mesos cluster in fine-grained sharing mode,
#where one Mesos task is created per Spark task.
#(Fine grained mode is deprecated and one should consider dynamic allocation 
#instead)
spark.mesos.coarse=true

#ElasticSearch setting (first node to be reached => can use localhost everywhere)
spark.es.nodes=localhost
spark.es.port=9200
es.nodes.data.only=false

#The scheduling mode between jobs submitted to the same SparkContext.
#Can be FIFO or FAIR. FAIR Seem not to work well with mesos
#(FIFO is the default BTW ...)
spark.scheduler.mode=FIFO

#How long to wait to launch a data-local task before giving up 
#and launching it on a less-local node.
spark.locality.wait=20s

# Configuring dynamic allocation
# (See Spark configuration page online for more information)
spark.dynamicAllocation.enabled=true
#(Caution here : small values cause issues. I have executors killed with 10s for instance)
spark.dynamicAllocation.executorIdleTimeout=120s
spark.dynamicAllocation.cachedExecutorIdleTimeout=300s

# Configuring spark shuffle service (required for dynamic allocation)
spark.shuffle.service.enabled=true

On master mes_master only

In the very same file /usr/local/lib/spark-2.2.0/conf/spark-defaults.conf, we add what is required for Spark History Server:

#For the filesystem history provider, 
#the directory containing application event logs to load.
spark.history.fs.logDirectory=file:///var/lib/spark/eventlog

#The period at which to check for new or updated logs in the log directory.
spark.history.fs.update.interval=5s

4.7 ES-Hadoop

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installESHadoop.sh

Nothing specific to report, basically the only thing to be done to install ES-Hadoop is to copy the spark driver elasticsearch-spark-20_2.11-6.0.0.jar to the spark jars folder /usr/local/lib/spark-2.2.0/jars/.

5. Conclusion

With all of the information above, you should be able to set up your own ElasticSearch / Mesos / Spark Cluster in no time.
Or simply use the niceideas_ELK-MS package to build a test cluster using one single command.

Now the next article in this serie, ELK-MS - part II : assessing behaviour will present the tests I did on this test cluster and the conclusions in terms of behaviour assessment.

I'm already telling you the big conclusion: Using ElasticSearch / Mesos / Spark for your Big Data Analytics needs is mind-joggling. It works really amazingly and supports a striking range of use cases while being a hundred times lighter than a plain old Hadoop stack both to setup and to operate.

Kuddos to the folks at Apache and at Elastic for making this possible.

Big Data and private banking, what for ?

Jerome Kehrli — Wed, 5 Oct 2016 04:50:22 -0400

Big Data technologies are increasingly used in retail banking institutions for customer profiling or other marketing activities. In private banking institutions, however, applications are less obvious and there are only very few initiatives.
Yet, as a matter of fact, there are opportunities in such institutions and they can be quite surprising.

Big Data technologies, initiated by the Web Giants such as Google or Amazon, enable to analyze very massive amount of data (ranging from Terabytes to Petabytes). Apache Hadoop is the de-facto standard nowadays when it comes to considering Open Source Big Data technologies but it is increasingly challenged by alternatives such as Apache Spark or others providing less constraining programming paradigms than Map-Reduce.

These Big Data Processing Platform benefits from the NoSQL genes : the CAP Theorem when it comes to storing data, the usage of commodity hardware, the capacity to scale-out (almost) linearly (instead of scaling up your Oracle DB) and a much lower TCO (Total Cost of Ownership) than standard architectures.

Most essential applications for such technologies in retail banking institutions consist in gathering knowledge and insights on the customer base, customer's profiles and their tendencies by using cutting-edge Machine Learning techniques on such data.

In contrary to retail banking institutions that are exploiting such technologies for many years, private banking institution, with their very low amount of transactions and their limited customer base are considering these technologies with a lot of skepticism and condescension.

However, in contrary to preconceived ideas, use case exist and present surprising opportunities, mostly around three topics :

Enhance proximity with customers
Improve investment advisory services
Reduce computation costs

Enhance proximity with customers

Private banking institutions are increasingly challenged by new external asset management business models such as multi-family offices. These new kind of financial services firms provide their customers with a very high level of personalized service and a very close relationship management, up to a certain form of intimacy, able to seduce UHNWI (Ultra-High Net Worth Individuals).

The progressive denormalization of customer data, their transactions and all the other kinds of related data, even very indirectly, inside a hadoop cluster, then their massive exploitation with cutting-edge machine learning techniques enables traditional institutions to sharpen and refine their knowledge of their customer.
This consists in importing within hadoop all the different data in an incremental way. After every new stage, one needs to study carefully the new analysis opportunities.

The new knowledge and insights gained this way enables private banking institutions to reach a level of proximity, understanding of their customers and customized investment advisory services close to family offices. They may this way easier keep their top customers seduced by such asset management models.

Improve investment advisory services

Customer profiling to find out about tendencies in terms in investment in peer groups propose a certain interest. Profiles can be examined from various perspective and angles by combining characteristics of customers such as their age, origins, wealth level, activity sector or even their family situation.

The banking institutions should typically use the same hadoop cluster deployed for the above use case that already holds all the required information for such analysis.

There are several objectives there. For instance one might want to adapt investment advices by comparing a specific customer situation with the profile of her peer group or simply with the general tendencies of the market.
Another example is related to the advanced possibilities offered by such knowledge when it comes to optimize investigations on investments or unusual customers most essentially to detect frauds.

Reduce computation costs

For instance, recent NoSQL technologies such as Cassandra are very efficient when it comes to storing and manipulating time series. Infinispan enables to structure an impressive amount of information in memory for massive and complex computations. One could also mention Storm, very efficient when it comes to analyzing market events in real-time.
These new technologies from the Big Data / NoSQL landscape offer unprecedented opportunities for quantitative research on very massive amount of market data.

They have the opportunity to form a little revolution in the world of private banking institutions who finally have very cheap ways for massive and real-time computations of key risk and performance metrics. Even better, these same platforms can be used for optimization, rebalancing or even simulation of financial portfolios at a very large scale and in near-real time. They provide interesting alternatives to more traditional approaches such as Bloomberg's analytical platform or other very expensive home-made developments on terradata.

Deploying such ambitious technologies inside a private banking institutions is however not innocuous. Adopting an iterative approach consisting in building the software bricks, importing the data and implementing analysis step by step is a key factor of success.

Typical Architecture

A typical architecture of such system would be as follows :

(Image is copyrighted OCTO Technology SA / 2014)

(Paper originally published by myself in ICT Journal / March 2014)