Modern Information System Architectures
by Jerome Kehrli
Posted on Monday Dec 13, 2021 at 12:04PM in Big Data
For forty years we have been building Information Systems in corporations in the same way, with the same architecture, with very little innovations and changes in paradigms:
- On one side the Operational Information System which sustains day-to-day operations and business activities. On the Operational Information Systems, the 3-tiers architecture and the relational database model (RDBMS - Relational Database Management System / SQL) have ruled for nearly 40 years.
- On the other side the Decision Support Information System - or Business Intelligence or Analytical Information System - where the Data Warehouse architecture pattern has ruled for 30 years.
Of course the technologies involved in building these systems have evolved in all these decades, in the 80s COBOL on IBM hosts used to rule the Information Systems world whereas Java emerged quickly as a standard in the 2000s, etc.
But while the technologies used in building these information systems evolved fast, their architecture in the other hand, the way we design and build them, didn't change at all. The relational model ruled for 40 years along the 3-tiers model in the Operational world and in the analytical world, the Data Warehouse pattern was the only way to go for decades.
The relational model is interesting and has been helpful for many decades. its fundamental objective is to optimize storage space by ensuring an entity is stored only once (3rd normal form / normalization). It comes from a time when storage was very expensive.
But then, by imposing normalization and ACID transactions, it prevents horizontal scalability by design. An Oracle database for instance is designed to run on a single machine, it simply can't implement relational references and ACID transactions on a cluster of nodes.
Today storage is everything but expensive but Information Systems still have to deal with RDBMS limitations mostly because ... that's the only way we used to know.
On the Decision Support Information System (BI / Analytical System), the situation is even worst. in Data warehouses, data is pushed along the way and transformed, one step at a time, first in a staging database, then in the Data Warehouse Database and finally in Data Marts, highly specialized towards specific use cases.
For a long time we didn't have much of a choice since implementing such analytics in a pull way (data lake pattern) was impossible, we simply didn't have the proper technology. The only way to support high volumes of data was to push daily increments through these complex transformation steps every night, when the workload on the system is lower.
The problem with this push approach is that it's utmost inflexible. One can't change his mind along the way and quickly come up with a new type of data. Working with daily increments would require waiting 6 months to have a 6 months history. Not to mention that the whole process is amazingly costly to develop, maintain and operate.
So for a long time, RDBMSes and Data Warehouses were all we had.
It took the Internet revolution and the web giants facing limits of these traditional architectures for finally something different to be considered. The Big Data revolution has been the cornerstone of all the evolutions in Information System architecture we have been witnessing over the last 15 years.
The latest evolution in this software architecture evolution (or revolution) would be micro-services, where finally all the benefits that were originally really fit to the analytical information system evolution finally end up overflowing to the operational information system.
Where Big Data was originally a lot about scaling the computing along with the data topology - bringing the code to where the data is (data tier revolution) - we're today scaling everything, from individual components requiring heavy processing to message queues, etc.
In this article, I would want to present and discuss how Information System architectures evolved from the universal 3 tiers (operational) / Data Warehouse (analytical) approach to the Micro-services architecture, covering Hadoop, NoSQL, Data Lakes, Lambda architecture, etc. and introducing all the fundamental concepts along the way.
- 1. Introduction
- 2. The Web giants and Big Data
- 3. The CAP Theorem
- 4. NoSQL / NewSQL
- 5. Hadoop and Data Lakes
- 6. Streaming Architectures
- 7. Big Data 2.0
- 8. Micro-services
- 9. Conclusion
As stated in the summary above, the way we build information systems really didn't evolve in so many decades. The technologies used underneath have evolved of course, a long way from COBOL to Java and Angular, but the architectures in use - the 3-tiers model on the operational information system and the data warehouse pattern on the decision support system (a.k.a analytics system) - haven't evolve in more than 30 years.
The Software Architecture is defined as the set of principal design decision about the system. Software architecture is kind of the blueprint for the system's construction and evolution. Design decisions encompass the following aspects of the system under development: Structure, Behaviour, Interactions, Non-functional properties. (Taylor 2010)
And then again, the technologies under the hood, from the operating systems to the User Interfaces through the programming languages, have evolved drastically. We all remember 3270 green-on-black terminal screens and can only consider the terrific evolution to the fancy HTML5/bootstrap screens we see today.
But the design of the Information system components, their interactions and the technical processes in between didn't evolve at all!
I find it amazing to consider that if you put COBOL, 3270 and a few terms like that on this schema instead of the web elements you literally get what would have been the high-level architecture schema 40 years ago.
As stated above, RDBMS - Relational Database Management Systems - have a lot of limits and some benefits, namely the standardized querying language - SQL - and the optimization of storage space. But in today's digital world, the benefits don't stand up to the drawbacks, mostly the impossibility to scale.
The Data Warehouse pattern in use for 30 years on the analytical Information System is also a poor fit for today's pace of development of digital services. It is much too inflexible not to mention the cost in developing and maintaining it.
It took the web giants to face the limits of these current architecture paradigms and invent new ways of building information systems to finally see some evolutions in the way we are building them in corporations. The first evolutions came to the analytics system side with Big Data technologies and overflowed later to the operationnal IS side with NoSQL, streaming architectures and eventually micro-services.
In this article I want to present these evolutions from an historical perspective. We'll start with the Web giants and the Big Data revolution, cover NoSQL and Hadoop, run through Lambda and Kappa architectures, and end up discussing Kubernetes and Micro-services.
2. The Web giants and Big Data
The web giants have been the first to face the limits of traditional architectures in an unacceptable way. Can you imagine google running their Search Engine on an IBM mainframe ? Can you imagine what that would be for a machine and how much money (licensing fees) they would need to leave to IBM every year for running such a host ?
Can you imagine Amazon running their online retail business on an Oracle database with hundreds of millions of users connected and querying the DB at any time ? Can you imagine the price of a computer that would be able to support such volumes of data and concurrent requests ?
The Web giants had to invent both new data storage technologies and programming paradigms to run their business and support their volume of activities.
But let's start with the beginning.
2.1 The Era of Power
As an prequel to introducing Big Data, let's have a look at these both situations:
These two computers are separated by only 30 years of technological evolutions.
The computer on the left is a Cray II. When it came out in 1985, it was a revolution since it was the fastest machine in the world, the first multi-processor computer from Seymour Cray and included many unique technological evolutions.
The computer on the right is a Samsung S6 Smartphone. It's 30 years younger that the Cray 2.
It's 30 years younger only and around 15 times more powerful that the Cray 2. While the later was by far bigger than a human being, the Samsung S6 fits in the palm. The Cray 2 has 4 processors while the S6 packages 8 processors.
Considering how the hardware technology progressed over one generation is mind-blowing.
Another comparison is even more impressive, 50 years before the Samsung S6, a computer has been used to send people to the moon. The S6 is a million times more powerful in terms of raw computing power than that very computer.
We have today a device so small that it fits in our palm, incredibly powerful, which enables us to be interconnected everywhere, all the time and for every possible need. This is the definition of the digitization
The smartphones are really an amazing piece of technology, but much more impressive are the apps behind and the services they enable us to use. This leads us to the Web Giants.
2.2 The Web Giants
The Web giants have been the first to face the limits of traditional architectures and the usual way information systems were built.
And the revolution came from them. They had to find new technical solutions to business-critical challenges such as :
- Google: Index the whole web, and keep a response time to any request below one second - or how to keep the search free for the user ?
- Facebook: Interconnect billions of users, display their feeds in near-real-time and understand how they use their product to optimize ads ?
- Amazon: How to build a product recommendation engine for dozens of millions of customers, on millions of products ?
- EBAY : How to do a search in ebay auctions, even with misspelling ?
These are just oversimple examples of course and the challenges faced by the web giants go much beyond such simple cases.
These business challenges are backed by technical challenges such as:
- How to invert a square matrix that doesn't fit in memory in a reasonable time ?
- How to query a database containing trillions of documents in real-time ?
- How to read billions of files of multiple megabytes in a reasonable time ?
At the end of the day, it all boiled down to finding ways to manage volumes of data bigger by several orders of magnitude than the volumes of data that IT systems were used to manipulate so far.
2.3 Data Deluge
So the most striking problem they had to solve is getting prepared and ready for the Data Deluge!
Not only do we generate more and more data, but we have today the means and the technology to analyze, exploit and mine it and extract meaningful business insights.
The data generated by the company’s own systems can be a very interesting source of information regarding customer behaviours, profiles, trends, desires, etc. But also external data, Facebook, twitter logs, etc.
2.4 The Moore Law
The Moore Law:The number of transistors and resistors on a chip doubles every 24 months" (Gordon Moore, 1965)
For a long time, the increasing volume of data to be handled by any given corporation in its Information System was not an issue at all.
The volume of data increases, the number of user increases, etc. but the processing abilities increases as well, sometimes even more.
The Moore law was there to cover our ass. The corporation CTO just had to buy a new machine to host the Information System every few years.
This model has hold for a very long time. The cost are going down, the computing capacities are rising, one simply needs to buy a new machine to absorb the load increase.
This is especially true in the mainframe world. There wasn’t even any need to make the architecture of the systems (COBOL, etc.) evolve for 30 years.
Even outside the mainframe world. The architecture patterns and styles we are using in the operational IS world haven’t really evolve for the last 30 years. Despite new technologies such as Web, Web 2.0, Java, etc. of course. I’m just speaking about architecture plans and styles.
2.5 The Death of the Moore Law
But everything has an end.
Let's consider a fifth dimension, too often left aside when considering the evolutions of computer technologies and hardware architectures: the throughput of the connection between the data on the disk (long term storage) and the memory (i.e. hard drive controllers mostly, but also buses, etc.)
Issue: the throughput evolution is always lower than the capacity evolution.
The throughput has become the biggest concern in scaling computer / platform hardware up. It did not progress in terms of efficiency in a way comparable to the four other dimensions.
We are able to store more and more data, of course, but we are less and less able to manipulate this data efficiently.
In practice, fetching all the disk data on a computation machine to fit it in RAM to process it is becoming more and more difficult.
2.6 Fundamentals of Big Data - the Web giants new paradigms
In order to workaround the limits of traditional architectures, the web giants invented new architecture paradigms and new ways of building information systems by leveraging on three fundamental ideas:
Key idea 1 : distribution - Since its impossible to fit the data in the RAM of one single machine, split it and distribute it on as many different machines as are required.
Distribution means partitioning the dataset - also called sharding it sometimes - but also always replicate the partitions or shards. We'll see exactly why and how later.
- Key idea 2 : Horizontal scalability - Just as we split the data, let's split the computation and distribute it on as many nodes as are required to support the workload, even if it means multiple datacenters.
Key idea 3 : Data tier revolution - So we distribute both the data on a cluster of computers - or nodes - and the processing as well. We end up using the data nodes as processing nodes. This is the data tier revolution, which is in complete opposition to what was usually done so far in traditional architectures: fetching the required data to the place where the computation occurs.
But it goes further than that.
Most of the time we end up distributing different types or categories of data. Every time a specific business process needs to compute something out of a specific piece of all this data, it's crucial to ensure the processing will happen on the very nodes where this specific piece of data is located. This is called co-local processing or data locality optimization.
As a summary, the web giants have designed new architectures and programming paradigms where distributing the data and the processing (ideally in a co-local way) on a cluster of nodes was the most fundamental principle.
3. The CAP Theorem
But moving from a mainframe world - where everything is on the same computer and the data to compute always fits in the memory of that computer - to a distributed system world most definitely has benefits, but it also has some consequences. And that's the topic of this chapter.
3.1 The origins of NoSQL
Let's start with a bit of history.
3.1.1 Flat files as data store
In the early days of digital data, before 1960, the data within a Computer Information System was mostly stored in rather flat files (sometimes indexed) manipulated by top-level software systems.
The primitives provided by the operating system were really very low level, basically just the possibility to read or write file or file increments.
Directly using flat files was cumbersome and painful. Different uncovered needs emerged at the time:
- Data isolation
- Access efficiency
- Data integrity
- Reducing the time required to develop brand new applications
Addressing such needs by relying on indexed flat files required solutions to be implemented artificially by the applications using such files.
It was highly difficult, inefficient, time consuming, etc. And the wheel had to be re-invented all the time all over again.
So something else was required.
3.1.2 RDBMS and the relational model
So in 1969, Edgar F. Cood, a British engineer, invented the relational model. In the relational model, business entities are modeled as Tables and Associations (relations).
The relational model is at the root of RDBMS - Relational DataBase Management Systems - that ruled the Data Storage world for 30 years.
The relational model is conceived to reduce redundancy in order to optimize disk space usage. At the time of its creation Disk storage was very expensive and limited. And the volume of data in Information Systems was rather small.
The relational model avoids redundancy to optimize disk space usage by guaranteeing:
- Structure: using normal design forms and modeling techniques
- Coherence: using transaction principles and mechanisms
An example relational model would be as follows, illustrating an Exam Grade Management application
In this example, if we want to know the subject assigned to a student on his profile screen, we would need to
- Extract the personal data from the "student" table
- Fetch its subject if from the "relation" table
- Read the subject title from the "subject" table.
Why, oh why, separate all this information in different tables when in practice 99% of the time we want to fetch all of this together ?
3.1.3 criticism of the relational model
The relational model comes from a time where storage was expensive. The fundamental idea behind its design is rationalizing storage space by ensuring every piece of information is stored only once.
But nowadays long-term storage space is not expensive at all anymore. A Terabyte of SSD storage is not more that a few dozens of dollars. Optimizing the storage space at all cost makes little sense today.
In addition, the relational model is not the best way to represent some information. Let's see some examples
- Tabular information fits naturally well in the relational model but not only. Every time we can naturally divide a business problem into well-defined and predefined entities and relations among them, the relational model is usually a good fit.
- But then think of other type of information, such as registration forms, product descriptions, etc. Such types of semi-structured data fit very poorly in the relational model.
- Also molecular data or graph data would be way better stored in very different types of databases.
The web giants had to get away from the mainframe pattern, and if you challenge that, the very fundamental architecture pattern on which all information systems were built, why wouldn't you challenge all the rest, including the relational model ?
We'll get back to this.
3.2 Horizontal scalability
The mid and late 2000’s were times of major changes in the IT landscape. Hardware capabilities significantly increased and eCommerce and internet trade, in general, exploded.
Some internet companies- the "Web giants" (Yahoo!, Facebook, Google, Amazon, Ebay, Twitter, ...), pushed traditional databases to their limits. Those databases are by design hard to scale.
Traditional RDBMS and traditional architecture can only scale up. And scaling up is tricky.
3.2.1 Scaling up
With relational RDBMSes, the only way to improve performance is by scaling up, i.e. getting bigger servers (more CPU, more RAM, more disk, etc.). There's simply nothing else that can be done.
But one eventually hits a hard limit imposed by the current technology.
With traditional architectures and RDMBS, all the workload happens on one single machine. And while running a few thousands operations or transactions on one single machine is perhaps possible, going much beyond it just doesn't work. The programming paradigm we use - mostly around thread synchronizations and context switches - make it impossible to run effectively a million threads on one single machine for instance.
But there's worst than that.
Imagine that a machine A with 4 CPUs, 64 GB RAM and 1 TB hard drive costs 10'000 USD.
Do you think that a machine B with twice the power so 8 CPUs, 128 GB RAM and a 2 TB hard drive would cost the double, hence 20'000 USD ?
No! It would be much more than that, perhaps four or five times the price, so more than 40k USD.
The price of individual machines doesn't scale linearly with the processing power of the machine, it's exponential !
3.2.2 Scaling out
By rethinking the architecture of databases, the web giants have been able to make them scale at will, by adding more servers to clusters instead of upgrading the servers.
When scaling out, instead of buying bigger machines, one buys more machines and add them in a processing cluster, working together on distributed data and processing.
The servers are not made of expensive, high-end hardware; they are qualified as commodity hardware (or commodity servers).
When scaling out, the limits vanish, one can add as many nodes as one wants in a processing cluster.
And there's a cherry on the cake, recall the example of machine A above, buying 10 machine A is not even 10 times the price of a single machine A, since one can get discounts from the number being bought.
The only drawback is that the application leveraging on scaling out, or the information system as a whole, needs to be designed from the grounds up for distribution. And there are constraints for this, we'll see that further in this article.
3.3 Data Distribution
With most NoSQL databases, the data is not stored in one place (i.e. on one server). It is distributed among the nodes of the cluster. When created, an object is split in a number of shards, for instance 4 shards, A, B, C, D and these shards are assigned to a node in the cluster.
This is called sharding - or partitioning - the portion of data assigned to a node is called a shard - or a partition.
Having more cluster nodes implies a higher risk of having some nodes crash, or a network outage splitting the cluster in two. For this reason, and to avoid data loss, objects are also replicated across the clusters. The number of copies, called replicas, can be tuned. 3 replicas is a common figure.
Imaging that the specifications of a given computer indicates that there is a 10% chance for the computer to experience any kind of hardware failure in its first year of exploitation. Then imagine you have 10 nodes like that one in a cluster, what is the probability that at least one of these nodes experiences an hardware failure ? Yes, you can be nearly sure at least one of them will fail.
For this reason, when we start to distribute data on a cluster of multiple machines, we have to design for failures.
In data management, this means creating multiple copies of every shard in such a way that we maximize the chances of one of them always being available.
This is called replication.
We can see here that the objects has been split in 4 shards A, B, C, D and that every shard has three replicas.
The objects may move, as nodes crash or new nodes join the cluster, ready to take charge of some of the objects. Such events are usually handled automatically by the cluster; the operation of shuffling objects around to keep a fair repartition of data is called rebalancing.
3.4 Properties of a distributed system
In RDBMSes, we expect DB transactions to respect some fundamental properties, identified by ACID: Atomicity, Consistency, Isolation and Durability.
In distributed systems, we consider things a little differently and consider the following properties:
Availability (or lack thereof) is a property of the database cluster. The cluster is available if a request made by a client is always acknowledged by the system, i.e. it is guaranteed to be taken into account.
That doesn’t mean that the request is processed immediately. It may be put on hold. But an available system should at a minimum always acknowledge it immediately.
Practically speaking, availability is usually measured in percents. For instance, 99.99% availability means that the system is unavailable at most 0.01% of the time, that is, at most 53 min per year.
Partition Tolerance is verified if a system made of several interconnected nodes can stand a partition of the cluster; if it continues to operate when one or several nodes disappear. This happens when nodes crash or when a network equipment is shut down, taking a whole portion of the cluster away.
Partition tolerance is related to availability and consistency, but it is still different. It states that the system continues to function internally (e.g. ensuring data distribution and replication), whatever its interactions with a client.
When talking about distributed databases, like NoSQL, consistency has a meaning that is somewhat different than in the relational context.
It refers to the fact that all replicas of an entity, identified by a key in the database, have the same value whatever the node being queried.
With many NoSQL databases, updates take a little time to propagate across the cluster. When an entity’s value has just been created or modified, there is a short time span during which the entity is not consistent. However the cluster guarantees that it will eventually be, when replication has occurred. This is called eventual consistency
These 3 properties, Consistency, Availability and Partition tolerance, are not independent.
The CAP theorem - or Brewer’s theorem - states that a distributed system cannot guarantee all 3 properties at the same time.
This is a theorem. That means it is formally true, but in practice it is less severe than it seems.
The system or a client can often choose CA, AP or CP according to the context, and "walk" along the chosen edge by appropriate tuning.
Partition splits happen, but they are rare events (hopefully).
Traditional Relational DBMSes would be seen as CA - consistency is a must
Many NoSQL DBMSes are AP - availability is a must. Big clusters failures happen all the time so they better live with it. Consistency is only eventual.
3.4.1 Eventual consistency
Consistency refers to the fact that all replicas of an entity, identified by a key in the database, have the same value at any give time whatever the node being queried.
With many NoSQL databases, the preferred working mode is AP and all-the-time consistency is sacrificed.
Favoring performance, updates take a little time to propagate across the cluster. When an entity’s value has just been created or modified, there is a short time span during which the entity is not consistent. One node being queried at that moment could show the previous value while another node at the same time would already show the new value.
However the cluster guarantees that it will eventually be, when replication has occurred. This is called eventual consistency and this is an essential notion.
While all-the-time consistency is sacrificed, eventual consistency is a must and is guaranteed by most-if-not-all NoSQL Database.
4. NoSQL / NewSQL
NoSQL databases are a new type of databases emerging from the web giants technologies mostly, scaling out natively and renouncing to the usual behaviours and features of RDBMS - Relational Database Management Systems.
A NoSQL - originally referring to "not-SQL" for "non-relational" - database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st century, triggered by the needs of Web 2.0 companies.
NoSQL databases are increasingly used in Big Data and Real-Time Web applications.
NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.
(Wikipedia - https://en.wikipedia.org/wiki/NoSQL)
The fundamental idea behind NoSQL is as follows:
- Because of the need to distribute data (Big Data), the Web giants have abandoned the whole idea of ACID transactions (only eventual consistency is possible).
- So if we drop ACID Transactions - which we always deemed to be so fundamental - why wouldn't we challenge all the rest - the relational model and table structure?
There are 4 common types of NoSQL databases:
- Document-oriented , e.g. MongoDB, ElasticSearch
- Column-family (aka BigTable), e.g. Cassandra
- Key/Value pairs, e.g. Redis
- Graph, e.g. Neo4J
Document-oriented databases are really the most wide spread with market leaders such as MongoDB, ElasticSearch, CouchDB, etc.
Column-oriented databases are also wide spread with multiple good open source solutions.
Key/Value pairs databases are really distributed caching products in in the end. Multiple good solutions are available on the market but most of them are proprietary software with sometimes a limited open-source version (unfortunately).
In terms of graph oriented databases, the lead player is Neo4J.
The following schema provides an illustration of the way data is structured and stored in these Databases:
The NoSQL landscape is y very rich ecosystems with hundreds of different products and solution and growing continuously, with nearly every week a new product appearing.
What is NewSQL ?
NewSQL refers to relational databases that have adopted upon some of the NoSQL genes, thus exposing a relational data model and SQL interfaces to distributed, high volume databases.
NewSQL, contrary to NoSQL, enables an application to keep
- The relational view on the data
- The SQL query language
- Response times suited to transactional processing
Some were built from scratch (e.g. VoltDB), others are built on top of a NoSQL data store (e.g. SQLFire, backed by GemFire, a key/value store)
The current trend is for some proven NoSQL databases, like Cassandra, to offer a thin SQL interface, achieving the same purpose
Generally speaking, the frontier between NoSQL and NewSQL is a bit blurry… SQL compliance is often sought for, as the key to integrating legacy SQL software (ETL, reporting) with modern No/NewSQL databases
5. Hadoop and Data Lakes
Around 2006, Google published two papers, "GFS - The Google FileSystem" where they explained how they designed an implemented a distributed filesystem and "Map Reduce" where they presented the distributed programming paradigm they used to process data stored on GFS.
A few years after, google published "Big Table", a paper presenting how they designed and implemented a Column-oriented database on top of HDFS and MapReduce.
Doug Cutting, the leader of the Apache Lucene Project at the time discovered these papers and decided to work on an Open-Source implementation of these concepts.
Hadoop was born.
5.1 What is Hadoop ?
Hadoop is an Open Source Platform providing:
- A distributed, scalable and fault tolerant storage system as a grid
- Initially, a single parallelism paradigm : MapReduce to reuse the storage nodes as processing nodes
- Since Hadoop V2 and YARN, other parallelization paradigms can be implemented on Hadoop
- Schemaless and optimized sequential write once and read many times
- Querying and processing DSL (Hive, Pig)
Hadoop was initially primarily intended for Big Data Analytics. It is the middleware under the Data Lake Architecture pattern and intents to revolution the architecture of analytical information systems / decision support systems.
Nowadays Hadoop can be an infrastructure for much more, such as Micro-services architecture (Hadoop V3) or Real-time Architectures.
Hadoop is declined in different distributions: Fundation Apache, Cloudera, HortonWorks, MapR, IBM, etc.
5.2 Hadoop Overview
Hadoop is designed as a layered software where technologies in every layer can be interchanged at will:
- Distributed storage: Hadoop packages HDFS as the default underlying distributed filesystem. But for instance the MapR Hadoop distribution uses the MAPR filesystem instead.
Parallel Computing Framework / MapReduce Processing Engine: In Hadoop V1, MapReduce was the only parallel computing paradigm available on top of Hadoop, taking care of the processing distribution as well as the resources negotiation on the Hadoop cluster.
With Hadoop 2.0, The MapReduce paradigm has been split from the Resource negotiation which became YARN - Yet Another Resource Negotiator. With this split, it has become possible to use Hadoop with different parallel processing constructs and paradigms, MapReduce becoming just one possibility among others.
Machine Learning / Processing: This is in the end the most essential layer on top of Hadoop core. Hadoop is designed first and foremost for Big Data Analytics. There are numerous solutions that were initially either implemented on top of MapReduce or ported to MapReduce.
Nowadays, with YARN, software doesn't need anymore to be ported to MapReduce to run on Hadoop, it just needs to integrate with YARN.
- Orchestration: Numerous different solution can be used to operate Hadoop and orchestrate processes.
- Querying: A lot of NoSQL / NewSQL databases have been implemented as an Hadoop querying framework, such as HBase or Hive. Some more advanced tools goes beyond querying with as Pig.
- Reporting: User have multiple choices of software specialized in building reports on the data in Hadoop.
- IS Integration: Integrating Hadoop in the Information System, specifically building data import / Export between Hadoop and the operational information system components is a key concern. Numerous different solutions have been developed for this and are packaged with most Hadoop distributions.
- Supervision and Management: Most Hadoop distributions provide their own management tool. Some tools are available as Apache projects.
Hadoop is a very large ecosystems of hundreds of different software in all these different layers.
The most common ones would be as follows:
But then again there are really many more components than that in a typical Hadoop distribution.
Most Hadoop distributions are quite behemoth software stacks that would be very difficult to integrate and configure manually, which is the very reason behind the success of these distributions.
Hadoop core on its own is fairly complex to install, configure and fine tune so whenever one needs Hadoop core only for his specific software (e.g to run spark), it's sometimes a more appropriate approach to search for a lighter cluster management system such as Apache Mesos for instance; more on that later in this article.
5.3 Hadoop Architecture
A simplified view of Hadoop core components deployment architecture would be as follows:
Since Hadoop 2, having two master nodes for high-availability and avoiding single points of failure on the master components is highly advised.
The components from Hadoop core are deployed as follows:
- The HDFS Name node (and secondary name node) is the center piece of the HDFS File System. It acts as the HDFS Master and keeps the directory tree and tracks where on the cluster the file data is kept. The HDFS Data Nodes acts as slave processes, run on individual cluster nodes and take care of data storage.
- The YARN Resource Manager (and secondary resource manager) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers and the per-application ApplicationMaster.
- The MapReduce JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data for co-local processing optimization, or at least are in the same rack. Client applications submit jobs to the Job tracker. MapReduce TaskTrackers run on individual cluster nodes, execute the tasks and report the status of tasks to the JobTracker.
5.4 The DataLake Architecture pattern
A data lake is a system or repository of data stored in its natural/raw format.
- It's is usually a single store of data including raw copies of source system data, sensor data, social data etc. and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
- It can include structured data from relational databases, semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
With the continued growth in scope and scale of analytics applications using Hadoop and other data sources, then the vision of an enterprise data lake can become a reality.
In a practical sense, a data lake is characterized by three key attributes:
Collect everything. A data lake contains all data, both raw sources over extended periods of time as well as any processed data.
A data lake is characterized by a Big Volume of data.
Dive in anywhere. A data lake enables users across multiple business units to refine, explore and enrich data on their terms.
In a Data Lake, one doesn't know a priori the analytical structures.
Flexible access. A data lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.
As a result, a data lake delivers maximum scale and insight with the lowest possible friction and cost.
The fundamental approach of Data Lakes is to pull the required data from the raw data storage, transforming it and processing it dynamically, as required per the use case being executed. It's entirely dynamic, queries and processes are designed on the fly.
The storage principle is to store everything, all the raw data from the operational Information System as well as all the data produced by the IS, log files, usage metrics, etc. (Collect everything pattern).
Hadoop is kind of the Operating System underneath the Data Lake pattern and with Hadoop's power, there is nearly no analytics use case that can't be implemented in a dynamic fashion, requiring at worst a few hours of runtime processing before providing the expected results
This is in complete opposition with the Data Warehouse pattern where the data was pushed in statically predefined transformation pipelines. The most critical drawback of this approach is the impossibility to come up with a new use case in a quick time. Most of the time, when a corporation requires a new KPI to be computed by the analytical system, if the required data was not already collected for another use case, it was impossible to provide quickly, requiring for instance to wait 6 months before providing the KPI on a 6 months period.
Hadoop finally made it possible at a cheap cost to get away fro this push pattern.
The Data Lake architecture pattern and its hadoop engine form a tremendous opportunity to finally get away from the Data Warehouse pattern.
But there are pitfalls of course and many corporations experienced it the hard way.
It has been stated to much everywhere that data can be incorporated "as is" in data lakes that way too many corporations took it too literaly, forgetting about one essential aspect, even in Data Lakes.
A minimum of data cleaning, cleansing and preparation is always required. The most crucial aspect than can nevfer be neglected is the need to alway have proper correlation IDs in every single piece of data that is being ingested in a data lake.
Without correlation IDs, data is unusable. And your Data Lake turns into a Data Swamp.
6. Streaming Architectures
Streaming data refers to data that is continuously generated, usually in high volumes and at high velocity. A streaming data source would typically consist of a stream of logs that record events as they happen - such as a user clicking on a link in a web page, or a sensor reporting the current temperature.
A streaming data architecture is a framework of software components built to ingest and process large volumes of streaming data from multiple sources. While traditional data solutions focused on writing and reading data in batches, a streaming data architecture consumes data immediately as it is generated, persists it to storage, and may include various additional components per use case - such as tools for real-time processing, data manipulation and analytics.
A real-time system is an event-driven system that is available, scalable and stable, able to take decisions (actions) with a latency defined as below the frequency of events.
Streaming Architectures are not strictly related to the web giants and the Big Data revolution and CEP - Complex Events Processing - Engines exists since the early 2000s.
However, streaming architectures evolved significantly with products emerging from the needs of the web giants in Lambda Architecture first and then Kappa Architecture.
6.1 Complex Event Processing
Complex event processing, or CEP, consists of a set of concepts and techniques developed in the early 1990s for processing real-time events and extracting information from event streams as they arrive. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) in real-time situations and respond to them as quickly as possible.
In a Complex Event Processing Architecture :
- Historical data is regularly and consistently updated with live data. Live data is available to the end user
Both types or data (historical and live) are not necessarily presented consistently to the end user.
- Both sets of data can have their own screens or even application
- A consistent view on both sets of data would be proposed by Lambda Architecture (next topic in this presentation)
A few notes on typical CEP deployments, in a raw fashion:
- The rules GUI is often a user friendly editor supporting hot updates of rules and made available to business users.
- The capture middleware should support very high throughput of thousands of events per second, just as the whole processing line and negligible latency.
- The CEP engine needs to support very high throughput as well and usually a maximum latency of a few dozen milliseconds to hundreds milliseconds. Fault tolerance and state coherence are common concerns.
Complex Event Processing engines and architecture are heavily used in the industry in the world of real-time computing systems, such as trading systems, payment monitoring systems, etc.
Such engines form however a quite legacy technology and have limits in terms of analytics. Most if not all CEP engines on the market even nowadays are really some sort of evolved rules-engines.
And that would be the most common limit of CEP engines, the fact that its really only about rules. Machine learning and AI use cases are limited on CEP engines by the difficulty of these systems to derive features requiring correlation with large historical datasets.
The rise of Big Data analytics technologies have opened the door for much more advanced analytics use cases in real-time. Lambda Architecture and Kappa Architectures are much more recent approaches to real-time analytics.
6.2 Lambda Architecture
The Lambda Architecture, first proposed by Nathan Marz, attempts to provide a combination of technologies that together provide the characteristics of a web-scale system that satisfies requirements for availability, maintainability, fault-tolerance and low-latency.
Quoting Wikipedia: "Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation.
The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce."
At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.
The efficiency of this architecture becomes evident in the form of increased throughput, reduced latency and negligible errors. While we mention data processing we basically use this term to represent high throughput, low latency and aiming for near-real-time applications.
As new data is introduced to the system, it is processed simultaneously by both the batch layer, and the speed layer. The batch layer is an append-only repository containing unprocessed raw data. The batch layer periodically or continuously runs jobs that create views of the batch data-aggregations or representations of the most up-to-date versions. These batch views are sent to the serving layer, where they are available for analytic queries.
At the same time that data is being appended to the batch layer, it is simultaneously streaming into the speed layer. The speed layer is designed to allow queries to reflect the most up-to-date information-necessary because the serving layer's views can only be created by relatively long-running batch jobs. The speed layer computes only the data needed to bring the serving layer's views to real time-for instance, calculating totals for the past few minutes that are missing in the serving layer's view.
By merging data from the speed and serving layers, low latency queries can include data that is based on computationally expensive batch processing, and yet include real-time data. In the Lambda Architecture, the raw source data is always available, so redefinition and re-computation of the batch and speed views can be performed on demand. The batch layer provides a big data repository for machine learning and advanced analytics, while the speed and serving layers provide a platform for real-time analytics.
The Lambda Architecture provides a useful pattern for combining multiple big data technologies to achieve multiple enterprise objectives.
There are numerous solutions nowadays to build every layer of a Lambda Architecture:
The takeaway here is that we have gone a long way since Complex Event Processing architectures and we have now numerous solutions to build new generations of streaming architectures able to extend real-time streaming to much more advanced analytics use cases, embracing Real-time Artificial Intelligence use cases.
Pros and Cons of Lambda Architecture.
- Batch layer of Lambda architecture manages historical data with the fault tolerant distributed storage which ensures low possibility of errors even if the system crashes.
- It is a good balance of speed and reliability.
- Fault tolerant and scalable architecture for data processing.
- It can result in coding overhead due to the need to implement the same analytics logic twice: one time in the speed layer and one time all over again in the batch layer.
- Re-processes every batch cycle which is not beneficial in certain scenarios.
- A data modeled with Lambda architecture is difficult to migrate or reorganize.
6.3 Kappa Architecture
In 2014 Jay Kreps started a discussion where he pointed out some discrepancies of Lambda architecture that further led the big data world to another alternate architecture that used less code resource and was capable of performing well in certain enterprise scenarios where using multi layered Lambda architecture seemed like extravagance.
Kappa Architecture cannot be taken as a substitute of Lambda architecture on the contrary it should be seen as an alternative to be used in those circumstances where active performance of batch layer is not necessary for meeting the standard quality of service.
Kappa architecture is a streaming-first architecture deployment pattern. With most recent Stream Processing technologies (Kafka Streams, Flink, etc.) the interest and relevance of the batch layer tend to diminish. The streaming layer matches computation abilities of the batch layer (ML, statistics, etc.) and stored data as it processes it.
A batch layer would only be needed to kick-start the system on historical data but then Apache Flink can very well do that
Kappa architecture can be deployed for those data processing enterprise models where:
- Multiple data events or queries are logged in a queue to be catered against a distributed file system storage or history.
- The order of the events and queries is not predetermined. Stream processing platforms can interact with database at any time.
- It is resilient and highly available as handling Terabytes of storage is required for each node of the system to support replication.
Pros and Cons of Kappa architecture
- Kappa architecture can be used to develop data systems that are online learners and therefore don’t need the batch layer.
- Re-processing is required only when the code changes.
- It can be deployed with fixed memory.
- It can be used for horizontally scalable systems.
- Fewer resources are required as the machine learning is being done on the real time basis.
- Absence of batch layer might result in errors during data processing or while updating the database that requires having an exception manager to reprocess the data or reconciliation.
7. Big Data 2.0
When google published their papers in the early 2000s, it was quite a tsunami in the Computer Engineering World. Doug Cutting and the guys behind Hadoop started working on Hadoop but a lot of other initiatives kicked off as well.
With their approach - scaling information systems on commodity hardware, it turned out that massive computational systems suddenly became affordable and it gave a whole new level of interest in distributed systems and distributed computing.
There are now an entire range of engines that transcend the Hadoop framework and are dedicated to specific verticals (e.g. structured data, graph data, streaming data, etc.)
Nowadays, The NoSQL ecosystem provides incredibly efficient alternatives to HDFS storage in the storage layer. In the processing layer, there is a plethora of solutions available from Kafka Streams to Apache Flink through Spark, etc.
On the resource negotiation side as well, multiple initiatives provide lightweight and interesting alternatives to Hadoop's YARN.
7.1 Alternatives to Hadoop
A specific project kicked off by the University of California retained quite a bit of attention at the time, the Nexus project renamed later Mesos and given to the Apache Software Foundation.
Apache Mesos intended to be kind of the Operating System of a computer cluster, somehow in the same way the Linux Kernel for instance is operating a single machine. Mesos intended to provide the same kind of primitives for resources management at the scale of a whole cluster.
Pretty early in the Mesos development story, support of docker containers has been added to enable users to deploy and scale applications in the form of docker containers.
A few years Later, some folks inspired from Google Borg and created in their turn a cloud container orchestration system for automating computer application deployment, scaling, and management. They named it Kubernetes.
With Mesos and Kubernetes gaining a lot of traction since scaling applications in the form of docker containers is extremely convenient, the folks at Hadoop added support to deploying applications in the form of docker containers as well in YARN in Hadoop 3.0.
Nowadays in 2021, with Hadoop 3, these 3 technologies converge tend to converge to the same possibilities. Hadoop 3 supports deploying jobs as docker containers just as Mesos and Kubernetes.
Mesos and Kubernetes can use alternatives to HDFS such as Ceph, GlusterFS, Minio, (of course Amazon, Azure, ...) etc.
So while Kubernetes was really oriented to scale application in the Operational Information System space initially, it tends now to overflow to analytics use case as well.
And the other way around, while Hadoop is still first and foremost oriented to deploy applications in the Analytical Information System space, Hadoop 3 tends to be deployed increasingly in the operational space as well.
Apache Mesos can well be used on both sides and was forming an interesting alternatives to Hadoop YARN in both worlds for quite some time. Today, Apache Mesos, even though from my perspective an amazing software, is not heavily maintained anymore and support for Mesos tends to vanish in latest versions of software stacks.
Kubernetes (and/or technologies based on Kubernetes) is today a market standard for the Operational IS just as Hadoop remains a market standard for the Analytical IS.
Kubernetes is an Open Source Platform providing:
- Automated software applications deployment, scaling, failover and management across cluster of nodes.
- Management of application runtime components as Docker containers and application units as Pods.
- Multiple common services required for service location, distributed volume management, etc. (pretty much everything one requires to deploy application on a Big Data cluster).
Kubernetes is originally largely inspired and even based on Google Borg, (one of) Google’s initial cluster management system(s). It has been released as Open-Source component in Google in 2014 and the first official release was in 2015.
Kubernetes is emerging as a standard as a Cloud Operating System.
In comes in the flavour of many distributions. The mains ones are:
- PKS (Pivotal Container Service)/li>
- Red-Hat OpenShift/li>
- Canonical Kubernetes/li>
- Google / AWS / Azure
Kubernetes deployment architecture would be as follows:
With the ever-growing popularity of containerized cloud-native applications, Kubernetes has become the leading orchestration platform to manage any containerized application.
Again, nowadays Kubernetes is emerging as a market standard to scale the Operational Information System, while Hadoop largely remains a market standard to scale the Analytical Information System.
Microservice architecture - a variant of the Service-Oriented Architecture (SOA) structural style - arranges an application as a collection of loosely-coupled services. In a microservices architecture, services are fine-grained and the protocols are lightweight. Its characteristics are as follows:
- Services in a microservices architecture (MSA) are small in size, messaging-enabled, bounded by contexts, autonomously developed, independently deployable, decentralized and built and released with automated processes.
- Services are often processes that communicate over a network to fulfill a goal using technology-agnostic protocols such as HTTP.
- Services are organized around business capabilities.
- Services can be implemented using different programming languages, databases, hardware and software environment, depending on what fits best (Note JKE : this is not a strict requirement, e.g. Spring boot)
From Martin Fowler:
A Microservices-based architecture has the following properties:
- Independent services lifecycles leads to a continuous delivery software development process. A change to a small part of the application only requires rebuilding and redeploying only one or a small number of services.
- Adheres to principles such as fine-grained interfaces to independently deployable services, business-driven development (e.g. domain-driven design).
As early as 2005, Peter Rodgers introduced the term "Micro-Web-Services" during a presentation at the Web Services Edge conference. The architectural style name was really adopted in 2012.
Kubernetes democratized the architectural approach. The two big players in this field are Spring Cloud and Kubernetes
A typical micro-services infrastructure architecture would be as follows:
8.1. Micro-services discussion
Ask yourself : do you need microservices ?
- Microservices are NOT Big Data !. In Big Data Analytics, one needs to scale the processing linearly with the storage. Hadoop and for instance Spark with Mesos on ElasticSearch are designed for that very key aspect to be respected: co-local processing optimization. Micro-services are not designed for this. The scaling approach in micro-services is at the component / service level. Heavy resources consuming services are scaled widely while light services run typically on a few nodes mostly for high-availability concerns.
- You don’t need microservices or Kubernetes to benefit from Docker. Docker is a tremendous way to package and deploy applications as a whole or individual application components. Unless you need horizontal scalability and high-availability, you might not need Kubernetes or a micro-services infrastructure.
- You’re not scaling anything with synchronous calls. This is essential. A fundamental element in the design of a micro-services architecture resides in the usage of asynchronous calls as the communication paradigm. Think of it. If services call each others using synchronous calls, then scaling them is useless since they will all synchronize with the slowest of them.
As a consequence, don’t do microservices unless:
- You need independent service-level scalability (vs. storage / processing scalability - Big Data).
- You need a strong SOA - Service-Oriented Architecture.
- You need independent services lifecycle management.
There are various challenges to be accounted when implementing micro-services:
- Distributed caching vs reloading the world all over again. If every service is a fully independent application, then all the reference and master data need to be reloaded all over again by all services. This needs to be accounted and distributed caching needs to be considered.
- Not all applications are fit for asynchronous communications. Some applications require fundamentally synchronous calls.
Identifying the proper granularity for services.
- Enterprise architecture view is too big
- Application architecture view is too fine
- Data consistency without distributed transactions. Applications need to be designed with this in mind.
Weighting the overall memory and performance waste.
- A Spring boot stack + JVM + Linux Docker base for every single service ?
- HTTP calls in between layers ?
We went a long way in this article, from the web giants and their needs to scale their information systems horizontally, the reasons behind it and the challenges this implies, down to Micro-services and the scaling of individual Information Systems components.
The Web giants needs were really related initially to their massive amount of data to be manipulated and the need to scale the processing linearly with the storage distribution. Nowadays, cloud computing and SaaS - Software As A Service on the cloud form somehow a different needs.
Initial Big Data technologies were really oriented towards Data Analytics use cases and the Analytical Information System space. Later technologies, namely NoSQL / NewSQL and now Kubernetes and micro-services are much more oriented towards scaling or deploying on the cloud Operational Information System components
The Strong frontier between Operational IS and Analytical IS will tend to vanish in the future.
Increasingly, in Hadoop 3 with YARN able to manage and deploy docker containers, Hadoop is not so strictly limited to the Analytical IS.
On the other side, Kubernetes make it increasingly feasible to scale heavy data analytics applications as well.
Even today NoSQL, Streaming, Lambda and Kappa architectures are increasingly overflowing to the Operational IS and as such provide a common ground for operational processes and analytical processes.