niceideas.ch

Artificial Intelligence and fraud prevention with Netguardians' CTO, Jérôme Kehrli

2022-01-18T06:11:45-05:00

I spoke to to Rudolf Falat, founder of the Voice of FinTech podcast about leveraging AI in anti-fraud prevention and cybersecurity.

The Podcast is available here and can be listened to directly from hereunder:

Happy listening !

The full transcript is available hereunder

Jerome, can you introduce yourself? How did you get to do what you do today?

I'm swiss, 43 years old and a proud father of 3 boys.
I guess I'm first and foremost a passionate software engineer and computer scientist. I remember putting my hands on my first computer - a commodore 128 - when I was 12 years old and knowing very well at that very moment that that would be my carrier.
I'm passionate by technology, artificial intelligence, programming, etc. for nearly 30 years now. I made all my career in financial institutions and fintechs and I wouldn't see myself working in another business nowadays. Financial institutions and financial markets form a very interesting domain of application in computer science due to the complexity of these systems and the wide range of concerns to be addressed, from real-time computing to highly mathematical applications.
I guess that my role today as CTO of NetGuardians is kind of a natural evolution in my career.

NetGuardians is a Swiss based software editor developing a Big Data Analytics platform that we package and deploy in financial institution mostly today to detect fraudulent activities and prevent fraudulent transactions.

Why have you decided to join a start-up (or a scale-up) like NetGuardians?

Before the NetGuardians co-founders reached out to me I was a consultant for a few years, working mostly for the major european banking institutions.
I really liked the job at the time, mostly the possibility to jump quickly from one topic to another, one customer to another.

But I did miss the product culture very much. As a consultant I was guiding other teams or leaders in adopting technologies, designing information systems, driving innovation projects, etc. But I was missing the deep implication and engagement that you get when you create a product from A to Z and sell it. Developing a software is the closest you can get in day job to having an actual child ;-)

So when NetGuardians pivoted to banking fraud fighting 7 years ago, the co-founders were looking for someone to lead the product research and development department, someone with a strong technical background and an extensive experience in finance.
Switzerland is a small country so they have been directed by some common relationship we had in their advisory board to my profile. So we met and they told me their story and shared their vision with me and I decided that I wanted to be part of it.
And today, 7 years after this first encounter I guess this company and this product are just as much my children as they're theirs.

When will we finally have truly intelligent AI working on fraud prevention in banking ?

Now that's a good question.
It's hard to answer since intelligent AI would need to be defined or precised. So first I would want to distinguish strong AI vs. weak AI and then share my perspective on what would be a truly intelligent AI.

If we qualify as a strong artificial intelligence, a software program able to contextualize, to show sensitivity, to show creativity or to exceed it's programming scope, then we don't have today the slightest trail of a proof that we'd be able one day to create such an program. This is downright science fiction. There is nothing in the real world anywhere close to the beginning of it.
The thing is that Artificial Intelligence is generating a lot of fantasy in the public's mind and I guess that the fact that we have given some of these algorithms names such as neural network is not helping in this regards. If we had given to neural networks the technical names they should have, such as largely convoluted and iterative statistical matrix model, I'm sure they wouldn't generate the same level of fantasy in people's minds.
Anyways.

Then if we qualify as a weak AI a software program able to optimize a mathematical function, solve a classification problem, or take a decision based on input data, then the progresses today are tremendous and new applications and solutions pop up nearly every week.
This technology evolves at a very fast pace and today's AI programs are a collection of sometimes hundreds of different algorithms working together to solve an analytical problem, such as driving a car autonomously for instance which is amazing.

Now when it comes to true intelligence, I strongly believe that the only true, actual intelligence is in the mind of the people developing these systems, not the machine, never the machine.
And then again, the progresses today are tremendous and essentially around 2 dimensions: the complexity of the individual Machine Learning algorithms and the number of these algorithms deployed together and working in conjunction in Artificial Intelligence Systems

And what we do at NetGuardians is a good illustration of all this evolution.
When we started in 2016, we were using one or two different methods to infer good features on events we were monitoring, mostly EBanking activities and financial transactions, as well as a single Supervised learning algorithm. Today we use a combination of multiple dozens of different unsupervised and supervised techniques all working together and each one of them focusing on a specific perspective, such as the timing of events, their frequency, their location, the destination of the funds, etc. or a specific step in the risk scoring process.

So yeah, again the true intelligence is in the mind of the guys developing these systems, not in the software.

How good is anti-fraud AI today? What kind of AI are we talking about?

Anti-fraud systems today form a very peculiar and passionate domain of application for artificial Intelligence. The nature of the problem to be solved makes it very specific.
Think of it, while some payments channel such as credit cards for instance experience a plethora of frauds, some other channels such as digital banking payments have typically only a few frauds for a million transactions a day.

Most sophisticated classification machine learning algorithms we have today perform very poorly on such datasets. They work well when the data is very much balanced between the positive and other populations.
As an example, every engineer knows today how to train a neural network to recognize pictures of cats, for instance, by feeding it with thousands of pictures of cats and thousands of pictures of other animals and other objects. Now if you try to train a neural network to recognize cats with only 6 pictures of cats and millions of random pictures of other animals and objects, the next picture of a cat you will present to the neural network will be classified as anything, such an elephant, right ? But there's no way an algorithm trained this way understands how to recognize cats.

And we're in the same situation. The very unbalanced nature of the data we're playing with makes all simple approaches simply irrelevant. So we have to do fairly complex stuff.

Our state of the art approach today at NetGuardians is a combination of multiple fairly evolved techniques and approaches working together.
I don't want to go to much into technical details but I would mention three categories of techniques we're using.

First, unsupervised learning techniques for anomaly detection ... with a wide range of different algorithms, from simple statistical or Poisson scoring down to clustering and peer group analysis. At the end of the day, fraudulent activities and transactions are always part of the set of anomalies.
Then, supervised learning techniques ... with a lot of different models being required from classification algorithms to risk scoring techniques, to distinguish between legitimate anomalies and highly potentially fraudulent transactions.
Last but not least, active learning and other supervision techniques to monitor the feedback we get from banking business users reviewing the hits, the activities or transactions being blocked in real-time by the system, etc.
And that is absolutely key because at the end of the day, our algorithms learn a lot from the feedback of these business people and they can only be as good as this feedback is. So supervising this quality is an essential concern.

So yeah, again, our approach at NetGuardians today is a combination of dozens of such techniques and algorithms deployed together to detect and block suspicious activities and transactions in real-time. And It works pretty good !

Another thing to consider: every transaction we block is investigated by a business expert within the bank who takes the eventual decision.
In a sense, we're not replacing the human decision process, but we're enhancing it. We give bankers a chance to review potentially or likely fraudulent activities before the funds leave the bank. And this is called Augmented Intelligence.

Who are your target customers?

At NetGuardians, we're working only for financial institutions. Our typical customers are Tier 1 and Tier 2 banks - big banks to medium size banks - where we detect fraudulent activities in a holistic fashion, fraudulent transactions and activities on digital channels just as much as internal fraud or scams.
In terms of types of financial institutions, we work just as much with massive retail banking institutions in Asia than private banking institutions in Switzerland.
Our key markets are Europe, our home market, Africa and Asia Pacific.
We support on premise deployment for Tier 1 banks who have a strong will to keep everything in house and onboard smaller institutions on one of our SaaS - Software as a Service - platforms on the cloud.

How do you make money?

Our customers pay an annual recurring licensing fee calculated from two metrics, their Asset under Management and their volume of financial transactions.
We bill delivery and integration costs when we integrate the solution ourselves but we intend to get away from this activity as much as possible and rely increasingly on local partners for integration.
We are not very much interested to sell services and would want to focus in the future on selling licenses only but that would require us to reach a critical mass and we're not there yet.But it's an ultimate objective for us, moving a way from a being both a product and service company today and turning to an only product company.

Where are you based?

We're a company founded in Switzerland and we are still today headquartered in Yverdon-les-bains, a small town north to Lausanne. We have offices in Nairobi where we manage our operations in Africa and in Singapore where we handle our Asian activities. We also have a commercial office in London and a near-shore development center in Warsaw .

Where are you on your journey in terms of product development, geographic reach, funding, hiring? Any numbers you can share?

We have today a very solid technology and product for the banking fraud detection and prevention problem. And in the short term we intend to leverage on our technology to extend our product to other financial crime use cases.

There are many different concerns in Financial crime fighting in banking institutions, Fraud detection is an essential one of them of course but then there's also AML - Anti Money Laundering - Transaction Monitoring, KYC - Know your customer and of course customer and transaction screening.
KYC and screening require very different technologies than the ones we've built so they're not in our short term focus. But AML Transaction Monitoring is very close from what we do on Fraud, just the perspective of the analytics is somewhat different.
Finding fraud is a lot about understanding where the money goes while AML is a lot about understanding where the money comes from, but from a technical standpoint it's really similar.

So long story short, we intend by the end of next year to extend our solution to state of the art AML transaction Monitoring leveraging on our technology. Eventually, over the next years, we intend to build a complete financial crime package by integrating third party solutions for KYC and screening.

In terms of geographic reach, we are today strong in Europe - our local market - and Africa. But we're really only building Asia. This is where we are investing our effort today and in the coming year to build a strong sales team, identifying and leveraging on the right partners, scaling the delivery team and eventually, hopefully, become a major player in Asia as well.
Interestingly, we have no intent to actively address the US market today aside of a few opportunistic leads through some of our partners.

To give you a few figures, we're today a 100 FTEs company and we have a little less than 80 financial institutions as customers.

Now regarding funding, I can't tell you much actually. We have raised roughly 30 million USD so far and we are in the process of challenging and building the next investment round. Building the proper structure in APAC to emerge as a major player here is not something on our reach today. We need support from investors to build this and we're working on that today.

What are the next steps for you next year and beyond? Customers, incumbents as partners, investors?

We intend to develop significantly in our three key markets, Europe, Africa and Asia. We are in the process of finalizing recruitment of the key people - such as regional sales director, etc. - who will be instrumental in driving our growth in these regions.
And as I said before, we need support from investors to build the proper structure in APAC, based in Singapore.

In terms of partnership, we have today very good partners in the core banking systems and banking package providers field where our strategy is to bundle our fraud detection engine with their Core Banking Package offering.
We are now in the process of looking for integration partners in the different regions to support our scaling and incrementally disengage our own people from delivery.

In terms of investment, we would also expect the next round to support our extension to completeness of offering in AML and more generally financial crime fighting as well as complete our transition to the cloud as lead deployment channel. We have still quite a path ahead of us to provider tier 1 banking institutions with a state of the art hybrid cloud approach.
A lot of tier 1 banking institution would sign up for a cloud deployment of NetGuardians if and only if we can provide them with means to guarantee that the confidential data remains within the bank information system boundaries. And the technology for that is called hybrid cloud which would be quite an evolution from what we do today.

Where can interested parties reach you?

Well, I guess the best way to get in touch with us is through our web site, www.netguardians.ch.
And the best way to contact me would be on linkedin I guess jerome KEHRLI.

Modern Information System Architectures

2021-12-13T06:04:45-05:00

For forty years we have been building Information Systems in corporations in the same way, with the same architecture, with very little innovations and changes in paradigms:

On one side the Operational Information System which sustains day-to-day operations and business activities. On the Operational Information Systems, the 3-tiers architecture and the relational database model (RDBMS - Relational Database Management System / SQL) have ruled for nearly 40 years.
On the other side the Decision Support Information System - or Business Intelligence or Analytical Information System - where the Data Warehouse architecture pattern has ruled for 30 years.

legacy / Information Systems Architecture for 40 years

Of course the technologies involved in building these systems have evolved in all these decades, in the 80s COBOL on IBM hosts used to rule the Information Systems world whereas Java emerged quickly as a standard in the 2000s, etc.
But while the technologies used in building these information systems evolved fast, their architecture in the other hand, the way we design and build them, didn't change at all. The relational model ruled for 40 years along the 3-tiers model in the Operational world and in the analytical world, the Data Warehouse pattern was the only way to go for decades.

The relational model is interesting and has been helpful for many decades. its fundamental objective is to optimize storage space by ensuring an entity is stored only once (3rd normal form / normalization). It comes from a time when storage was very expensive.
But then, by imposing normalization and ACID transactions, it prevents horizontal scalability by design. An Oracle database for instance is designed to run on a single machine, it simply can't implement relational references and ACID transactions on a cluster of nodes.
Today storage is everything but expensive but Information Systems still have to deal with RDBMS limitations mostly because ... that's the only way we used to know.

On the Decision Support Information System (BI / Analytical System), the situation is even worst. in Data warehouses, data is pushed along the way and transformed, one step at a time, first in a staging database, then in the Data Warehouse Database and finally in Data Marts, highly specialized towards specific use cases.
For a long time we didn't have much of a choice since implementing such analytics in a pull way (data lake pattern) was impossible, we simply didn't have the proper technology. The only way to support high volumes of data was to push daily increments through these complex transformation steps every night, when the workload on the system is lower.
The problem with this push approach is that it's utmost inflexible. One can't change his mind along the way and quickly come up with a new type of data. Working with daily increments would require waiting 6 months to have a 6 months history. Not to mention that the whole process is amazingly costly to develop, maintain and operate.

So for a long time, RDBMSes and Data Warehouses were all we had.

It took the Internet revolution and the web giants facing limits of these traditional architectures for finally something different to be considered. The Big Data revolution has been the cornerstone of all the evolutions in Information System architecture we have been witnessing over the last 15 years.

The latest evolution in this software architecture evolution (or revolution) would be micro-services, where finally all the benefits that were originally really fit to the analytical information system evolution finally end up overflowing to the operational information system.
Where Big Data was originally a lot about scaling the computing along with the data topology - bringing the code to where the data is (data tier revolution) - we're today scaling everything, from individual components requiring heavy processing to message queues, etc.

Example of modern IS architecture: Microservices

In this article, I would want to present and discuss how Information System architectures evolved from the universal 3 tiers (operational) / Data Warehouse (analytical) approach to the Micro-services architecture, covering Hadoop, NoSQL, Data Lakes, Lambda architecture, etc. and introducing all the fundamental concepts along the way.

Summary

1. Introduction
2. The Web giants and Big Data
3. The CAP Theorem
4. NoSQL / NewSQL
- 4.1 NoSQL
- 4.2 NewSQL
5. Hadoop and Data Lakes
6. Streaming Architectures
7. Big Data 2.0
- 7.1 Alternatives to Hadoop
- 7.2 Kubernetes
8. Micro-services
- 8.1. Micro-services discussion
9. Conclusion

1. Introduction

As stated in the summary above, the way we build information systems really didn't evolve in so many decades. The technologies used underneath have evolved of course, a long way from COBOL to Java and Angular, but the architectures in use - the 3-tiers model on the operational information system and the data warehouse pattern on the decision support system (a.k.a analytics system) - haven't evolve in more than 30 years.
The Software Architecture is defined as the set of principal design decision about the system. Software architecture is kind of the blueprint for the system's construction and evolution. Design decisions encompass the following aspects of the system under development: Structure, Behaviour, Interactions, Non-functional properties. (Taylor 2010)
And then again, the technologies under the hood, from the operating systems to the User Interfaces through the programming languages, have evolved drastically. We all remember 3270 green-on-black terminal screens and can only consider the terrific evolution to the fancy HTML5/bootstrap screens we see today.
But the design of the Information system components, their interactions and the technical processes in between didn't evolve at all!

Information Systems Architecture for 40 years

I find it amazing to consider that if you put COBOL, 3270 and a few terms like that on this schema instead of the web elements you literally get what would have been the high-level architecture schema 40 years ago.
As stated above, RDBMS - Relational Database Management Systems - have a lot of limits and some benefits, namely the standardized querying language - SQL - and the optimization of storage space. But in today's digital world, the benefits don't stand up to the drawbacks, mostly the impossibility to scale.
The Data Warehouse pattern in use for 30 years on the analytical Information System is also a poor fit for today's pace of development of digital services. It is much too inflexible not to mention the cost in developing and maintaining it.

It took the web giants to face the limits of these current architecture paradigms and invent new ways of building information systems to finally see some evolutions in the way we are building them in corporations. The first evolutions came to the analytics system side with Big Data technologies and overflowed later to the operationnal IS side with NoSQL, streaming architectures and eventually micro-services.

In this article I want to present these evolutions from an historical perspective. We'll start with the Web giants and the Big Data revolution, cover NoSQL and Hadoop, run through Lambda and Kappa architectures, and end up discussing Kubernetes and Micro-services.

2. The Web giants and Big Data

The web giants have been the first to face the limits of traditional architectures in an unacceptable way. Can you imagine google running their Search Engine on an IBM mainframe ? Can you imagine what that would be for a machine and how much money (licensing fees) they would need to leave to IBM every year for running such a host ?
Can you imagine Amazon running their online retail business on an Oracle database with hundreds of millions of users connected and querying the DB at any time ? Can you imagine the price of a computer that would be able to support such volumes of data and concurrent requests ?

The Web giants had to invent both new data storage technologies and programming paradigms to run their business and support their volume of activities.
But let's start with the beginning.

2.1 The Era of Power

As an prequel to introducing Big Data, let's have a look at these both situations:

The Era of Power
Source: https://pages.experts-exchange.com/processing-power-compared

These two computers are separated by only 30 years of technological evolutions.
The computer on the left is a Cray II. When it came out in 1985, it was a revolution since it was the fastest machine in the world, the first multi-processor computer from Seymour Cray and included many unique technological evolutions.
The computer on the right is a Samsung S6 Smartphone. It's 30 years younger that the Cray 2.

It's 30 years younger only and around 15 times more powerful that the Cray 2. While the later was by far bigger than a human being, the Samsung S6 fits in the palm. The Cray 2 has 4 processors while the S6 packages 8 processors.
Considering how the hardware technology progressed over one generation is mind-blowing.

Another comparison is even more impressive, 50 years before the Samsung S6, a computer has been used to send people to the moon. The S6 is a million times more powerful in terms of raw computing power than that very computer.
We have today a device so small that it fits in our palm, incredibly powerful, which enables us to be interconnected everywhere, all the time and for every possible need. This is the definition of the digitization
The smartphones are really an amazing piece of technology, but much more impressive are the apps behind and the services they enable us to use. This leads us to the Web Giants.

2.2 The Web Giants

The Web giants have been the first to face the limits of traditional architectures and the usual way information systems were built.

The Web Giants

And the revolution came from them. They had to find new technical solutions to business-critical challenges such as :

Google: Index the whole web, and keep a response time to any request below one second - or how to keep the search free for the user ?
Facebook: Interconnect billions of users, display their feeds in near-real-time and understand how they use their product to optimize ads ?
Amazon: How to build a product recommendation engine for dozens of millions of customers, on millions of products ?
EBAY : How to do a search in ebay auctions, even with misspelling ?

These are just oversimple examples of course and the challenges faced by the web giants go much beyond such simple cases.
These business challenges are backed by technical challenges such as:

How to invert a square matrix that doesn't fit in memory in a reasonable time ?
How to query a database containing trillions of documents in real-time ?
How to read billions of files of multiple megabytes in a reasonable time ?
etc.

At the end of the day, it all boiled down to finding ways to manage volumes of data bigger by several orders of magnitude than the volumes of data that IT systems were used to manipulate so far.

2.3 Data Deluge

So the most striking problem they had to solve is getting prepared and ready for the Data Deluge!

Data Deluge!

Not only do we generate more and more data, but we have today the means and the technology to analyze, exploit and mine it and extract meaningful business insights.
The data generated by the company’s own systems can be a very interesting source of information regarding customer behaviours, profiles, trends, desires, etc. But also external data, Facebook, twitter logs, etc.

2.4 The Moore Law

The Moore Law:The number of transistors and resistors on a chip doubles every 24 months" (Gordon Moore, 1965)

The Moore Law Click to enlarge-

For a long time, the increasing volume of data to be handled by any given corporation in its Information System was not an issue at all.
The volume of data increases, the number of user increases, etc. but the processing abilities increases as well, sometimes even more.
The Moore law was there to cover our ass. The corporation CTO just had to buy a new machine to host the Information System every few years.

For the 40 years, the IT component capabilties grew exponentially
Source: http://radar.oreilly.com/2011/08/building-data-startups.html

This model has hold for a very long time. The cost are going down, the computing capacities are rising, one simply needs to buy a new machine to absorb the load increase.
This is especially true in the mainframe world. There wasn’t even any need to make the architecture of the systems (COBOL, etc.) evolve for 30 years.
Even outside the mainframe world. The architecture patterns and styles we are using in the operational IS world haven’t really evolve for the last 30 years. Despite new technologies such as Web, Web 2.0, Java, etc. of course. I’m just speaking about architecture plans and styles.

2.5 The Death of the Moore Law

But everything has an end.
Let's consider a fifth dimension, too often left aside when considering the evolutions of computer technologies and hardware architectures: the throughput of the connection between the data on the disk (long term storage) and the memory (i.e. hard drive controllers mostly, but also buses, etc.)

The death of the Moore Law

Issue: the throughput evolution is always lower than the capacity evolution.

How read/write more and more data through an always thicker pipe?

The throughput has become the biggest concern in scaling computer / platform hardware up. It did not progress in terms of efficiency in a way comparable to the four other dimensions.
We are able to store more and more data, of course, but we are less and less able to manipulate this data efficiently.
In practice, fetching all the disk data on a computation machine to fit it in RAM to process it is becoming more and more difficult.

2.6 Fundamentals of Big Data - the Web giants new paradigms

In order to workaround the limits of traditional architectures, the web giants invented new architecture paradigms and new ways of building information systems by leveraging on three fundamental ideas:

Fundamentals of Big Data - the Web giants new paradigms

In details:

Key idea 1 : distribution - Since its impossible to fit the data in the RAM of one single machine, split it and distribute it on as many different machines as are required.
Distribution means partitioning the dataset - also called sharding it sometimes - but also always replicate the partitions or shards. We'll see exactly why and how later.
Key idea 2 : Horizontal scalability - Just as we split the data, let's split the computation and distribute it on as many nodes as are required to support the workload, even if it means multiple datacenters.
Key idea 3 : Data tier revolution - So we distribute both the data on a cluster of computers - or nodes - and the processing as well. We end up using the data nodes as processing nodes. This is the data tier revolution, which is in complete opposition to what was usually done so far in traditional architectures: fetching the required data to the place where the computation occurs.
But it goes further than that.
Most of the time we end up distributing different types or categories of data. Every time a specific business process needs to compute something out of a specific piece of all this data, it's crucial to ensure the processing will happen on the very nodes where this specific piece of data is located. This is called co-local processing or data locality optimization.

As a summary, the web giants have designed new architectures and programming paradigms where distributing the data and the processing (ideally in a co-local way) on a cluster of nodes was the most fundamental principle.

3. The CAP Theorem

But moving from a mainframe world - where everything is on the same computer and the data to compute always fits in the memory of that computer - to a distributed system world most definitely has benefits, but it also has some consequences. And that's the topic of this chapter.

3.1 The origins of NoSQL

Let's start with a bit of history.

3.1.1 Flat files as data store

In the early days of digital data, before 1960, the data within a Computer Information System was mostly stored in rather flat files (sometimes indexed) manipulated by top-level software systems.
The primitives provided by the operating system were really very low level, basically just the possibility to read or write file or file increments.

Indexed flat file

Directly using flat files was cumbersome and painful. Different uncovered needs emerged at the time:

Data isolation
Access efficiency
Data integrity
Reducing the time required to develop brand new applications

Addressing such needs by relying on indexed flat files required solutions to be implemented artificially by the applications using such files.
It was highly difficult, inefficient, time consuming, etc. And the wheel had to be re-invented all the time all over again.
So something else was required.

3.1.2 RDBMS and the relational model

So in 1969, Edgar F. Cood, a British engineer, invented the relational model. In the relational model, business entities are modeled as Tables and Associations (relations).
The relational model is at the root of RDBMS - Relational DataBase Management Systems - that ruled the Data Storage world for 30 years.

The relational model is conceived to reduce redundancy in order to optimize disk space usage. At the time of its creation Disk storage was very expensive and limited. And the volume of data in Information Systems was rather small.
The relational model avoids redundancy to optimize disk space usage by guaranteeing:

Structure: using normal design forms and modeling techniques
Coherence: using transaction principles and mechanisms

An example relational model would be as follows, illustrating an Exam Grade Management application

Relational Model Example

In this example, if we want to know the subject assigned to a student on his profile screen, we would need to

Extract the personal data from the "student" table
Fetch its subject if from the "relation" table
Read the subject title from the "subject" table.

Why, oh why, separate all this information in different tables when in practice 99% of the time we want to fetch all of this together ?

3.1.3 criticism of the relational model

The relational model comes from a time where storage was expensive. The fundamental idea behind its design is rationalizing storage space by ensuring every piece of information is stored only once.
But nowadays long-term storage space is not expensive at all anymore. A Terabyte of SSD storage is not more that a few dozens of dollars. Optimizing the storage space at all cost makes little sense today.
In addition, the relational model is not the best way to represent some information. Let's see some examples

Other models

Tabular information fits naturally well in the relational model but not only. Every time we can naturally divide a business problem into well-defined and predefined entities and relations among them, the relational model is usually a good fit.
But then think of other type of information, such as registration forms, product descriptions, etc. Such types of semi-structured data fit very poorly in the relational model.
Also molecular data or graph data would be way better stored in very different types of databases.

The web giants had to get away from the mainframe pattern, and if you challenge that, the very fundamental architecture pattern on which all information systems were built, why wouldn't you challenge all the rest, including the relational model ?
We'll get back to this.

3.2 Horizontal scalability

The mid and late 2000’s were times of major changes in the IT landscape. Hardware capabilities significantly increased and eCommerce and internet trade, in general, exploded.
Some internet companies- the "Web giants" (Yahoo!, Facebook, Google, Amazon, Ebay, Twitter, ...), pushed traditional databases to their limits. Those databases are by design hard to scale.
Traditional RDBMS and traditional architecture can only scale up. And scaling up is tricky.

3.2.1 Scaling up

With relational RDBMSes, the only way to improve performance is by scaling up, i.e. getting bigger servers (more CPU, more RAM, more disk, etc.). There's simply nothing else that can be done.
But one eventually hits a hard limit imposed by the current technology.

With traditional architectures and RDMBS, all the workload happens on one single machine. And while running a few thousands operations or transactions on one single machine is perhaps possible, going much beyond it just doesn't work. The programming paradigm we use - mostly around thread synchronizations and context switches - make it impossible to run effectively a million threads on one single machine for instance.

But there's worst than that.
Imagine that a machine A with 4 CPUs, 64 GB RAM and 1 TB hard drive costs 10'000 USD.
Do you think that a machine B with twice the power so 8 CPUs, 128 GB RAM and a 2 TB hard drive would cost the double, hence 20'000 USD ?
No! It would be much more than that, perhaps four or five times the price, so more than 40k USD.

The price of individual machines doesn't scale linearly with the processing power of the machine, it's exponential !

3.2.2 Scaling out

By rethinking the architecture of databases, the web giants have been able to make them scale at will, by adding more servers to clusters instead of upgrading the servers.
When scaling out, instead of buying bigger machines, one buys more machines and add them in a processing cluster, working together on distributed data and processing.
The servers are not made of expensive, high-end hardware; they are qualified as commodity hardware (or commodity servers).

When scaling out, the limits vanish, one can add as many nodes as one wants in a processing cluster.
And there's a cherry on the cake, recall the example of machine A above, buying 10 machine A is not even 10 times the price of a single machine A, since one can get discounts from the number being bought.

The only drawback is that the application leveraging on scaling out, or the information system as a whole, needs to be designed from the grounds up for distribution. And there are constraints for this, we'll see that further in this article.

Scaling out is also called Horizontal scalability while scaling up is called vertical scalability.

3.3 Data Distribution

With most NoSQL databases, the data is not stored in one place (i.e. on one server). It is distributed among the nodes of the cluster. When created, an object is split in a number of shards, for instance 4 shards, A, B, C, D and these shards are assigned to a node in the cluster.
This is called sharding - or partitioning - the portion of data assigned to a node is called a shard - or a partition.

Having more cluster nodes implies a higher risk of having some nodes crash, or a network outage splitting the cluster in two. For this reason, and to avoid data loss, objects are also replicated across the clusters. The number of copies, called replicas, can be tuned. 3 replicas is a common figure.
Imaging that the specifications of a given computer indicates that there is a 10% chance for the computer to experience any kind of hardware failure in its first year of exploitation. Then imagine you have 10 nodes like that one in a cluster, what is the probability that at least one of these nodes experiences an hardware failure ? Yes, you can be nearly sure at least one of them will fail.

For this reason, when we start to distribute data on a cluster of multiple machines, we have to design for failures.
In data management, this means creating multiple copies of every shard in such a way that we maximize the chances of one of them always being available.
This is called replication.

We can see here that the objects has been split in 4 shards A, B, C, D and that every shard has three replicas.

The objects may move, as nodes crash or new nodes join the cluster, ready to take charge of some of the objects. Such events are usually handled automatically by the cluster; the operation of shuffling objects around to keep a fair repartition of data is called rebalancing.

3.4 Properties of a distributed system

In RDBMSes, we expect DB transactions to respect some fundamental properties, identified by ACID: Atomicity, Consistency, Isolation and Durability.
In distributed systems, we consider things a little differently and consider the following properties:

Availability
Availability (or lack thereof) is a property of the database cluster. The cluster is available if a request made by a client is always acknowledged by the system, i.e. it is guaranteed to be taken into account.
That doesn’t mean that the request is processed immediately. It may be put on hold. But an available system should at a minimum always acknowledge it immediately.
Practically speaking, availability is usually measured in percents. For instance, 99.99% availability means that the system is unavailable at most 0.01% of the time, that is, at most 53 min per year.
Partition tolerance
Partition Tolerance is verified if a system made of several interconnected nodes can stand a partition of the cluster; if it continues to operate when one or several nodes disappear. This happens when nodes crash or when a network equipment is shut down, taking a whole portion of the cluster away.
Partition tolerance is related to availability and consistency, but it is still different. It states that the system continues to function internally (e.g. ensuring data distribution and replication), whatever its interactions with a client.
Consistency
When talking about distributed databases, like NoSQL, consistency has a meaning that is somewhat different than in the relational context.
It refers to the fact that all replicas of an entity, identified by a key in the database, have the same value whatever the node being queried.
With many NoSQL databases, updates take a little time to propagate across the cluster. When an entity’s value has just been created or modified, there is a short time span during which the entity is not consistent. However the cluster guarantees that it will eventually be, when replication has occurred. This is called eventual consistency

These 3 properties, Consistency, Availability and Partition tolerance, are not independent.
The CAP theorem - or Brewer’s theorem - states that a distributed system cannot guarantee all 3 properties at the same time.

This is a theorem. That means it is formally true, but in practice it is less severe than it seems.
The system or a client can often choose CA, AP or CP according to the context, and "walk" along the chosen edge by appropriate tuning.
Partition splits happen, but they are rare events (hopefully).

Traditional Relational DBMSes would be seen as CA - consistency is a must
Many NoSQL DBMSes are AP - availability is a must. Big clusters failures happen all the time so they better live with it. Consistency is only eventual.

3.4.1 Eventual consistency

Consistency refers to the fact that all replicas of an entity, identified by a key in the database, have the same value at any give time whatever the node being queried.

With many NoSQL databases, the preferred working mode is AP and all-the-time consistency is sacrificed.
Favoring performance, updates take a little time to propagate across the cluster. When an entity’s value has just been created or modified, there is a short time span during which the entity is not consistent. One node being queried at that moment could show the previous value while another node at the same time would already show the new value.
However the cluster guarantees that it will eventually be, when replication has occurred. This is called eventual consistency and this is an essential notion.

While all-the-time consistency is sacrificed, eventual consistency is a must and is guaranteed by most-if-not-all NoSQL Database.

4. NoSQL / NewSQL

NoSQL databases are a new type of databases emerging from the web giants technologies mostly, scaling out natively and renouncing to the usual behaviours and features of RDBMS - Relational Database Management Systems.

4.1 NoSQL

A NoSQL - originally referring to "not-SQL" for "non-relational" - database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st century, triggered by the needs of Web 2.0 companies.
NoSQL databases are increasingly used in Big Data and Real-Time Web applications.
NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.
(Wikipedia - https://en.wikipedia.org/wiki/NoSQL)

The fundamental idea behind NoSQL is as follows:

Because of the need to distribute data (Big Data), the Web giants have abandoned the whole idea of ACID transactions (only eventual consistency is possible).
So if we drop ACID Transactions - which we always deemed to be so fundamental - why wouldn't we challenge all the rest - the relational model and table structure?

There are 4 common types of NoSQL databases:

Document-oriented , e.g. MongoDB, ElasticSearch
Column-family (aka BigTable), e.g. Cassandra
Key/Value pairs, e.g. Redis
Graph, e.g. Neo4J

Document-oriented databases are really the most wide spread with market leaders such as MongoDB, ElasticSearch, CouchDB, etc.
Column-oriented databases are also wide spread with multiple good open source solutions.
Key/Value pairs databases are really distributed caching products in in the end. Multiple good solutions are available on the market but most of them are proprietary software with sometimes a limited open-source version (unfortunately).
In terms of graph oriented databases, the lead player is Neo4J.

The following schema provides an illustration of the way data is structured and stored in these Databases:

The NoSQL landscape is y very rich ecosystems with hundreds of different products and solution and growing continuously, with nearly every week a new product appearing.

4.2 NewSQL

What is NewSQL ?

NewSQL refers to relational databases that have adopted upon some of the NoSQL genes, thus exposing a relational data model and SQL interfaces to distributed, high volume databases.

NewSQL, contrary to NoSQL, enables an application to keep

The relational view on the data
The SQL query language
Response times suited to transactional processing

Some were built from scratch (e.g. VoltDB), others are built on top of a NoSQL data store (e.g. SQLFire, backed by GemFire, a key/value store)

The current trend is for some proven NoSQL databases, like Cassandra, to offer a thin SQL interface, achieving the same purpose
Generally speaking, the frontier between NoSQL and NewSQL is a bit blurry… SQL compliance is often sought for, as the key to integrating legacy SQL software (ETL, reporting) with modern No/NewSQL databases

5. Hadoop and Data Lakes

Around 2006, Google published two papers, "GFS - The Google FileSystem" where they explained how they designed an implemented a distributed filesystem and "Map Reduce" where they presented the distributed programming paradigm they used to process data stored on GFS.
A few years after, google published "Big Table", a paper presenting how they designed and implemented a Column-oriented database on top of HDFS and MapReduce.

Doug Cutting, the leader of the Apache Lucene Project at the time discovered these papers and decided to work on an Open-Source implementation of these concepts.
Hadoop was born.

5.1 What is Hadoop ?

Hadoop is an Open Source Platform providing:

A distributed, scalable and fault tolerant storage system as a grid
Initially, a single parallelism paradigm : MapReduce to reuse the storage nodes as processing nodes
Since Hadoop V2 and YARN, other parallelization paradigms can be implemented on Hadoop
Schemaless and optimized sequential write once and read many times
Querying and processing DSL (Hive, Pig)

Hadoop was initially primarily intended for Big Data Analytics. It is the middleware under the Data Lake Architecture pattern and intents to revolution the architecture of analytical information systems / decision support systems.
Nowadays Hadoop can be an infrastructure for much more, such as Micro-services architecture (Hadoop V3) or Real-time Architectures.

Hadoop is declined in different distributions: Fundation Apache, Cloudera, HortonWorks, MapR, IBM, etc.

5.2 Hadoop Overview

Hadoop is designed as a layered software where technologies in every layer can be interchanged at will:

Distributed storage: Hadoop packages HDFS as the default underlying distributed filesystem. But for instance the MapR Hadoop distribution uses the MAPR filesystem instead.
Parallel Computing Framework / MapReduce Processing Engine: In Hadoop V1, MapReduce was the only parallel computing paradigm available on top of Hadoop, taking care of the processing distribution as well as the resources negotiation on the Hadoop cluster.
With Hadoop 2.0, The MapReduce paradigm has been split from the Resource negotiation which became YARN - Yet Another Resource Negotiator. With this split, it has become possible to use Hadoop with different parallel processing constructs and paradigms, MapReduce becoming just one possibility among others.
Machine Learning / Processing: This is in the end the most essential layer on top of Hadoop core. Hadoop is designed first and foremost for Big Data Analytics. There are numerous solutions that were initially either implemented on top of MapReduce or ported to MapReduce.
Nowadays, with YARN, software doesn't need anymore to be ported to MapReduce to run on Hadoop, it just needs to integrate with YARN.
Orchestration: Numerous different solution can be used to operate Hadoop and orchestrate processes.
Querying: A lot of NoSQL / NewSQL databases have been implemented as an Hadoop querying framework, such as HBase or Hive. Some more advanced tools goes beyond querying with as Pig.
Reporting: User have multiple choices of software specialized in building reports on the data in Hadoop.
IS Integration: Integrating Hadoop in the Information System, specifically building data import / Export between Hadoop and the operational information system components is a key concern. Numerous different solutions have been developed for this and are packaged with most Hadoop distributions.
Supervision and Management: Most Hadoop distributions provide their own management tool. Some tools are available as Apache projects.

Hadoop is a very large ecosystems of hundreds of different software in all these different layers.
The most common ones would be as follows:

But then again there are really many more components than that in a typical Hadoop distribution.
Most Hadoop distributions are quite behemoth software stacks that would be very difficult to integrate and configure manually, which is the very reason behind the success of these distributions.
Hadoop core on its own is fairly complex to install, configure and fine tune so whenever one needs Hadoop core only for his specific software (e.g to run spark), it's sometimes a more appropriate approach to search for a lighter cluster management system such as Apache Mesos for instance; more on that later in this article.

5.3 Hadoop Architecture

A simplified view of Hadoop core components deployment architecture would be as follows:

Hadoop Architecture

Since Hadoop 2, having two master nodes for high-availability and avoiding single points of failure on the master components is highly advised.
The components from Hadoop core are deployed as follows:

The HDFS Name node (and secondary name node) is the center piece of the HDFS File System. It acts as the HDFS Master and keeps the directory tree and tracks where on the cluster the file data is kept. The HDFS Data Nodes acts as slave processes, run on individual cluster nodes and take care of data storage.
The YARN Resource Manager (and secondary resource manager) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers and the per-application ApplicationMaster.
The MapReduce JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data for co-local processing optimization, or at least are in the same rack. Client applications submit jobs to the Job tracker. MapReduce TaskTrackers run on individual cluster nodes, execute the tasks and report the status of tasks to the JobTracker.

5.4 The DataLake Architecture pattern

From Wikipedia:
A data lake is a system or repository of data stored in its natural/raw format.

It's is usually a single store of data including raw copies of source system data, sensor data, social data etc. and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
It can include structured data from relational databases, semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

With the continued growth in scope and scale of analytics applications using Hadoop and other data sources, then the vision of an enterprise data lake can become a reality.
In a practical sense, a data lake is characterized by three key attributes:

Collect everything. A data lake contains all data, both raw sources over extended periods of time as well as any processed data.
A data lake is characterized by a Big Volume of data.
Dive in anywhere. A data lake enables users across multiple business units to refine, explore and enrich data on their terms.
In a Data Lake, one doesn't know a priori the analytical structures.
Flexible access. A data lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.
As a result, a data lake delivers maximum scale and insight with the lowest possible friction and cost.

The fundamental approach of Data Lakes is to pull the required data from the raw data storage, transforming it and processing it dynamically, as required per the use case being executed. It's entirely dynamic, queries and processes are designed on the fly.
The storage principle is to store everything, all the raw data from the operational Information System as well as all the data produced by the IS, log files, usage metrics, etc. (Collect everything pattern).
Hadoop is kind of the Operating System underneath the Data Lake pattern and with Hadoop's power, there is nearly no analytics use case that can't be implemented in a dynamic fashion, requiring at worst a few hours of runtime processing before providing the expected results

This is in complete opposition with the Data Warehouse pattern where the data was pushed in statically predefined transformation pipelines. The most critical drawback of this approach is the impossibility to come up with a new use case in a quick time. Most of the time, when a corporation requires a new KPI to be computed by the analytical system, if the required data was not already collected for another use case, it was impossible to provide quickly, requiring for instance to wait 6 months before providing the KPI on a 6 months period.
Hadoop finally made it possible at a cheap cost to get away fro this push pattern.

DataLake Architecture

The Data Lake architecture pattern and its hadoop engine form a tremendous opportunity to finally get away from the Data Warehouse pattern.
But there are pitfalls of course and many corporations experienced it the hard way.
It has been stated to much everywhere that data can be incorporated "as is" in data lakes that way too many corporations took it too literaly, forgetting about one essential aspect, even in Data Lakes.
A minimum of data cleaning, cleansing and preparation is always required. The most crucial aspect than can nevfer be neglected is the need to alway have proper correlation IDs in every single piece of data that is being ingested in a data lake.
Without correlation IDs, data is unusable. And your Data Lake turns into a Data Swamp.

6. Streaming Architectures

Streaming data refers to data that is continuously generated, usually in high volumes and at high velocity. A streaming data source would typically consist of a stream of logs that record events as they happen - such as a user clicking on a link in a web page, or a sensor reporting the current temperature.

A streaming data architecture is a framework of software components built to ingest and process large volumes of streaming data from multiple sources. While traditional data solutions focused on writing and reading data in batches, a streaming data architecture consumes data immediately as it is generated, persists it to storage, and may include various additional components per use case - such as tools for real-time processing, data manipulation and analytics.
A real-time system is an event-driven system that is available, scalable and stable, able to take decisions (actions) with a latency defined as below the frequency of events.

Streaming Architectures are not strictly related to the web giants and the Big Data revolution and CEP - Complex Events Processing - Engines exists since the early 2000s.
However, streaming architectures evolved significantly with products emerging from the needs of the web giants in Lambda Architecture first and then Kappa Architecture.

6.1 Complex Event Processing

From Wikipedia
Complex event processing, or CEP, consists of a set of concepts and techniques developed in the early 1990s for processing real-time events and extracting information from event streams as they arrive. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) in real-time situations and respond to them as quickly as possible.

In a Complex Event Processing Architecture :

Historical data is regularly and consistently updated with live data.
Both types or data (historical and live) are not necessarily presented consistently to the end user.
- Both sets of data can have their own screens or even application
- A consistent view on both sets of data would be proposed by Lambda Architecture (next topic in this presentation)

Complex Event Processing

A few notes on typical CEP deployments, in a raw fashion:

The rules GUI is often a user friendly editor supporting hot updates of rules and made available to business users.
The capture middleware should support very high throughput of thousands of events per second, just as the whole processing line and negligible latency.
The CEP engine needs to support very high throughput as well and usually a maximum latency of a few dozen milliseconds to hundreds milliseconds. Fault tolerance and state coherence are common concerns.

Complex Event Processing engines and architecture are heavily used in the industry in the world of real-time computing systems, such as trading systems, payment monitoring systems, etc.
Such engines form however a quite legacy technology and have limits in terms of analytics. Most if not all CEP engines on the market even nowadays are really some sort of evolved rules-engines.
And that would be the most common limit of CEP engines, the fact that its really only about rules. Machine learning and AI use cases are limited on CEP engines by the difficulty of these systems to derive features requiring correlation with large historical datasets.

The rise of Big Data analytics technologies have opened the door for much more advanced analytics use cases in real-time. Lambda Architecture and Kappa Architectures are much more recent approaches to real-time analytics.

6.2 Lambda Architecture

The Lambda Architecture, first proposed by Nathan Marz, attempts to provide a combination of technologies that together provide the characteristics of a web-scale system that satisfies requirements for availability, maintainability, fault-tolerance and low-latency.

Quoting Wikipedia: "Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation.
The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce."

At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.
The efficiency of this architecture becomes evident in the form of increased throughput, reduced latency and negligible errors. While we mention data processing we basically use this term to represent high throughput, low latency and aiming for near-real-time applications.

As new data is introduced to the system, it is processed simultaneously by both the batch layer, and the speed layer. The batch layer is an append-only repository containing unprocessed raw data. The batch layer periodically or continuously runs jobs that create views of the batch data-aggregations or representations of the most up-to-date versions. These batch views are sent to the serving layer, where they are available for analytic queries.
At the same time that data is being appended to the batch layer, it is simultaneously streaming into the speed layer. The speed layer is designed to allow queries to reflect the most up-to-date information-necessary because the serving layer's views can only be created by relatively long-running batch jobs. The speed layer computes only the data needed to bring the serving layer's views to real time-for instance, calculating totals for the past few minutes that are missing in the serving layer's view.
By merging data from the speed and serving layers, low latency queries can include data that is based on computationally expensive batch processing, and yet include real-time data. In the Lambda Architecture, the raw source data is always available, so redefinition and re-computation of the batch and speed views can be performed on demand. The batch layer provides a big data repository for machine learning and advanced analytics, while the speed and serving layers provide a platform for real-time analytics.
The Lambda Architecture provides a useful pattern for combining multiple big data technologies to achieve multiple enterprise objectives.

There are numerous solutions nowadays to build every layer of a Lambda Architecture:

The takeaway here is that we have gone a long way since Complex Event Processing architectures and we have now numerous solutions to build new generations of streaming architectures able to extend real-time streaming to much more advanced analytics use cases, embracing Real-time Artificial Intelligence use cases.

Pros and Cons of Lambda Architecture.

Pros

Batch layer of Lambda architecture manages historical data with the fault tolerant distributed storage which ensures low possibility of errors even if the system crashes.
It is a good balance of speed and reliability.
Fault tolerant and scalable architecture for data processing.

Cons

It can result in coding overhead due to the need to implement the same analytics logic twice: one time in the speed layer and one time all over again in the batch layer.
Re-processes every batch cycle which is not beneficial in certain scenarios.
A data modeled with Lambda architecture is difficult to migrate or reorganize.

6.3 Kappa Architecture

In 2014 Jay Kreps started a discussion where he pointed out some discrepancies of Lambda architecture that further led the big data world to another alternate architecture that used less code resource and was capable of performing well in certain enterprise scenarios where using multi layered Lambda architecture seemed like extravagance.
Kappa Architecture cannot be taken as a substitute of Lambda architecture on the contrary it should be seen as an alternative to be used in those circumstances where active performance of batch layer is not necessary for meeting the standard quality of service.

Kappa architecture is a streaming-first architecture deployment pattern. With most recent Stream Processing technologies (Kafka Streams, Flink, etc.) the interest and relevance of the batch layer tend to diminish. The streaming layer matches computation abilities of the batch layer (ML, statistics, etc.) and stored data as it processes it.
A batch layer would only be needed to kick-start the system on historical data but then Apache Flink can very well do that

Kappa architecture can be deployed for those data processing enterprise models where:

Multiple data events or queries are logged in a queue to be catered against a distributed file system storage or history.
The order of the events and queries is not predetermined. Stream processing platforms can interact with database at any time.
It is resilient and highly available as handling Terabytes of storage is required for each node of the system to support replication.

Pros and Cons of Kappa architecture

Pros

Kappa architecture can be used to develop data systems that are online learners and therefore don’t need the batch layer.
Re-processing is required only when the code changes.
It can be deployed with fixed memory.
It can be used for horizontally scalable systems.
Fewer resources are required as the machine learning is being done on the real time basis.

Cons

Absence of batch layer might result in errors during data processing or while updating the database that requires having an exception manager to reprocess the data or reconciliation.

7. Big Data 2.0

When google published their papers in the early 2000s, it was quite a tsunami in the Computer Engineering World. Doug Cutting and the guys behind Hadoop started working on Hadoop but a lot of other initiatives kicked off as well.
With their approach - scaling information systems on commodity hardware, it turned out that massive computational systems suddenly became affordable and it gave a whole new level of interest in distributed systems and distributed computing.

There are now an entire range of engines that transcend the Hadoop framework and are dedicated to specific verticals (e.g. structured data, graph data, streaming data, etc.)
Nowadays, The NoSQL ecosystem provides incredibly efficient alternatives to HDFS storage in the storage layer. In the processing layer, there is a plethora of solutions available from Kafka Streams to Apache Flink through Spark, etc.
On the resource negotiation side as well, multiple initiatives provide lightweight and interesting alternatives to Hadoop's YARN.

7.1 Alternatives to Hadoop

A specific project kicked off by the University of California retained quite a bit of attention at the time, the Nexus project renamed later Mesos and given to the Apache Software Foundation.

Apache Mesos intended to be kind of the Operating System of a computer cluster, somehow in the same way the Linux Kernel for instance is operating a single machine. Mesos intended to provide the same kind of primitives for resources management at the scale of a whole cluster.
Pretty early in the Mesos development story, support of docker containers has been added to enable users to deploy and scale applications in the form of docker containers.

A few years Later, some folks inspired from Google Borg and created in their turn a cloud container orchestration system for automating computer application deployment, scaling, and management. They named it Kubernetes.
With Mesos and Kubernetes gaining a lot of traction since scaling applications in the form of docker containers is extremely convenient, the folks at Hadoop added support to deploying applications in the form of docker containers as well in YARN in Hadoop 3.0.

Nowadays in 2021, with Hadoop 3, these 3 technologies converge tend to converge to the same possibilities. Hadoop 3 supports deploying jobs as docker containers just as Mesos and Kubernetes.
Mesos and Kubernetes can use alternatives to HDFS such as Ceph, GlusterFS, Minio, (of course Amazon, Azure, ...) etc.

So while Kubernetes was really oriented to scale application in the Operational Information System space initially, it tends now to overflow to analytics use case as well.
And the other way around, while Hadoop is still first and foremost oriented to deploy applications in the Analytical Information System space, Hadoop 3 tends to be deployed increasingly in the operational space as well.
Apache Mesos can well be used on both sides and was forming an interesting alternatives to Hadoop YARN in both worlds for quite some time. Today, Apache Mesos, even though from my perspective an amazing software, is not heavily maintained anymore and support for Mesos tends to vanish in latest versions of software stacks.

Kubernetes (and/or technologies based on Kubernetes) is today a market standard for the Operational IS just as Hadoop remains a market standard for the Analytical IS.

7.2 Kubernetes

Kubernetes is an Open Source Platform providing:

Automated software applications deployment, scaling, failover and management across cluster of nodes.
Management of application runtime components as Docker containers and application units as Pods.
Multiple common services required for service location, distributed volume management, etc. (pretty much everything one requires to deploy application on a Big Data cluster).

Kubernetes is originally largely inspired and even based on Google Borg, (one of) Google’s initial cluster management system(s). It has been released as Open-Source component in Google in 2014 and the first official release was in 2015.

Kubernetes is emerging as a standard as a Cloud Operating System.
In comes in the flavour of many distributions. The mains ones are:

PKS (Pivotal Container Service)/li>
Red-Hat OpenShift/li>
Canonical Kubernetes/li>
Google / AWS / Azure

Kubernetes deployment architecture would be as follows:

Kubernetes Architecture

With the ever-growing popularity of containerized cloud-native applications, Kubernetes has become the leading orchestration platform to manage any containerized application.
Again, nowadays Kubernetes is emerging as a market standard to scale the Operational Information System, while Hadoop largely remains a market standard to scale the Analytical Information System.

8. Micro-services

From Wikipedia:
Microservice architecture - a variant of the Service-Oriented Architecture (SOA) structural style - arranges an application as a collection of loosely-coupled services. In a microservices architecture, services are fine-grained and the protocols are lightweight. Its characteristics are as follows:

Services in a microservices architecture (MSA) are small in size, messaging-enabled, bounded by contexts, autonomously developed, independently deployable, decentralized and built and released with automated processes.
Services are often processes that communicate over a network to fulfill a goal using technology-agnostic protocols such as HTTP.
Services are organized around business capabilities.
Services can be implemented using different programming languages, databases, hardware and software environment, depending on what fits best (Note JKE : this is not a strict requirement, e.g. Spring boot)

From Martin Fowler:
A Microservices-based architecture has the following properties:

Independent services lifecycles leads to a continuous delivery software development process. A change to a small part of the application only requires rebuilding and redeploying only one or a small number of services.
Adheres to principles such as fine-grained interfaces to independently deployable services, business-driven development (e.g. domain-driven design).

As early as 2005, Peter Rodgers introduced the term "Micro-Web-Services" during a presentation at the Web Services Edge conference. The architectural style name was really adopted in 2012.
Kubernetes democratized the architectural approach. The two big players in this field are Spring Cloud and Kubernetes

A typical micro-services infrastructure architecture would be as follows:

Micro-services Architecture

8.1. Micro-services discussion

Ask yourself : do you need microservices ?

Microservices are NOT Big Data !. In Big Data Analytics, one needs to scale the processing linearly with the storage. Hadoop and for instance Spark with Mesos on ElasticSearch are designed for that very key aspect to be respected: co-local processing optimization. Micro-services are not designed for this. The scaling approach in micro-services is at the component / service level. Heavy resources consuming services are scaled widely while light services run typically on a few nodes mostly for high-availability concerns.
You don’t need microservices or Kubernetes to benefit from Docker. Docker is a tremendous way to package and deploy applications as a whole or individual application components. Unless you need horizontal scalability and high-availability, you might not need Kubernetes or a micro-services infrastructure.
You’re not scaling anything with synchronous calls. This is essential. A fundamental element in the design of a micro-services architecture resides in the usage of asynchronous calls as the communication paradigm. Think of it. If services call each others using synchronous calls, then scaling them is useless since they will all synchronize with the slowest of them.

As a consequence, don’t do microservices unless:

You need independent service-level scalability (vs. storage / processing scalability - Big Data).
You need a strong SOA - Service-Oriented Architecture.
You need independent services lifecycle management.

There are various challenges to be accounted when implementing micro-services:

Distributed caching vs reloading the world all over again. If every service is a fully independent application, then all the reference and master data need to be reloaded all over again by all services. This needs to be accounted and distributed caching needs to be considered.
Not all applications are fit for asynchronous communications. Some applications require fundamentally synchronous calls.
Identifying the proper granularity for services.
- Enterprise architecture view is too big
- Application architecture view is too fine
Data consistency without distributed transactions. Applications need to be designed with this in mind.
Weighting the overall memory and performance waste.
- A Spring boot stack + JVM + Linux Docker base for every single service ?
- HTTP calls in between layers ?

9. Conclusion

We went a long way in this article, from the web giants and their needs to scale their information systems horizontally, the reasons behind it and the challenges this implies, down to Micro-services and the scaling of individual Information Systems components.
The Web giants needs were really related initially to their massive amount of data to be manipulated and the need to scale the processing linearly with the storage distribution. Nowadays, cloud computing and SaaS - Software As A Service on the cloud form somehow a different needs.
Initial Big Data technologies were really oriented towards Data Analytics use cases and the Analytical Information System space. Later technologies, namely NoSQL / NewSQL and now Kubernetes and micro-services are much more oriented towards scaling or deploying on the cloud Operational Information System components

The Strong frontier between Operational IS and Analytical IS will tend to vanish in the future.

Increasingly, in Hadoop 3 with YARN able to manage and deploy docker containers, Hadoop is not so strictly limited to the Analytical IS.
On the other side, Kubernetes make it increasingly feasible to scale heavy data analytics applications as well.
Even today NoSQL, Streaming, Lambda and Kappa architectures are increasingly overflowing to the Operational IS and as such provide a common ground for operational processes and analytical processes.

Powerful Big Data analytics platform fights financial crime in real time

2021-09-03T05:17:04-04:00

(Article initially published on NetGuardians' blog)

NetGuardians overcomes the problems of analyzing billions of pieces of data in real time with a unique combination of technologies to offer unbeatable fraud detection and efficient transaction monitoring without undermining the customer experience or the operational efficiency and security in an enterprise-ready solution.

When it comes to data analytics, the more data the better, right? Not so fast. That’s only true if you can crunch that data in a timely and cost-effective way.

This is the problem facing banks looking to Big Data technology to help them spot and stop fraudulent and/or non-compliant transactions. With a window of no more than a hundredth of a millisecond to assess a transaction and assign a risk score, banks need accurate and robust real-time analytics delivered at an affordable price. Furthermore, they need a scalable system that can score not one but many thousands of transactions within a few seconds and grow with the bank as the industry moves to real-time processing.

AML transaction monitoring might be simple on paper but making it effective and ensuring it doesn’t become a drag on operations has been a big ask. Using artificial intelligence to post-process and analyze alerts as they are thrown up is a game-changing paradigm, delivering a significant reduction in the operational cost of analyzing those alerts. But accurate fraud risk scoring is a much harder game. Some fraud mitigation solutions based on rules engines focus on what the fraudsters do, which entails an endless game of cat and mouse, staying up to date with their latest scams. By definition, this leaves the bank at least one step behind.

At NetGuardians, rather than try to keep up with the fraudsters, we focus on what we know and what changes very little – customers’ behavior and that of bank staff. By learning “normal” behavior, such as typical time of transaction, size, beneficiary, location, device, trades, etc., for each customer and internal user, and comparing each new transaction or activity against those of the past, we can give every transaction a risk score.

Billions of pieces of data

To do this effectively means taking into account thousands of pieces of information every time a customer makes a transaction. Multiply that by the number of customers a bank has on its books, and it quickly gets to billions.

Such high volumes would overwhelm most platforms, slowing the analytics to an unacceptable speed for the demands of real-time banking. At NetGuardians, we have solved this by using a combination of technologies that allows us to regularly batch process all the data for super-accurate models and supplement these batch models in real time by checking and adding smaller data sets as they arrive. This allows our software to accurately assess huge volumes of transactions in real-time.

The technologies we use are:

Apache Kafka
Elasticsearch
Apache Mesos
Apache Spark

All are open-source and run on our proprietary Lambda architecture-driven platform. Together, they make up a powerful and affordable solution for analyzing every transaction accurately in real time. In fact, our platform catches up to 99% of fraud, with 85 percent fewer false alerts, cutting investigation time by 95 percent compared with alternative rule-based solutions.

While this is key, it’s not the best bit.

At NetGuardians, we help our customers reap the benefits of cutting-edge and state-of-the-art open-source technologies without them suffering any of the drawbacks. We integrate these technologies, fine-tune and secure them and, critically, we implement enterprise-grade requirements on top. This means banks can use our solution out of the box.

Enterprise-Ready Big Data Platform

NetGuardians combines all the appropriate technologies in a way to make them work together 100 percent of the time, perfectly fine-tuned and secure, providing a bank everything it requires for an enterprise environment. This includes high availability, data and communication encryption, disaster recovery processes, state-of-the-art authorization, identification and authentication frameworks, single sign on, backup and restore procedures and much more. In this way, banks using our software enjoy the benefits of open source – easy integration and further development/fine-tuning – with the security and resilience of proprietary software.

With NetGuardians, banking institutions get the best of both worlds. But the cherry on the cake is that our banks don’t have to do anything. The NetGuardians platform takes care of everything and operates itself automatically, benefitting from strong NoSQL and DevOps genes. And that is unique to us.

Should it want to, though, a bank can create its own analytics on top of the open-source components on which the NetGuardians’ platform is built for its own use cases. A bank may want to use our version of Kafka for its own data-streaming use cases, for example, or it can open our 360 vision of the customer and user activities in ElasticSearch and expose that data through a secured API to in-house, third party software. This allows it to use the data for whatever it wants or needs to do - perhaps AML use cases or enriching the CRM application with NetGuardians’ data about customers.

The future of finance is real-time payments

Typically, many banks access the anonymized data we collect and store on our platform as a financial crime data lake to enrich their own customer 360 views in front office applications with risk indicators and a consolidated view of customers’ activities on their account. This is important because real time payments are growing fast. In 2020, 54 percent of consumers had used real-time payment app PayPal https://www.paymentsjournal.com/real-time-payments-everything-you-need-to-know/. Similar apps such as Venmo and Zelle are also growing fast – with the latter claiming 13 percent of consumers using its app in 2020, up from 1 percent in 2017.

While retail payments are important, it’s in business that the big volumes lie and in one survey 80 percent of businesses said they wanted real time banking. Already this is translating into action - in the US, 2020 saw a fivefold increase year on year in financial institutions implementing real-time payments.

Such huge growth means banks, big and small, will need affordable fraud detection in real time that can cope with these volumes. For the big banks, the solution will need to scale fast; for the smaller ones, they need a platform that can deliver accurate real-time risk scoring with smaller data sets. NetGuardians, with its unique combination of proprietary and open-source technologies, satisfies both. That is why banks worldwide – from Tier 1 to credit unions and co-ops – are turning to NetGuardians fraud-mitigation software to keep their customers’ cash safe.

(Article initially published on NetGuardians' blog)

A proposed framework for Agile Roadmap Design and Maintenance

2021-06-11T04:25:42-04:00

In my current company, we embrace agility down the line, from the Product Management Processes and approaches down to the Software Development culture.
However, from the early days and due to the nature of our activities, we understood that we had two quite opposed objectives: on one side the need to be very flexible and change quickly priorities as we refine our understanding of our market and on the other side, the need to respect commitment taken with our customers regarding functional gaps delivery due dates.
In terms of road-mapping and forecasting, these two challenges are really entirely opposed:

Strong delivery due dates on project gaps with hard commitment on planning. Our sales processes and customer delivery projects are all but Agile. We know when in the future we will start any given delivery project and we know precisely when the production rollout is scheduled, sometimes up to 12 months in advance. We have most of the tine a small set of Project Gaps required for these projects. Since we need to provide the delivery team with these functional gaps a few weeks prior to the production rollout, it turns out that we have actually strong delivery due dates for them, sometimes 12 months in advance.
Priorities changing all the time as our sales processes and market understanding progress. We are an agile company and mid-term and even sometimes short-term focus changes very frequently as we sign deals and refine our understanding of our market, not to mention that the market itself evolves very fast

These two opposed challenges are pretty common in companies that are refining their understanding of their Product-Market Fit. Having to commit strongly on sometimes heavy developments up to more than a year in advance, while at the same time changing the mid-term and short-term priorities very often is fairly common.

In this article, I would like to propose a framework for managing such a common situation by leveraging on a roadmap as a communication, synchronization and management tool by inspiring from what we do in my current company (leveraging on some elements brought by Mr. Roy Belchamber - whom I take the opportunity to salute here).

There are three fundamental cultural principles and practices that are absolutely crucial in our context to handle such opposed objectives. These three elements are as follows:

Multiple interchangeable development teams: multiple teams that have to be interchangeable are required to be able to spread the development effort among flexible evolutions - that can be reprioritized at will - and hard commitments - that need to be considered frozen and with a fixed delivery due date.
Independent and autonomous development teams: these development team need to be able to work entirely independently and without and friction from any other team. This is essential to have reliable estimations and forecasts. A lot of the corollary principles and practices I will be presenting in this article are required towards this very objective.
An up to date and outcome-based Roadmap. Having a roadmap that crystallizes the path and the foreseen development activities in the next 2 years is absolutely key. Such a roadmap is all of an internal communication tool, an external communication support, a management and planning tool..

Agile Roadmap Example

In this article, I intend to present the fundamental principles behind the design and maintenance of such a roadmap that are required to make it a powerful and reliable tool - and not yet another good looking but useless drawing - along with everything that is required in terms of Agile principles practices.

This article is available as a Slideshare presentation

Summary

1. A Software Development Roadmap
2. Prerequisites
3. Conclusions and final notes

1. A Software Development Roadmap

A good roadmap implies some important characteristics:

Aligns with company strategy, rallies all around this and steers us towards delivering on this strategy
Focuses on delivering customer value & articulating benefits
Excites our customers about our company & product direction
Reflects what we have learned, over time, as an organisation
Does not pretend to have all the answers, neither needs to

If it is done well, a roadmap is a strategic communication tool, a statement of intent and direction.
A good roadmap has to set a clear direction and simultaneously embrace uncertainty inherent in product development

Some pitfalls have to be avoided when designing a good roadmap:

Excessive granularity – too much focus on detail and dates means it will inevitably soon be inaccurate or even obsolete!
Mistakenly thinking every item in the roadmap demands upfront design and estimation - this is impossible and wasteful. Only the short-to-mid-term elements have to be estimated accurately.
Believing each stakeholders must personally value every item
Ensuring our roadmap is not conflated with our product release plan!

1.2 Roadmap elements

The essential elements of a roadmap are as follows:

A good roadmap starts with a vision of where we are going, guides us there and explains the stops along the way. The vision is the guiding principle.
Broad timeframes avoid overcommitment - it's the sequence that matters (now, next, future). As we move along the sequence, accurate estimations become less important. Down to the point where we don't really care about the time that might take something on which we are not going to be working before several years.
Focus on outcomes not outputs. Themes are not granular features and functions.
What goals will our product accomplish? What outcomes?
Protects against claims of broken promises by explaining that changes can happen.

These elements would be illustrated as follows for instance for a RIAO - Rich Internet Application Organizer (as introduced in my "Lean / Agile Product shaping" slideshare and detailed in my "Introduction to Modern Software Architecture" slideshare.):

Roadmap Elements
(inspired from Roy Belchamber @ NetGuardians)

1.3 Hierarchy

In terms of hierarchy, we have to wonder why we are doing this, why it supports our strategy, what customer problems it will solve, and finally how to solve it.
The Roadmap is about what customer problem we're about to solve. And the Product Backlog is about how will we solve them and what solutions do we need to implement to solve these.
The roadmap is a product management concern, while the backlog is an R&D concern.

Roadmap hierarchy
(© Roy Belchamber @ NetGuardians)

1.4 A realistic Roadmap

Let's look at something more realistic.

Realistic Agile Roadmap

This example is a slightly reworked (anonymized and generified) version of the roadmap we use today (April 2021) in my current company. It is really an instantiation of all the fundamental principles expressed above and we will be using it throughout the remainder of this article to illustrate all key practices required to make it a living and yet fairly useful tool.

A first important thing is to be noted.
As we move along the timeline of the roadmap, the confidence and certainty diminishes down to the point where, in 2 years from now, the roadmap is more a collection of long term development ideas and in no way any kind of actual commitment on working on any of these.

1.5 In the next chapters

In the next chapters of this article, we will be covering all the fundamental concepts, principles and practices that need to be understood and adopted to make this roadmap a useful communication and planning tool.

2. Prerequisites

In order to be able, to design and maintain such a roadmap, some organizational principles as well as some product management principles are required.
Then, if one intends such a Product Roadmap to be more than a fancy marketing tool, really a strong communication, management and planning tool, it's essential that the forecasts are as reliable and realistic as possible. And this is fairly complicated since it relies on the ability of the individual development teams working on a specific topic to be able to work autonomously, independently and without any friction with other teams. And this in turn requires an in depth adoption of Agile Principles and practices not only in the development team but at product Management level as well.

In this chapter, we will be looking at each and every of these prerequisites.

2.1 A. Agile Software Development Teams

First, if we want to be in a situation where we can respect the roadmap timeline and have reliable forecasts, the multiple development teams working in parallel on the multiple streams need state-of-the-art agile culture, principles and practices.
The whole problem is How to build a team with a culture, an organization and a set of principles and practices that make estimations possible and forecasting reliable and accurate ?
The answer is by adopting a state of the art Agile Engineering methodologies.

The schema below is a slide that I've designed when I was a consultant carrying on digital transformation projects in big corporations. Very often I was meeting IT teams that were not agile at all. I needed a slide to explain to them what are the prerequisites if one want to embrace digital transformation.

My message here was as follows:
"Look, if you want to go digital, if you want to come up with digital products and be able to develop and adapt them at the pace that is required in the digital world, it's going to be fairly difficult if you're not an agile Corporation, if you haven't scaled agile principles and practices at the level of the whole organization.
And then scaling agile, if you haven't embraced the Lean Startup methodology, and if you don't have company wide monitoring and improvement approaches such as Kanban and Kaizen, will be difficult.
Then doing Lean Startup, Kanban, and Kaizen if you don't have a state of the art Agile software development methodology, and if you haven't embraced DevOps principles, will be difficult.
And finally doing Agile and DevOps, if you're not state of the art regarding eXtreme Programming principles and practices will be difficult."

Now, if you want to be in a situation where you have independent and autonomous teams which can benefit from reliable estimations, and eventually come up with reliable forecasting abilities, then you have to have raised agility adoption to this level in your company.

Another way to illustrate, these agile methodologies such as DevOps, Lean Startup and others would be The Periodic table of Agile principles and practices.

Periodic Table of Agile Principles and Practices

All the principles and practices with a red border are effectively needed for accurate planning and road mapping. All these practices are required to have independent and autonomous teams giving you the ability to parallelize development epics and be in a position where you can accurately estimate the velocity of these teams, in such a way that eventually you're able to do reliable estimations and forecasting on the items for which you need to have an accurate due date.
Now in reality the situation is more complex since at the end of the day all the practices from the table heavily depend on one another so in the end you need to have embraced most of them.

We will now review the most essential practices with direct impact on planning and forecasting abilities from these various methodologies.
We can't review all of the required practices, of course, but at least those that I believe are the most essential ones from eXtreme Programming, Lean Startup and DevOps that one needs to adopt, if one want to be able to reach the ultimate goal, which is having autonomous and independent team, giving one the ability to paralyze development items, while still being able, within teams, to have accurate planning capacity and estimations, and eventually reliable forecasting.

2.1.1 eXtreme Programming

The eXtreme Programming practices are as follows. This representation is interesting since it shows that even only within XP, all these practices depend from each others.

XP Practices

Let's discuss more in detail 4 essential principles when it comes to planning: Small-releases, Testing, On-Site Customer and the Planning Game.

2.1.1.1 Small-releases

And the first part I want to mention is XP's Small Releases principle, which DevOps streamlines as Continuous Delivery. The fundamental idea behind it is that if you want to master your release process, you have to release as often as possible, for multiple reasons. First, because if you release as often as possible, the releases are small. And as a result, your chances of mastering the process are much higher than if you have rare and then very large releases.

XP Practices

Having smaller gaps and smaller changes in your releases significantly reduces the risk inherent to the release.
But there's another reason behind it. If you ask an engineer to do something very often, he will automate it. This is not necessarily the case if an engineer does only a few releases a year, why would one bother automating it?
But if you ask your development team, your engineering team, to release at the end of every single sprint a production-ready and shippable version of the product, then your development team will automate the release process. That's what engineers do.

2.1.1.2 Testing, testing and more testing

The next topic is about testing. XP's practitioners are always saying that one has to invest the 20% of time required to reach 80% coverage of the cyclomatic complexity of the code or branch / line coverage, and not more (since the remaining 20% would require 80% of the investment) .
But if one wants to go to continuous delivery, if one wants to be in a situation where one is able to entirely automate the release process, then the 80% target is not sufficient anymore, one has to target 100% coverage.

And this is doable with different types of test. Covering more than 80% of the code complexity with unit test only is impossible. But integration tests enable to go beyond, for instance typically 90%. Then if one is able to have the product entirely built and deployed automatically on a production like environment, then one is able to implement end-to-end / functional tests using for instance protractor, Selenium or whatever. And with end-to-end / functional tests, you are able to cover the remaining 10% to reach nearly 100% coverage with your whole automated tests suite.
And this is essential if one wants to be able to have reliable forecasting abilities. Because stop the world releases are killing it. It's fundamentally unpredictable how much time it takes to complete acceptance tests, fix the remaining issues and eventually release the software if one waits for periodic releases.
One needs to entirely automate non-regression tests and acceptance tests using functional, integration and unit tests. And this will be absolutely key to reach continuous delivery.

But there's something else that needs to be accounted. In this next chart, we're looking at three typical situations of a feature development. The first situation is without any automated test, the second is with some tests implemented after the development is done, and the last situation is when embracing TDD.
This chart is explained in detail here in a previous article.

This is interesting. We have the illusion that skipping the development of tests makes us gain time. And if you look at the blue box, it does indeed make us gain time in terms of pure development time. But the time lost after the development phase exceeds greatly the time gained by skipping tests.
Writing some tests after implementing the code really helps, we can see that the development time takes longer but is really compensated by the time gained on debugging and manual testing. Not to mention the formidable documentation that unit and integration tests form.
But the striking approach is when we're putting in place TDD - enabling us to reach a close to 100% coverage of the code with our automated tests - where we spend most of the time doing pure development, which takes longer of course, but the return on investment is absolutely brilliant.

And the interesting thing when it comes to planning and forecasting is that the blue box can be estimated. This is typically what we estimate, when we do the planning game with Story Points. It's possible to estimate the time it takes to develop a feature, including the tests. On the contrary the time that it would take to debug it, manually testing it or re-understanding it is fundamentally unpredictable.
TDD brings most of the development activities back to something that can be estimated, which is the pure development time. And this is not optional. If one wants to be able to have reliable forecasts and reliable estimations, one needs to bring most of the development activities back to something that can be estimated. This is absolutely precious with TDD.

For more information on TDD, refer to my previous article.

2.1.1.3 On-site customer

XP insists on the need to have an on-site customer to streamline the development pace by being able to answer questions, provide refine specifications and, importantly, do acceptance testing continuously as development tasks are being completed.
The notion of on-site customer - having someone with a true business expertise - is replaced in scrum with the notion of product Owner.
If one wants the development teams to be independent and autonomous then one needs to make sure that this team can work on a feature peacefully and in isolation down to production readiness without any interruption and without any dependency on an external actor.
The product owner enables to do continuous acceptance testing thus avoiding stop the world events and streamlines acceptance testing as part of the development activities.
For this reason mostly, the on-site customer - or the scrum product owner - is crucial.

2.1.1.5 The Planning Game

And finally, the planning game. You need to think of it this way: imagine you're looking at a big stone somewhere in a field. Estimating out of the blue the weight of the stone is very difficult. Estimating absolute figures is a very difficult game.
But answering to another question which would be "is the stone heavier or lighter than this other stone ?" is a completely different story. It's a comparison game. And that's surprisingly easier. Finding out whether a stone is somewhat bigger than another one, but also somewhat smaller than a third one, is a much easier game.
Story points enabled to transform a very difficult estimation game into a much more a much easier comparison game.
And that's what the planning poker with Story Points is all about.
It's about finding ways to transform a difficult problem in an easier problem. And after a while, the team becomes quite good at it

And if you're able to accurately estimate a task in story points using the comparison game, then you're able to compute how many story points a team is able to do in a sprint. And if you know how many story points a feature cost and what is the sprint velocity of a team, then you can compute how many sprints does the team need to implement this very feature.
And with this, you can do forecasting and planning ... if and only if the team is able to work independently, autonomously, without any synchronization point, without stop the world events and without any friction coming from the need to collaborate with another team.
And we will see what these teams are to enable such independence.

More information on the Planning game in a previous article and further in this very article.

2.1.1.6 XP Takeaways

Summing things up:

First, the product owner (i.e. on-site customer) - enables the team to be autonomous and work on its own without interruption by answering questions as they pop up and, more importantly, running acceptance test continuously, so that a developer can consider his task being finished as soon as possible before he moves to the next one.
Small releases - which we will see as Continuous Delivery in the DevOps methodology - enables to avoid periodic releases as stop the world events, which would break both the autonomy and the pace of the team by forcing synchronization points everywhere on the roadmap timeline.
TDD enables to have reliable forecasting by bringing most of the development team activities to something that can actually be estimated and forecasted, which is code implementation time. Debugging sessions, testing and re-understanding the code are activities that are fundamentally impossible to estimate.
And finally, the planning poker estimation game enables to have reliable estimation and forecasting abilities, surprisingly and counter-intuitively much more than traditional waterfall.

XP takeaways

2.1.2 Lean Startup

Eric Ries presents Lean Startup as the Build - Measure - Learn loop.
Steve Blank presents it as the Four Steps to the Epiphany process as follows. We won't be discussing Lean Startup in depth today but I would want to discuss two Lean Startup Practices that are crucial to shape autonomous and independent, and yet interchangeable and equivalent teams: Pizza Teams and Feature Teams.
One might refer to my previous article on lean startup to get an overview of all Lean Startup principles and practices.

Lean startup - The four steps to the Epiphany.

Let's discuss shortly these two important principles related to Agile Teams.

2.1.2.1 Pizza Teams

I discussed in length Pizza Teams in my previous article about Lean Startup so I will only recap a few things here.

The reason for small teams is that the more people you have in your team, the more you explode the number of one-to-one communication channels. So one needs to keep a team being sufficiently small to make it so that everyone is able to understand what everyone else is working on.
One needs to keep the team sufficiently small to have an efficient organization within the team, and so the ability to have people collaborating together, for instance, the UI UX expert with the backend developer, the DevOps engineer, etc.
But the teams needs to be sufficiently large to enable efficient brainstorming, the ability to generate new ideas, interchangeability of essential resources, etc.
In my current company, we believe that the ideal size for our Feature Teams is between 5 and 8 engineers.

2.1.2.2 Feature Teams

Just as for Pizza Teams, I discussed in length Feature Teams in my previous article about Lean Startup so I will only recap a few things here.

The key point with Feature Teams is to have teams that are independent and autonomous as possible without any synchronization needs or friction with any other team. This is essential to guarantee that the team cam deliver a feature from A to Z on its own, where Z is the production rollout or release. This in turn is key to have reliable estimations and eventually forecasting abilities.
Organizing an R&D department with Feature Teams is striking in this regards.

Feature Teams
(Source : Large Scale Scrum : http://less.works/)

To understand this, consider the following.
Most of the time, the way software engineering development projects or IT departments are organized in companies is by having component teams. In the component teams model, one team is taking care of the UI, another team takes care of developing the back-end, there is perhaps a data science team, another team takes are of data management and database infrastructure, and so on.
These multiple teams have to collaborate to develop a feature or to implement an evolution on the set of products that they develop or maintain. And that's where the problem is, every single team cannot go faster than the slowest of all the teams on which they all have a dependency.
Imagine that each and every feature requires some changes in the database model. And these can only be done by the database team. Then no team can go faster than the database team because they all have to wait on the database team to finally implement the changes to the database model that they require.
Components team are killing the performance of the IT development department or project as a whole.

A feature team on the other hand is able to implement a feature from A to Z, from refined specifications down to production deployment, entirely independently and autonomously.
This can only work if you have within the feature team all the competencies that are required for that: developers, UI/UX experts, data scientists, QA engineers, DevOps engineers - not necessarily to operate the software in production, but to automate the deployment and operation of the software in production - and so forth.
And since these feature teams are able to work on a feature entirely independently and autonomously, then if you know the velocity of your feature team, you know how long it will take it to implement a feature that you have been able to estimate.
Feature teams are multiplying the performance of the whole development or software development organization by several orders of magnitude. And the side benefit is they are autonomous and independent by design.

Just a little note about component teams, people are sometimes confusing component teams and product teams. While it is not acceptable to have component teams (one team working on the UI, another team working on the back end, another team working on the research, yet another working on the database and so on=, it is on the contrary crucial to have product teams.
Component teams are killing performance by introducing strong dependencies between teams.
But having a Feature Team linked not to a component, but to a product makes a lot of sense. We want to leverage and develop business expertise of the feature team. And this can't happen if a feature team is moving all the time from one product to another.

Product Teams

A feature team is fundamentally linked to a product or perhaps a consistent set of products (or product line), because that's the product they will start to be familiar with in terms of business understanding.
And that is essential to avoid wasting time. With experience on a product, the team can become even more autonomous and slowly reduce its dependency on the product owner to be there and working with the team all the time to understand what it has to do in terms of business requirements.

2.1.2.3 Lean Startup Takeaways

Summing things up:

First, a feature team is an independent and autonomous team that's able to implement a feature from A to Z. And the fundamental idea behind that is that as soon as a team has any dependency on another team, its estimation and forecasts are simply not reliable anymore. Multi-team consolidated planning and forecasting is a much more difficult problem.
From A to Z means that the team needs to have everything it takes to carry on its mission, 3rd level support, documentation, maintenance, automated test development, code reviews, IT testing, acceptance testing with the product owner, releasing, continuous delivery, continuous deployment, deployment and delivery automation, everything... And the DoD - Definition Of Done on a task or epic has to account all these elements.
And the takeaway here is that because the team is autonomous and independent, its planning and forecasts are accurate and reliable.
The reason you want to have many of these independent teams is that you want to be able to work on different things in parallel. You want to have a fair share of spread between the project gaps and the short term development, while retaining the ability to work on long term topics.
And how you want to do this if you don't have different teams able to work on different things at the same time?
So you want to have multiple feature teams in your project development organization to be able to have different core focuses at the same time.
Finally, because a feature is being given to one autonomous team, and that team is able to work without any friction without an interruption, and more importantly, without any stop the world event, the team's velocity calculation enables to do reliable forecasting.
As a side note, something that's important to understand as well is that the development team should never ever be exposed to customers. If all the time the team has to answer customer requests then its performance will critically suffer. First and second levels of Support need to be external to R&D.

Lean Startup Takeaways

2.1.3 DevOps

At the very root of DevOps is the wall of confusion between developers and operators. And the wall of confusion is a crystallization of the fact that developers and operators have fundamentally different objectives, and fundamentally different cultures.
A developer is challenged to deliver new functionalities to production as fast as possible. An operator, on the other hand, has the fundamental mission of maintaining the production stability, which is precisely what a developer pushing changes all the time is compromising. These two different roles in an organization have entirely different and completely opposed objectives.

DevOps

Interestingly, the web giants have built organizations making this wall of confusion nearly vanish. DevOps is a lot about understanding how traditional industries can inspire from the web giants to streamline the interactions, and smoothen the relationship between developers and operators.
And DevOps relies on three pillars:

DevOps pillars

I described these principles and practices in length in my previous article on DevOps so I wont be repeating much more here.
I will just emphasize a little two aspects that are utmost essential for planning and estimations that are Infrastructure as code and Continuous Delivery

2.1.3.1 Infrastructure as Code

Again, this is very much detailed in my previous article so have a look at the section on Infrastructure as Code.

The reason why Infrastructure As Code and Continuous Delivery are essential to planning and forecasting is because both enable to automate entirely the deployment and release process - the deployment process in case of a cloud or SaaS application, and the release process in case of an application deployed on premise.
Automating release and/or deployment is crucial since we don't want stop the world events. We want a feature team to be able to implement changes on the software, release them and push them to a customer, without any friction and without any contention coming from other teams that have different speed or timelines. And this can only work if the whole release or deployment process is entirely automated. And at the end of the day, this is what DevOps is about.
We have all the possible tools today to automate machine provisioning, system configuration, application deployment, etc.

Infrastructure as Code

Automating all of this is difficult, I'm not saying it's easy. it's everything but easy when you implement features that involve significant technology changes to keep up to date the automated deployment process, the automated release process, etc. At the end of the day, there is a reason why courage is the most essential value in eXtreme Programming.
But again, you don't have much of a choice. That's the only way you can make teams independent from each other and avoid stop the world events. And 20 years ago, that would have been crazy. But with the tools we have today, with virtual machines, Docker containers, Ansible, Chef, Puppet, with everything you can build around these tools to make it so that you click a button and the release is entirely automated with all the tests being executed, etc. this can definitely be done. And it has to be done.
The Return On Investment in building and maintaining such automation infrastructure is absolutely striking. It enables a team to deploy to production (or create a release) independently from every other team. And this is absolutely key when planning and forecasting matters.

2.1.3.2 Continuous Delivery

The fundamental idea behind continuous delivery is that doing deployments is difficult. And because it's difficult, one has to do it as often as one can.
There are two reasons for this:

The first reason is that the more often you release, the smallest the changeset will be. And the smallest the changeset is, the more you master it, the more you can control it and the smallest the inherent risks are.
But there's another reason: if you ask an engineering team to do something all the time, they will automate it, that's the way engineers work. They will invest the time to make it so that the next time they need to do something it's as easy as pushing a button. And that's absolutely key, because if you automate your release and production rollout processes, then you can do it as often as you can.

A little note about on premise deployment.

If you want to release an application, for instance at every end of sprint or at the end of every feature and push it to a customer, then internally you have to organize the team in such a way that it makes it possible to release at every end of sprint a shippable, production ready version of the product.
Releasing the product and deploying it to any given customer has to be a product management decision. As far as the development team or the R&D is concerned, every single sprint has to finish with a production ready, shippable version. Period.
For this to be possible, you have to be in a situation where if one day you push, for instance, version 7.3.4 to a customer, then then a few months or years later you release a 9.46.5, you need to be able to push that upgrade to that very customer automatically. For this you have to build a framework that enables you to apply these upgrades entirely automatically. It should be possible all the time to push upgrades automatically. If you don't do that, if applying an upgrade to a customer involves manual steps, manual data migration, manual configuration migration, etc. then you're screwed because it would force you to potentially maintain all the individual versions you have in production at your customers. And that would kill the performance of your team with its ability to focus on the last development version only.

In short, Continous delivery is key for planning!

You don’t want to loose time on Deployment
You want deployment to be automated
You want deployment and production release to happen without you worrying about it or even only noticing
You want to be able to develop on one single branch, the latest development branch, and push any version to any customer thanks to the data migration framework maintained along the software.
it's key to avoid multiple team synchronization

I won't detail continuous delivery any further here and would refer to my article on DevOps, specifically the sections about:

2.1.3.3 DevOps Takeaways

Summing things up:

You have to be in a situation where releasing a new version of the product or pushing it to the SaaS of cloud environment is a push-a-button, anecdotal event. You don't want to lose time on releasing or deploying in production because that would mean you have stop the world events in your entire development department, killing the ability of your teams to work independently on their own feature.
If every three months or every two months or even every month a team has to stop everything because a release process is going on, it's impossible to respect the estimations and forecasts.
Automated tests are essential. You want to be in a situation where you push a button to deploy the software to production. This only works if you have a complete suite of automated tests and end-to-end tests that ensure the software is working 100% before it's automatically pushed to a production environment or automatically released.

DevOps Takeaways

2.2 B. The 3 Horizons Framework

Your might have noticed these H1, H2 and H3 notions on the roadmap. It's now time to explain this. This comes from McKinsey's 3 Horizons framework
In order to introduce the three horizons framework by McKinsey, I need to start by explaining a few concepts of Business Economics.

2.2.1 Business economics

So let's start by looking at the following charts:

On the left here, we have what we call the Technology S Curve. And it's quite typical in product development stories with a strong innovation dimension.
The technology S curve illustrates that when you invent a new product, when you invent a new technology or come up with a new offering for specific market, the progresses you make at first from a technology standpoint are huge. But after a while, continuing to invest more and more in your product or your technology doesn't make a whole lot of sense. Because after a while, the amount of investment you have to consent to develop your technology further is exploding. After a while, you have reached the maximum you can do with your technology or your innovative product. And it doesn't make a whole lot of sense at that point to keep investing in it, you better move your idea to the next level, or look at adjacent segments, or even entirely new products.

And this is shown pretty well by the technology profit curve on the right as well, which is saying basically the same thing, but from a different perspective.
When you have a new idea, when you put a new product on the market, in a real Product-Market fit situation, the money you make out of your product gets very high quite early. You make very fast a lot of profits. But after a while, this technology is being disrupted in its turn - you have competitors offering the same idea, you have alternative offering coming up, and your profit will go down. And when this happens, you have to find the next level investment, the next level innovation in your idea or develop something completely different. And again this is quite typical in companies developing product with a strong innovation dimension.

2.2.2 The three Horizon framework from McKinsey

This is at the root of the three horizons framework from McKinsey.
McKinsey is basically saying that, in order to keep developing your company, and perhaps your solution or your technology, you have to be aware, monitor, foresee and manage these situations. If you're working on your current technology, you have to be aware that it has a deadline. After a while it doesn't make a whole lot of sense to keep investing in what you're doing today. You have to identify your next idea, you have to identify the next level evolution to enhance your idea or technology that will make you reach the next level.
And also, you have to keep in mind that eventually you might need to come up with something completely new to keep developing your company further.

Three Horizons Framework

This reads as follows:

Horizon 1 is about maintaining and strengthening your core business.
Horizon 2 is about expanding your offering further for new opportunities or emerging markets.
Horizon 3 is about genuinely new businesses, competencies and possibilities, perhaps on top of your current technology, but likely on something completely new, yet related to your product vision, the next level idea.

The best example of that is Uber. They came up with their initial application very fast. And after a while, developing the Uber application itself didn't make much sense anymore. It was working perfectly. And it started to be challenged by competitors offering. So Uber came with Uber Eats leveraging on their technology to provide a completely new product on a new market.

Again:

The first horizon involves implementing innovations that are improving the current operations or the UX of the product or covering functional aspects that were not covered so far.
The second horizon is about innovations that extend the human competencies or technology abilities into new related markets. It's about looking at new verticals or adjacent segments.
The third is about disruption or high-end innovations that will change the nature of your industry and generate entirely new possibilities and competencies.

McKinsey speaks of the horizon 1 being one to two years, horizon 2 being two to four years and Horizon 3 being three to five years.
In my current company, this doesn't speak to us, since, well, we have absolutely no idea of what we're going to be doing in five years, this is like forever to us. And it doesn't make a whole lot of sense to think of it so much today aside of some high-level orientations. For us, our interpretation in terms of timeline is as follows. Horizon 1 is now to 12 months, Horizon 2 is six to 24 months, Horizon 3 is three years from now.

An important aspect of the 3 horizons model is that at every single moment, you should have a fair share of investment on the thre horizons, meaning elements from the 3 horizons in your current backlog.
If you have not reached the flattening of the profit curve, the you have mostly H1 innovations or elements, but you should also have a fair share of Horizon 2 and 3 elements.
If you have reached the flattening of your profit curve, then the ratio of Horizon 2 and 3 elements in your backlog should be higher.

2.2.3 Three Horizons framework - takeaways

Summing things up:

You should have a fair share between the different Horizons in your backlog and your sprint development tasks. If most of the time, the majority of your investment should be on Horizon 1, there should still be a fair share of Horizon 2 and 3 elements. Whenever your reach the flattening of your curves, the ratio should change in favor of Horizon 2. But you can't avoid investing on Horizon 2 and 3 already today. If you don't, you will die.
A little note about the stories related to individual customer integration projects or individual delivery project: for some of them, you might not know before you actually start the project or the specific implementation, all of that very customer needs, the precise scope, and hence you can't estimate accurately. In this case, you have to take some reserves. And it's fine to take reserves, you can absolutely be in a situation where you don't know exactly what the team will be working on. There are always backlog fillers tasks, tiny tasks like a typo fixing or another small change somewhere that are literally filling up your backlog. These tasks are fillers. And this is what your teams work on when they need to fill in the blanks.
In a normal situation though, these fillers - all these tiny tasks - shouldn't be prioritized and scheduled independently, it's impossible. So as far as roadmapping is concerned, all these small tasks have to be grouped together in batches, consistent batches - grouping them together by theme, or scope or value. Such evolution batches are prioritized and scheduled as one big block. And this kind of makes sense, you're almost forced to implement all these tasks in a continuous way all the time. But you might decide that in the next six months, we will implement this batch of consistent evolution together on the platform.
Regarding what are these stories or epics on the roadmap, most of them comes from the PMC identified topics or Project Gaps. There's also some long and large functional evolution coming from business analyst or from the product owners just as technology evolutions and maintenance come from the CTO. The project gaps are coming from delivery of course. The takeaway here is that the cardinality has to be larger, big, important topics. And if you want to schedule smaller things, group them together in consistent evolution batches.
If you have a team that's working on H1 topics, such as project gaps or evolution batches for six months, make it so that six months after they work on H2 or H3 topics. This is crucial to avoid frustration. You can tell a team that in the coming four months they will be working on tiny and not so passionate improvements here and there, but only as long as they know that the next big technology evolution is coming up and is for them. If you have a fair amount of tasks in the teams among H1, H2, H3, everybody keeps being motivated.
And then, if you don't do that you're somehow back to having component teams, which doesn't make a whole lot of sense.
Finally, you want to have independent and autonomous team for estimations and reliable forecasting. And you want to have many of them to keep the possibility to have a fair share of investment between H1, H2 and H3 at any given time in your product development timeline.

3 Horizons Takeaways

2.3 C. The Estimation process

I won't spend much time on the estimation process since I detailed a few years ago all of this in a dedicated article on this very blog.
So I will just be repeating some essential information regarding how epics or stories on the roadmap are estimated and how team velocity is estimated.

2.3.1 The roles and rituals involves in the estimation process

Let's start by defining all the roles involved in the estimation process. In this regards, the most important roles are the Product Manager, the CTO, the Tech Leads and Architects, the Team Leaders and of course the Product Owners.
Their core responsibilities are as follows, at least in my current company.

Roles involves in the estimation process

For the estimation process in our context, the important rituals are the Product Management Committee, and the Architecture committee. The first is central to identifying evolutions and prioritizing them while the second is central to designing and estimating them.

Rituals involved in estimation

The Product Management Committee identifies the opportunities and evolutions and prioritize them.
The Architecture committee is responsible to proceed with everything that is required for accurate estimations: the detailed specification of these epics and stories, their breakdown in tasks and their estimations.
It's key to estimate the short to mid-term elements accurately, but then we don't care so much in estimating something on which we will work in two years. Why would you care how much time it takes to develop something we are not even sure we will be doing? We want at all cost to have accurate estimations of the things on which we know we will be working, to prioritize and plan them efficiently. But for long term ideas, a T-Shirt sizing approach is sufficient.

2.3.2 Rituals are scheduled

All precautions should be taken to avoid interupting the development sprint all the time. For this reason, rituals are clearly scheduled and take place at a predefined time and pace. There should be no unforeseen interruption of the sprint course.
In my company, we stick to the following month organization, with 2 sprints per month:

This reads as follows:

A sprint ends with a Sprint Retrospective (Kaizen) and starts with a Sprint Planning where the sprint backlog is filled up to the team sprint velocity by the Product Owner and the Team Leader. Tasks are discussed with all the team at the same time and some estimations may be reviewed.
In my current company, we do both on Friday since we want everything to be ready for Monday morning regardless of the timezone of the teams.
Of course every single day there is a Daily Scrum in the morning where everyone presents where they are and where problems are discussed and escalated to the Team Leader and Tech Lead who might schedule dedicated meetings to discuss them further if required.
Also, every day finishes with an automated deployment of the whole product in a production-like setup on the Integration Environment.
At the end of the sprint, a production-ready, shippable version of the product is automatically deployed on the Test Environment. This internal release is considered as a customer release as far as the development teams are concerned.
The PMC - Product Management Committee occurs once a month. It's sometimes a very short meeting, just reviewing the updated roadmap and fine tuning priorities and sometimes a very long meeting finishing late at night, when multiple opportunities have to be discussed and prioritized.
The ARCHCOM - Architecture Committee - is a fairly central ritual in my current company. This is where we do all of:
- Dispatch stories among Architects, Tech Leads, Product Owners and even the CTO for design, refined specifications and tasks breakdown.
- Discuss open points, comment and amend designs and breakdowns.
- Proceed with task estimations
- Identify and Challenge technical evolutions and maintenance
As far as the estimation process is concerned, this ritual is really central. It can last a few dozens of minutes in some cases, and long hours until late in the evening sometimes. We run it every week at the same time.

2.3.3 Now the Estimation Process

Shortly put, the estimation process consists in identifying topics and evolutions in the PMC, specifying them with details, having the ARCHCOM design them and then proceeding with the breakdown in tasks for estimations, having the tasks estimated and eventually computing the overall epic or story estimations. With everything in the short-to-mid term properly estimated, the Product Manager can proceed with updating and maintaining the roadmap.
The process looks as follows:

Which reads as follows:

The PMC decides to prioritize a new feature, a new element or an evolution, as a new story and we shall proceed with its estimation so that it can be positioned accurately on the roadmap.
So the product manager or a product owner will be specifying the story in a detailed way with the help of the CTO perhaps, focusing on identifying as precisely as possible its functional elements. The result in our context is called a detailed story
This closes the specification phase.
Now comes the design phase.
Then the CTO, a PO, an architect or sometimes a teach lead will be transforming the detailed story - which is a marketing / product formalism - into a development epic - which is a technical formalism where we cover functional design, solution design, application design, perhaps data identification, data research needs and so on.
It can happen that the CTO creates himself some development epics for mandatory technology evolutions, refactorings, etc.
The ARCHCOM will challenge the design and other technical elements and discuss / refine open points before an ARCHCOM member will be assigned the task to proceed with the breakdown in tasks of the development epic.
The breakdown in task is challenged and discussed at ARCHCOM before being validate.
It can happen that tasks are created directly from architects or tech leads for refactorings for instance or transformed from delivery wishes. In such case, they are bound to an existing epic or a container epic is created for them (for instance for a specific Project Gaps).
Then all individual tasks can be estimated.
With the individual task estimations finalized, the CTO or a PO will compute the total amount of Story Points at Epic level.
The CTO or a PO will then liaise with the PM to communicate the estimations on the story.
Finally the PM will update the roadmap accordingly and send the updated roadmap to the PMC. The next PMC can challenge the view, re-prioritize or end up taking a different decision regarding the evolution or new feature now that it knows how much it costs.

2.3.4 Team Sprint velocity

The approach is fairly usual, one needs to monitor the number of story points a team has been able to implement in the past five sprints. Then the extreme values are eliminated because they make no sense (people tend to go on holidays, people are sick, some tasks have been postponed to another sprint, etc.).
We take the second and the last but one and with these we're able to compute an upper and a lower bound. This range addresses the uncertainty inherent to software development.

Computing Team Sprint capacity

As far as estimations and planning are concerned, we will take the lower bound to compute the team sprint velocity since we have to be defensive.

2.3.5 Forecasting

Now that we know how many Story Points a team can do in a sprint, we know how many it can do in a month of 2 sprints (the additional days are ignored, they form a reserve).
Add if we know how many Story Points a team can do in a month, then we know how many months a team needs to implement any given story that has been properly estimated.
Boom. Done.

Forecasting delivery dates

As far as delivery date estimations are concerned, we will take the pessimistic estimation coming from the lower bound of the Team Sprint capacity.

2.3.6 The Estimation process - Takeaways

Summing things up:

First a note about something specific to our context in my project company. In our model, the project gaps are the only items on which we decide to have hard commitments.
In our model, we know that we have to start any given customer delivery project at a precise moment in time and that it has to be in production at another precise moment in time. So we need to make sure we will deliver the missing functionalities sufficiently in advance for our integration team or delivery team to be able to integrate them at the customer sufficiently in advance.
So we don't have much of a choice, we need to have at least some elements of our roadmap that are strong commitments. And in our model, these are the project gaps.
To handle this, we make sure the project gaps are well balanced between the different teams. They become the only items from the roadmap that can't be moved, they are frozen items; we want to ensure that we will respect their delivery dates.
The way we do that is that we consider them as frozen once they are scheduled. Once they are planned, they can't move anymore.
The estimation game tells us the total story point for each and every story of epic that we want to follow on the roadmap. So assuming two things, the amounts with the team sprint velocity, we know how many sprints and how many months will be required to implement a story or an epic. So we can put it on the roadmap. The key word here is reserve!. Take reserves, more reserves and even more reserves.
In our model, sufficient reserve is taken by the fact that we schedule these elements for the end of a period, the end of the month in the three to six months timeline, and then the end of a quarter in the next periods, or even the end of the year in the following period. And by doing so, most of the time, we find out who have sufficient reserve taken when laying these elements down the roadmap.

Estimation Process Takeaways

2.4 D. Roadmap timeline

The roadmap timeline is also an important aspect. The whole principle is to avoid having a timeline with too many different deadlines since they would be impossible to follow and only bring noise, but we have to have a sufficient number since they form our scheduling unit.
A good number of terms of scheduling deadlines is eight. Eight deadlines are easy to follow and with a reducing granularity approach, they provide enough scheduling and synchronization points to successfully draw a 24 to 36 months roadmap.

Roadmap timeline

Which reads as follows:

The Next 3 months are followed on three elements. This makes it more than 1/3 of the roadmap dedicated to this shortest term and is a reflection of the fact that we follow and monitor things carefully on this short term roadmap. In addition, we're having strong commitment there and any change on the activity plan of the next three months should really be considered and weighted very carefully since we commit on most of these elements.
The next 2 elements are related to the following 2 quarters. This makes 1/4 of the roadmap timeline dedicated to a 6 months period and is an indication that we want a fine control over the next priorities past the shortest term 3 months period. This is where changes can happen frequently as we discover our market or sign new deals. Priorities can change thre but it has to be backed by strong business objectives.
The following 2 elements are related to the following 2 semesters. This is the mid term roadmap basically related to what we will do in a year or so. This is really rather indicative and estimations there do not necessarily need to be precise, a T-shirt sizing approach can be sufficient.
The elements on the mid-term roadmap yet need to be realistic and correspond to the vision we have today, even though that vision can entirely change.
The last element being the following year is really related to the long-term roadmap, a basket of ideas.

2.4.1 Monthly roadmap update

The roadmap is updated every month and a trick needs to be found to ensure that whatever happens and wherever we are within the year, we keep tracking 8 baskets, not more and importantly not less.
This is done by tricking the vision and adapting some periods to fit the specific situation in which we are within the year.

Roadmap timeline

By tricking the period this way every month, we make sure that we only follow eight different buckets in the in the roadmap since anything beyond eight wouldn't make a whole lot of sense, at least in our model.

Again, one aspect that is absolutely critical to be in a position where we can have reliable forecasts is to avoid at all cost stop-the-world events (refactorings, releases, etc.)
Stop-the-world releases or refactoring, or big technology evolutions that require synchronizing the teams, are killing the ability of a team to work independently and autonomously. So eventually, it's killing the ability of a team to respect the forecast on schedule and planning. These have to be avoided at all cost.
And interestingly, even though implementing a big refactoring or technology evolution without stopping the world is much more difficult, from a purely technical standpoint, most of the time, it's absolutely doable. It will be more expensive. It will indeed take longer. But it's absolutely doable.
And it's absolutely worth it. Because avoiding stop the world events enable the team to respect and fulfill expectations in terms of forecasts and planning.

2.4.2 Back on Continuous Delivery

Continous delivery is very hard to reach in the mind of software development teams working on SaaS platforms.
But think of the web giants. Think of what they are doing ...
Let's see some examples. Amazon is deploying code in production every 11 seconds, on average. Netflix is pushing code to production 1000 times a day. They have even developed Chaos Monkeys, which is a software literally killing VMs, services and containers all the time in production, as a way to force developers to design for failure. Facebook, before 2016, were cherry picking the features in a few release trains a day. They don't do that anymore, they push the master branch directly in a dozen of release trains a day.

Now that's what the Web Giants manage to do. If you embrace the same techniques - they are using state of the art XP, agile, Scrum, DevOps and lean startup principles and practices, then you can do this as well.

And even if you do on premise deployment, it still makes sense to do continuous delivery since it forces you to automate your release process.
As far as the development teams are concerned, every sprint ends up with a shippable and production-ready version of the product. And because Database migration scripts are maintained along the software, you end up in a situation where it's perfectly feasible from a technology standpoint to push this version at a customer. It's perfectly doable because you have designed the product and developed it this way.
If the developments teams stick to this, then releasing the product as an actual end-customer version becomes purely a Product management decision, not any more a technical concern. At the end of every feature development, Product Management has the possibility to decide to transform the internal release into a customer release and give it a proper version number.

Releases

3. Conclusions and final notes

In terms of conclusion I would say that embracing the core Agile practices from XP, Scrum, DevOps and Lean Startup enables a company to have reliable forecasting and planning abilities which eventually lead to design a relevant and useful roadmap.
This principles and practices enables the company to have autonomous and independent teams, which are then able to work on a development topic or technology evolution without interruptions and without and friction with other teams, which is key to have their real-life development pace aligned with estimations and forecasts.
In addition, multiple autonomous teams have the possibility to work in parallel on multiple development topics which enables the company to have a fair share of development in the multiple Horizons.

Some tools are required to give life to all of this and I will leave the reader to discover them hereunder:

My final words would be that such an Agile roadmap has the potential to be much more that a communication tool, it's really an internal alignment tool and provides the high-level management view to align everyone in the company towards the common product development objectives.

This article is available as a Slideshare presentation

NetGuardians' 3D AI Technology

2021-06-08T03:18:54-04:00

(Article initially published on NetGuardians' blog)

Whenever our software is run head-to-head in a pitch situation against that of our rivals, we always come out top. We always find more fraud with a lower number of alerts. For some, this is a surprise – after all, we are one of the youngest companies in our field and one of the smallest. To us, it is no surprise. It is testament to our superior analytics.

A focus on customer behavior

We began working in fraud prevention in 2013 and quickly realized the futility of rules engines in this endless game of cat-and-mouse with the fraudsters. The criminals will always tweak and reinvent their scams; those trying to stop the fraud with rules engines will always be left desperately working as fast as possible to identify and incorporate the latest scams into their surveillance. Far better to focus on what we know changes very little – customer behavior.

If a bank knows how a customer spends money, it can spot when something is awry by looking for anomalies in transaction data. However meticulous the fraudster is at trying to hide, every fraudulent transaction will have anomalous characteristics. People’s lives are constantly changing – they buy from new suppliers, they move house, go on holiday and their children grow up – all of which will affect their spending and transaction data. Every change will throw up false alerts that will undermine the customer experience unless you train your models correctly.

The three pillars of 3D AI

We train our models using what we call our 3D AI approach. This enables them to assess the risk associated with any transaction with extraordinary accuracy, even if it involves new behavior by the customer. This also keeps false alerts to the minimum.

Developed by us at NetGuardians, this approach has three pillars, each of which uses artificial intelligence (AI) to constantly update and hone the models.

The pillars are: anomaly detection, fraud-recognition training analytics and adaptive feedback. Together, they give our software a very real advantage by not only spotting fraud and helping banks stop fraudulent payments before any money has left the account, but also by minimizing friction and giving the best possible customer experience. This is what differentiates our software in head-to-head pitches.

The first pillar is anomaly detection, which is mostly unsupervised learning. At this stage, we are looking for anomalies and working out the level of risk associated with them. This involves examining a set of parameters such as time of transaction, counterparty, location, amount and currency. As seen above, on their own, these parameters aren’t enough to prevent an unacceptable level of false alerts. By including peer-group behavior, we begin to reduce the number of false alerts. For example, people don’t buy a car often and when they do, they don’t want their bank to block and query the transaction because of its rarity. But if you place the customer among peers, the size of the transaction can trigger associations that bring its risk level down. This, along with other techniques including Poisson Law help us understand the timing and regularity of the transaction, resulting in a highly nuanced picture of risk.

The second pillar is fraud-recognition training analytics, mostly relying on supervised learning techniques. Typically, a T2 or T3 bank might see 10 frauds a year out of 100 million transactions. This level of fraud is insufficiently high to train a complex algorithm. So we have developed models using anonymized bank data that can be overlaid on a bank’s own data. The models cover different situations, regions, size of bank and type of customer, which allow us to create analytics that look at the context of the data. Our software is even capable of deciding which model is best for the analysis.

For a T1 bank, it takes just a few hours to train the algorithms, with perhaps some manual intervention to confirm the final choice of models. Smaller banks take one to two hours and the process is fully automated.

The final element is adaptive feedback using active learning. This is absolutely crucial to reduce false alerts to the lowest possible level while minimizing the risk of missing a fraud. Our adaptive feedback technology monitors, controls, challenges and supervises feedback from the alert investigators – the bank’s back- and middle-office employees who review alerts and decide when to call the customer – to ensure that it is of sufficient quality before re-injecting it into the machine learning models. As this is unique to the NetGuardians approach, it’s worth going into a little more detail.

Fine-tuning and machine learning

All alerts raised by the NetGuardians software come up on a dashboard and have to be reviewed manually. If the alert turns out not to be a fraud, we ask the bank to classify the risk of transaction.

High = not a fraud but confirmed as high-risk, so continue to block similar transactions
Medium = not a fraud but can see why NetGuardians software thought it would be. Proceed with care
Low = not a fraud and unclear why the software thought it was. Never block similar again

This feedback retrains unsupervised and supervised learning and is key to the precision of our solution. Further tuning comes from the ability of our software to query feedback to ensure it is high quality.

For example, someone has been working in fraud investigation for many years. They have “learnt” that usually just two parameters reveal suspicious behavior – let’s say the amount and the beneficiary – and use only those two to decide the level of risk. When applied over and over again, a model would learn to look at just those two features, undermining its ability to see and learn from the whole picture – a form of “overfitting”. This reduces its accuracy. But our software does something very clever. If it believes something is anomalous, it will ask the investigator a direct question about another parameter, forcing them to re-examine the transaction. It then uses this additional feedback to learn more about the customer, the bank and the transactions.

Taken together, these three pillars of AI are responsible for our market-beating performance. Only NetGuardians uses these three together. And only NetGuardians is able to find all the fraud a bank knows about in historic data as well as up to nearly one-fifth more. Our software doesn’t need to be taught new fraud types because it isn’t looking for them. And it doesn’t add customer friction – in fact it reduces false alerts by as much as 83 percent. This is because it is always learning and refining its models across the broadest possible perspective, resulting in our analytics supremacy.

(Article initially published on NetGuardians' blog)

COVID-19 - Just get the vaccine as soon as possible!

2021-05-11T10:57:10-04:00

Well. For once this will be an article far away from the kind of stuff I use to post on this blog. This is a post about COVID-19 and its vaccines. But it's perhaps the most important thing I will have ever written.

Since the start of the week, various events have cruelly reminded us of how dangerous COVID is.
Particularly the situation in India, where infections are increasing at an absolutely terrifying rate, is very worrying. India this week set a world record for new daily COVID cases with more than 400,000 daily infections in recent days (and this is probably underestimated) and more than 2,000 daily deaths.

It is time for everyone to realize that our only way out of this long-term humanitarian disaster is clearly through herd immunity.
We have finally had an incredible chance for a few months: in Europe and the US we have wide access to many vaccines. The speed of vaccination has increased. And that is great.

The reason that prompts me to write today is that the #AntiVax are now emerging as the main pitfall on the way out of this disaster and that we - scientists or simply educated people - have the responsibility to react. All over Western countries, the same signs are being seen: some vaccination centers have already gone from insufficient supply a few weeks ago to insufficient demand today. And that is a disaster.

For example, in the US, according to the CDC, the average daily immunization has dropped by 20% since the beginning of April. A lot of centers find themselves with excess doses and possible appointments remaining, for example in South Carolina (see US media). Another example, without the county of Palm Beach, where a large number of vaccination centers have opened in recent weeks, the health department announces that it has 10,000 possible appointments on the various sites that remain vacant.
In Europe, the situation is unfortunately no better. In France, for instance, several vaccination centers claim that people are not registering for the vaccine as quickly as they had expected and that others are not showing up. The bottom line is that multiple doses are not given at the end of the day and some must be thrown away. This is madness.

A worrying and growing number of people are reluctant to get the free COVID vaccine.
However, these vaccines could save not only your life, but also the life of the people around you.

The bottom line is that for the Coronavirus, the herd immunity threshold would be between 70 and 90% of the population. It is this threshold that we have to reach as quickly as possible. This must be our most sacred goal.

In the US, for example, a survey found that if 60% of American adults have either received a first dose of the vaccine or wish to be vaccinated, 18% respond "maybe" and 22% categorically refuse.
Let us imagine that in the long term the undecided ones convince themselves to get vaccinated, that only places us at 78%. And unfortunately, this poll does not take into account children who are not currently eligibale to bgetting vaccinated and who represent about 22% of the population. From this perspective, achieving herd immunity today seems illusory.

We must therefore ensure that the vaccine is given to as many adults as possible as quickly as possible. And that means we really all need to get vaccinated.

I can understand that young, athletic and healthy people do not see the benefits of vaccination. We might not get seriously ill from COVID or even get sick at all, right?
But it could always be inadvertently passed on to someone who could then die.
And only vaccinating people at risk is not satisfactory since the circulation of the virus would not be stopped. And this is the real problem: the more the virus circulates, the more likely it is that we will see mutations appear that make it more dangerous, and perhaps even strains capable of fully resisting current vaccines, thus bringing us back to The starting point.

The point is, the current hesitation over vaccination is terribly problematic.

So in the rest of this article, I would like to review and discuss most of the "arguments" of the #Antivax groups.

VACCINES DO NOT WORK. THEY DO NOT PREVENT YOU FROM GETTING SICK OR INFECTING OTHERS

We hear that often and I kind of understand it. Especially when you have to receive two injections, you really want the thing to be foolproof (nothing is foolproof).

The argument most often made by conspirators is the fact that public health departments in Western countries recommend that vaccinated people (even force them) to continue to wear masks when in contact with vulnerable people.
However, this is in no way an element able to determine the uselessness of vaccines.

Clinical trials have shown that vaccines are amazingly successful in preventing people from contracting a severe form of the disease.
As to whether they protect you from the spread of the virus, the tests were not designed to assess that aspect. But the evidence so far indicates that they have significantly reduced transmission. The best proof of this is Israel, which with 80% of its population vaccinated has virtually eliminated the circulation of the virus.

The reason masks and distancing are still recommended is because public health organizations are as cautious as possible and apply the precautionary principle. In my opinion, that makes sense during a global pandemic, right?

What is crisp in all of this. is the fact that the #AntiVax groups were impatiently awaiting an event like this. They have a unique window of opportunity to propagate vaccine doubt and they have profited greatly.
And what's really scary is that they don't even really need to convince people that they're right. They just need to convince people that no one else is about the situation and the vaccines to get people to buy into myths, with lots of conspiracy videos spreading untruths and studies and claims taken ont of context.
So in the US, for example, only 4% of the population believes that the COVID vaccine is more dangerous than the disease. But 25% say they don't know. And this is a tragedy because if the #AntiVax groups manage to spread enough disinformation to convince people to "not be vaccinated for lack of information", then they ruin any chance of getting out of this crisis for everyone.

To be clear, I'm sure most of the hesitant aren't fanatics or conspirators, most are just people trying to make the best decisions for themselves and their families. We all know a bunch of people like that. But by swallowing the myths of the conspirators, they are complicit in the current situation.

BILL GATES USES THE VACCINE TO PUT A MICROCHIP IN EACH OF US

This myth fascinates me specifically because it touches a topic with which I am more than very familiar.
And the answer is really very simple. The technology for this simply does not exist, neither today nor in the near future.

Make no mistake, we have the technology today to miniaturize some very limited feature devices at this scale.
But miniaturizing a worthwhile chip (with remote connection capabilities, a reliable power supply, and useful analytical capabilities or the ability to influence the human body) is simply not possible today. It's science fiction. Believe me, it's definitely my job.

And also - and I don't stress how essential this is - monitoring you, stalking you in any way possible, and observing you all day long is precisely what your phone does with all your consent possible. Why would anyone bother to design a chip for this when everyone is buying these smartphones with such excellent analytics and tracking capabilities so greedily?

So to put it once and for all: there is no microchip in the COVID vaccine.
This rumor is based on the fact that the Gates Foundation funded research years ago that has often been taken out of context. In this study, researchers looked at creating an invisible ink that could potentially be injected with a vaccine so populations like refugee children could keep immunization records without paperwork. Over the time, the original context has been skewed by conspirators and made to become a microchip story that Bill Gates wants to inject into vaccines. Which, if you think about it for a second, doesn't make any sense.

Again, Bill Gates, just like facebook, google etc. already has all the means possible to track you with your smartphone.

VACCINES ARE TESTED ON US, THE POPULATION

Many people think that pharmaceutical companies use the general population as guinea pigs. And that hardly encourages, eh? No one wants to be used as a guinea pig.

It's worth understanding exactly how the vaccine got to the market so quickly. The very first thing to understand is that researchers have been working on vaccines against other coronaviruses for years. So when COVID-19 emerged, they got a big head start. "Operation Warp Speed", as it was called, was not about rushing scientific research. It was simply a matter of drastically reducing the bureaucracy that might otherwise have slowed it down.

The vaccine researchers have explained in detail (and in full BTW if you get the information in the right places, not on youtube) that they were able to develop vaccines in 12 months because they were able to compress the vaccine. calendar. While the various steps necessary for putting a vaccine into circulation are normally done in a linear fashion - from A, then B, then C and so on up to Z - they were able, this time around, to start steps in parallel - by doing the F and the E for example at the same time as the A and the B.
They took steps that usually happen sequentially and saved time by performing them simultaneously.

It is this aspect above all, but also the massive investments of the private sector and Western states in research which have made it possible to be so rapid, to the benefit of the population, in no way to their detriment.

mRNA-BASED VACCINES MAY DAMAGE DNA AND CAUSE MUTATIONS

Many have heard or read online that the Pfizer and Moderna vaccines were the first to be allowed to use messenger RNA, which is true, but it has given rise to speculation about what messenger RNA is capable of.
The most common fear is that anyone who takes these vaccines will invariably have neurological problems within a year, and most of them will die within 10 years.
Huh ... okay. I think it is important to clarify certain things.

First, other than a few conspiratorial videos that claim to have evidence that people have never seen, there is absolutely no credible evidence or study to support any of this.
As for the claim that mRNA vaccines can modify DNA, it is very important to know that messenger RNA from vaccines does not enter the genome. It's impossible. It does its job far away from the nucleus of the cell (where DNA is located).
Fear of what the vaccine contains or what it might do, however, seems to be common. For example, some evangelicals in the US fear that it contains cells from aborted fetuses, which is not the case either.
For most of the people who left #AntiVax insinuate doubt in their mind, they actually fear that the vaccine (still based on DNA modifications) could alter how their body works.
We have heard some individuals on facebook claim that these vaccines cause infertility; these rumors were fueled by a blog post, which falsely claimed that Pfizer's vaccine can cause the female body to attack a protein that plays a crucial role in the development of the placenta. It is a myth. This is absolutely wrong. Whatever these famous experts say, there is absolutely no evidence to support this claim.
There is, however, already quite satisfactory evidence that Pfizer's vaccine does not cause infertility, since during trials carried out in 2020, several women became pregnant ... and the only one to have suffered a miscarriage received the placebo. So this is simply completely wrong.

If you know a biologist or doctor among your friends, ask them to explain to you what messenger RNA is and how polymerase enzymes translate DNA into RNA and how the latter is used by ribosomes to synthesize the proteins that drive our body. It is really quite fascinating.
The bottom line is: Messenger RNA is not designed to enter a cell nucleus, let alone alter its DNA. It just doesn't work that way.

THE RISKS OF VACCINES ARE GREATER THAN THE RISKS OF COVID

This is a perception fueled by the constant circulation of misleading headlines about people falling ill or dying after receiving their dose of vaccine.

For example, you may have seen this story which was widely shared about 23 people in Norway who died within a week of taking the first dose of the vaccine, which sounds scary indeed. But that headline obscures an absolutely essential piece of context: At that time, in Norway, the vaccine was being given to the oldest or sickest people, and a certain percentage of them would statistically die that week. anyway, vaccine or not vaccine. 400 people die every week in nursing homes in Norway.
You see what I mean ?
If someone who has just received their first dose dies of a car crash on their way out of the vaccination center, we should be able to exclude the vaccine as the cause of the death, nope ?

When the WHO looked at these incidents, it found no unexpected or untoward increases in the number of deaths, which makes sense. Correlation is not causation. The vaccine protects against COVID, not the concept of mortality, fuck it.
It's strange to have to clarify this: but you have to know that you are all going to die one day, all of you, vaccine or not vaccine.

This also applies to the stories that are popping up in the US that are linked to the frightening exaggeration of data from VAERS (Vaccine Adverse Event Reporting System) - the system for collecting vaccine-related adverse events. VAERS a database that collects all medical events following vaccination for any US citizen. The problem with VAERS is that reports can be entered by anyone and are not verified. The goal is to collect everything and since the users of the system are doctors or medical researchers, the system relies on their skills to separate the wheat from the chaff themselves.
So any layman must treat the resulting data with extreme caution, which journalists obviously do not do.
A self-styled doctor once claimed that the flu shot turned him into the Hulk. And this report was obviously accepted and entered into the database. Again the goal is to collect everything.
See where this takes us, eh?

The reason the US CDC collects this data is that if a pattern emerges, action can be taken. This is exactly what happened with Johnson and Johnson's vaccine. The CDC has found a potential pattern of rare blood clots and, with due caution, halted the large-scale deployment of this vaccine pending further testing and protocols.
Skeptics pointed to this fact as proof they were right that vaccines were dangerous. But in reality, this obviously proves the opposite: the risk to public safety of vaccines is analyzed and rigorously monitored, and in any case not buried in secret and requiring conspirators with an IQ 2 to be revealed to the public.

Obviously I am not claiming that there are no side effects to vaccines, of course not. But serious side effects like anaphylaxis are incredibly rare: 4.7 cases per million for Pfizer, 2.5 cases per million for Moderna. Also, it is essential to understand that these side effects mainly occur in people with a history of severe allergies.
The point is, the vast majority of people can expect the most common cold or flu symptoms within the first few days after their injection, or maybe just arm pain, or even nothing at all most of the time. time.

But most importantly, the bottom line, the key thing to remember is that no vaccine side effect is worse than the alternative COVID, a disease that has killed more than 3 million people worldwide, while one again to this day, there is no conclusive evidence that the vaccine killed anyone (and not that no one who received the vaccine died of course).

IN CONCLUSION

It is of course more than natural to have questions. But there are reassuring answers. You just have to look for them in the right places, so not on youtube or facebook.

The goal for everyone must be to achieve herd immunity, as quickly as possible. This is the only way out of this human, social and economic crisis in the long term.
And to get there, we must at all costs convince all those who can be to take these vaccines.

Talk to your loved ones and bring these arguments to those who hesitate, in the most benevolent way possible.
We need to get out of this crisis, and vaccines are the only solution.

The Search for Product-Market Fit

2020-08-17T04:31:43-04:00

The Search for Product-Market Fit is the sinews of war in a startup company. While the concept and whereabouts are well known by most founders, the importance of this event in the company and product building process, what it means to be Before-Product-Market-Fit and After-Product-Market-Fit and the fundamental differences in terms of objectives, processes, culture, activities, etc. between these two very distinct states is almost always underestimated or misunderstood.

Product-Market Fit is this sweet spot that startups reach when they feel eventually that the market is really pulling out their product. It's what happens when they found out that they can only deliver as fast as the customers buys their product or when they can only add new servers in the SaaS cloud as fast as is required to sustain the rise in workload.
Product-Market Fit is so important because it has to be a turn point in the life of a young company.

Pre-Product-Market Fit, the startups needs to focus on the leanest possible ways to solve Problem-Solution Fit, define and verify its business model and eventually reach Product-Market-Fit.
Post-Product-Market Fit, the company becomes a scale up, needs to ramp up its marketing roadmap and effort, build and scale it's sales team, build mission-centric departments, hire new roles and recruit new competencies, etc.

Dan Olsen designed the following pyramid to help figure what Product-Market Fit means (we'll be discussing this in length in this article):

Understanding Product-Market Fit and being able to measure and understand whether it's reached or not is crucial. Reaching PMF should be the core focus of a startup in its search phase and understanding whether it's reached is key before scaling up.
This article is an in-depth overview of what Product-Market-Fit means and the various perspective regarding how to get there. We will present the Lean-Startup fundamentals required to understand the process and the tools to reach product market fit, along with the Design thinking fundamentals, the metrics required to measure it, etc.

This article is available as a Slideshare presentation and a Youtube video.

Summary

1. Introduction - Product Market Fit
2. Lean Startup Fundamentals
3. Design Thinking Fundamentals
4. Reaching Product Market fit - Different perspectives
5. Measure Obsession

5.1 Net Promoter Score
5.2 CLV to CAC Ratio
5.3 Retention Ratio / Curve
5.4 Growth Rate
5.5 Further readings: pirate metrics

6. Conclusion

1. Introduction - Product Market Fit

"The term product/market fit describes "the moment when a startup finally finds a widespread set of customers that resonate with its product'."
by Eric Ries

In this chapter, we will detail what is Product-Market Fit, what it means and how it's defined.

1.1 Startups ...

A startup

We can find multiple definitions of a Startup online:

The Most of the time ... definition is from my perspective downright wrong. A startup is not a scaled down version of a company. There are important inherent differences between a running company and a startup (we'll come back to this later)

The wikipedia definition is better. There is the important notion of search - a startup is indeed still searching the Product-Market Fit, the very important notion of validating - need to make assumptions, test them and confirm or contradict them and the notion of scaling.

But Eric Ries' definition is the best from my perspective since it emphasizes three important aspects of a startup:

The Notion of Human institution is better than company or project - a startup can be many things, a guy working in his garage on his idea is in some ways a startup
The notion of new product or service
Most importantly, the notion of Extreme uncertainty - that's the root of the problem

Startup often fail!

Most frequently cited reasons for startup failures:

(Source: CB Insights - https://www.statista.com/chart/11690/the-top-reasons-startups-fail/ )

One of the main reasons why products fail is because they don't meet customer needs in a way that is better than other alternatives. That is the essence of product-market fit.

Most startup fail because the founder have an idea, work in a tunnel for multiples months or even year to build their idea, and only eventually rise their head looking for customers and face the ugly truth: the idea may well be brilliant indeed, but there's no market. I've seen this so many times, so many times.
Whenever one has an idea of a product, a new technology, before writing the single Line of Code, before investing a single dollar on it, one needs to answer the single and only question that matters: Is there a market for it and what is that market?

In the reasons why startups fail mentioned above, there are a few others interesting things to note:

Ran out of cash - why spend so much cash before Product / Market Fit. If your each Product-Market fit before heavily investing in your product / company, you can't run out of cash because investors will be killing themselves to put money in your product / company. When you reached Product - Market fit, you WILL have the data points to raise investments, BIG investments.
Not the right team - when you reach Product Market fit , you can raise investments and then hire the right team - Founders do not necessarily need to remain CEO and COO of their company.

Interestingly, The Lean startup Methodology with its processes, principles and practices gives a solution to most of the top 10 problems listed above. Today, I intend to focus a lot on Lean Startup in this article.

1.2 From Product-Market Fit to "Lean Startup"

The "Product-market fit" term and concept is widely misattributed to Marc Andreessen by bloggers and writers, but Andy Rachleff coined the term. In a 2007 article, "The only thing that matters," Andreessen credits Rachleff for the term and synthesizes much of Rachleff's thinking

Alex Osterwalder is a swiss guy living in Lausanne. He wrote "Business Model Generation" where he presents the business model canvas and the lean approach to it along with a lot of hints on how to efficiently (and cheaply) relieve uncertainties in startup with concepts such as prototyping, getting feedback from the market, challenging the status quo, etc.
Eric Ries is a Silicon Valley Serial Entrepreneur with failures and successes. His failures make him think of them and come up with the Lean Startup way, putting the customer at the center of the process.
Ash Maurya is an entrepreneur from Austin that understood on his end as well most of the Lean Startup principles. In his Running Lean book he details a lot of the Lean Startup principles and practices and came up with the Lean Canvas, a version of Osterwalder's Business Model Canvas adapted for Startups.
Steve Blank is the grand-father of the Lean Startup. He is a professor in Stanford University and a Silicon Valley Serial entrepreneur (search for him on Linkedin) - Steve Blank designed the customer development methodology that he presents in his "Four Steps to the Epiphany" book. Four steps to the Epiphany is a process for finding Product Market Fit and eventually scaling the company.

With Lean Startup, Product Market Fit is the step separating a startup from a scale up and Customer development methodology is the path to Product Market Fit.

1.3 Defining "Product-Market Fit"

The product/market fit (PMF) concept was developed and named by Andy Rachleff.

It answers the question some might wonder about: what correlates the most to success, team, product, or market?
Or, more bluntly, what causes success? And, for those of us who are students of startup failure, what's most dangerous: a bad team, a weak product, or a poor market?
If you ask entrepreneurs or VCs which of team, product, or market is most important, many will say team.
On the other hand, if you ask engineers, many will say product.
But actually market is the most important factor in a startup's success or failure.

(Source: June 25, 2007 / Marc Andreessen - The only thing that matters (blog post))

In a great market - a market with lots of real potential customers - the market pulls product out of the startup.
The market needs to be fulfilled and the market will be fulfilled, by the first viable product that comes along.
The product doesn't need to be great; it just has to basically work. And, the market doesn't care how good the team is, as long as the team can produce that viable product.
Conversely, in a terrible market, you can have the best product in the world and an absolutely killer team, and it doesn't matter -- you're going to fail.

You can obviously screw up a great market - and that has been done, and not infrequently - but assuming the team is baseline competent and the product is fundamentally acceptable, a great market will tend to equal success and a poor market will tend to equal failure. Market matters most.
To quote Tim Shephard : "A great team is a team that will always beat a mediocre team, given the same market and product."

Second question: Can't great products sometimes create huge new markets?
And the answer is yes, this is possible. But it's really exceptional.
VMWare is a good example of this since it was so profoundly transformative out of the box that it catalyzed a whole new movement towards operating system virtualization, which turned out to be a monster market. But then again this is an exception.

Rachleff's corollary of startup success gives us the first and most crucial definition of Product Market Fit :

"Product market fit means being in a good market with a product that can satisfy that market"

A good market is essential, there needs to be a market and the product needs to satisfy that market, give a solution to the market's problem.

So what is Product-Market Fit?

(Source: https://medium.com/@briantod/about-product-market-fit-what-ive-learned-about-the-goal-the-process-and-the-nuance-e7b317740f43 )

Looking at things from a Human Centered Design perspective - Product Market fit is the intersection between:

A problem that a sizeable group of people really need solved = i.e.: Desirability
A product that can actually be built well to fully solve that problem = i.e.: Feasibility
business model that can be executed to be profitable at some point in time = i.e.: Viability

If all three of these elements are not well identified, assessed, controlled and balanced, you can't reach PMF.

If desirability is weak, not many people want what you can make and believe you can sold => waste
If feasibility is weak, your are not able to build the product => failure
If viability is weak, you're not making money => failure
Etc.

Another view

Product-market fit occurs when your product or service solves a problem that directly affects your target customers/audience. But it can't be just a few people. It's got to fill a gap and fix a problem for a large market. So when approximately 40% of your customers say that they can't imagine living or working without, then you know you have product-market fit.

(Source: https://startupdevkit.com/guide-to-product-market-fit-with-everything-you-need-to-know/ )

However, there are times when there's a gap in the market, but most of the target audience doesn't know about this gap. That is, they don't know about the problem until you show them how your solution improves their lives. By doing this, you create the need for the market and you directly solve an existing problem in their niche or market.

1.4 Four myths about Product-Market Fit

(Source : https://blog.pmarca.com/2010/03/20/the-revenge-of-the-fat-guy/)

MYTH 1 : Product market fit is always a discrete, big bang event

Some companies achieve primary product market fit in one big bang. But Most don't, instead getting there through partial fits, a few false alarms, and a big pile of perseverance. Most of the time it's a lot of trial and error, running around it until finally all indicators are green

MYTH 2: It's patently obvious when you have product market fit

I am sure that Twitter knew when it achieved product market fit, but it's far blurrier for most startups. Twitter is a good example because most of these myths were actually true for Twitter. But Twitter is an exception.
Determining if you reached product market fit will require you to have very good understanding of your market and your product and give a lot of thoughts into coming up with the proper metrics and indicators and the recipe to interpret them.
Pretty much every product and market will require a different set of indicators and a specific way to interpret them.

MYTH 3: Once you achieve product market fit, you can't lose it.
And
MYTH 4: Once you have product-market fit, you don't have to sweat the competition.

These two myths really go together and are obviously wrong.
It's fine to stay lean if you are not quite sure that you have product market fit and there are no competitors in your face every day. But usually there are. In fact, the best markets are usually the ones in which competition is fierce because the opportunity is big. The number and quality of competitors is actually a fairly good indicator of the market.
The big principle here is that post PMF, monitoring and watching competitors closely should be an every day concern, just as staying very close the customer and keeping to run lean.

1.5 Feeling Product-Market Fit

These two quotes from Marc Andreessen are spot on and give a good perspective on how Product Market Fit can be felt.

(Source: https://hackernoon.com/product-market-fit-heres-why-youre-probably-confused-about-it-1dr73zgd )

1.6 Dan Olsen's PMF Pyramid

Dan Olsen represents Product-Market Fit using the following pyramid:

(Source: https://www.mindtheproduct.com/the-playbook-for-achieving-product-market-fit/ )

The Problem Space

A market is a set of related customer needs, which rests squarely in problem space or you can say "problems" define market, not "solutions". A market is not tied to any specific solutions that meet market needs. It is a broader space. There is no product or design that exists in problem space. Instead, problem space is where all the customer needs that you'd like your product to deliver live. You shouldn't interpret the word "needs" too narrowly: Whether it's a customer pain point, a desire, a job to be done, or a user story, it lives in problem space.

The Solution Space

If I speak of solution space, any product or the product design - such as mock-ups, wire-frame, prototype, depends on and is built upon problem space, but is in solution space. So we can say problem space is at the base of solution space. Solution space includes any product or representation of a product that is used by or intended for use by a customer. When you build a product, you have chosen a specific implementation. Whether you've done so explicitly or not, you've determined how the product looks, what it does, and how it works.

The What and How Approach

"What" the product needed to accomplish for customers is Problem space. The "what" describes the benefits product should give to the target customer. Whereas, "how" the product would accomplish it, is solution space. The "how" is the way in which the product delivers the "what" to target customer. The "how" is the design of the product and the specific technology used to implement the product.

The best problem space learning often comes from feedback you receive from customers on the solution space.

1.7 A first high-level process

Based on Dan Olsen's pyramid, we can introduce here already a first idea of a process to reach Product-Market Fit:

This is a first idea only, we will see different processes and different perspective in chapter Different Perspective

2. Lean Startup Fundamentals

In this chapter we will be covering the most fundamentals aspects of Lean Startup required to understand the different perspective in the Search for Product-Market Fit presented in chapter 1. Introduction.

The Lean Startup is a movement, initiated and supported by some key people that I presented in the previous section.
But it's also a framework, an inspiration, an approach, a methodology with a set of fundamental principles and practices for helping entrepreneurs increase their odds of building a successful startup.
Lean Startup cannot be thought as a set of tactics or steps. Don't expect any checklist (well, at least not only checklists) or any recipe to be applied blindly, but it gives you a set or processes, principles and practices to reach Product-Market Fit and eventually scale the company.

2.1 Lean Startup

Lean Movement (1990)

Lean thinking is a business methodology that aims to provide a new way to think about how to organize human activities to deliver more benefits to society and value to individuals while eliminating waste.
The aim of lean thinking is to create a lean enterprise, one that sustains growth by aligning customer satisfaction with employee satisfaction, and that offers innovative products or services profitably while minimizing unnecessary over-costs to customers, suppliers and the environment.
The Lean Movement finds its roots in Toyotism and values performance, visual management (Kanban) and continuous improvement (Kaizen)

Lean Startup (2010)

Blank, Ries, Osterwalder and Maurya are the founders or initiators of the Lean Startup Movement. Eric Ries is considered as the leader of the movement, while Steve Blank considers himself as its godfather.
Osterwalder and Maurya's work on business models is considered to fill a gap in Ries and Blank's work on processes, principles and practices. In Steve Blank's "The four Steps the the Epiphany", the business model section is a vague single page.
Moreover, Maurya's "Running Lean" magnificently completes Blank's work on Customer Development. We'll get to that.

2.2 Key principles

Before digging any further into Lean Startup, below are the essential principles that characterize The Lean Startup approach, as reported by Eric Ries' book.

Entrepreneurs are everywhere

You don't have to work in a garage to be in a startup. The concept of entrepreneurship includes anyone who works within Eric Ries' definition of a startup, which I repeat here:

A startup is a human institution designed to create new products and services under conditions of extreme uncertainty.

That means entrepreneurs are everywhere and the Lean Startup approach can work in any size company, even a very large enterprise, in any sector or industry.

Entrepreneurship is management

A startup is an institution, not just a product, and so it requires a new kind of management specifically geared to its context of extreme uncertainty.
In fact, Ries believes "entrepreneur" should be considered a job title in all modern companies that depend on innovation for their future growth

Validated learnings

Startups exist not just to make stuff, make money, or even serve customers. They exist to learn how to build a sustainable business. This learning can be validated scientifically by running frequent experiments that allow entrepreneurs to test each element of their vision.

Innovation accounting

To improve entrepreneurial outcomes and hold innovators accountable, we need to focus on the boring stuff: how to measure progress, how to set up milestones, and how to prioritize work. This requires a new kind of accounting designed for startups-and the people who hold them accountable.

Build-Measure-Learn

The fundamental activity of a startup is to turn ideas into products, measure how customers respond, and then learn whether to pivot or persevere. All successful startup processes should be geared to accelerate that feedback loop.

2.3 The Feedback loop

The feedback loop is represented as below.
The five-part version of the Build-Measure-Learn schema helps us see that the real intent of building is to test "ideas" - not just to build blindly without an objective.
The need for "data" indicates that after we measure our experiments we'll use the data to further refine our learning. And the new learning will influence our next ideas. So we can see that the goal of Build-Measure-Learn isn't just to build things, the goal is to build things to validate or invalidate the initial idea.

In the end, the Build-Measure-Learn framework let startups be fast, agile and efficient by validating every single assumption of the Problem, The Solution fit and the Business Model before consenting to any heavy investment.

2.4 The Four steps to the Epiphany

Most startups lack a process for discovering their markets, locating their first customers, validating their assumptions, and growing their business.
The Customer Development Model creates the process for these goals.

The life of any startup can be divided into two parts: before product/market fit (call this "BPMF") and after product/market fit ("APMF").
When you are BPMF, focus obsessively on getting to product/market fit.
Do whatever is required to get to product/market fit. Including changing out people, rewriting your product, moving into a different market, telling customers no when you don't want to, telling customers yes when you don't want to, raising that fourth round of highly dilutive venture capital, whatever is required! When you get right down to it, you can ignore almost everything else.

Whenever you see a successful startup, you see one that has reached product/market fit, and usually along the way screwed up all kinds of other things, from channel model to pipeline development strategy to marketing plan. And the startup is still successful.

PMF means it's safe to scale !!
If you decide to scale-up a SaaS company without proven Product/Market Fit, you're taking a huge risk. There's no guarantee that a market for your product exists. Even if it does, it might not be able to sustain your business.
Without PMF, major investments into marketing, sales and customer success are premature.

Customer discovery: is to determine who your customers are, and whether the problem you're solving is important to them. In this phase, you may spend a lot of time conducting primary research, with surveys and interviews, or looking through secondary research. For example, in the case of Uber, Travis Kalanick, decided to build the business model as a private black cab service for himself. Gradually as the service was shared with friends, they began to realize demand from others.
Customer validation: is when you build a sales process that can be repeated by a sales and marketing team. This process is validated by selling the product to early customers for money. In the case of Uber, customers were paying for the ride from the get go, hence the business model was validated. And for Facebook, in its early days, Mark Zuckerberg was selling banners to local college businesses as a proof that the freemium monetization model will work.
Customer creation / Get new Customers: seeks to increase demand for a product and scale the business. In the case of Uber, the referral bonus program with ride subsidies was the key to its rapid growth, or customer creation.
Customer building / Company Creation: is when a company transitions into a more formalized structure where different specialized departments are created to specialize functions such as sales, marketing, and business development.

Shortly put, Steve Blank proposes that companies need a Customer Development process that complements, or even in large portions replaces, their Product Development Process. The Customer Development process goes directly to the theory of Product/Market Fit.
In "The four steps to the Epiphany", Steve Blank provides a roadmap for how to get to Product/Market Fit.

2.5 Customer development - the practices

So I want to present the most essentials principles and practices introduced and discussed by the Lean Startup approach.
These principles and practices are presented on the following schema attached to the stages of the Customer Development process where I think they make more sense:

We will be focusing now on the relevant practices to reach product-market fit.

2.6 Get out of the building

If you're not Getting out of the Building, you're not doing Customer Development and Lean Startup.
There are no facts inside the building, only opinions.

If you aren't actually talking to your customers, you aren't doing Customer Development. And talking here is really speaking, with your mouth. Preferably in-person, but if not, a video call would work as well, messaging or emailing doesn't.

As Steve Blank said "One good customer development interview is better for learning about your customers / product / problem / solution / market than five surveys with 10'000 statistically significant responses."

The problem here is that tech people, especially software engineers, try to avoid going out of the building as much as possible. But this is so important. Engineers need to fight against their nature and get out of the building and talk to customers as much as possible; find out who they are, how they work, what they need and what your startup needs to do, to build and then sell its solution.

Again, getting out of the building is not getting in the parking lot, it's really about getting in front of the customer.
At the end of the day, it's about Customer Discovery. And Customer Discovery is not sales, it's a lot of listening, a lot of understanding, not a lot of talking.

A difficulty that people always imagine is that young entrepreneurs with an idea believe that they don't know anybody, so how to figure out who to talk to ?
But at the time of Linkedin, Facebook, Twitter, it's hard to believe one cannot find a hundred of people to have a conversation with.

And when having a conversation with one of them, whatever else one's asking (Problem interview, Solution interview), one should ask two very important final questions:

"Who else should I be talking to ?"
And because you're a pushy entrepreneur, when they give you those names, you should ask "Do you mind if I sit here while you email them introducing me ?"
"What should I have really asked you ?"
And sometimes that gets into another half hour related to what the customer is really worried about, what's really the customer's problem.

Customer Discovery becomes really easy once you realize you don't need to get the world's best first interview.
In fact its the sum of these data points over time, it's not one's just going to be doing one and one wants to call on the highest level of the organization.
In fact you actually never want to call on the highest level of the organization because you're not selling yet, you don't know enough.
What one actually wants is to understand enough about the customers, their problems and how they're solving it today, and whether one's solution is something they would want to consider.

2.7 Problem interview

Problem Interview is Ash Maurya's term for the interview you conduct to validate whether or not you have a real problem that your target audience has.

In the Problem Interview, you want to find out 3 things:

Problem - What are you solving? - How do customers rank the top 3 problems?
Existing Alternatives - Who is your competition? - How do customers solve these problems today?
Customer Segments - Who has the pain? - Is this a viable customer segment?

Talking to people is hard, and talking to people in person is even harder. The best way to do this is building a script and sticking to it. Also don't tweak your script until you've done enough interviews so that your responses are consistent.
The main point is to collect the information that you will need to validate your problem, and to do it face-to-face, either in-person or by video call. It's actually important to see people and be able to study their body language as well.

The interview script - at least the initial you should follow until you have enough experience to build yours - is as follows:

If you have to remember just three rules for problem interviews here they are:

Do not talk about your business idea or product. You are here to understand a problem, not imagine or sell a solution yet.
Ask about past events and behaviours
No leading question, learn from the customer

After every interview, take a leap backwards, analyze the answers, make sure you understand everything correctly and synthesize the results.
After a few dozen of interviews, you should be a able to make yourself a clear understanding of the problem and initiate a few ideas regarding the solution to it.
Finding and validating your solution brings us to the next topic: the Solution Interview.

And what if a customer tells you that the issues you thought are important really aren't? Learn that you have gained important data.

2.8 Solution interview

In the Solution Interview, you want to find out three things:

Early Adopters - Who has this problem? - How do we identify an early adopter?
Solution - How will you solve the problems? - What features do you need to build?
Pricing/Revenue - What is the pricing model? - Will customers pay for it?

The key point here is to understand how to come up with a solution fitting the problem, step by step getting to the right track with your prototype and also understanding what could be a pricing model.

A demo is actually important. Many products are too hard to understand without some kind of demo. If a picture is worth a thousand words, a demonstration is probably worth a million.

Identifying early adopters is also key.
Think of something: if one of the guys you meet tells you that you definitely hold something, ask him if he would want to buy it. If he says he would definitely buy it when it's ready and available, ask him if he would commit to this. If he says he commits to this, ask him if he would be ready to pay half of it now and have it when its ready, thus becoming a partner or an investor.
If you find ten persons committing on already paying for the solution you draw, you may not even need to search for investors, you already have them. And that is the very best proof you can find that your solution is actually something.
And customers or partners are actually the best possible type of investors.

2.9 MVP

The Minimum Viable Product is an engineering product with just the set of features required to gather validated learnings about it - or some of its features - and its continuous development.
This notion of Minimum Feature Set is key in the MVP approach.

The key idea is that it makes really no sense developing a full and finalized product without actually knowing what will be the market reception and if all of it is actually worth the development costs.
Gathering insights and directions from an MVP avoids investing too much in a product based on wrong assumptions. Even further, The Lean Startup methodology seeks to avoid assumptions at all costs, see The Feedback Loop and Metrics Obsession.

The Minimum Viable Product should have just that set of initial features strictly required to have a valid product, usable for its very initial intent, and nothing more. In addition these features should be as minimalist as possible but without compromising the overall User Experience. A car should move, a balloon should be round and bounce, etc.
when adopting an MVP approach, the MVP is typically put at disposal at first only to early adopters, these customers that may be somewhat forgiving for the "naked" aspect of the product and more importantly that would be willing to give feedback and help steer the product development further.

Eric Ries defines the MVP as:

"The minimum viable product is that version of a new product a team uses to collect the maximum amount of validated learning about customers with the least effort."

The definition's use of the words maximum and minimum means it is really not formulaic. In practice, it requires a lot of judgement and experience to figure out, for any given context, what MVP makes sense.

The following chart is pretty helpful in understanding why both terms minimum and viable are equally important and why designing an MVP is actually difficult:

When applied to a new feature of any existing product instead of a brand new product, the MVP approach is in my opinion somewhat different. It consists of implementing the feature itself not completely; rather, a mock-up or even some animation simulating the new feature should be provided.
The mock-up or links should be properly instrumented so that all user reactions are recorded and measured in order to get insights on the actual demand of the feature and the best form it should take (Measure Obsession),
This is called a deploy first, code later method.

Fred Voorhorst' work does a pretty good job in explaining what an MVP is:

(Fred Voorhorst - Expressive Product Design - http://www.expressiveproductdesign.com/minimal-viable-product-mvp/)

Developing an MVP is most definitely not the same as developing a sequence of elements which maybe, eventually combine into a product. A single wheel is not of much interest to a user wanting a personal transporter like a car, as illustrated by the first line.
Instead, developing an MVP is about developing the vision. This is not the same as developing a sequence of intermediate visions, especially not, if these are valuable products by themselves. As an example, a skateboard will likely neither interest someone in search for a car, as illustrated by the second line.

Developing an MVP means developing a sequence of prototypes through which you explore what is key for your product idea and what can be omitted.

Sidenote on Product Design Artifacts

This is important: The MVP is not the only solution to capture user feedback, there are multiple different tools.
For instance during the first customer solution interviews, something as stupid as a multiples hand sketch move around with your hands may well be sufficient to capture feedback.
You should always settle to the simplest form of demonstration that is required to capture the feedback you need from your customer to verify your assumption, your hypothesis or your idea.

(Source: https://www.slideshare.net/LeanStartupConf/a-playbook-for-achieving-productmarket-fit )

2.10 Fail Fast

The key point of the "fail fast" principle is to quickly abandon ideas that aren't working. And the big difficulty of course is not giving up too soon on an idea that could potentially be working. should one find the right channel, the right approach.
Fail fast means getting out of planning mode and into testing mode, eventually for every component, every single feature, every idea around your product or model of change. Customer development is the process that embodies this principle and helps you determine which hypotheses to start with and which are the most critical for your new idea.

It really is OK to fail if one knows the reason of the failure, and that is where most people go wrong. Once a site or a product fails then one needs to analyse why it bombed. It's only then that one can learn from it.
The key aspect here is really learning. And learning comes from experimenting, trying things, measuring their success and adapting.
An entrepreneur should really be a pathologist investigating a death and finding the cause of the failure. Understanding the cause of a failure can only work if the appropriate measures and metrics around the experiment are in place.

Now failing is OK as long as we learn from it and as long as we fail as fast as possible. Again, the whole lean idea is to avoid waste as much as possible and there's no greater waste than keeping investing on something that can ultimately not work. Failing as fast as possible, adapting the product, pivoting the startup towards its next approach as soon as possible is key.
But then again, the big difficulty is not to give up too soon on something that could possible work.

Fail fast,
Learn faster,
Succeed sooner !

So how do you know when to turn, when to drop an approach and adapt your solution ? How can you know it's not too soon?

Measure, measure, measure of course!

The testing of new concepts, failing, and building on failures are necessary when creating a great product.
The adage, "If you can't measure it, you can't manage it" is often used in management and is very important in The Lean Startup approach.
Lean Startup is about verifying all your assumptions and hypothesis, and the only way to verify them is to take measures, compute metrics, infer insights and adapt.

2.11 Metrics Obsession

In the build-measure-learn loop, there is measure ... The Lean Startup makes from measuring everything an actual obsession. And I believe that this is a damn' good thing.
Think of it: what if you have an idea regarding a new feature or an evolution of your product and you don't already have the metrics that can help you take a sound and enlightened decision? You'll need to introduce the new measure and wait until you get the data. Waiting is not good for startups.

This is why I like thinking of it as a Metrics Obsession. Measure everything, everything you can think of!
And repeat a hundred times:

I will never ever again think that ...
Instead I will measure that ...

Or as Edward Deming said :

"In god we trust, all others must bring data"

We'll come back to this

2.12 Pivot

In the process of learning by iterations, a startup can discover through field returns with real customers that its product is either not adapted to the identified need, that it does not meet that need.
However, during this learning process, the startup may have identified another need (often related to the first product) or another way to answer the original need.
When the startup changes its product to meet either this new need or the former need in a different way, it is said to have performed a Pivot.
A startup can pivot several times during its existence.

A pivot is ultimately a change in strategy without a change in vision.
It is defined as a structured course correction designed to test a new fundamental hypothesis about the product, business model and engine of growth.

The vision is important. A startup is created because the founder has a vision and the startup is really built and organized around this vision. If the feedback from the field compromises the vision, the startup doesn't need to pivot, it needs to resign, cease its activities and another startup, another organization aligned to the new vision should perhaps be created.

There are various kind of pivots:

Zoom-In : a single feature becomes the whole product
Zoom-Out : the whole initial product becomes a feature of a new product
Customer segment : Good product, bad customer segment
Customer need : Repositioning, designing a completely new product (still sticking to the vision)
Platform : Change from an application to a platform, or vice versa
Many others ...

Pivot or Persevere

Since entrepreneurs are typically emotionally attached to their product ideas, there is a tendency to hang in there too long. This wastes time and money. The pivot or persevere process forces a non-emotional review of the hypothesis.

Unsuprisingly, knowing when to pivot is an art, not a science. It requires to be well thought through and can be pretty complicated to manage.
At the end of the day, knowing when to pivot or persevere requires experience and, more importantly, metrics: proper performance indicators giving the entrepreneur clear insights about the market reception of the product and the fitting of customer needs.

One thing seems pretty clear though, if it becomes clear to everyone in the company that another approach would better suit the customer needs, the startup needs to pivot, and fast.

2.13 The Lean Canvas

Evolution on Business Models and the relative processes were surprisingly missing or poorly addressed from Ries' and Blank's initial work.
Fortunately, Osterwalder and Maurya caught up and filled the gap.

Business Model Canvas

The Business Model Canvas is a strategic management template invented by Alexander Osterwalder and Yves Pigneur for developing new business models or documenting existing ones.
It is a visual chart with elements describing a company's value proposition, infrastructure, customers, and finances. It assists companies in aligning their activities by illustrating potential trade-offs.

Lean Canvas

The Lean Canvas is a version of the Business Model Canvas adapted by Ash Maurya specifically for startups. The Lean Canvas focuses on addressing broad customer problems and solutions and delivering them to customer segments through a unique value proposition.

So how should one use the Lean Canvas?

Customer Segment and Problem
Both Customer Segment and Problem sections should be filled in together.
Fill in the list of potential customers and users of your product, distinguish customers (willing to pay) clearly from users, then refine each and every identified customer segment. Be careful not no try to focus on a too broad segment at first, think of Facebook whose first segment was only Harvard students.
Fill in carefully the problem encountered by your identified customers.
Identify carefully your early adopters since they will help you test and refine your business model
UVP - Unique Value Proposition
For new products, the initial battle is about how to get noticed ? How will you get the customer's attention ?
The UVP is the unique characteristic of your product or your service making it different from what is already available on the market an that makes it worth the consideration of your customers. Focus on the main problem you are solving and what makes your solution different.
Solution
Filling this is initially is tricky, since knowing about the solution for real requires trial and error, build-measure-learn loop, etc. In an initial stage one shouldn't try to be to precise here and keep things pretty open.
Channels
This consists in answering: how should you get in touch with your users and customers ? How do you get them to know about your product ? Indicate clearly your communication channels.
it's one of the riskiest item on your canvas! Start testing from day 1! (Social networks, Newsletter, Ads, Friends, Events, SEO, Etc.)
Revenue Stream and Cost Structure
Both these sections should also be filled in together.
At first, at the time of the initial stage of the startup, this should really be focused on the costs and revenues related to launching the MVP (how to interview 50 customers ? Whats the initial burn rate ? etc.)
Later this should evolve towards an initial startup structure and focus on identifying the break-even point by answering the question : how many customers are required to cover my costs ?
Key Metrics
Ash Maurya refers to Dave McClure Pirate Metrics to identify the relevant KPIs to be followed :
Aquisition - How do user find you ?
Activation - Do user have a great first experience ?
Retention - Do users come back ?
Revenue - How do you make money ?
Referral - Do users tell others ?
Unfair Advantage
This consists in indicating the adoption barriers as well as the competitive advantages of your solution. An unfair advantage is defined as something that cannot be copied easily neither bought.
Examples: Insider Information, Personal authority, A dream team, Existing customers, "Right" celebrity endorsement, Large network effect, Community, SEO ranking, Patents, Core values, etc.

Lean Startup : test your plan !

Using the new "Build - Measure - Learn" diagram, the question then becomes, "What hypotheses should I test?". This is precisely the purpose of the initial Lean Canvas,

Product Market Fit on the Lean Canvas

The Lean-Canvas is a formidable tool to capture the assumptions and hypothesis leading to Product Market Fit.
Product Market Fit happens here on the Lean Canvas:

Filling up these two parts can be challenging, so another tool comes in the game to help identify Assumptions leading to Product-Market Fit.

2.14 The Value Propostion Canvas

The Value Proposition Canvas is a tool which can help ensure that a product or service is positioned around what the customer values and needs.
The Value Proposition Canvas was initially developed by Dr Alexander Osterwalder as a framework to ensure that there is a fit between the product and market. It is a detailed look at the relationship between two parts of the Osterwalder's broader Business Model Canvas; customer segments and value propositions.
The Value Proposition Canvas can be used when there is need to refine an existing product or service offering or where a new offering is being developed from scratch.

Customer Profile

Customer jobs - the functional, social and emotional tasks customers are trying to perform, problems they are trying to solve and needs they wish to satisfy.
Gains - the benefits which the customer expects and needs, what would delight customers and the things which may increase likelihood of adopting a value proposition.
Pains - the negative experiences, emotions and risks that the customer experiences in the process of getting the job done.

Value Map

Gain creators - how the product or service creates customer gains and how it offers added value to the customer.
Pain relievers - a description of exactly how the product or service alleviates customer pains.
Products and services - the products and services which create gain and relieve pain, and which underpin the creation of value for the customer.

Achieving fit between the value proposition and customer profile

After listing gain creators, pain relievers and products and services, each point identified can be ranked from nice to have to essential in terms of value to the customer. A fit is achieved when the products and services offered as part of the value proposition address the most significant pains and gains from the customer profile. Identifying the value proposition on paper is only the first stage. It is then necessary to validate what is important to customers and get their feedback on the value proposition. These insights can then be used to go back and continually refine the proposition.

3. Design Thinking Fundamentals

Just as the previous chapter intended to cover the Lean Startup fundamentals required to present the different perspective in the Search for Product-Market fit presented in the next chapter, this one covers the most essential Design Thinking Fundamentals

3.1 Design Thinking

Design Thinking is an iterative process in which we seek to understand the user, challenge assumptions, and redefine problems in an attempt to identify alternative and innovative strategies and solutions that might not be instantly apparent with our initial level of understanding (Creative thinking, Outside-the-box thinking, ...).
Design thinking is a way of thinking and working as well as a collection of hands-on methods.

The Design Thinking Process involves five phases: Empathize, Define, Ideate, Prototype and Test-it is most useful to tackle problems that are ill-defined or unknown.

Design Thinking revolves around a deep interest in developing an understanding of the people for whom we're designing the products or services. Experience says customers are not likely to communicate their needs clearly. It's not how the human brain works. We have a natural tendency to think in terms of solutions.
Design Thinking is based on the assumption that designers' work processes can help us systematically extract, teach, learn, and apply these human-centered techniques to solve problems in a creative and innovative way. It is kind of a capture of the best practices in use by designers for ages, formalized and collected in a process and set of tools. It's an attempt to leverage designer's ways of working and thinking to other business fields where brainstorming is required to converge to a solution to a given problem.

3.2 The Design Thinking Process

Design thinking starts with Empathy and uses collaborative and participatory methods, repeating all 5 steps as many times as needed to achieve a complete solution.
The process helps not skipping to solution thinking before crystal clear problem understanding and formulation !

Design Thinking is an iterative and non-linear process. This simply means that the design team continuously use their results to review, question and improve their initial assumptions, understandings and results. Results from the final stage of the initial work process inform our understanding of the problem, help us determine the parameters of the problem, enable us to redefine the problem, and, perhaps most importantly, provide us with new insights so we can see any alternative solutions that might not have been available with our previous level of understanding.

Get back to the customer for further refinement of the problem expression
Working on the prototype give new ideas : challenge them and reprioritize!
Tests give new ideas : challenge them and reprioritize!
Tests reveal insights that redefine the problem

(Source : https://www.slideshare.net/ChrisJackson43/i-design-think-therefore-i-am-a-uxer)

Designers don't become designers from day 1, there are design schools, it requires experience etc.
Design thinking is just the same. It requires a lot of practice and familiarity with the process and the tools to become good at it.

3.3 The Design Thinking Framework

This is intended as a map of the different tools and practices in use in the different stages of the design thinking process.

I wont go any deeper today in detailing these practices and tools and let the reader google them.
I will likely dedicate a full article to design thinking on this very blog in the short term

At the end of the day, Design thinking is a lot about bringing Agility and Lean practices to the design and problem solving process.
In this perspective, it is different to the traditional thinking process in many ways, just as Agile Development is different than Waterfall Development.

	Traditional thinking	Design thinking
Style	Directed	Emergent
Process	Planning of flawless intellect	Enlightened trial and error
Path to success	Avoid failure, secure	Fail fast
Factor of success	Expert Advantage	Ignorance advantage
	Right answers	Right questions
	Rigorous Analysis	Rigorous Testing
	Subject experts	Process experts
Rituals	Presentations and meetings	Experiments and experiences
Communication	Telling	Showing
Base	Headquarters	In the field
Customer involvement	Arm's length customer research	Deep customer immersion
Customer involvement	Periodic	Continuous
Activities	Thinking and planning	Doing

3.4 Thinking Outside of the Box

The best way to illustrate this key aspect of Design Thinking is with the following quote from Henri Ford:

I'd like to illustrate this quote with the following process as an example, to show what a over-simplied design thinking process to the problem above could look like:

3.5 Sum-up

Design Thinking is essentially

a problem-solving approach specific to design,
which involves assessing known aspects of a problem and
identifying the more ambiguous or peripheral factors that contribute to the conditions of a problem.

It contrasts with a more scientific approach where the concrete and known aspects are tested in order to arrive at a solution.

Design Thinking is

an iterative process to identify alternative strategies and solutions that might not be instantly apparent with our initial level of understanding.
often referred to as "outside the box thinking', as designers are attempting to develop new ways of thinking that do not abide by the dominant or more common problem-solving methods - just like artists do.

At the heart of Design Thinking is the intention to improve products by analyzing how users interact with them and investigating the conditions in which they operate.
Design Thinking offers a means of digging that bit deeper to uncover ways of improving User eXperiences.

4. Reaching Product Market fit - Different perspectives

Now with all we have covered above - lean startup and design thinking fundamentals - we can come back to this very article topic, the search for Product-Market Fit.
When searching online articles or posts about Product Market Fit, you will mostly likely fall in one of the following four perspectives

The Lean-Startup perspective: with actually two sub-cases that converge to the same thing:
- The Feedback-loop perspective: Searching Product-market fit is applying the Build-Measure-Learn feedback loop comprehensively throughout the product identification and design lifecycle and the business plan definition to shape a product fulfilling perfectly the customer needs.
- The Four-Steps-to-the-Epiphany perspective: Product-market fit is the result of the search phase, when the solution to the customer problem is clearly identified along with its feature set, market potential, business plan, foreseen evolutions, etc.
Again, these two perspectives actually converge to what I will describe hereunder as the Lean Startup Perspective.
The MVP-Centric perspective: For many, searching product-market fit is iterating around a Minimum Viable Product. It is the result of a process centered around the MVP design iterations, when the MVP and what we learned from it enabled to identify the product fulfilling the market needs.
The Lean-Canvas-Centric perspective: For others, Product-market fit happens when you succeeded in designing great value propositions that match your customer needs and jobs-to-be-done and helps solve their problems.
The Design Thinking perspective: Product-market-fit is what happens when applying successfully Lean-Startup principles to the last design-thinking process stages to reach maturity and the growth stage.

These different perspectives are all, well, perspectives ... Different visions of the same thing: putting the customer needs at the center of the Solution Search and Design process and being lean-by-the-book as long as the Problem, the Solution, the Market and the Product along with its minimum features are not well identified.
We should now detail these different perspectives.

4.1 The Lean Startup Perspective

Again, actually the Feedback loop and the Four Steps to the Epiphany - both described and referenced often in the literature - converge to the very same thing: The Lean Startup way, which can be represented this way:

The Lean Startup perspective to product market fit consists, well, in being Lean Startup by the book:

"4-steps to the Epiphany" as a high level process
Lower-level process is represented by the Lean-Startup Feedback Loop
- First Problem-Solution Fit
- Then Business-Model validation
- Eventually Product-Market Fit
MVP happens late in the process, most assumptions are verified during interviews with mock-ups and prototype (Lean-by-the-book)

At the end of the day Lean Startup is about reaching Product-Market Fit.

Key Aspects

The Lean Startup way to product market fit is fundamentally Customer-centric: Get out of the building, work with your customers:

Understand your customer's problem
Understand if your solution works for your customer!
Understand your market, capture its constraints, abilities, means.

It's lean-by-the-book, only very little investment should be made upfront, focus on Problem-Solution Fit first, then design your business mode, all of this can be done almost for free.
Only then one should develop an MVP - which requires some investment - at the latest stage, in any case after Problem-Solution Fit.
The whole process is fundamentally Data-driven: make an hypothesis, test it, measure, learn, adapt or persevere, move to next assumption, etc.
Product-Market Fit is reached when the metrics measured from the MVP confirms it.

Advantages

Drawbacks

Little investment on MVP, everything remains theoretical before Problem-Solution Fit and Business Model validation - Lean by the book!
Only when most assumptions are verified one moves to developing MVP (which requires money!)

While this approach - working a lot upfront before starting to work on an MVP - is seducing (less investment required), it is also challenging
- The first feedback we get from the MVP often challenges a lot the initial assumptions, even though they were validated with customers.
- This is because users and customers have a strong tendency not to be able to clearly state what they need and want before something concrete us put in front of them

4.2 The "MVP-Centric" Perspective

The MVP-centric perspective is very similar to the Lean-Startup perspective, with only one fundamental difference.
Instead of remaining lean too long and focusing a lot on Problem / Solution fit and the Business Model Design from a theoretical perspective (through interview, design sessions with customers, etc.), for some it makes more sense to rush it to the MVP and capture better feedback based on something concrete, the MVP, instead of remaining nearly theoretical too long.

The MVP-Centric approach puts the iterations on MVP at the center of the process.
it is very relevant approach for online and very wide audience services, such as SaaS platform, online services, etc.

Fundamental Idea

Jump to the MVP development stage as fast as possible - without neglecting Problem / Solution fit though - and iterate on MVP as long as required to reach PMF
Getting IRL - In Real Life - and live as fast as possible to optimize feedback

Note about MVP

Refer to section MVP, the first version of the MVP in this context can very well be a prototype or a simple mock-up of even sketch-up.
What the MVP-centric approach is saying is that at every stage (Problem/Solution fit, etc.) you should have something concrete to present to the customer to have him react on something real, not just open questions or theoretical solutions.
In this perspective, every kind of feedback that is not measured on something real (in the sense of existing, as real as a simple sketch-up can be) is simply useless. This is opposed to the previous Lean-by-the-book perspective where the first stages can me made through simple interviews.

Key aspects

Some believe that reasoning on a concrete MVP is the best way (again, the definition of MVP in this context is wide)
The principles behind the underlying process are:

Build MVP, put it live in real-life and start getting feedback
Feedback and metrics collection automation is key
A/B Testing / UX Metrics / etc.

In contrary to the Lean-Startup perspective, in this approach there some more significant investment up-front required, one needs to develop the MVP.
The Problem-Solution fit search phase is not neglected, but shorten as much as possible to reach the more concrete MVP stage faster.
Product-Market Fit is reached when the metrics measured from the MVP confirms it (as usual).

Advantages	Drawbacks
People have difficulties reasoning on abstractions and customers have trouble expressing clearly what they need as long as they don't see anything concrete (the infamous "that's not what I wanted" Moving AFAP to the MVP enables to address this and get feedback on the real-thing as fast as possible Very well suited for online and wide audience services (e.g. Netflix, Google, Facebook, etc.)	More up-front investment, need to develop the MVP and put it live Does not apply to all businesses / products - how to develop an MVP in Pharma, Bio-tech or heavy industry?

4.3 The "Lean Canvas-Centric" Perspective

The Lean Canvas-Centric perspective is kind of the the symmetric of the MVP-Centric perspective if the Lean Startup perspective is the pivot point.
It's fundamental idea is the exact opposite of the MVP Centric approach: postponing the MVP and the investment required in it as much as possible and stay lean and theoretical as long as possible, even as long as is required to reach "theoretical Product-Market fit"
The Lean-Canvas centric approach puts a strong emphasis on the theoretical work and customer representatives / market experts interviews instead of the MVP

Theoretical Product-Market fit would be defined as "Designing a Product and a Business Model that has the potential to have Product-Market Fit" as measured by whatever possible means to confirm most assumptions without having the actual product or even only an MVP.
The Lean Canvas and the Value proposition Canvas form a formidable tool to drive researches with customers towards Product-Market fit and discussion with Investors. They sum up the findings and validated assumptions based on theoretical work (such as prototype, academic research, scientific findings, etc.) leads to theoretical product market-fit before actually building anything on the product
Developing, evolving and maintaining these canvas is

Useful during the search phase for every startup
Essential when an MVP is not possible or very expensive. These canvas provide a guiding line for the theoretical search phase.

Key aspects

Designing a lean-Canvas, maintaining and evolving it , along with a Value proposition canvas, always makes sense and should always be done to drive the initial assumptions on the Problem solution fit, the Business Model and the Product Market-Fit and their evolutions.
But most of the time, coming back over and over again to the Lean Canvas is dropped in favor of iterations around the MVP and it's predominant usage to capture customer feedback and converge to Product-Market Fit.
When working with and around the MVP is not possible - heavy industry, Bio-Tech, Pharma, etc. - the lean canvas and its maintenance remains the principal guideline when searching for Product Market Fit Every customer interview, expert consulting, scientific research should lead to evolving the Lean Canvas and the Value Proposition Canvas. The Lean canvas is the Big Picture of the Business Plan leading discussions with investors
The Lean canvas is the map to the data points that need to be collected before talking to investors.

Advantages	Drawbacks
The Lean-Canvas helps create a quick visualization of an idea, share it and get feedback. The Value Proposition canvas helps capture how Product Market fit will be reached. Sometimes - when working with an MVP is not possible - every single word on these canvas capture an essential assumption that has been verified and that is key to build the eventual product and the company.	Again, if working with these canvas is important in the initial stage and perhaps at later stage when discussing with potential customers and investors, iterating and evolving these canvas takes a lesser importance in favor of working with an MVP as e move forward in the search phase.

4.4 The "Design Thinking-Centric" Perspective

The Design Thinking-Centric perspective is actually not a variation of the previous ones, but rather a complementary approach.
Lean startup doesn't tell a lot about how to conduct brainstorming and the thinking process towards solutions. This is where Design Thinking kicks in.

The Design-Thinking Perspective is actually a way to structure and formalize the Problem-solving approach in the search phase.
There's some overlap between Lean Startup and Design Thinking

Design thinking emphasizes on Problem / Solution Fit, MVP design and UX as ultimate results from a Design perspective,
while Lean Startup focuses on reaching Product Market Fit before scaling.

Both are very complementary!

The design thinking process is a very good fit for the Lean-Startup Search Phase
Lean Startup doesn't give much processes and tools on how to get to Problem / Solution Fit (aside from some general principles, Get out of the building, Problem / Solution interview, etc.), how to design the MVP, etc.
This is where Design thinking kicks in.

Applying the Design-thinking process to the required brainstorming in the Lean-Startup search phase is a striking fit.

Key aspects

Lean-Startup insists on the need to reach Problem Solution Fit, a working Business Model and eventually Product Market fit and gives principles and practices to it (Get out of the Building, Converge to MVP, Lean Canvas, etc.).
But Lean startup doesn't give much recipe for how to conduct brainstorming, how to search for solution, how to design the MVP, etc.

Here comes Design thinking!

The Design thinking process can be applied every time a solution to a problem, a design job or simply a brainstorming exercise has to be performed in the search phase … or after.
Design thinking and Lean Startup share some genes (getting feedback, iterate, etc.), but they are rather very much complementary with each other

Advantages

Drawbacks

Helps structuring the search for solutions to various problems and aspects in the search phase
- Problem-Solution fit (striking application for the design thinking process)
- MVP Design
- Commercial and marketing issues
Very much relevant only when a lot of thoughts need to be put in the design of the product or the search for a solution to the customer problem.

When the solution is clear after the sets of interviews with key customers, or on the contrary when searching for a solution requires a lot of scientific research, Design thinking is out of scope.
- If the solution is crystal clear, a structured brainstorming process such as design thinking is not required.
- If the solution requires a lot of scientific research, design thinking is not of a great help

5. Measure Obsession

There is one fairly important topic that I haven't covered in this paper and that would require a dedicate blog post on its own.
And that is "How do you know when you have reached Product-Market Fit" ?
I said in the introduction that a lot of it is about feeling, when you really feel the market pulling out your product.
But fortunately, knowing whether your startup reached product-market fit or whether simply you are going in the right direction is a lot more than feelings. It's all about metrics !

Or, as W. Edwards Deming said:

"In God we trust, All others must bring data."

(Source : Les pratiques des géants du web / Stephen Perrin - OCTO Technology)

Now the question of course is which metrics makes sense in measuring whether one is going in the right direction (Product-Market Fit)?

Honestly there is no magic silver bullet and it can in fact be pretty difficult to pick up the right metric that would be most helpful to validate a certain hypothesis. However, metrics should at all cost respect the three A's.
Good metrics

are actionable,
can be audited and
are accessible

An actionable metric is one that ties specific and repeatable actions to observed results. The actionable property of picked up metrics is important since it prevents the entrepreneur from distorting the reality to his own vision.
We speak of Actionable vs. Vanity Metrics.
Meaningless metrics such as "How many visitors ?", "How many followers ?" are vanity metrics and are useless.

Ultimately, your metrics should be useful to measure progress against your own questions.

Now giving you a list of metrics and the proper way to interpret them is a topic on his own and I might write another article on this blog in a near future to define and present such metrics.
Since this article is already long enough this way, I'll just mention four metrics that I believe should be among the minimum set of metrics that any startup retain:

NPS - Net Promoter Score
CLV (or LTV) to CAC Ratio Customer LifeTime Value to Customer Acquisition Cost Ratio
Retention Ratio
Growth Rate

5.1 Net Promoter Score

The Net promoter Score - or NPS - is perhaps both the simplest metrics to gather and computer as well as one of the most meaningful.
It consists in understanding how great your product is by capturing how much your users are so enthusiastic about it that they would recommend it to others. On other words, it's really about how much your product is susceptible to generate a Wow effect.

The Net Promoter Score is a metric that has become a standard for measuring customer loyalty and satisfaction by many companies.
It is built on the power of one simple question: "how likely is it that you would recommend your product to a friend or colleague?"
it's now used by companies of all sizes in virtually every industry all over the world.

While NPS is a good leading indicator of business growth, it can also be a vanity metric if used alone without looking at the context of why your customers would recommend (or not recommend) your product.
If your company is committed to measuring NPS, here's a tip that you can use to understand the "why" behind your NPS score and potentially increase it. Follow up the standard NPS question with one additional question, "What would it take for you to recommend my product to someone you know?", and target the people who are not your promoters, which means they rated their likelihood to recommend your product an 8 or lower. You can then analyze those open-ended responses to identify key trends in the data.

Computing the Net Promoter Score

Net Promoter Score = % Promoters - % Detractors

(Source : https://www.netigate.net/articles/customer-satisfaction/nps-ultimate-guide-to-net-promoter-score)

Understanding the Net Promoter Score

There's a simple rule of thumb:

A positive value is OK/li>
A value above 20% is what you want to reach/li>
A value above 50% is extremely good

5.2 CLV to CAC Ratio

The CLV to CAC Ratio is an expression of how much money you can make built on two key figures:

CAC - Customer Acquisition Costs - is the figure representing what it costs to your company in average to acquire a new customer.
It's the total cost of converting a prospect or convincing a potential customer to become an actual customer. It's the total cost devoted to your sales and marketing effort - cross department, domains and worldwide divided by the number of your new customers over the period.
CLV - Customer Lifetime Value (or often LTV in the literature) - is the figure representing how much money you make in average with one of your customer. CLV is more difficult to compute and no formula working out of the box can be expressed easily since one need to account upselling, sales model such as subscriptions vs. one time license, etc.
One can however simplify it by considering incomes only incomes from new customers

The CLV to CAC ratio gives you an indication of how much your business is profitable.
The metric is computed by dividing LTV by CAC. It is a signal of customer profitability, and sales and marketing efficiency.

(Sources :
https://www.klipfolio.com/resources/kpi-examples/saas/customer-lifetime-value-to-customer-acquisition-cost
https://www.forentrepreneurs.com/startup-killer/ )

5.3 Retention Ratio / Curve

Acquisition isn't the whole answer. Retention is even more important!

Definitions:

N-Day Retention: The proportion of users who come back on the 'Nth' day after first use.
Retention Curve: A line graph depicting the average percentage of active users for each day within a specified timeframe.

At a high level, retention is a measure of how many users return to your product over time.
It is the mathematical inverse of the customer churn (which can be another metric)

The point is, every improvement that you make to retention also improves all of these other things-virality, LTV, payback period. It is literally the foundation to all of growth, and that's really why retention is the king

A good way of visualizing retention rate is by plotting a retention curve:
Retention can actually indicate if you have a product-market fit problem; if you plot out your retention numbers as a percentage of active users over time and you have a flat line that reaches zero instead of a curve that stabilizes-you need to solve a product-market fit problem, not a retention problem.

(Source : https://amplitude.com/mastering-retention/why-care-about-user-retention )

5.4 Growth Rate

The Growth Rate is an expression of the speed at which your business is growing.
The Growth Rate is unfortunately a metric that is simple to understand, yet fairly difficult to compute since a lot of different elements need to be accounted.

Imagine a situation where the growth would be 10% monthly, composed by 40% new customers every months and 30% customers leaving or stopping to use the product.
In such a situation, even though the monthly growth seems interesting, the company is actually losing all its customers every 3 months!
Under such conditions, the survival of the company is almost impossible after the hype effect passes.
For this reason, the growth rate metrics needs to account the ability of the company to retain its customers, the churn rate, etc.

(Sources :
https://tribecap.co/a-quantitative-approach-to-product-market-fit/
https://www.lightercapital.com/blog/how-to-establish-product-market-fit/)

Growth accounting framework

The Growth Accounting Framework proposed on tribecap.co at https://tribecap.co/a-quantitative-approach-to-product-market-fit/ presents a fairly relevant approach to computing the Growth Rate.

Shortly put, it consists of working with the Compound Monthly Growth Rate over past X months as illustrated here:

(Source : https://tribecap.co/a-quantitative-approach-to-product-market-fit/)

Product Market Fit is confirmed when these indicators - the cmgr3, cmgr6 and cmgr12 go up consistently.
If you intend to work to use the Growth Rate as key metric, you definitely should read very carefully the article above.

5.5 Further readings: pirate metrics

In this article, I have presented four example metrics, the ones that are used most of the time. But there are many more.

The reader should get familiar with the Pirate Metrics framework proposed by Dave McClure.
Pirate metrics is a helpful customer-lifecycle framework invented by Dave McClure from 500 startups that you can use to determine where you should focus on optimizing your marketing funnel, to make the most of your scarcest resource - your time.
Pirate metrics is essentially a way of categorizing different metrics and KPIs, and is made up of the metric “categories” Awareness, Acquisition, Activation, Revenue, Retention, Referral - or AAARRR for short (like a pirate. Pirate metrics, get it?).

The reader might want to head on over to Slideshare to read the original slide deck outlining Pirate Metrics.

6. Conclusion

So there is a recipe for success.

Entrepreneurship has a magic power, it triggers positive energies and it leaves people with an irresistible willingness to start doing things.
However, all these positive energies can very easily become negative when they are not channeled in the correct direction. And by negative we mean: having quit a day job, having spent most of our savings, having re-mortgaged the house and ultimately having trouble explaining to our life partner, family and friends why we have done all of that and we still haven't been able to succeed. That's awful.
Luckily there is a process that we can follow, developed after having worked with hundreds of entrepreneurs.

Instead of starting to develop a product and hiring people immediately, these are the questions that we need to answer in order to build and launch a product that customers need and for which there is market demand:

Which problem are we going to solve?
Who has the problem in our market?
Who are the early adopters?
What is the value proposition able to satisfy their needs?
How much are they willing to pay for it?
What is the minimum set of features required for launch?

The way to answer most of these questions is to engage with customers from the very early stage of a new business idea, get to know them profoundly, create a value proposition based on the insights captured that relies on the company's key strengths to create a competitive advantage customers care about, and test and iterate that proposition on the market until we reach Product-Market Fit.

Most of this can be done before investing any substantial resource into the business, and that's really the best thing about Lean Startup methodology. The most difficult thing is to resist from the instinct to jump into “build mode”. Instead, we invest some time to de-risk an idea before investing heavily in it. Everyone is in love with their idea, and the last thing we want to know is that it's not a good one. But the sooner we realize an idea is flawed - i.e. there is no market need for it - the better it is.
In terms of practical steps, this is a possible process to validate a new business idea and achieve product-market fit:

Compile a lean canvas to have clarity of a business idea (with Value Proposition Canvas)
Identify the riskiest assumptions for the business idea.
Hint: usually these are the ones around target segment, problems and market size.
(Problem Interview) Conduct a round of qualitative interviews with target customers, to understand if they have the problem, how big it is, and what they currently do to solve it.
Run a collaborative workshop with the entire team to refine the value proposition based on customer insights collected so far. This is how disruptive ideas are generated!
Prepare a cheap and quick form of a prototype.
(Solution Interview) Conduct another round of qualitative interviews with target customers, to understand if the solution prepared solves the problem and if they are willing to pay for it. At this stage it is mandatory to attempt to get a commitment from them.
Iterate the solution, and conduct new interviews if they didn't commit already.
(MVP) When we get enough commitment, we define a MVP (Minimum Viable Product) and start proper development (when an MVP is not possible - heavy industry, pharma, etc. - we conduct additional confirmation research and engage with more potential customers)
Put MVP on the market, collect feedback on MVP and evolve / adapt / pivot as required until reaching Product-Market-Fit

An on-going health check

The Innovator's Dilemma?
A new solution achieves PMF, and manages to capture the lion's share of the market.
They become the dominant player, and stay that way, until new technology appears and supersedes their solution.
By failing to stay ahead of changing technology, the incumbent company loses market share to a smaller, disruptive business, who in turn, go on to dominate the market.
Rinse and repeat.

In this narrative, the market for a product evolved over time, and the definition of Product/Market Fit changed with it. As a result of developing technology, a solution that fits the market in the here-and-now might not fit the same market in the future.

PMF is like an ongoing health check for your business, allowing you to periodically test the key assumptions that underpin your business:

Does the problem we solve still exist?
Is the problem important enough?
Is the market for our product still a 'good' market?

TDD - Test Driven Development - is first and foremost a way to reduce the TCO of Software Development

2020-01-18T17:23:56-05:00

Test Driven Development is a development practice from eXtreme Programming which combines test-first development where you write a test before you write just enough production code to fulfill that test and refactoring.
TDD aims to improve the productivity and quality of software development. It consists in jointly building the software and its suite of non-regression tests.

The principle of TDD is as follows:

write a failing test,
write code for the test to work,
refactor the written code,

and start all over again.

Instead of writing functional code first and then the testing code afterwards (if one writes it at all), one instead writes the test code before the functional code.
In addition, one does so in tiny small steps - write one single test and a small bit of corresponding functional code at a time. A programmer taking a TDD approach shall refuse to write a new function until there is first a test that fails - or even doesn't compile - because that function isn't present. In fact, one shall refuse to add even a single line of code until a test exists for it. Once the test is in place one then does the work required to ensure that the test suite now passes (the new code may break several existing tests as well as the new one).
This sounds simple in principle, but when one is first learning to take a TDD approach, it does definitely require great discipline because it's easy to "slip" and write functional code without first writing or extending a new test.

In theory, the method requires the involvement of two different developers, one writing the tests, then other one writing the code. This avoids subjectivity issues. Kent Beck has more than a lot of examples of why and how TDD and pair programming fit eXtremely well together.
Now in practice, most of the time one single developer tends to write tests and the corresponding code all alone by himself which enforces the integrity of a new functionalities in a largely collaborative project.

There are multiple perspective in considering what is actually TDD.
For some it's about specification and not validation. In other words, it's one way to think through the requirements or design before one writes the functional code (implying that TDD is both an important agile requirements and an agile design technique). These considers that TDD is first and foremost a design technique.
Another view is that TDD is a programming technique streamlining the development process.
TDD is sometimes perceived as a way to improve quality of software deliverables, sometimes as a way to achieve better design and sometimes many other things.

I myself believe that TDD is all of this but most importantly a way to significantly reduce the "Total Cost of Ownership (TCO)" of software development projects, especially when long-term maintenance and evolution is to be considered.
The Total Cost of Ownership (TCO) of enterprise software is the sum of all direct and indirect costs incurred by that software, where the development, for in-house developped software, is obviously the biggest contributor. Understanding and forecasting the TCO and is a critical part of the Return on Investment (ROI) calculation.

This article is an in depth presentation of my views on TDD and an attempt to illustrate my perspective on why TDD is first and foremost a way to get control back on large Software Development Projects and significantly reduce their TCO.

Summary

1. So what is TDD exactly ?
2. Improving Design
3. Reducing TCO
- 3.1 Implementing Automated tests
- 3.2 Embracing TDD
4. An example to illustrate the TCO reduction
5. Conclusion / Take Aways

1. So what is TDD exactly ?

1.1 Principle of TDD

The principle of TDD is simple: when one wants to develop a new feature, one starts by writing the test that assesses how it shall work. In the next step, the functional code is developed so that the test is validated. And nothing more!
Focusing on functionalities avoids writing code without meeting a requirement satisfied by a validated test.

The principle then consists in working in small iterative cycles consisting of:

writing the minimum possible code to pass the test;
enriching the test base with a new test;
rewriting the minimum code to pass the test;
and so on...

This practice mostly comes from Kent Beck, one of the signatories of the Agile Manifesto. It encourages a simple, clean and sound design of software products and makes the developer more confident in the ability of his code to do what he wants correctly, without hiding a few bugs.

Let's take a closer look at the different stages of the TDD cycle.

Write a test. The first thing to do when one wants to implement a new feature is to write a test. It involves understanding the functionality that one has to develop beforehand, which is a very good thing.
Execute the test(s). Then one has to run the test that he just wrote. In practice, the new test is executed, along with all those already existing. This implies that they must be very quick to execute, otherwise too much time is wasted waiting for feedback. Some IDEs even push it to the extent of running the tests continuously during the development, in order to have an even faster response.
The test must fail, since no code has been written to make it pass. In general, it doesn't event compile, because the method / class doesn't even exist.
Write the code. Then, one writes the strictly minimum functional code required to make the test pass and nothing more. If the written code is not perfect yet, or makes the test inelegantly, it doesn't really matter for now.
Execute the test(s) again. The developer then re-runs all the tests ane makes sure that they run successfuly and that everything is working fine.
Refactoring. In this phase, one shall improve the code he has written. This helps to see if it may be simplified, written better, made generic, factorized, etc. One shall get rid of duplications, rename variables if some are not utmost meaningful, as well as methods, classes, etc., so that the code is clean, simple and clearly expresses its intentions. One shall separate responsibilities, maybe extract some design patterns, etc.

Following this virtuous development approaches enforces single responsibilities, separation of concerns, important code coverage, etc. and comes with multiple benefits described in the next section.

What to do when a bug manages to make it through to production?

When TDD is properly applied, it makes it simply impossible for the vaste majority of bugs to make it through to production.
However, some very tricky corner case situations may be difficult to assess with automated tests and as a consequence, even when TDD is applied by the book, it may happen that a bug passes through the cracks and is discovered late in production, sometimes after months, when the specific situation triggering the bug occurs.

This is an interesting situation and is worth discussing: what shall happen with TDD whenever a bug still manages to make it through to production?
Long story short: whenever a bug is spot in production, the resolution of the bug shall follow TDD as well:

First, implement a new unit test or integration test that reproduces the bug. The test shall fail at first, since the bug exists.
Then do whatever it takes to have the test passing.

This method shall simply always be respected whenever a bug that passes through the cracks is encountered. Eventually, these bug resolution tests will form the most important assets in the non-regression tests suite.

A note about 100% functional code coverage

When following the TDD methodology, one shall target 100% coverage of the functional code with automated tests, both in terms of Lines of Code and Conditions Branches.
This does not mean that sonar with its default configuration - or other code coverage measuring tools - shall necessarily report 100% coverage.
In practice, some boilertplate code doesn't neet to be tested. It's note considered as functional code.
For instance in Java, some exception catch blocks - that may be mandatory for the code to compile but that simply can't happen in practice because of the impossibility of the functional code to enter the specific branch triggering the exception - shall not be tested. Testing those would be a waste of time.
On the other hand, if the exception catch block corresponds to a specific exceptional business situation that can happen in practice, it has to have a proper unit test assessing it and it shall be covered.

Most of the time, the code coverage computation tool or the quality assessment tool - such as sonar - shall be properly configured to exclude from the coverage computation the blocks of code that would be a waste of time to test.
In practice that is rarely the case and the reported coverage never reaches 100%. This is not a problem as long as the functional code coverage reaches nearly 100%.

For this reason, there is an important distinction between the functional code and the whole code.
The essential point is that the functional code - the business meaningful conditions - are covered 100%. The technical boilerplate code doesn't need to be covered 100%.

1.2 Advantages of TDD over tests after or even no tests

Implementing automated tests - mostly unit tests and some integration tests - is a formidable development tool.

Even without TDD, automated tests provide significant benefits:

In software development, the written functional code has to be tested continuously as it is written. Without automated tests, one needs to test the code on a live running application to assess its behaviour. In addition, when some misbehaviour is happening, one is left with the debugger to figure what is going wrong. There is no less efficient mean to figure what some code is doing than using a debugger (even though sometimes it's the only way). Implementing unit tests assessing some conditions capturing what the code under testing is doing is a much simpler way. Automated tests enable to understand the code behaviour in a simple and unitary way. Again, going through the debugger once is a while to understand why an assessment fails is still required, but to a much lesser extent.
Automated tests form a formidable non-regression tests suite. Instead of testing the live running application over and over again to search for regressions, on simply re-runs the whole automated test suite and, if it succeeds, one can be confident that no regression have been introduced by some maintenance or evolution.
This non-regression benefit also helps during the development process. Running the existing tests indicates whether the last change breaks something in the existing code base.
Bugs passing through the cracks and making it to production are significantly reduced. Unit testing and integration testing provides a much larger coverage of code both in terms of lines of codes and condition branches over manual testing. On large software products, there is simply no way manual testing can compete with automated tests, regardless of the size, the complexity and the maturity of the test plan.
Developers are getting more productive since they get confident in changing and evolving the code. When programming, the bigger the codebase gets, the harder it gets to move further or to change the code because it's easier to mess up. When one has automated tests, they become the safety net, allowing one to see what the mistake are, where they are, and how they affects the system. It helps identify errors in a really short period of time. These tests give developers very fast feedback when something breaks.
Automated tests, mostly unit tests, form a very good form of detailed specification and documentation. This not only streamlines the development process but helps re-understanding the code quickly when it has to be maintained, sometimes several months or even years after its initial development.

TDD, or driving the software development with the tests, brings additional benefits over writing the tests after:

TDD is about getting feedback. Some define TDD as being a mental model (discipline) which relies on a very short feedback loop at the code level. Getting short and frequent feedback about what the code is doing streamlines the whole development process. Quick and small feedback is much easier to understand than late and large feedback. TDD gives the ability to think more about simplicity, focusing on writing only the code strictly necessary to pass an assumption. In that sense, with TDD one challenges the wrong assumptions as early as possible and one identifies errors and problems very quickly.
The number of bugs passing through the cracks are reduced even further with TDD.
TDD enables to reach an almost exhaustive coverage of the code both in terms of Lines of Code and Condition branches.
The increase in code coverage by tests also makes it much more straightforward to proceed with large refactorings. And without refactoring ability, one's screwed to ensure best possible design since, unless one is a genius, achieving the best possible design relies mostly on his ability to refactor (improve the design)
TDD is ultimately about design. One is forced to write small class focused about one concern and small methods dedicated to one responsibility. TDD enforces SOLID design rules (see below). TDD enforces clean, simple and sound design since it simply makes it difficult not to say impossible to write convoluted code with TDD. One of the principal reasons behind this is that writing the tests first requires one to really consider what do he wants from the code a the very beginning.
All of this leads to significant reduction of the TCO (Total Cost of Ownership) of Software Development

1.3 Different types of tests

There is a distinction between automated tests and unit tests. Not all automated tests are necessarily unit tests and while TDD is mostly about unit testing, other types of tests also make a lot of sense and bring value when embracing TDD.

There are basically three types of automated tests:

Unit tests: are meant to test individual modules of an application in isolation (without any interaction with dependencies) to confirm that an individual piece of code - typically a method - is doing things right.
Integration tests: are meant to check if different modules are working fine when combined together as a group and that their interactions is producing the expected results.
Functional tests: are meant to test a slice of functionality in the system (may interact with dependencies) to confirm that the code is doing the right things.

Functional tests are comparable to integration tests in some way, with the difference that they are intended to ensure the sound behaviour of the entire application's functionality with all the code running together and deployed in a realistic way, nearly somehow a super integration test.
Unit tests considers checking a single component of the system whereas functional tests are intended to assess the conformity of a whole feature with its specifications (such as the user story and its acceptance criterias).
Functional tests are intended as a safety net and form a good way to automate acceptance tests. Working with the Product Owner, the Product Managers or Business Experts, developers can formalize acceptance criterias and automate acceptance tests in the form of functional tests. This is actually the best possible way to implicate business experts in assessing the product quality.

Unit tests alone are not sufficient since it's nearly impossible to achieve near to 100% coverage of the functional code with unit tests. For instance, assessing behaviour related to interactions between modules is by design impossible with unit tests.
This is where integration tests kick in. Integrations tests are meant to cover and assess behaviour that unit tests are not targeting by assessing the well behaviour of the different units when working with each others.

These different kind of tests don't have the same level of complexity and the same target in terms of coverage. Yet they have a large overlap of scope which can be represented as follows:

All these tests shall be operated in a consistent way, using maven for instance if maven is the chosen tool to build and package the software.
In terms of technology though, while unit tests and integration tests usually share a common base, this is not necessarily the case for automated tests which often rely on a completely different technical stack.

The technologies quite widely used for these different kinds of tests can be reprsented as follows :

Again, while TDD is mostly about unit testing, and depending on the approach as well as the stage of the development, some other types of tests are fully part of the TDD scope.
Eventually, all these tests together form the non-regression test suite.

For instance while unit tests are intended to tests a method (or other forms of units) specifically (perhaps under multiple conditions), integration tests are intended to while functional tests are expected to tests end to end features and mimim user behaviour on the software.

1.4 Styles of TDD

There are actually two quite opposed approaches when applying TDD on a large software development project, the Inside-Out approach and the Outside-In approach:

1.4.1 Inside-Out TDD (Bottom Up / Detroit School / Classic Approach)

The first approach is Inside-Out TDD, which is sometimes called the "Detroit School" of TDD, or bottom-up, or even classical approach. With Inside-Out TDD, one starts by writing tests and implementation for small aspects of the system. The aim is to grow the design through a process of refactoring and generalizing the codebase as tests become increasingly specific.

Although all developers should be mindful of the bigger picture, Inside-Out TDD enables developers to focus on one thing at a time. Every component (i.e. an individual module or single class) is created one after the other and pile up until the whole application is built up.
One one hand, individual components written this way could be deemed somewhat worthless until they are connected together by higher level components and working together. Also wiring the system together at a late stage may constitute a higher risk in terms of overall design consistency. On the other hand, focusing on one component at a time helps parallelize development work efficiently within a team and refactoring is here to ensure the overall design when the components starts to pile up.

The main characteristics of the Inside-Out approach are as follows:

Emergent Design happens during the refactoring phase.
Very often tests are state-based tests.
During the refactoring phase, the unit under test may grow to multiple classes.
Mocks are rarely used, unless when isolating external systems.
No or little up-front design considerations are made except for breaking it in small features. Overall design emerges from code and is improved with refactoring..
Inside-Out TDD is often used in conjunction with the 4 Rules of Simple Design.

Its advantages are often considered as being the following:

It's a great way to avoid over-engineering.
Easier to understand and adopt due to state-based tests and little design up-front.
Good for exploration, when one knows what the input and desired output are but one doesn't really know how the implementation looks like at the early stage.
Great for cases where one can't rely on a domain expert or domain language (data transformation, algorithms, etc.)

Of course, it suffers from some commonly accepted drawbacks:

Exposing state for tests purpose only.
Refactoring phases are normally bigger and more complexe when compared to Outside-In approach (more on that below).
Unit under test becomes bigger than a class when classes emerge during the refactoring phase. This is fine when we look at that test in isolation but as classes emerge, they evolved and are extended significantly as they are being reused by other parts of the application. As these other classes evolve, they often break completely unrelated tests, since the tests use their real implementation instead of mocks.
The refactoring step (improvement of the design) is often skipped or not done properly by inexperienced practitioners, leading to a cycle that looks more like RED-GREEN-RED-GREEN-...-RED-GREEN-MASSIVE AND YET NOT EFFICIENT REFACTORING.
Due to its exploratory nature, some classes under test are created according to the "I think I'll need this class with this interface (public methods)", making them not fit well when connected to the rest of the system and requiring further more complex refactorings.
Can be slow and wasteful since quite often one already knows that one cannot have so many responsibilities in the class under test. The classicist advice is to wait for the refactoring phase to fix the design, only relying on concrete evidence to extract other classes. Although this is good for novices, this is often somewhat a waste of time for more experienced developers.

An illustration of the inside-out method could be as follows:

1.4.2 Outside-In TDD (Top Down / London School / Mockist Approach)

Second approach is Outside-In TDD, which is sometimes called "London School" of TDD, or Top-Down or even Mockist Approach. Using this approach, development begins at the very top of the system's architecture and is grown downwards. The aim is to progressively implement increasing functionality of lower levels, one layer of the system at a time.
As a result, a reliance on mocks is required to simulate the functionality of lower level components.

Outside-In TDD lends itself well to having a definable route through the system from the very start, even if some (most if not all at first) parts are initially mocked.
The tests are based upon user-requested scenarios (or user stories with proper acceptance criteria well defined), and components are wired together from the beginning. This allows a fluent API to emerge and integration is performed from the very start of development.

By focusing on a complete flow through the system from the start, knowledge of how different parts of the system interact with each other is required from the very beginning. A little time to come up with a proper architecture of the system and even a rough design of every layer and functional blocks is required.
As required components emerge, they are mocked or stubbed out, which allows their detail to be deferred until later, when their time comes. This approach means that the developer needs to know how to test interactions up front, either through a mocking framework or by writing their own test doubles. The developer will then loop back, providing the real implementation of the mocked or stubbed components through new unit tests as the development moves forward.

The main characteristics of the Outside-In approach are as follows:

Different from the classicist, Outside-In TDD prescribes a direction in which we start test-driving our code: from outside (first class to receive an external request) to the inside (classes that will contain single pieces of behaviour that satisfy the feature being implemented).
One normally starts with an acceptance test which verifies if the feature as a whole works. The acceptance test also serves as a guide for the implementation as it progresses to lower layers.
With a failing acceptance test informing why the feature is not yet complete (no data returned, no message sent to a queue, no data stored in a database, etc.), one starts writing unit tests. The first class to be tested is the class handling an external request (a controller, queue listener, event handler, the entry point for a component, etc.)
As one already knows that the entire application won't be built in a single class, one makes some assumptions of which type of collaborators the class under test will need. One then writes tests that verify the collaboration between the class under test and its collaborators.
Collaborators are identified according to all the things the class under test needs to do when its public method is invoked. Collaborators names and methods should come from the domain language (nouns and verbs).
Once a class is tested, one picks the first collaborator (which was created with no implementation) and test-drive its behaviour, following the same approach one used for the previous class. This is why Outside-In is called this way: one starts from classes that are closer to the input of the system (outside) and move towards the inside of the application as more collaborators are identified.
Design starts in the red phase, while writing the tests.
Tests are rather about collaboration and behaviour, and only little about state.
Design is refined during the refactoring phase.
Each collaborator and its public methods are always created to serve an existing client class, making the code read very well.

It's advantages are often considered as being the following:

Since most classes are designed to serve the client calling code, the design is client-centric. This is not only better conceptually but also helps enforcing the domain language when naming methods.
Refactoring phases are much smaller, when compared to the classicist approach.
Promotes a better encapsulation since usually less state is exposed for test purposes only.
More aligned to the original ideas of Object Oriented Programming: tests are about objects sending messages to other objects instead of checking their state.
More suitable for business applications, where names and verbs can be extracted from user stories and acceptance criteria (domain model, domain language).

Of course, it also suffers from some commonly accepted drawbacks:

The architecture needs to be defined up-front and a significant design work needs to be done as well before starting to work on the first feature.
Much harder for novices to adopt since a higher level of design skill is necessary.
Developers don't get feedback from code in order to create collaborators. They need to visualize collaborators while writing the test.
May lead to over-engineering due to premature type (collaborators) creation.
Less suitable for exploratory work or behaviour that is not specified in a user story (data transformation, algorithms, etc).
Bad design skills may lead to an explosion of mocks.
Behavioural tests are harder to write than state tests.
Knowledge of Domain Driven Design and other design techniques, including 4 Rules of Simple Design (SOLID, see below), are required while writing tests.
Doesn't enforce simple and clean design as much as the classical approach (the emergent design is weaken)

An illustration of the Outside-In method could be as follows:

At the end of the day, every development team shall ask itself what it is more comfortable with, what makes more sense considering the product management and development organization around it, the maturity of the software architects and their ability to come up with a design first or a good breaking down in functionality.
One method is not better than the other one even though, again, Inside-Out fits better exploratory work and technical software while Outside-In works better for Business Applications and large software development projects.
But at the end of the day it's more related to the culture of the software develpment team and different cultures might prefer one or the other.

2. Improving Design

2.1 Design by testing and initial design

TDD is eventually a tool to help us design faster, first because of the necessity to write testable code, and second by easing refactoring.
The fact that one needs to write code that fullfills a unit test, magically forces to write code with simple, clear and sound design.
There really is some kind of magic in this which is interesting to explain a little.

Whenever one starts by writing some code and perhaps only then write a few unit tests (most of the time only for the methods that are easy to test this way), the testability of the code is not a key concern and it's likely that significant portions of it won't be testable using unit tests.
For the sake of solely testing and test coverage, integration tests can help a little of course, but as far as design is concerned, they don't.

Even when one tries to write code with testability in mind, ending up with code that is really only a well-thought collection of single-responsibility classes and methods each doing one and only one clearly identified functional action is really hard, not to say impossible.

With TDD, one writes the unit test first.
Writing a unit test that tests and assesses a single and unique clearly identified behaviour (or responsibility) is natural to everyone, even junior developers. When code is implemented by strictly following a logic of "making this unit test pass", it naturally and logically ends up in being a collection of single concern methods and class.
This really happens magically because it simply becomes natural to write the code this way. Implementing a unit test that would require a very convoluted code to make it pass is nearly impossible.

This is the magic part in TDD.
Interestingly, following TDD, even very junior developers end up with an initial design that is way simpler, cleaner and sound than what experienced developers could do without TDD.

Specifically, TDD is especially good at ensuring Low Coupling between different modules, different classes, etc. by forcing to think and design interactions and dependencies very carefully.

But that is not all in TDD that related to design, the next section is even more important.

2.2 Emergent Design

The next level of clarity and simplicity is then achieved with refactoring, which TDD makes easy and natural, thanks to the way it promotes nearly 100% functional code coverage both in terms of lines of code and condition branches.
TDD is a design help tool. The quality of the design one gets out of TDD depends largely on the capacity of the developer to use refactoring to Design Patterns, or refactoring to SOLID principles.
We say that the developer makes the design emerge using continuous refactoring. Applying TDD without doing constant refactoring is missing half of the job and will often lead to systems not being designed as good as they could / should be.

TDD is always associated with this important notion of "emergent design". In agile, one often builds the software incrementaly, feature by feature. So one can't know right from the start what fine design will be required, it will evolve / emerge as the development moves forward. So any time one adds a new piece of functionality, one does some refactoring to improve the design of the application. It's continuous / incremental design. That's why TDD is key in agile development processes.

Doing a lot of design upfront (BDUF = Big Design Up Front) is not incompatible with TDD though, on the contrary. There is nothing wrong with starting a piece of sofware while having the design already in mind. TDD will then enable one to put that design in place quickly. And in the case the design one thought about was wrong, TDD will allow one to refactor it nicely and safely. Again, it's just a tool, it's there to help one develop his ideas faster and design stuff safely and faster.
Now RDUF - Rough Design Up Front - probably makes more sense when embracing TDD.
When using Outside-In TDD, the RDUF is a strong requirement along with a proper pre-identification of the architecture of the software product.

In every case, one should never try to do emergent design without being willing to do some constant refactoring, they both go together and it does really require a lot of discipline

2.3 Design principles to identify refactoring opportunities

In the world of Agile projects and Agile design, several principles shall be respected to help the code keep a clean, simple and sound design.

First, the SOLID principles:

Single responsibility principle (SRP) : A class should have one and only one reason to change, meaning that a class should only have one job, one single respnsibility. Note that the same should apply to a method, a package, one could consider even a whole application, with different levels of abstraction of course.
Open-closed Principle (OCP) : Objects or components should be open for extension, but closed for modification. Open for extension means that we should be able to add new features or components to the application without breaking existing code. Closed for modification means that we should not be able to introduce breaking changes to existing functionality, because that would force one to refactor a lot of existing code.
Liskov Substitution Principle (LSP) : Every subclass / derived class should be substitutable for its base / parent class. In other words, a subclass should override the parent class methods in a way that does not break functionality from a client's point of view.
The LSP stats that whenever one is tempted to introduce some inheritance between classes but would break this principle in doing so, then one should consider composition instead of inheritance.
At the end of the day, it's about answering the question "Is that X really an Y ?" and if the answer is positive, then inheritance between X and Y can be used. For instance, the answer to "Is a cat really an animal ? is clearly yes. But the answer to the question "Is than appartment really a room ?" is clearly negative even though they share some common properties - such as size, volume, number of light switches, etc. - some common methods - such as Enter, Leave, etc. This is a good indication that Apartment should not inherit from Room, an appartment should own a collection of rooms. Now The Composite pattern would help factorize the common stuff.
Dependency Inversion Principle (DIP) : Entities must depend on abstractions not on concretions. It states that the high level module must not depend on the low level module, but they should depend on abstractions.

Then, some common sense principles:

YAGNI - You ain't gonna need it : Don't implement today something that is not strictly required today. When one thinks of some cool feature and one's tempted to implement it because it may one day be required, one shall simply never implement it today, rather implement it that one day, when the need is confirmed, not today. This way, if eventually it's not required, one doesn't loose the amount of time needed to develop it.
DRY - Don't repeat yourself : Use sonar or else to identify every piece of duplicated code or feature and factorize the it to eliminate the duplication. Take the opportunity to identify the duplicated code responsibility and introduce a new proper abstraction.
KISS - Keep It Simple and Stupid : Keep design as simple as possible. This sound easy ... but it's not. Actually coming up with the simplest possible design is much harder and requires a lot more thoughts than settling for the first idea that comes in mind.
Design Patterns : Introduce Design Patterns when identifying an opportunity for it.

3. Reducing TCO

3.1 Implementing Automated tests

Automated tests reduce maintenance costs significantly since they:

form a formidable form of documentation aimed at understanding the code much faster when it needs to be maintained and/or evoluted, sometimes months or years after the initial development,
prevent from long sessions of manual tests to assess the behaviour of the application on edge cases,
prevent from deploying the code over and over again in a live running application to test it as the development is ongoing,
prevent from relying exclusively on the debugger to understand misbehaviour. Using the debugger is highly inefficient. Unit and Integration tests don't entirely anihilate the needs to use the debugger once in a while, but significantly reduce this need,
finally, unit tests (as well as integration tests) prevents a lot of bugs from passing through the cracks and making it to production, being discovered weeks or months later when a specific conditions occurs and making the business users as well as the developers loose a lot of time to figure and fix.

All these benefits that reduce the TCO comes out of the box when one reaches a good coverage of the lines of code and the condition branches with automated tests.
A good coverage of the code with unit tests means reaching 80% of lines of code and condition branches covered.
The 80/20 rule states the following : "If 100 days would be required to cover 100% of the functional code in lines and condition branches coverage, then it's likely that only 20 days are required to cover 80% of them and the remaining 20% to be covered require the additional 80 days. This overwhelming investment is not worth it and one is better of limiting the invested development effort to the 20 days to cover 80%. of the code".

3.2 Embracing TDD

Implementing unit tests after the code suffers from two important drawbacks:

It doesn't enforce, simple, clean and sound design. The emergent design approach - relying on testability of the code and refactoring - is not enforced whenever one writes tests after the code.
Because of the previous problem, covering (nearly) 100% of the functional code with tests is overkill and one limits to the 80/20 rule.
Since the design is not as simple, clean and sound as it shall be, long term maintenance and evolution of the code will be more expensive that when the design is as good as it can be with

These drawbacks have direct consequences on the TCO.

Hence the reason for Tests Driven Development.

With TDD:

Because a unit test is written first and the code written after limited to each and every line of code required for the unit test to run successfully, the functional code coverage both in terms of lines coverage and condition branches reaches (almost) 100%.
The code is forced to be simple and clean because it has to fullfill the specification of a single unit test. When following TDD, it's impossible to write convoluted code since it would have been impossible to implement a unit test for this convoluted code first.
Thanks to the nearly 100% functional code coverage by automated tests and to the simple design, refactoring is not only always possible but also simple and straightforward.

These advantages have direct benefits on the maintenance cost since:

the need to spend long hours re-understanding the code over and over again every time it needs to be maintained is significantly reduced futher, benefiting from the simple, clean and sound design enforced by TDD,
the need to use a debugger to figure and understand misbehaviour the code, thanks to both the documentation that the test form and the simple and clean design, is almost eliminated,
the need to deploy the code in a live running application to test is reduced significantly further.

TDD is really about embracing unit and other formats of automated testing to the next level which benefits first and foremost to the TCO.
The next section will illustrate this statement with an example.

4. An example to illustrate the TCO reduction

4.1 Illustration Example

Let's take an example as an illustration of the gain in TCO when coding with TDD.
This example assume that some code must be developed, representing some 10 days of work.

This code will be developed following 3 methods:

A. No automated tests whatsoever
B. Automated tests implemented after the code
C. Strict TDD following Bottom-Up approach

For the sake of illustrating the potential TCO gain with TDD, the code will experience some maintenance after a few weeks (next maintenance) and then a major evolution (further evolution) after a few months.

In details, the development and maintenance tasks in the illustration scenario are as follows:

Initial development : initial development of the feature down to production rollout
- Development Time : this is the initial development time of the first version of the feature
- Debugging Time : this is the debugging at development time, on the live running application to polish the behaviour
- Manual Testing time : this is the manual testing of the application at development time on the live running application
- Pre and Post-Production Debugging Time : this is the additional debugging just before and after the features enters production, most of the time required when the test coverage is not good.
After a few weeks maintenance : few weeks after, a small set of changes are required
- Next maintenance re-understanding time : time lost on re-understanding the code and doing the small changes
- Next maintenance Debugging time : time lost again in debugging the code to figure and assess its behaviour
After a few months evolution : few months after, an important evolution is required.
- Further evolution re-understanding time : time lost on re-understanding the code
- Further evolution implementation time : time required to implement the evolution
- Further evolution manual testing : time required to tests the evolution manually
- Further evolution non-regression testing : time required to re-test the feature as a whole and assessing the evolution didn't break anything.

This is a simplification over what could be a real development and maintenance scenario of course since it leaves out all aspects that are not relevant to compare the different methods such as Documentation, Acceptance testing by business users or the product owner, etc.

Each and every approach listed above is discussed hereunder in terms of advantages and drawbacks related to costs.

4.2 No automated tests whatsoever (A)

The costs for the different steps above of the software component development and maintenance lifecycle in the case of the "no tests" method are as follows:

We can see that the actual coding cost ("Development Time" and "Further evolution implementation time") is only a tiny part of the whole development and maintenance lifecycle.
Debugging, manual testing and struggling to understand the software again after a few months represents a significant portion of the whole TCO.

4.3 Automated tests implemented after the code (B)

The costs for the different steps above of the software component development and maintenance lifecycle in the case of the "test after" method are as follows:

The actual coding time becomes a lot more significative compared to the other activities (debugging, manual testing, etc.) which are significantly reduces thanks to the introduction of automated tests. Their cumulated cost remains significant though.

4.4 Strict TDD following Bottom-Up approach (C)

The costs for the different steps above of the software component development and maintenance lifecycle in the case of the "test before" (TDD) method are as follows:

We can see that most of the TCO is related to coding activity, either writing tests or the functional code.
Other activities are reduced to marginal levels.

4.5 How do these methods compare with each others in regards to TCO?

We can now compare how these methods compare together in termd of TCO.

Let's first explain how these different method diverge from each others on each and every task of our scenario :

	No automated tests	Tests implemented after code	TDD
Development Time	Not implementing any automated tests makes the "purely" development part of the process quicker indeed. There is much less code to be written. However this illusional gain will need to be paid later on. In addition, the need to deploy the application to run live tests after every block of code implemented reduces the gain. One is better off assessing the code with unit tests instead of with a running application.	Implementing tests after writing the business logic code is better than nothing, it prevents from the need to test the code within the live running application. It comes with a little additional cost, the need to write these unit tests. The problem here is mostly that impementing tests after the code doesn't benefit from the first advantage of TDD which is ensuring a clear, simple and sound design. In addition, when writing the tests after the code, one struggles to have a good code coverage. In most-if-not-all cases the code coverage is way below what is achieved with TDD. This will be paid later on.	With TDD, tests are implemented first. This forces the code to have a clear, simple and sound design and to be utmost testable. This comes with an additional cost over tests implemented after. However, the testing, the maintainance and the future evolution of the code will tremendously benefit from the almost exhaustive coverage of the code by automated tests and the simple design.
Debugging Time	Without unit tests, one is left with a debugger to spot the misbehaviour of the code. Debugging a running application to figure what the code is doing and where the problems are form the worst possible way to develop software.	The unit tests are preventing from loosing so much time with a debugger here. However, the poor coverage makes it so that one still needs to rely quite a lot on the debugger to figure the interactions between the different parts of the code and understand misbehaviours.	The almost exhaustive coverage by unit tests makes it almost entirely useless to debug the running application to figure mishbehaviours and side effects. Everything, including edge cases, is properly covered by unit tests and the debugger is only required very rarely to understand tricky code parts.
Manual Testing time	Debugging is one thing, but the worst aspect without unit tests is that one needs to tests the whole behaviour of the code manually. And this is where it can get quite tricky, sometimes several minutes of manipulations are required on the UI of the running application to put in place the conditions required to test a specific edge case of the business logic.	Tests prevents manual testing to some extend only when they are not exhaustive. Most of the time when tests are implemented after the code, edge cases are not covered and as such manual testing is still required quite extensively to assess the behaviour of the code on edge cases.	This is one of the most striking advantages of TDD. The almost exhaustive coverage of condition branches and code by automated tests reduces significantly the need for manual testing. Just the tricky integration aspects remain to be tested manually.
Pre and Post-Production Debugging Time	Without integration tests, most of the time when the application is prepared for production and/or integrated in a realistic production environment for the first time, a whole new range of corner cases appear and require a new set of very lengthy debugging session, not to mention the need to reproduce the production environment on the developer's computer first. In addition, after the production roll-out, specific conditions triggering new bugs will most certainly occur and make busines users as well as developers loose a lot of time to figure and fix.	Unfortunately, the poor coverage of condition branches as well as the lack of good integration tests reproducing different production situations most of the time prevent from benefitting from the advantages of the tests. When tests are implemented after the code, most of the time an important level of debugging under the specific conditions of the production environment is still required. Nevertheless, automated tests prevents the majority of bugs to pass through the cracks and make through post-production rollout.	Unit and integration tests can easily reproduce the whole range of specificities of the possibles conditions around the code being tested and assessed. Especially with intgeration tests, developers have the possibility to reproduce different production conditions and assess the well behaviour of the code under these specific conditions. This prevents most-if-not-all of the production debugging nightmare.
Next maintenance re-understanding time	Unit and integration tests form a formidable form of documentation for the code. Without any of these, whenever a developer needs to apply a maintenance on some piece of code after a few months, he needs first to dedicate the required amount of time to understand the code all over again.	With unit and integration tests, the developer benefits from a surprisingly good form of documentation to understand the code very quickly and be fast in a position where he can apply the maintenance changes. However, writing the tests after the code doesn't enforce the clean, simple and sound design that TDD brings. As such, without TDD, some time is still lost due to the need to understand sometimes quite convoluted code.	With TDD, the developer doesn't only benefit from the exhaustive automated tests forming a good documentation, he also benefits from the fact that TDD enforces clean, simple and sound design and can underdstand the code produced this way much faster.
Next maintenance Debugging Time	Without tests, the need to debug the code over and over again at every maintenance kicks in. Deploying the code in a live application is the only way to figure and understand it's misbehaviours.	The unit tests are preventing from loosing so much time with a debugger here. However, the poor coverage doesn't entirely prevent from its usage to understand some tricky part of the code or some complex interactions and side effects.	The almost exhaustive coverage by automated tests makes it almost entirely useless to debug the running application to figure mishbehaviours and side effects. This is especially important when maintaining the code or evoluting it months or years after it's been initially written. Finally, with TDD the proper reaction whenever a bug is encountered is implement a unit or integration test that reproduces the bug and assess the wrong behaviour and then fixing the failing tests. This is a much more efficient way of fixing a bug than debugging.
Further evolution re-understanding time	Same as above. Without unit tests documenting the behaviour, one is left with reading the code itself to figure what it does.	Same as above, the developer benefits from unit tests to understand and assess the expected behaviour of the code which comes with a significant gain of time when needed to maintain or evolve the code sometimes several months after the initial development.	TDD comes with better and more tests, making the whole process even more efficient. In addition, the enforcement of a simple, clean and sound design makes the code itself much more readable which comes with a great increase of TCO gains.
Further evolution implementation time	Once all the required time to understand the code all over again is invested, the developer can proceed with implementing the evolution. Not writing any tests is again quicker of course. But that gain is an illusion and without tests a lot of time will be lost further on the process.	Writing the tests after the development does take some additional development time, of course. But for all the reasons already presented, this time lost will be regained with huge benefits further on the process. Now again writing the tests after the code doesn't lead neither to an optimal code coverage nor to the best possible design which will have consequences later.	Here as well again the development of the tests will require some additional coding time but the other acitivites will be significantly reduced thanks to these almost exhaustive automated tests suite, not to mention the simple design which makes the whole evolution process easier.
Further evolution manual testing	Same as above. Without unit tests, one is left with manual testing of every aspect of the feature and all corner cases on the live running application. This takes a lot of time and more importantly this has to be done over and over again everytime the feature evolves.	Again, writing unit tests after the code is better than nothing of course. But doing so, one struggles to come up with a sufficient coverage of the code with automated tests. And in this case at least some level of manual testing of the feature and egde cases is required.	Again, with an almost exhaustive coverage of code both in terms of lines of code and in terms of condition branches, the need for manual testing is significantly reduced. Only specific integration concerns and very rare border cases need to be assessed on the live running application. And most of the time when a glitch is discovered, it comes from a lack of prevision of some corner cases, almost never from a bug passing through.
Further evolution non-regression testing	This is perhaps the biggest problem from which a software development project not leveraging on automated tests will suffer. Without a proper suite of automated tests to assess the non-regression of the software, one is left with manually testing almost the whole application each and every time some code is changed. This comes with the most amazing hidden cost and is the price to pay when not investing on automated tests at the development time.	Automated tests, even when written after the code, form a formidable protection against regressions. Most of the manual testing needed against regressions is prevented by the suite of automated tests. In case the coverage of the functional code is not 100%, which is the case most of the time when tests are written after the code, a little level of manual testing is still required.	This is another one of the most striking benefits from TDD: the fact that the test suite form a formidable non-regression testing approach. With an almost exhaustive coverage of the code with automated tests, non-regression testing boils down to simply running these tests.

As a consequence, the TCO in terms of required man/days diverge quite a lot between the three different approaches:

The scenario above gives us the following figures in terms of man/days required for every approach:

37 M/D for the approach without any test
30 M/D for the approach with tests written after the code
26 M/D for the TDD approach

Which represents the following difference

20% TCO gain when working with a consistent suite of automated tests over the no tests approach
10% addition gains with TDD over the tests written after approach.

This represents a 30% reduction of TCO when embracing TDD over an approach without a comprehensive suite of automated tests.
On software development projects requiring millions of dollars of investment, this represents a more than significant gain.

5. Conclusion / Take Aways

I believe that the most important take away when reading an article about TDD is that TDD is eventually the only way to recover some level of mastery on software development processes.
Please allow me to develop this statement.

Software Engineering forms a very specific and peculiar domain in the engineering business. Let's compare its situation with Civil Engineering for instance. We are building bridges for litterally several thousands of years. Today, even a 10 years old child can have a basic understanding of how a bridge shall be built, some pilars should be anchored in the ground and a deck shall be layed on top of these, etc.
Everyone is able to figure what would be the trivial steps when building a bridge.

A software product is something completely different. Due to its very abstract nature, building large software products is very hazardous. In contrary to other engineering domains, it's nearly impossible to estimate the required effort to develop a large software component and the reality shifts simply always from the plan.
And this is not even accounting debugging, maintenance and evolutions.

Without TDD, whenever a development team believe it's quite close to completing the project is most of the time also the very moment it just starts figuring the tons of bugs that will need to be solved and the tremedous amount of work that actually still remains to be done.

TDD is a way to get the control back.

TDD enables to reduce significantly maintenance and evolution costs and at the same time master the software development process. With TDD, the implemented code is most of the time almost production ready from a functional perspective and pre-production debugging sessions are largely reduced.
But more importantly TDD enables to smoothing the future evolutions of the software product by significantly improving its design and providing an exhaustive set of non-regression tests out of the box.

At the end of the day, this significant reduction of the TCO is the most important aspect of TDD. The pressure to deliver should never dictate whether one uses TDD or not. The time gained at development time when skipping automated tests is an illusion. Eventually much more time will be lost without tests. And then again TDD is not only about tests...

AI - opportunities and challenges for Swiss banks

2019-12-06T11:00:58-05:00

Yesterday we were amazed by the first smartphones. Today they have almost become an extension of ourselves.
People are now used to be connected all the time, with highly efficient devices on highly responsive services, everywhere and for every possible need.

This is a new industrial revolution - the digitization . and it forces corporations to transform their business models to meet customers on these new channels.

Banks worldwide are on the first line in this regards and for many years now they have well understood the urgency in proclaiming digitization as a key objective.
From a user perspective, the digitization confers enormous benefits in the form of ease, speed and multiple means of access and a paradigm shift in engagement. Since banking as a whole benefits from going digital, it is only a matter of time before operations turn completely digital.

The journey to digital transformation requires both strategy investments as well as tactical adjustments in orienting operations for the digital road ahead.
Fortunately, if technology can be perceived as a challenge, it is also a formidable opportunity.
And in this regards, Artificial Intelligence is a category on its own.

Artificial Intelligence and it's potential in the banking business.

AI provides a unprecedented opportunity to make banks smarter. Deploying AI solutions in banking leads to better customer intelligence and better customer experience.
Both are key to increase benefits and reduce operational costs.

There are multiple applications for AI solutions in the banking business around three major axis:

Customer Experience revolution when putting technology in direct contact with the customer
AI analytics improving operational efficiency in various domains (e.g. investment research, credit scoring, etc.) or providing personalized advisory to customers
Risk mitigation with better fraud detection, more efficient AML, more efficient compliance controls, etc.

One of the most impressive opportunity on the customer experience revolution axis is formed by chat-bots and voice assisted banking. The need for physical presence is definitely fading and technology empowers customers to use banking services using voice commands and touch screens.

In regards to improving operational efficiency within the bank, the most promising evolution comes form the conjunction of Real-time Big Data processing with Machine Learning. The technology can provide personalized, value-added products to customers as it learns about spending habits or investment profiles, but it can also automate most analytics duties within the bank.
Data-driven AI applications are intended in the future to cover the whole range of financial decisions: advisory, calculations, scoring and forecasts, for the bank as well as for the customers. For instance if approving a commercial real estate loan was traditionally a several days process within a bank, using AI will reduce it to a few dozen of minutes.

Last but not least, embracing AI has been at the root of significant improvements in Fraud detection and AML. Companies like MasterCard and Visa have been using AI to detect fraudulent transaction patterns for several years now. At NetGuardians we deploy AI solutions for digital banking fraud prevention and internal fraud detection for several years as well.

AI solutions are key to react proactively and inform the customer before the funds leave the bank. AI enables to implement Transaction analytics but also behaviour analysis aimed at catching more complex fraud patterns.

What about Swiss banks ?

Interestingly, while most would describe Switzerland as less innovative than other countries such as UK for instance, especially in the retail banking space, the reality is a quite differentiated picture.

The digital solutions of the major Swiss banks are among the best in the world and Machine Learning algorithms are used down the line on the three axis described above. Due to their conservative nature, the major Swiss banks have a tendency to rather follow the market best practices and state of the art in terms of customer experience evolution but on the backend side - the technology running under the hood - they are rather very well in advance.

The situation of smaller Swiss retail banking institutions is somewhat similar. Their strong footprint in their regions, their attractive conditions as well as their good digital banking solutions in general relieves the pressure. The biggest difference with major banks is that smaller institutions don't necessarily have the ability to research AI or Machine Learning technology on their own so they rely on third party providers such as NetGuardians for fraud prevention or other fintechs for other use cases.
In this sense, keeping a close proximity with engineering schools and universities is a tremendous opportunity to stay on the top of the game and get in touch with the numerous fintechs flourishing in Switzerland. One could only advise them to be less timorous when it comes to supporting these startups since investing in them is eventually their only way to support the development of the technology that will be available to them in the future.

Private banking institutions on the other hand are more vulnerable today, at least for the smaller ones. Their margin is reducing and their wealth management business is increasingly cannibalized by other actors such as External Asset Managers, Fintechs or bigger institutions - even retail institutions where AI has been instrumental into making them reach a level of proximity in advisory that was so far the exclusive privilege of private banks.
Private banking institutions need to understand the urgency in revolutionizing the private banking customer experience and recover the lead in this regards from the other actors. Here as well the opportunities for Artificial Intelligence applications are striking: operational efficiency, advisory, etc.

Dissecting SWIFT Message Types involved in payments

2019-04-05T05:40:51-04:00

In my current company, we implement a state-of-the art banking Fraud Detection system using an Artificial Intelligence running on a Big Data Analytics platform. When working on preventing banking fraud, looking at SWIFT messages is extremely interesting. 98% of all cross-border (international) funds transfers are indeed transferred using the SWIFT Network.
The SWIFT network enables financial institutions worldwide to send and receive information about financial transactions in a secure, standardized and reliable environment. Many different kind of information can be transferred between banking institution using the SWIFT network.

In this article, I intend to dissect the key SWIFT Messages Types involved in funds transfers, present examples of such messages along with use cases and detail the most essential attributes of these payments.

These key messages are as follows:

MT 101 - Request for Transfer
MT 103 - Single Customer Credit Transfer
MT 202 - General Financial Institution Transfer
MT 202 COV - General Financial Institution Transfer for Cover payments

This article presents each and every of these messages, discuss their typical use cases and details key SWIFT fields involved.

Summary

1. Introduction to SWIFT
2. Dissecting key SWIFT Mesages involved in payments (Funds Transfers)
3. Conclusion

1. Introduction to SWIFT

SWIFT - Society for Worldwide Inter-bank Financial Telecommunication - is a Belgian company, located in Belgium, and is a trusted and closed network used for communication between banks around the world. It is overseen by a committee composed of the US Federal Reserve, the Bank of England, the European Central Bank, the Bank of Japan and other major banks.
SWIFT is used by around 11 thousands institutions in more than 200 countries and supports around 25 million communications a day, most of them being money transfer transactions, the rest are various other types of messages.

The majority of international inter-bank messages use the SWIFT network.
SWIFT does not facilitate funds transfer: rather, it sends payment orders, which must be settled by correspondent accounts that the institutions have with each other. For two financial institutions to exchange banking transactions, they must have a banking relationship beforehand.

1.1 Key SWIFT aspects

Internationally standardized messaging means that every transaction between every financial institution is recorded in exactly the same way, providing all the details in a clear and transparent manner.
Every financial institution has its own unique code that provides information on the name and location of the bank. Each transaction contains a unique reference number, bank operation code and details of charges incurred during the transaction.

Because SWIFT uses internationally standardized messages, it is a transparent way for institutions to communicate between each other and securely relay the details of any transaction. There are a number of known benefits to using SWIFT:

Transparency. SWIFT payments clearly detail the amounts involved in the transaction, the route it takes between banks, the details of all charges and the nature of the payment (along with many other details). This information allows all parties involved to track the transaction and to understand the costs and time period involved.
Traceability. Because SWIFT details the route of the transaction between banks and the amount of money involved, it provides clear and recognized proof of payment.
Consistency. Due to the consistency of how messages are structured, payment information is easy to decipher regardless of country or language barriers.

1.2 Correspondent banking

Correspondent banking is an important aspect of international banking and a key concept underneath the SWIFT network.

1.2.1 Correspondent Bank

A correspondent bank is a financial institution that provides services on behalf of another financial institution. It can facilitate wire transfers, conduct business transactions, accept deposits and gather documents on behalf of another financial institution. Correspondent banks are most likely to be used by domestic banks to service transactions that either originate or are completed in foreign countries, acting as a domestic bank's agent abroad.

Generally speaking, the reasons domestic banks employ correspondent banks include:

limited access to foreign financial markets and the inability to service client accounts without opening branches abroad,
act as intermediaries between banks in different countries or as an agent to process local transactions for customers abroad,
accept deposits, process documentation and serve as transfer agents for funds.

The ability to execute these services relieves domestic banks of the need to establish a physical presence in foreign countries.

1.2.2 Transferring Money Using a Correspondent Bank

International wire transfers often occur between banks that do not have an established financial relationship. When agreements are not in place between the bank sending a wire and the one receiving it, a correspondent bank must act as an intermediary. For example, a bank in Geneva that has received instructions to wire funds to a bank in Japan cannot wire funds directly without a working relationship with the receiving bank.
Most if not all international wire transfers are executed through SWIFT. Knowing there is not a working relationship with the destination bank, the originating bank can search the SWIFT network for a correspondent bank that has arrangements with both banks.

Interestingly, when a bank wants to send some funds to another bank on the other side of the world, it happens often that the sending bank has no banking relationships with any bank having itself a relationship with the target bank. In this case, the order needs to be transferred through several, sometimes many, distinct banking institutions through the SWIFT network.
These intermediate banks are called routing banks.

1.2.3 VOSTRO and NOSTRO accounts

General usage of NOSTRO/VOSTRO (they refer to the same account but different bank perspective):
A NOSTRO account is a reference used by Bank A to refer to "our" account held by Bank B. NOSTRO, is a shorthand way of talking about "our money that is on deposit at your bank."
VOSTRO is the term used by Bank B, where bank A's money is on deposit. VOSTRO is a reference to "yours" and refers to "your money that is on deposit at our bank."

1.3 Key SWIFT Message Types

When it comes to fund transfers, only a subset of the SWIFT messages are relevant:

Message Identification and Name	Space	Reference document
MT 101 Request for Transfer	Customer-to-Bank and Interbank	Standards MT November 2018
MT 103 Single Customer Credit Transfer	Interbank
MT 202 General Financial Institution Transfer	Interbank
MT 202 COV General Financial Institution Transfer	Interbank
MT 9XX - Confirmations and statements	Customer-to-Bank and Interbank

1.3.1 Detailed presentation of key SWIFT messages

These SWIFT messages are described as follows:

The SWIFT MT101 message is a request for transfer, enabling the electronic transfer of funds from one account to another. Funds are transferred from ordering customers account to a receiving financial institution or account servicing financial institution. For us right now, the important thing to note is that the message format that enables this transfer is the SWIFT MT101 format.
The MT103 is a SWIFT message format used for making payments. MT103 SWIFT payments are known as international wire transfers, telegraphic transfers, standard EU payments (SEPA payments), LVTS in Canada, etc.
Swift MT202 Requests the movement of funds between financial institutions, except if the transfer is related to an underlying customer credit transfer that was sent with the cover method, in which case the MT 202 COV must be used.
MT202 COV is a SWIFT message format for financial institution (FI) funds transfer between financial institutions. MT202's are used primarily for two purposes, bank-to-bank payments (i.e. interest payments and settlement of FX trades) and Cover Payments.
MT202 COV was implemented in 2009 to create traceability of the origination of funds (institution and account) through to the destination of funds (institution and account.) This was in response to anti-money laundering and associated banking requirements.
Prior to MT202 COV, The message format, MT202, did not include origination/destination financial institution information. Particularly for Cover Payments, where a combination of MT103 and MT202 are used to direct funds transfers to a beneficiary account, the intermediate banks in the MT202 had no ability to understand and perform risk analysis/AML/compliance checks on the funds transfer based on the original and destination of the funds. Thus, intermediate banks could be unwittingly involved in illegal transactions under new regulations.

1.3.2 Typical situation and use cases in banking institutions

In regards to this set of messages, the various situations that a banking institution is confronted to can be represented as follows:

With the following details:

A customer can request a fund transfer on his behalf using either a bank native channel (email, EBanking, branch, etc.) or by sending an MT101 to the bank. Big corporate customers can indeed be connected themselves to the bank network and send requests to the bank using the MT101 Message Type which is intended for this purpose.
Most of the time, the reception of an MT101 by the bank will make it issue an MT103 to proceed further with the Fund Transfers. But if the account to be debited is not by the bank, it can as well transfer the MT101 further to the other institution owning the account to be debited.
MT103 are sent following a customer request to transfer the funds further (serial method) or announce to the receiving institution that the funds will be coming following an MT202.COV (cover method).
An MT202 is sent on behalf of bank itself and not following a customer request, in contrary to MT103.
If the bank is a routing bank in the middle of a routing chain, MT103 and MT202 are sent further. Actually new messages are sent, the sent MT103 or MT202 can be different than the received one.

1.4 Serial and cover payments

SWIFT Serial and Cover payments originate from the two methods that are used to settle transactions in the SWIFT network and specifically in the field of correspondent banking: Serial method and Cover method.
When sender and receiver are located in different currency zones, they send or receive funds through their correspondents. And in this case, either of both methods can be used.

With the cover method, two different messages are initiated by the sender to settle the funds, an MT103 and an MT202.COV. The MT101 message is used to inform the creditor bank that funds are coming, it is an announcement. The MT202.COV, called cover message, moves the funds between correspondent accounts.
With the serial method, one single message is initiated by the sender to settle the funds, an MT103. That MT103 is in this case not an announce, but the fund transfer itself that moves for one party to the next in the payment chain until it reaches the beneficiary bank.

1.4.1 Cover method details

The MT103 announcement is sent to the beneficiary bank to announce that funds are coming for a specific beneficiary. It does not carry the funds but rather just informs the bank of the beneficiary that funds are coming for which beneficiary customer and which correspondent (of the beneficiary bank) will receive the funds.

The cover payment (MT202 COV) is sent by the sender to its correspondent. This is the message that really moves the funds. The MT202 COV enables the sender to ask its correspondent to debit its account (of the sender, with the correspondent) and credit the beneficiary bank's account with its correspondent.

Most of the time, the announcement is created and sent before the cover but that is not a requirement so receiving the cover payment before the announce is a situation that needs to be taken into account by the receiver.

1.4.2 Serial method details

Using an MT103, the funds moves from one party to another until it reaches the final beneficiary. Following a customer request, the sender sends an MT103 serial to its correspondent which transfers the funds to the intermediary institution. The intermediary institution is the correspondent of the beneficiary most of the time. The intermediary institution on its turn credits the account of the creditor bank and eventually the beneficiary account.

In the SWIFT MT103 Serial Message, the fields 56a and 57a are used while the fields 53a and 54a are used in the MT103 Announcement Message (cover method).
Intermediary institution and receiver's correspondent are usually two names to designate the same thing. The account with institution is the bank that holds the beneficiary account, so just another name for Creditor bank.

1.5 SWIFT Message Structure

This section presents how an ISO 15022 message looks like and decomposes it to its particular parts. The description of the structure is intended as a guidance to build a SWIFT message parser.

A message consists of blocks enclosed in curly braces. The first colon separates the block name and content. The block content can consist of sub-blocks.

Block Ident.	Block Name	Mandatory or Optional	Description	Comments
1	Basic Header	M	The only mandatory block is the basic header. The basic header contains the general information that identifies the message, and some additional control information. The FIN interface automatically builds the basic header.	Common to all SWIFT messages. It contains five fields that are all mandatory.
2	Application Header	O	The application header contains information that is specific to the application. The application header is required for messages that users, or the system and users, exchange. Exceptions are session establishment and session closure.	Common to all SWIFT messages. There are two variations: One block for input messages which may contain up to six fields and one for output messages which may have up to seven fields.
3	User Header	O	The user header is an optional header.	Common to all SWIFT messages. All fields of the user header (except the tag 103 for FINCopy Service) are optional. Fields are populated in specific situations.
4	Text	O	The text is the actual data to transfer.	This is the block found in the Message Reference Guide.
5	Trailers	O	The trailer either indicates special circumstances that relate to message handling or contains security information.	Common to all SWIFT messages. Like the block 3, this block consists of only non-mandatory fields except the checksum.

The various header blocks contain different kinds of information but not all of them are interesting for our use cases in my current company. The next chapters will present which field we extract and for what usage.

1.6 SWIFT BIC Code

The SWIFT BIC code is much more than an entity identifier. It is used to route the financial messages from the issuing institution to the receiving institutions. The SWIFT BIC Code therefore plays a crucial role in payment messaging. Without it, a message cannot be transported to the receiving entity over SWIFT Net.

The BIC code contains the identity and the location of the participants that are used to find out and reach the message destination.

The SWIFT BIC code is composed of exactly 8 or 11 alphanumeric characters structured as followed:

1.6.1 Structure of the SWIFT BIC Code

The structure of the BIC code is as follows:

alphabetical characters that indicate the identification of the institution (bank or corporate)
2 alphabetical characters for the ISO code of the country in which the institution is located
2 alphabetic or numeric characters used to locate the institution head office in the country or the head office in a particular region in the country.
- When the second character takes the value "0", it is typically a test BIC used in test systems as opposed to a BIC used on the live network (also production).
- When the second character takes the value "1", it denotes a passive participant in the SWIFT network. Passive participants cannot be contacted directly over the SWIFT Network. These BICs are sometimes referred to as ‘BIC1', ‘non-SWIFT BIC' and ‘non-connected BIC'. A non-connected BIC is not allowed in the header of a SWIFT message, otherwise the message is rejected by the SWIFT system.
- When second character takes the value "2", it indicates a reverse billing BIC, where the recipient pays for the message as opposed to the more usual mode where the sender pays for the message.
3 alphabetical or numeric characters to indicate a branch or agency of the institution. Unlike the first 8 characters, these last 3 are not mandatory. They are mainly used by banks and less by corporations.

Few examples to illustrate the above explanations:

DEUTDEFF is the BIC of Deutsche Bank (DEUT) / in Germany (DE) / Main office of Frankfurt (FF)
DEUTDESS is the BIC of Deutsche Bank (DEUT) / in Germany (DE) / Main office of Stuttgart (SS)
DEUTDESS648 is the BIC of Deutsche Bank (DEUT) / in Germany (DE) / Main office of Stuttgart (SS). 648 is the branch located in Vaihingen-Enz in the same region.
DEUTDES0 and DEUTDES0648 are test BIC for DEUTDESS and DEUTDESS648
LAFAFRPP is the BIC of Lafarge (LAFA) / in France (FF) / Main office of Paris (PP)

1.7 Other specific details

Below are some specific noteworthy details about SWIFT messages in a raw fashion:

Caution: about the little "a" in field names
Very often in documents and papers about SWIFT, fields are indicated with a little "a" as suffix, such as 54a, 56a, etc.
This lower case "a" needs not to be confused with the upper case "A" which indicates the variant of the field.
For instance, Field 50a exists in 3 variants: 50A, 50F, 50K. The little suffix "a" is just a convention to mention the field 50, it is not the indication of the variant "A"
About SWIFT Input and Output messages
In SWIFT, the notion of output and input related to the SWIFT network, not to the bank.
A SWIFT input message is an outbound message: a message emitted by the bank and send to another bank on the SWIFT network.
A SWIFT output message is an inbound message: a message received to a bank from the SWIFT network.
Qualifying messages as Input or Output is relative to the SWIFT network, i.e. inverse to the bank perspective (Input are messages sent by the bank, Output are messages received by the bank)

2. Dissecting key SWIFT Mesages involved in payments (Funds Transfers)

This chapter presents key SWIFT messages as well as some examples aimed at understanding the meaning and usage of most important SWIFT fields.

2.1 SWIFT MT101 Detailed Analysis

The SWIFT MT101 Request for Transfer is a payment initiation message used to send domestic and/or international payments instructions to their banks.

The situation of SWIFT MT101 in a banking institution is as follows:

Most of the time when the bank receives an MT101 it will either perform the operation if both debited and credited accounts are with itself, or it will issue an MT103 to proceed with the crediting of the destination account with another banking institution.

But there are situations as well where the account to be debited is not by the bank itself, in which case the MT101 can be routed further.

2.1.1 MT101 Introductory examples

This section presents various examples of SWIFT MT101 corresponding to different situations

2.1.1.1 MT101 Example 1: Simplest case

A corporate customer - Robert Corporation - of bank XYZ located in Switzerland wants to send 50k CHF to Vinino SARL in Geneva, another customer of bank XYZ. Robert Corporation being a big corporation, it is connected to the SWIFT network and trades on his account by the bank using SWIFT MT 101 messages.

In this situation, there is no SWIFT messages emitted further whatsoever since both customers are customer of bank XYZ.

Details of the MT101:

There is one single transaction in this MT101.
Both accounts are with bank XYZ so no Account with Institution or Account Servicing Institution need to be indicated.

2.1.1.2 MT101 Example 2: beneficiary with another institution

A corporate customer - Robert Corporation - of bank XYZ located in Switzerland wants to send 500k EUR to Dupont SARL in Paris.
The corporation Dupont SARL is not a customer of bank XYZ, it is a customer of bank BNP Parisbas in Paris.
Bank XYZ and BNP Paribas have a direct banking relationship.

Since Bank XYZ and BNP Paribas have a direct banking relationship, Bank XYZ can send an MT103 to BNP Paribas to make it credit the customer account from its VOSTRO account by BNP Paribas.

Details of the MT101:

There is one single transaction in this MT101.
The beneficiary account to be credit is not by bank XYZ, so the eventual banking institution the funds need to be sent to is indicated in Account With Institution - 57A.

2.1.1.3 MT101 Example 3: Multiple payments in MT101

In this third example, Corporation Robert Corp of bank XYZ located in Switzerland wants to send a first payment of 100k EUR to Dupont Sarl in Paris and a second payment of 200 K to Jacob SA in France.
Dupont Sarl is a customer of Bank BNP Paribas in Paris and Jacob SA is a customer of bank Credit Agricole in France as well.

Bank XYZ in Switzerland and BNP Paribas Paris have a direct banking relationship, and BNP Paribas and Credit Agricole have a direct banking relationship as well.

He wants to use different accounts for both transactions and he is submitting both transactions in the very same MT101.

The MT101 makes it possible to input as many transaction in a single message, having at least one occurrence of the repetitive sequence is mandatory.

Since Credit Agricole and Bank XYZ have no direct banking relationship, the second transaction, crediting Jacob SA has to go through BNP Paribas as well.

Details of the MT101:

This MT101 contains two distinct transactions
In both case Account with Institution - field 57A - identifies the eventual destination of the funds for both transactions
The ordering customer - field 50H - is the same from a customer perspective but the accounts used for both fund transfers are different. For this reason the ordering customer is indicated in the repetitive sequence, not in sequence A.

2.1.1.4 MT101 Example 4: Payment from a subsidiary account

A parent company can also use the SWIFT MT101 to pay from own account on behalf of multiple subsidiaries.

In this example, the parent company Robert Corp located in Switzerland has received an invoice from Jacob SA for various services that Jacob SA provided to the local company Robert Corp France SARL, a subsidiary of Robert Corp located in France.

The parent company Robert Corp decides to use the account of the subsidiary company in France to pay for the invoice (there can be various reasons for that).
The subsidiary has granted permission to the parent company for trading with its account with BNP Paris.

The trick here is in the field 50a Instructing Party. In the SWIFT standard we read the following under the usage rules of that field: "This field must only be used when the instructing customer is not also the account owner."
The field 50a Instructing Party specifies the subsidiary company - FR12983459182931 / Robert Corp France Sarl - on behalf of which the parent company, RBCPCHAA02A, sends the payment instruction.

Details of the MT101:

Account Servicing Institution - field 52A - is identifying the fact that the MT101 needs to be transferred further to the institution owning the account of the subsidiary where the funds need to be debited from: BNP Paris / BNPSFRZA93B.
The Account with institution - field 57A - is identifying the eventual recipient of the funds, the institution owning the account to be credited: Bank Credit Agricole / CACIFRXA12C.
Instructing Party - field 50L - identifies that the parent corporation is actually the one at the origin of this MT101, acting on behalf of the ordering customer - 50 H - the subsidiary in France.

2.1.1.5 MT101 Example 5: Fund repatriation

There is another possibility: funds repatriation.
Funds repatriation simply means to move the funds available on one account to another account held either by the same financial institution or by another financial institution. Funds repatriation is performed for cash pooling inside a company or a group of companies. Corporations resort to cash pooling to optimize the liquidity usage.

In this example, Robert Corp (a corporate customer) has an account by bank XYZ in Switzerland and an account by bank BNP Paribas Paris which he uses for his operations in EUR.
Robert Corp. wants to repatriate all the funds he has on his BNP Paris account back to his Bank XYZ Account.
Bank XYZ account is called centralized account, master account or leader account. The BNP Paris account is called secondary account or slave account.

In this case the MT101 is transmitted further to the bank owning the debited account and the funds are repatriated with the MT103.

Details of the MT101:

Account Servicing Institution - field 52A - is identifying the fact that the MT101 needs to be transferred further to the institution owning the account: BNP Paris / BNPSFRZA93B.
The Account with institution - field 57A - is identifying the eventual recipient of the funds, the institution owning the account to be credited: Bank XYZ / BXYZCHZZ80A.
Using the instruction code - field 23 - CMZB - Code to Zero Balance, it is not necessary to specify any amount, the code will be balanced down to zero and all the available funds repatriated.

2.1.2 MT101 Parsing and Data Mapping

The MT101 parsing details are presented in the table below. Only most essential fields are discussed.

Meaning	SWIFT		Example	Comment
	Field	Variant
AppID	Block1/ApplId		F	The Application Identifier identifies the application within which the message is being sent or received. The available options are: F = FIN , A = GPA, etc. These values are automatically assigned by the SWIFT system and the user's CBT
ServiceID	Block1/Servid		01	The Service Identifier consists of two numeric characters. It identifies the type of data that is being sent or received and, in doing so, the type of the following message
Sender (Sending bank / BIC)	Block1/ LTaddrBlk1 (I) Or Block2/ LTaddrBlk2 (O)		SGOBFRPP	Sender BIC appears in header block (Block 1) in the MT101 Input and in the application block (Block 2) in the MT101 Output (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out
Message Type	Block2/ Msgtype		101	SWIFT Message Type = MT 101
Receiver (Receiving Bank / BIC)	Block2/ LTaddrBlk21 (I) Or Block1/ LTaddrBlk1 (O)		RBOSGB2L	The Receiver BIC appears in header block (Block1) in the MT101 Output and in the application block (Block 2) in the MT101 Input. (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out. The receiver of the message is the eventual beneficiary only if no field 57 says otherwise.
Sender's Reference	20		ORDERREF1234	This field is mandatory and of format 16x. It is a reference assigned by the Sender to unambiguously identify the message.
Customer Specified Reference	21	R	GBSUPPLIERS5668	The field is optional and of format 16x. It is a reference to the entire message assigned by either the instructing party, when present or by the ordering customer, when the instructing party is not present.
Message Index/Total	28	D	1/1	It is mandatory and of format : 5n/5n for (Message Index)/(Total) If you send 5 messages for the same order, this field will take the value 1/5 in the first message, 2/5 in the second message, and so on. 1/1 means there is only one message sent for this order.
Sender Msg. Sending Timestamp	(O) Block2 / ‘ Intime + Inmate (I) System.time()		1538070522	(O) = Output only : SWIFT timestamp for an Output message (HHMMYYMMDD) or local date/time for an Input Message.
In / Out-put flag	Block2 / Inoutind		I	Single letter ‘I' or ‘O'
Ordering Customer	50	F	/DE20700800000... 1/Essilor International 2/147 Rue de Paris 3/FR/Charenton 94220	Line 1 (subfield Party Identifier) /34x (Account) Lines 2-5 : (Number/Name and Address) 1!n/33x (Number)(Details)	The field ordering customer is mandatory. In can be given either in the message details (here) or per transaction in the repeating sequence
		G	/31926819 UBSCH123FNX	Line 1 (subfield Party Identifier) /34x (account) Line 2 (subfield bank) BIC (8 or 11 characters)
		H	/31926819 Compagnie de Saint Gobain 118 Rue Lauriston 75016 Paris	Line 1 (subfield Party Identifier) /34x (Account) Lines 2-5 : (Number/Name and Address) 4*35x (Name and Address)
Authorization	25		12DF64BG345A	Optional, 35x (authorization code)
Requested Execution Date	30		181017	The value date, mandatory and of format 6!n (YYMMDD). It is the date on which all subsequent transactions should be initiated by the executing bank.
Repeating Sequence
Transaction Reference	21		35863REFOFTRX1	This field is mandatory and of format 16x. It is a reference assigned by the Sender to unambiguously identify a unique transaction.
F/X Deal Reference	21	F	FXDEALID78685	Optional, if there is an underlying foreign exchange deal to this transaction, then this field specifies the FX deal reference
Instruction Code	23	E	CMZB (= Code to Zero balance the account. This transaction contains a cash Management instruction, requesting to zero balance the account of the ordering customer) INTC =( Code for Intra-Company Payment.)	It is optional and of format :4!c[/30x] (Instruction Code)(Additional Info.) Optional instruction codes that identifies the operation types. Caution : there can be several instruction codes (all with same code) in the SWIFT message.
Charges Account	25	A	/FR763000402837...	This is the ordering customer's account number to which applicable transaction charges should be separately applied by the debtor.
Currency, Transaction Amount	32	B	GBP50000	Mandatory and of Format 3!a15d (Currency)(Amount)
Currency, Original Ordered Amount	33	B	EUR200000	This optional field is provided in format 3!a15d. It specifies the original currency and amount as specified by the ordering customer.
Exchange Rate	36		1,1382	Mandatory since field 33B is present and 'amount' in field 32B is not equal to zero. Provided in format 12d. The integer part of Rate must contain at least one digit. A decimal comma is mandatory and is included in the maximum length.
Instructing Party	50	L	Compagnie de Saint Gobain 118 Rue Lauriston 75016 Paris	This field is optional and to be provided if the sending customer (instructing party) is different than the owner of the account (but has an authorization to make payment on account given by owner) Compagnie de Saint Gobain instructs the payment but does not own the account to be debited. It is authorized by it's the owner (e.g. a subsidiary) to pay form the ordering customer account provided below.
Ordering Customer	50	F	/DE207008000... 1/Essilor International 2/147 Rue de Paris 3/FR/Charenton 94220	Line 1 (subfield Party Identifier) /34x (Account) Lines 2-5 : (Number/Name and Address) 1!n/33x (Number)(Details)	The field ordering customer is mandatory. In can be given either in the message details or per transaction in the repeating sequence (here)
		G	/31926819 UBSCH123FNX	Line 1 (subfield Party Identifier) /34x (account) Line 2 (subfield bank) BIC (8 or 11 characters)
		H	/31926819 Compagnie de Saint Gobain 118 Rue Lauriston 75016 Paris	Line 1 (subfield Party Identifier) /34x (Account) Lines 2-5 : (Number/Name and Address) 4*35x (Name and Address)
Account Servicing Institution	52	A	//083098 (BSB) NATAAU33 (bic)	This is to notify the receiving bank that the institution owning the account to debit is another institution So the MT101 needs to be transferred to that other institution identified here. The BSB (Bank State Branch) code and the BIC are provided. Even if it is not mandatory in the SWIFT standards to use the BSB code preceded by a double slash. Many banks impose it. We need to parse the BIC out of it Format : &bnsp;&bnsp;&bnsp;&bnsp;[/1!a][/34x] (Party Identifier) &bnsp;&bnsp;&bnsp;&bnsp;4!a2!a2!c[3!c] (Identifier Code)
Intermediary Institution	56	A	PNBPUS3N (bic)	This is the Correspondent of the Creditor Bank. It holds the account in currency of the creditor bank. It is optional and can be provided in option A, C or D. Formats C or D are rarely used and most of the time not supported by banks. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
Account With Institution	57	A	BARCGB22XXX (bic) or e.g. //939400 (BSB) AMPBAU2SXXX (bic)	We need to parse the BIC out of it Format : [/1!a][/34x] (Party Identifier ) 4!a2!a2!c[3!c] (Identifier Code)	Optional. The Account with Institution is provided because the beneficiary does not have an account with the debtor bank, but with another bank .(i.e. an MT103 will be sent further) It can be provided in option A, B, C or D. Formats B, C are rarely used and most of the time not supported by banks. For instance, in case the Creditor bank is not the same as the debtor bank (most of the time), it is identified in field 57a. The BSB (Bank State Branch) code and the BIC of AMP Bank are provided. We need to parse the BIC out of it or hash the address (priority over header)
		D	Hong Kong Banking Assoc. Avenue du Léman 1204 Genève - CH Switzerland	If no BIC is available to identify the target institution, option D is used. In principle minimum 3 lines with name and address should be provided Format: [/1!a][/34x] (Party Identifier) 4*35x (Name and Address) In this case, take country from receiver BIC
Beneficiary	59	(no letter)	/26351-38947 Company One CITY STREET 50 LONDON, UK	Line 1 : [/34x] (Account) IBAN format or else Line 2-5: 4*35x (Name and Address)	Beneficiary customer information is Mandatory
		A	NBPUS3N or e.g. /12345678901 PNBPUS3N	Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
		F	/10078074 1/Company One 2/CITY STREET 50 3/GB/LONDON	Line 1 (subfield Party Identifier) [/34x] (Account) Lines 2-5 : (Number/Name and Address) 4*(1!n/33x) (name and address)
Remittance Information	70		Payment from Compagnie de Saint Gobain /INV/7828728292	Remittance information is optional and provided in format 4*35x if available. Up to 4 lines of up to 35 X characters each. The name of the parent company is provided here, so that the beneficiary can see it is the parent company that is paying.
Details of Charges	71	A	OUR	It is mandatory and of format 3!a. It can take 3 values: BEN, OUR and SHA. OUR means charges are to be borne by the ordering customer. SHA means charges are shared between Ordering and beneficiary customers. BEN means charges are to be borne by the beneficiary

2.1.3 Additional notes on MT101

Some complementary notes:

MT 101 have a repeating sequence (several transactions in same message). In my currently company, we consider every repeating sequence B (repeating) as a different transaction. The common sequence (A) is duplicated for each.
We still need a way identify uniquely a message. Unfortunately SWIFT doesn't have such thing as a unique transaction identified. We usually use as "business_reference" = [Transaction Reference / 21 / transaction_business_reference] concatenated to a UUID as suffix.
What to do in case of 57D if we cannot find a country code? One way is to use the SWIFT message Receiver BIC country (sometimes wrong but better than nothing, only sometimes since when 57D is used we are likely in the same country

2.2 SWIFT MT103 Detailed Analysis

The MT103 is a SWIFT message format used for making payments. MT103 SWIFT payments are known as international wire transfers, telegraphic transfers, standard EU payments (SEPA payments), LVTS in Canada, etc.

The situation of SWIFT MT103 in a banking institution is as follows:

There can be many different situations:

An MT103 can be sent after a request from a customer of the bank to send funds cross-border to another institution. The customer request can come from any channel including the reception of an MT101.
The MT103 can be a routed message in case the bank is simply a routing bank in the chain or an explicit intermediary institution.
Then MT103 can be simple announces - in the cover method - or actual funds transfers - in the serial method.

2.2.1 MT103 Introductory examples

This section presents various examples of SWIFT MT101 corresponding to different situations

2.2.1.1 MT103 Example 1: Simplest case

In this example, the customer "John Robert" of bank XYZ in Switzerland wants to send 500k EUR to Dupont SARL, a corporation located in Paris which is a BNP Paribas customer.

Bank XYZ and BNP Paribas have a direct banking relationship.

A single MT103 is sufficient to implement the fund transfer.

2.2.1.2 MT103 Example 2: more realistic example

This example is a more realistic example. The previous one is from a pure SWIFT specification perspective entirely valid. But in practice, a few more fields are almost always present in a SWIFT message.

This example is more or less the same as the previous one: the customer "John Robert" of bank XYZ in Switzerland wants to send 30k EUR to Dupont SARL, a corporation located in Paris which is a BNP Paribas customer.
Bank XYZ and BNP Paribas have a direct banking relationship.

The first thing is that Bank XYZ obviously has several VOSTRO account by BNP Paribas, and it wants to choose which of these accounts it wants to use for the reimbursement. It wants to use account 12345678901.

The field 53 normally indicates the Sender's correspondent (i.e. a banking institution), but it's common practice to use the variant 53B to indicate the account by the receiving institution to be used for the reimbursement.

In addition, in this more realistic example the charges details are indicated as well as remittance information.

2.2.1.3 MT103 Example3: forwarded serial message

In the case the bank we are operating within is neither the ordering institution (initial sender) neither the eventual beneficiary institution, it is just a routing bank. In this case, specific fields are used to identify the initial sending institution as well as the eventual beneficiary institution. These fields are illustrated in this example.

In this example, customer Max Prank of bank BCVs in Switzerland wants to send 30k EUR to Alfred SARL in Samoens (France), a customer of "Banque Populaire" in France.
BCVs in Switzerland and Banque Populaire in France have no direct relationship together so the fund transfer has to go through several routing bank, one of them being bank XYZ.

We are here interested in the MT 103 sent by Banque XYZ, a routing bank to the next bank in the routing chain: BNP Paribas.

Since bank XYZ is not the initial sending institution, and the receiver of the SWIFT message, bank BNP Paribas is not the eventual beneficiary institution, these two other institutions (initial sending and eventual beneficiary) needs to be indicated in the SWIFT messages.

Details of the SWIFT MT103:

The field Ordering institution - field 52A - is clearly identifying the initial sending institution, bank BCVs
The field Account with institution - field 57A - is clearly identifying the eventual beneficiary institution of the funds

It is relevant to look at the two other SWIFT messages from the chain to be able to compare them:

Another noteworthy field is the field Intermediary account - field 56A - which identified that the message has to go through BNP Paribas after Bank XYZ.

2.2.1.4 MT103 Example 4: Announce message (cover method)

We'll now go through a case where the MT103 is just an announce as part of the cover payment method. This is the case when the initial sending institution and the eventual beneficiary institution have no relationship together and decide to go through their correspondent.

In case the MT103 is just an announce, it can be sent directly from the initial sending institution to the eventual beneficiary institution regardless of the fact that they have no banking relationship with each other.

In this example, the customer John Trump of bank XYZ in Switzerland wants to send 1'000'000 USD to Cowboy Corp. in Kensas City, a customer of "Kensas Credit"
Due to the nature of the transaction and their missing banking relationship together, they decide to go through correspondent banks :

An MT103 is sent directly to the beneficiary bank, regardless of the fact they have no banking relationship together
An MT202.COV will be routed through correspondent(s) and routing banks

We are here interested in the MT 103 sent by Bank XYZ, the announcement, which can be sent directly to beneficiary bank

The MT103 announce clearly indicates that the fund transfer will go through correspondents with the usage of the fields Sender's correspondent - field 53A - and Receiver's Correspondent - field 54A.
Aside from this, there is no specific difference between this announce and a serial MT103 fund transfer.

2.2.2 MT103 Parsing and Data Mapping

The MT103 parsing details are presented in the table below. Only most essential fields are discussed.

Meaning	SWIFT		Example	Comment
	Field	Variant
AppID	Block1/ApplId		F	The Application Identifier identifies the application within which the message is being sent or received. The available options are: F = FIN , A = GPA, etc. These values are automatically assigned by the SWIFT system and the user's CBT.
ServiceID	Block1/Servid		01	The Service Identifier consists of two numeric characters. It identifies the type of data that is being sent or received and, in doing so, the type of the following message
Sender (Sending bank / BIC)	Block1/ LTaddrBlk1 (I) Or Block2/ LTaddrBlk2 (O)		SGOBFRPP	Sender BIC appears in header block (Block 1) in the MT103 Input and in the application block (Block 2) in the MT103 Output (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out
Message Type	Block2/ Msgtype		103	SWIFT Message Type = MT 103
Cover or Serial Transfer Type			If field 56a or 57a are present : then transfer type = "serial" If field 53a (ex 53B) or 54a are present : then transfer type = "announce (cover)" (Default : serial)	Knowing whether the MT103 is a serial transfer or the announce preceding a cover transfer is important. For a given use case, we actually expect a banking institution always to use the very same method.
Receiver (Receiving Bank / BIC)	Block2/ LTaddrBlk2(I) Or Block1/ LTaddrBlk1 (O)		RBOSGB2L	The Receiver BIC appears in header block (Block1) in the MT103 Output and in the application block (Block 2) in the MT103 Input. (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out. The receiver of the message is the eventual beneficiary only if no field 57 says otherwise.
Unique End-to-end Transaction Reference	Block3/ Tag 121		b03c6901-bbed-4aa9-afdh-A5bc26d19257	This reference is provided in the user block (Block 3) and transported end-to-end. It is mandatory in MT103 but can still be missing as well as have duplicates.
Sender's Reference	20		ORDERREF1234	This field is mandatory and of format 16x. It is a reference assigned by the Sender to unambiguously identify the message.
Sender Msg. Sending Timestamp	(O) Block2 / ‘ Intime + Indate (I) Sys.time()		1538070522	(O) = Output only : SWIFT timestamp for an Output message (HHMMYYMMDD) or local date/time for an Input Message.
In / Out-put flag	Block2 / Inoutind		I	Single letter ‘I' or ‘O'
Bank operation code	23	B	CRED	It is mandatory and of format 4!c.
Instruction code	23	E	PHOB/+34.91.397.6789	It is optional and of format :4!c[/30x] (Instruction Code)(Additional Info.) Code PHOB means Phone Benef.. The sender requests the benef. bank to contact benef. by phone when funds are received.
Value Date / currency / interbank settled amount	32	A	180816USD2325,	It is mandatory and of format 6!n3!a15d (Date)(Currency)(Amount). Note the trailing coma (i.e. decimal part is not mandatory if 0)
Currency / Instructed Amount	33	B	USD2350,	Normally optional in the standard. It may be provided for instance because the sender has taken fees. See field 71F below. Format 3!a15d (Currency)(Amount)
Ordering Customer	50	A	/DE3750070010... DEUTDEFF	Line 1 (subfield Party Identifier) /34x (account) Line 2 (subfield bank) 4!a2!a2!c[3!c] (Identifier Code)	The field ordering customer is mandatory. In can be given either in the message details (here) or per transaction in the repeating sequence. The ordering customer is customer of the sender only if there is no field 52. The ordering customer remains constant in the message chain.
		F	/DE207008000... 1/Essilor International 2/147 Rue de Paris 3/FR/Charenton-le-Pont, 94220	Line 1 (subfield Party Identifier) /34x (Account) Lines 2-5 : (Number/Name and Address) 1!n/33x (Number)(Details)
		K	/CH570483509... GALLMAN COMPANY GMBH RAEMISTRASSE, 71 8006 ZURICH SWITZERLAND	Line 1 : (subfield party identified) /34x (Account) Line 2-5 (subfield Address) 4*35x (Name and Address)
Ordering institution	52	A	BNPAFRPP or e.g. /FR123509321... BNPAFRPP	Format [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code)	Ordering institution is optional and can be provided in two options A (usual) and D (less common). The sender populates this field to indicate that the initial instruction comes from another institution (ordering institution). The ordering institution remains constant in the message chain. (priority over header)
		D	BANQUE DELUBAC ET CIE 16 PL SALEON TERRAS 07160 LE CHEYLARD	Format [/1!a][/34x] (Party Identifier) 4*35x (Name and Address)
Sender's correspondent	53	A	PNBPUS3N or e.g. /12345678901 PNBPUS3N	Cover payments only Correspondent of sender. Sender has an account in Currency with this banking institution. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
		B	/12345678901	Caution : Serial and cover payments. Field 53B indicates the account number of the Sender, serviced by the Receiver, which is to be used for reimbursement (debit) in the transfer. This the account of the sender by the receiver (VOSTRO) Option B Format is: [/1!a][/34x] (Party Identifier) [35x] (Location) The field is optional but In practice, the account number is almost always provided.
Receiver's correspondent	54	A	IRVTUS3N or e.g. /9876412-1234/123 IRVTUS3N	Cover payments only. Correspondent of receiver. Receiver has an account in currency with this banking institution. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
Intermediary Institution	56	A	IRVTUS3N (bic) or e.g. /939400 (BSB or account) AMPBAU2SXXX (bic)	Serial payments only. This is the Correspondent of Creditor Bank. It holds the account in currency of the creditor bank. It is used instead of over 54a (Receiver's Correspondent) in case of a serial payment transfer. It is optional and can be provided in option A, C or D. Formats C or D are rarely used and most of the time not supported by banks. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
Account With Institution	57	A	BARCGB22XXX (bic) or e.g. //939400 (BSB) AMPBAU2SXXX (bic)	We need to parse the BIC out of it Format : [/1!a][/34x] (Party Identifier ) 4!a2!a2!c[3!c] (Identifier Code)	Serial payments only. Account with institution is optional and can be provided in option A, B, C or D. Formats B, C are rarely used and most of the time not supported by banks. Field 57 is used when the receiver of the SWIFT MT103 doesn't own the beneficiary account and needs to send the Message Further. In the final MT103 on the chain, the holder of the account will be the receiver and no field 57 will be required anymore. We need to parse the BIC out of it or hash the address (priority over header)
		D	Hong Kong Banking Assoc. Avenue du Léman 1204 Genève - CH Switzerland	If no BIC is available to identify the target institution, option D is used. In principle minimum 3 lines with name and address should be provided Format : [/1!a][/34x] (Party Identifier ) 4*35x (Name and Address) In this case, take country from receiver BIC.
Beneficiary	59	(no letter)	/26351-38947 Company One CITY STREET 50 LONDON, UK	Line 1 : [/34x] (Account) (IBAN format or else) Line 2-5: 4*35x (Name and Address)	Beneficiary customer information is Mandatory. The beneficiary remains constant in the message chain.
		A	PNBPUS3N or e.g. /12345678901 PNBPUS3N	Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
		F	/10078074 1/Company One 2/CITY STREET 50 3/GB/LONDON	Line 1 (subfield Party Identifier) [/34x] (Account) Lines 2-5 : (Number/Name and Address) 4*(1!n/33x) (name and address)
Remittance Information	70		/INV/18042-090715	Remittance information is optional and provided in format 4*35x if available. Up to 4 lines of up to 35 X characters each. Usually the remittance information is generated by the beneficiary and sent to the ordering customer (or debtor). The beneficiary requests the debtor to provide it the payment message, so that the beneficiary can easily reconcile the payment with an invoice for instance.
Details of Charges	71	A	OUR	It is mandatory and of format 3!a. It can take 3 values: BEN, OUR and SHA. OUR means charges are to be borne by the ordering customer. SHA means charges are shared between Ordering and beneficiary customers. BEN means charges are to be borne by the beneficiary
Sender's charges	71	F	EUR2,50	Optional. When 71A is BEN (or SHA), 71G contains amount of the charges due, which have been deducted from the interbank settlement amount. Interbank settled amount = Instructed amount - Sender's charges. Format 3!a15d Caution: there can be several different 71F in a same MT103.
Receiver's charges	71	G	EUR2,50	Optional. When 71A is OUR (or SHA), 71G contains amount of the charges due, which have been prepaid and included in the interbank settlement amount. Format 3!a15d Caution : there can be several different 71G in a same MT103.
Sender to Receiver Information	72		/INS/BNPAFRPP	This is an optional field. It takes the Format 6*35x. There can be many codes indicating additional information. INS is a code indicating that BNPAFRPP is the instructing institution. Without field 72, the receiver may not know it since that information is not provided somewhere else in the message when sender is the next bank on the routing chain and ordering institution is another bank before the instructing one.

2.2.3 Additional notes on MT103

Some complementary notes:

The option field 53B (only variant B) is really only used to indicate which account at the correspondent (receiver) should be debited.
Fields 51A seems to be not supported by most banks (at least all I found such as UBS, etc.)
When no correspondent is used neither on sender side (Tag 53A) nor on receiver side (Tag 54A) and No reimbursement party (Tags 56a and 57a) is indicated in the SWIFT MT103 message. It means
- there is a direct account relationship, in the currency of the transfer, between the Sender and the Receiver. Money will be taken from account and credited to the beneficiary.
- Beneficiary customer account (:59) is hold by the receiver.
The fields 56a and 57a are used for serial transfers while the fields 53a (ex 53B) and 54a are used in the MT103 Announcement Message (cover method).
What to do in case of 57D if we cannot find a country code? We Use SWIFT message Receiver BIC country (sometimes wrong but better than nothing, only sometimes since when 57D is used we are likely in the same country)
Routing
- When there is no ordering institution (Tag 52) in the SWIFT MT103 message. That means implicitly that the ordering customer is customer of the Sender.
- When the ordering institution (Tag 52D) is provided in the MT103 SWIFT Message. This means the ordering customer is not customer of the Sender.
Field 57 is used when the receiver of the SWIFT MT103 doesn't own the beneficiary account and needs to send the Message Further.
In the final MT103 on the chain, the holder of the account will be the receiver and no field 57 will be required anymore.
- This indicates that Sender and Beneficiary customer's Bank do not have direct account relationship in the currency of the transaction (USD). Otherwise the sender would send the message directly to the beneficiary customer's Bank.
Sender's reference is new for every message in the routing chain, but end-to-end reference remains constant.

2.3 SWIFT MT202 Detailed Analysis

SWIFT MT202 Messages are used for interbank funds transfers. There is no customer involved when issuing an MT202. Funds are not sent on behalf of a customer but on behalf of the initial sending banking institution itself.

The situation of SWIFT MT202 in a banking institution is as follows:

There can be two different situations:

Either the bank is the initial sending institution, in which case it will be identified both as the sender of the SWIFT message and perhaps as the Ordering Institution (52a)
Or the bank is just a routing bank in the routing chain in which case it is the sender but shall be different that the Ordering Institution (52a)

2.3.1 SWIFT MT202 Introductory Exampkes

This section presents various examples of SWIFT MT202 corresponding to different situations

2.3.1.1 MT202 Example 1: simplest case

In this first example, bank XYZ wants to send 1 million euros from its general VOSTRO account 1234-5678 by BNP Paribas to another of its own account FR982381827331 also by BNP Paribas.
The field "Receiver's Correspondent" - 53B - is diverted for the usage of identifying the source account to use for debit.

Details of the SWIFT MT202:

The beneficiary institution is a field that doesn't exist in MT103 and introduced by MT202. It does not identify the eventual institution owning the account to be credited but the Banking institution owning that account, actually the bank being the customer of the institution owning its correspondent account.

2.3.1.2 MT202 Example 2: other bank case

In this second example, bank XYZ wants to send 1 million euros from its general VOSTRO account 1234-5678 by BNP Paribas to an account belonging to another financial institution: Banque Populaire in France. This other banking institution also owns an account by BNP Paribas.
The account of Banque Populaire by BNP Paribas is FR98238182733.

The field "Receiver's Correspondent" - 53B - is diverted for the usage of identifying the source account to use for debit.

The details of this SWIFT MT 202 are fundamentally similar to the one from the previous example:

The field Beneficiary Institution - 58A - identifies clearly the eventual beneficiary institution owning the account by the Account with Institution - field 57A, in this case Banque Populaire.

2.3.1.3 MT202 Example 3: routed message

In this example we’ll see how a routed MT202 looks like. In this example, bank BCVs wants to send money to Banque Populaire. They are not using correspondent, the initial account debited is by BCVs and the eventual beneficiary account is by Banque Populaire.

Since BCVs and Banque Populaire have no relationship together, the MT202 needs to be routed through banks having a banking relationship. In this example we look at the MT202 forwarded by bank XYZ, one of the bank on the routing chain.

Details of this SWIFT MT202 message:

The field account with Institution - 57A has the same value than the field beneficiary institution - 58A - since the funds make it to Banque Populaire.

2.3.2 MT202 Parsing and Data Mapping

The MT202 parsing details are presented in the table below. Only most essential fields are discussed.

Meaning	SWIFT		Example	Comment
	Field	Variant
AppID	Block1/ApplId		F	The Application Identifier identifies the application within which the message is being sent or received. The available options are: F = FIN , A = GPA, etc. These values are automatically assigned by the SWIFT system and the user's CBT.
ServiceID	Block1/Servid		01	The Service Identifier consists of two numeric characters. It identifies the type of data that is being sent or received and, in doing so, the type of the following message
Sender (Sending bank / BIC)	Block1/ LTaddrBlk1 (I) Or Block2/ LTaddrBlk2 (O)		SGOBFRPP	Sender BIC appears in header block (Block 1) in the MT202 Input and in the application block (Block 2) in the MT202 Output (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out
Message Type	Block2/ Msgtype		202	SWIFT Message Type = MT 202
Validation Flag	Block3/Tag119		COV (if MT202 is a cover payment)	This validation flag is provided the user block (Block 3) and transported end-to-end. It indicates that the message is a MT202 Cover.
Receiver (Receiving Bank / BIC)	Block2/ LTaddrBlk2(I) Or Block1/ LTaddrBlk1 (O)		RBOSGB2L	The Receiver BIC appears in header block (Block1) in the MT292 Output and in the application block (Block 2) in the MT202 Input. (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out. The receiver of the message is the eventual beneficiary only if no field 57 says otherwise.
Unique End-to-end Transaction Reference	Block3/ Tag 121		b03c6901-bbed-4aa9-afdh-A5bc26d19257	This reference is provided in the user block (Block 3) and transported end-to-end. It is mandatory in MT103 but can still be missing as well as have duplicates.
Sequence A - General Information (Matching MT 202 format)
Sender's Reference	20		ORDERREF1234	This field is mandatory and of format 16x. It is a reference assigned by the Sender to unambiguously identify the message.
Related reference	21		123456789ABCDEF	This field is mandatory and of format 16x.
Sender Msg. Sending Timestamp	(O) Block2 / ‘ Intime + Indate (I) Sys.time()		1538070522	(O) = Output only : SWIFT timestamp for an Output message (HHMMYYMMDD) or local date/time for an Input Message.
In / Out-put flag	Block2 / Inoutind		I	Single letter ‘I’ or ‘O’
Value Date / currency / interbank settled amount	32	A	180816USD2325,	It is mandatory and of format 6!n3!a15d (Date)(Currency)(Amount). Note the trailing coma (i.e. decimal part is not mandatory if 0)
Ordering institution	52	A	BNPAFRPP or e.g. /FR123509321... BNPAFRPP	Format [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code)	Ordering institution is optional and can be provided in two options A (usual) and D (less common). The sender populate this field to indicate that the initial instruction comes from another institution (ordering institution). The ordering institution remains constant in the message chain. (priority over header)
		D	BANQUE DELUBAC ET CIE 16 PL SALEON TERRAS 07160 LE CHEYLARD	Format [/1!a][/34x] (Party Identifier) 4*35x (Name and Address)
Sender's correspondent	53	A	PNBPUS3N or e.g. /12345678901 PNBPUS3N	Cover payments only Correspondent of sender. Sender has an account in Currency with this banking institution. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
		B	/12345678901	Caution : Serial and cover payments. Field 53B indicates the account number of the Sender, serviced by the Receiver, which is to be used for reimbursement (debit) in the transfer. This the account of the sender by the receiver (VOSTRO) Option B Format is: [/1!a][/34x] (Party Identifier) [35x] (Location) The field is optional but In practice, the account number is almost always provided.
Receiver’s correspondent	54	A	IRVTUS3N or e.g. /9876412-1234/123 IRVTUS3N	Cover payments only. Correspondent of receiver. Receiver has an account in currency with this banking institution. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
Intermediary Institution	56	A	IRVTUS3N (bic) or e.g. /939400 (BSB or account) AMPBAU2SXXX (bic)	Serial payments only. This is the Correspondent of Creditor Bank. It holds the account in currency of the creditor bank. It is used instead of over 54a (Receiver's Correspondent) in case of a serial payment transfer. It is optional and can be provided in option A, C or D. Formats C or D are rarely used and most of the time not supported by banks. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
Account With Institution	57	A	BARCGB22XXX (bic) or e.g. //939400 (BSB) AMPBAU2SXXX (bic)	We need to parse the BIC out of it Format : [/1!a][/34x] (Party Identifier ) 4!a2!a2!c[3!c] (Identifier Code)	Serial payments only. Account with institution is optional and can be provided in option A, B, C or D. Formats B, C are rarely used and most of the time not supported by banks. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC) Field 57 is used when the receiver of the SWIFT MT103 doesn’t own the beneficiary account and needs to send the Message Further. In the final MT103 on the chain, the holder of the account will be the receiver and no field 57 will be required anymore. We need to parse the BIC out of it or hash the address (priority over header)
		D	Hong Kong Banking Assoc. Avenue du Léman 1204 Genève - CH Switzerland	If no BIC is available to identify the target institution, option D is used. In principle minimum 3 lines with name and address should be provided Format : [/1!a][/34x] (Party Identifier ) 4*35x (Name and Address)
Beneficiary Institution	58	A	BNPAFRPP or e.g. /FR123509321... BNPAFRPP	Format [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code)	Beneficiary Institution is a mandatory field. In the Case of MT 202 and MT 202.COV, beneficiary institution is the most reliable and straightforward way to identify the eventual beneficiary institution of the founfs.
		D	BANQUE DELUBAC ET CIE 16 PL SALEON TERRAS 07160 LE CHEYLARD	Format [/1!a][/34x] (Party Identifier) 4*35x (Name and Address)
Sender to Receiver Information	72		/INS/BNPAFRPP	This is an optional field. It takes the Format 6*35x. There can be many codes indicating additional information. INS is a code indicating that BNPAFRPP is the instructing institution. Without field 72, the receiver may not know it since that information is not provided somewhere else in the message when sender is the next bank on the routing chain and ordering institution is another bank before the instructing one.

2.3.3 Additional notes on MT202

Some complementary notes:

When no correspondent is used neither on sender side (Tag 53A) nor on receiver side (Tag 54A) and No reimbursement party (Tags 56a and 57a) is indicated in the SWIFT MT103 message. It means
- There’s a direct account relationship, in the currency of the transfer, between the Sender and the Receiver. Money will be taken from account and credited to the beneficiary.
- Beneficiary customer account (:59:/00012367493) is hold by the receiver.
What to do in case of 57D if we cannot find a country code? We Use SWIFT message Receiver BIC country (sometimes wrong but better than nothing, only sometimes since when 57D is used we are likely in the same country
Routing
- When there is no ordering institution (Tag 52) in the SWIFT MT202 message. That means implicitly that the ordering customer is customer of the Sender.
- When the ordering institution (Tag 52D) is provided in the MT202 SWIFT Message. This means the ordering customer is not customer of the Sender.
  - Either the sending institution sends the MT202 on behalf of the ordering institution in 52D. This happens when the ordering institution is a small bank that has an agreement with a major bank (sending bank) for the processing and settlement of currency transactions. The small bank can use the correspondent network of sending institution.
  - Or the Sender is a routing bank on the chain
Field 57 is used when the receiver of the SWIFT MT103 doesn’t own the beneficiary account and needs to send the Message Further.
In the final MT103 on the chain, the holder of the account will be the receiver and no field 57 will be required anymore.
- This indicates that Sender and Beneficiary customer's Bank do not have direct account relationship in the currency of the transaction (USD). Otherwise the sender would send the message directly to the beneficiary customer's Bank.
The field 58a identifies in a unique way the beneficiary institution. To be perfectly honest, it does not have the same meaning as the 57a in 103 for instance, it rather has the same beaning as 59 to identify the beneficiary, when the beneficiary is an institution and not a customer. We shall use the 58 for 202 and 202.cov to identify the eventual beneficiary institution.

2.4 SWIFT MT202 COV Detailed Analysis

MT202 COV is a SWIFT message format for financial institution (FI) funds transfer between financial institutions. MT202's are used primarily for two purposes, bank-to-bank payments (i.e. interest payments and settlement of FX trades) and Cover Payments.
MT202 COV was implemented in 2009 to create traceability of the origination of funds (institution and account) through to the destination of funds (institution and account.) This was in response to anti-money laundering and associated banking requirements.

Prior to MT202 COV, The message format, MT202, did not include origination/destination financial institution information. Particularly for Cover Payments, where a combination of MT103 and MT202 are used to direct funds transfers to a beneficiary account, the intermediate banks in the MT202 had no ability to understand and perform risk analysis/AML/compliance checks on the funds transfer based on the original and destination of the funds. Thus, intermediate banks could be unwittingly involved in illegal transactions under new regulations.

The situation of SWIFT MT202 COV in a banking institution is as follows:

There can be two different situations:

Either the bank is the initial sending institution, in which case it will be identified both as the sender of the SWIFT message and perhaps as the Ordering Institution (52a)
Or the bank is just a routing bank in the routing chain in which case it is the sender but shall be different that the Ordering Institution (52a)

2.4.1 MT202 COV Introductory examples

This section presents various examples of SWIFT MT202 COV corresponding to different situations

2.4.1.1 MT202 COV Example: Cover payment

We’ll now go through a case where the MT202 COV followed as the MT103 sent as an announce as part of the cover payment method as introduced in section [6.2.2.4 Example 4: Announce message (cover method)]. This is the case when the initial sending institution and the eventual beneficiary institution have no relationship together and decide to go through their correspondent.
The MT 202 COV in this case is the message actually carrying the funds.

In this example, the customer John Trump of bank XYZ in Switzerland wants to send 1’000’000 USD to Cowboy Corp. in Kensas City, a customer of “Kensas Credit”
Due to the nature of the transaction and their missing banking relationship together, they decide to go through correspondent banks :

An MT103 is sent directly to the beneficiary bank, regardless of the fact they have no banking relationship together
An MT202.COV will be routed through correspondent(s) and routing banks

We are here first having a look at the initial MT202 COV sent by bank XYZ, the sending institution:

Details of the SWIFT MT 202 COV message:

The field account with institution- 57a - ends up by Wells Fargo. The MT 202 chain doesn’t carry the funds any further.
But the account by Wells Fargo is the correspondent account of Kensas Credit by Well Fargo, the field Beneficiary Institution - 58a - identified the institution owning that account by Wells Fargo, hence Kensas Credit.
The field 59 identifies the beneficiary customer for which the funds are transferred, Cowbow corp, a customer of Kensas Credit.

We shall have a look at all the MT202 COV of the chain to see how they differ from each other’s:

2.4.2 MT202 COV Parsing and Data Mapping

The MT202 COV parsing details are presented in the table below. Only most essential fields are discussed.

Meaning	SWIFT		Example	Comment
	Field	Variant
AppID	Block1/ApplId		F	The Application Identifier identifies the application within which the message is being sent or received. The available options are: F = FIN , A = GPA, etc. These values are automatically assigned by the SWIFT system and the user's CBT.
ServiceID	Block1/Servid		01	The Service Identifier consists of two numeric characters. It identifies the type of data that is being sent or received and, in doing so, the type of the following message
Sender (Sending bank / BIC)	Block1/ LTaddrBlk1 (I) Or Block2/ LTaddrBlk2 (O)		SGOBFRPP	Sender BIC appears in header block (Block 1) in the MT202 Input and in the application block (Block 2) in the MT202 Output (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out
Message Type	Block2/ Msgtype		202	SWIFT Message Type = MT 202
Validation Flag	Block3/Tag119		COV (if MT202 is a cover payment)	This validation flag is provided the user block (Block 3) and transported end-to-end. It indicates that the message is a MT202 Cover.
Receiver (Receiving Bank / BIC)	Block2/ LTaddrBlk2(I) Or Block1/ LTaddrBlk1 (O)		RBOSGB2L	The Receiver BIC appears in header block (Block1) in the MT292 Output and in the application block (Block 2) in the MT202 Input. (Input and output related to the SWIFT network, not the bank). Need to use field Block2 / Inoutind to find out. The receiver of the message is the eventual beneficiary only if no field 57 says otherwise.
Unique End-to-end Transaction Reference	Block3/ Tag 121		b03c6901-bbed-4aa9-afdh-A5bc26d19257	This reference is provided in the user block (Block 3) and transported end-to-end. It is mandatory in MT103 but can still be missing as well as have duplicates.
Sequence A - General Information (Matching MT 202 format)
Sender's Reference	20		ORDERREF1234	This field is mandatory and of format 16x. It is a reference assigned by the Sender to unambiguously identify the message.
Related reference	21		123456789ABCDEF	This field is mandatory and of format 16x.
Sender Msg. Sending Timestamp	(O) Block2 / ‘ Intime + Indate (I) Sys.time()		1538070522	(O) = Output only : SWIFT timestamp for an Output message (HHMMYYMMDD) or local date/time for an Input Message.
In / Out-put flag	Block2 / Inoutind		I	Single letter ‘I’ or ‘O’
Value Date / currency / interbank settled amount	32	A	180816USD2325,	It is mandatory and of format 6!n3!a15d (Date)(Currency)(Amount). Note the trailing coma (i.e. decimal part is not mandatory if 0)
Ordering institution	52	A	BNPAFRPP or e.g. /FR1235093212... BNPAFRPP	Format [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code)	Ordering institution is optional and can be provided in two options A (usual) and D (less common). The sender populate this field to indicate that the initial instruction comes from another institution (ordering institution). The ordering institution remains constant in the message chain. (priority over header)
		D	BANQUE DELUBAC ET CIE 16 PL SALEON TERRAS 07160 LE CHEYLARD	Format [/1!a][/34x] (Party Identifier) 4*35x (Name and Address)
Sender's correspondent	53	A	PNBPUS3N or e.g. /12345678901 PNBPUS3N	Cover payments only Correspondent of sender. Sender has an account in Currency with this banking institution. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
		B	/12345678901	Caution : Serial and cover payments. Field 53B indicates the account number of the Sender, serviced by the Receiver, which is to be used for reimbursement (debit) in the transfer. This the account of the sender by the receiver (VOSTRO) Option B Format is: [/1!a][/34x] (Party Identifier) [35x] (Location) The field is optional but In practice, the account number is almost always provided.
Receiver’s correspondent	54	A	IRVTUS3N or e.g. /9876412-1234/123 IRVTUS3N	Cover payments only. Correspondent of receiver. Receiver has an account in currency with this banking institution. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
Intermediary Institution	56	A	IRVTUS3N (bic) or e.g. /939400 (BSB or account) AMPBAU2SXXX (bic)	Serial payments only. This is the Correspondent of Creditor Bank. It holds the account in currency of the creditor bank. It is used instead of over 54a (Receiver's Correspondent) in case of a serial payment transfer. It is optional and can be provided in option A, C or D. Formats C or D are rarely used and most of the time not supported by banks. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
Account With Institution	57	A	BARCGB22XXX (bic) or e.g. //939400 (BSB) AMPBAU2SXXX (bic)	We need to parse the BIC out of it Format : [/1!a][/34x] (Party Identifier ) 4!a2!a2!c[3!c] (Identifier Code)	Serial payments only. Account with institution is optional and can be provided in option A, B, C or D. Formats B, C are rarely used and most of the time not supported by banks. Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC) Field 57 is used when the receiver of the SWIFT MT103 doesn’t own the beneficiary account and needs to send the Message Further. In the final MT103 on the chain, the holder of the account will be the receiver and no field 57 will be required anymore. We need to parse the BIC out of it or hash the address (priority over header)
		D	Hong Kong Banking Assoc. Avenue du Léman 1204 Genève - CH Switzerland	If no BIC is available to identify the target institution, option D is used. In principle minimum 3 lines with name and address should be provided Format : [/1!a][/34x] (Party Identifier ) 4*35x (Name and Address)
Beneficiary Institution	58	A	BNPAFRPP or e.g. /FR123509321... BNPAFRPP	Format [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code)	Beneficiary Institution is a mandatory field. In the Case of MT 202 and MT 202.COV, beneficiary institution is the most reliable and straightforward way to identify the eventual beneficiary institution of the founfs.
		D	BANQUE DELUBAC ET CIE 16 PL SALEON TERRAS 07160 LE CHEYLARD	Format [/1!a][/34x] (Party Identifier) 4*35x (Name and Address)
Sender to Receiver Information	72		/INS/BNPAFRPP	This is an optional field. It takes the Format 6*35x. There can be many codes indicating additional information. INS is a code indicating that BNPAFRPP is the instructing institution. Without field 72, the receiver may not know it since that information is not provided somewhere else in the message when sender is the next bank on the routing chain and ordering institution is another bank before the instructing one.
Sequence B - Underlying Customer Credit Transfer Detail
Currency / Instructed Amount	33	B	USD2350,	Normally optional in the standard. It may be provided for instance because the sender has taken fees. See field 71F below. Format 3!a15d (Currency)(Amount)
Ordering Customer	50	A	/DE3750070010... DEUTDEFF	Line 1 (subfield Party Identifier) /34x (account) Line 2 (subfield bank) 4!a2!a2!c[3!c] (Identifier Code)	The field ordering customer is mandatory. In can be given either in the message details (here) or per transaction in the repeating sequence. The ordering customer is customer of the sender only if there is no field 52. The ordering customer remains constant in the message chain.
		F	/DE207008000... 1/Essilor International 2/147 Rue de Paris 3/FR/Charenton-le-Pont, 94220	Line 1 (subfield Party Identifier) /34x (Account) Lines 2-5 : (Number/Name and Address) 1!n/33x (Number)(Details)
		K	/CH5704835098... GALLMAN COMPANY GMBH RAEMISTRASSE, 71 8006 ZURICH SWITZERLAND	Line 1 : (subfield party identified) /34x (Account) Line 2-5 (subfield Address) 4*35x (Name and Address)
Beneficiary	59	(no letter)	/26351-38947 Company One CITY STREET 50 LONDON, UK	Line 1 : [/34x] (Account) (IBAN format or else) Line 2-5: 4*35x (Name and Address)	Beneficiary customer information is Mandatory. The beneficiary remains constant in the message chain.
		A	PNBPUS3N or e.g. /12345678901 PNBPUS3N	Option A is formatted [/1!a][/34x] (Party Identifier) 4!a2!a2!c[3!c] (Identifier Code / BIC)
		F	/10078074 1/Company One 2/CITY STREET 50 3/GB/LONDON	Line 1 (subfield Party Identifier) [/34x] (Account) Lines 2-5 : (Number/Name and Address) 4*(1!n/33x) (name and address)
Remittance Information	70		/INV/18042-090715	Remittance information is optional and provided in format 4*35x if available. Up to 4 lines of up to 35 X characters each. Usually the remittance information is generated by the beneficiary and sent to the ordering customer (or debtor). The beneficiary requests the debtor to provide it the payment message, so that the beneficiary can easily reconcile the payment with an invoice for instance.
Details of Charges	71	A	OUR	It is mandatory and of format 3!a. It can take 3 values: BEN, OUR and SHA. OUR means charges are to be borne by the ordering customer. SHA means charges are shared between Ordering and beneficiary customers. BEN means charges are to be borne by the beneficiary
Sender's charges	71	F	EUR2,50	Optional. When 71A is BEN (or SHA), 71G contains amount of the charges due, which have been deducted from the interbank settlement amount. Interbank settled amount = Instructed amount - Sender's charges. Format 3!a15d Caution: there can be several different 71F in a same MT202 COV.
Receiver's charges	71	G	EUR2,50	Optional. When 71A is OUR (or SHA), 71G contains amount of the charges due, which have been prepaid and included in the interbank settlement amount. Format 3!a15d Caution : there can be several different 71G in a same MT202 COV.
Sender to Receiver Information	72		/INS/BNPAFRPP	This is an optional field. It takes the Format 6*35x. There can be many codes indicating additional information. INS is a code indicating that BNPAFRPP is the instructing institution. Without field 72, the receiver may not know it since that information is not provided somewhere else in the message when sender is the next bank on the routing chain and ordering institution is another bank before the instructing one.

2.4.3 Additional notes on MT202 COV

Some complementary notes:

The option field 53B (only variant B) is really only used to indicate which account at the correspondent (receiver) should be debited.
Fields 51A seems to be not supported by most banks (at least all I found such as UBS, etc.)
When no correspondent is used neither on sender side (Tag 53A) nor on receiver side (Tag 54A) and No reimbursement party (Tags 56a and 57a) is indicated in the SWIFT MT202 message. It means
- there is a direct account relationship, in the currency of the transfer, between the Sender and the Receiver. Money will be taken from account and credited to the beneficiary.
- Beneficiary customer account is hold by the receiver.
What to do in case of 57D if we cannot find a country code? We Use SWIFT message Receiver BIC country (sometimes wrong but better than nothing, only sometimes since when 57D is used we are likely in the same country
Routing
- When there is no ordering institution (Tag 52) in the SWIFT MT202 message. That means implicitly that the ordering customer is customer of the Sender.
- When the ordering institution (Tag 52D) is provided in the MT202 SWIFT Message. This means the ordering customer is not customer of the Sender.
  - Either the sending institution sends the MT103 on behalf of the ordering institution in 52D. This happens when the ordering institution is a small bank that has an agreement with a major bank (sending bank) for the processing and settlement of currency transactions. The small bank can use the correspondent network of sending institution.
  - Or the Sender is a routing bank on the chain
Field 57 is used when the receiver of the SWIFT MT202 doesn’t own the beneficiary account and needs to send the Message Further.
In the final MT103 on the chain, the holder of the account will be the receiver and no field 57 will be required anymore.
- This indicates that Sender and Beneficiary customer's Bank do not have direct account relationship in the currency of the transaction (USD). Otherwise the sender would send the message directly to the beneficiary customer's Bank.
Sender’s reference is new for every message in the routing chain, but end-to-end reference remains constant.
The field 58a identifies in a unique way the beneficiary institution. To be perfectly honest, it does not have the same meaning as the 57a in 103 for instance, it rather has the same meaning as 59 to identify the beneficiary, when the beneficiary is an institution and not a customer.
We shall use the 58 for 202 and 202.cov to identify the eventual beneficiary institution.

3. Conclusion

The above is a collection of the most essential information we gathered at NetGuardians when implementing SWIFT parsing to monitor payments, specifically cross border payments. There are more Message Types relevant when it comes to monitoring payments than those mentioned in this article (for instance MT 205, MT 102, MT 203, etc.) but there usage is rarer.
In addition, the listing of fields provided in this article is far from complete and other fields of the SWIFT messages, especially from the various headers, may be interesting for your own business. I let the reader refer to the SWIFT MT specifications.
Last but not least, with the coming mandatory transition to XML format (SWIFT MX), I may well need to write a new version of this article pretty soon :-)

For the sake of exhaustivity, I wanted to complete this article by giving a very simplified view of how SWIFT kicks-in in a Banking Information System:

Most importantly, the SWIFT interface is connected to the payment HUB of the banking institution. But interestingly, it is also used as an input channel for customers, since the later may send Transfer Requests such as MT101 to their banking institution using SWIFT.

AI - what do we do differently at NetGuardians ?

2019-02-18T02:42:15-05:00

The world of fraud prevention in banking institutions has always been largely based on rules.
Bankers and their engineers were integrating rules engines on the banking information system to prevent or detect most common fraud patterns.
And for quite a long time, this was sufficient.

But today we are experiencing a change of society, a new industrial revolution.
Today, following the first iPhone and the later mobile internet explosion, people are interconnected all the time, everywhere and for all kind of use.
This is the digital era and the digitization of means and behaviours forces corporations to transform their business model.

As a consequence, banking institutions are going massively online and digital first. Both the bank users and customers have evolved their behaviours with the new means offered by the digital era.
And the problem is:
How do you want to protect your customer's assets with rules at a time when, for instance, people connect to their swiss ebanking platform from New York to pay for a holiday house rental in Morocco? How would you want to define rules to detect frauds when there are almost as many different behaviours as there are customers?

At NetGuardians, we prevent fraud using a completely different approach. We use Artificial Intelligence to monitor financial transactions and user behaviour in real time and detect suspicious transactions or activities.
With our Big Data Analytics Platform - NG|Screener - the machine analyzes the past transactions of the customers to understand their transactional behaviour, as well as past activities of users on the banking information system.
The machine is able to analyze a very important depth of history in real time, capturing what customers and users usually do in so called Dynamic Profiles.
Then, whenever a transaction is input on the system, the Artificial Intelligence is able to compare that specific transaction against the customer profile and compute a risk score for it. If the risk score is sufficiently high, the machine will decide to block the transaction and qualify it for further investigation by the bank.

Using advanced machine learning techniques, we are able to have a very broad spectrum of detection, while minimizing wrong alerts, these famous false positives, to an unprecedented low ratio. Instead of focusing on identifying fraudsters, at NetGuardians we focus on understanding user and customer behaviours and habits. In order to find frauds, we don't look only for Fraud. Our approach is to detect and block every transaction that is simply too unusual and risky for the banking institution to afford letting it go out without a further investigation. And it turns out the frauds are simply always part of this set of risky transactions.
Our unique combination of unsupervised and supervised approaches makes it possible to minimize false positives, while still being able to detect fraud patterns never encountered before. In the world of Fraud Prevention Solutions, this is the Holy Grail, being able to use models that can detect what has never been encountered before while at the same time keeping the amount of alerts very low. This is hardly seen in other AI solutions on the market today.

Over the time, we have been able to develop our Artificial Intelligence further to make it smarter and smarter. For instance, pretty soon we started to compare transactions against different background sets, the past customer or user activities of course, but also the banking institution past activities as a whole or even the customer or user peer group.
The peer groups are also built using Machine Learning to analyze customers and users past activities, this time to group them together, and achieve better and more accurate scoring of risky transactions or other activities.

Today, our unique Artificial Intelligence platform is able to use a dozen of different Analytics and Machine Learning approaches going much beyond solely transaction scoring. For instance, we are able to qualify the legitimacy of a specific interaction on the ebanking platform, or to monitor PSD2 providers activity, or even to do frequency and timing anomalies detection on card transactions.

At NetGuardians, we use AI to detect anomalies and prevent fraud by providing the experts and investigators within the bank with the tools aimed at making them more efficient than has ever been possible before.
We enhance the human investigation and analytics process, but we don't supplant it.
This is called Augmented Intelligence, the core of what we do at NetGuardians.

Interview about NetGuardians and fighting fraud in the digital era

2019-02-04T06:10:18-05:00

The below is an extract from an interview I ran in February 2019 during the EPFL Forward event.

NetGuardians is a Swiss Software Publisher based in Yverdon-les-bains that edits a Big Data Analytics Solution deployed Financial Institution for one key use case: fighting financial crime and preventing banking Fraud.
Banking fraud is meant in the broad sense here: both internally and externally.
Internal fraud is when employees misappropriate funds under management and external fraud is when cyber-criminals compromise ebanking applications, mobile devices used for payment or credit cards.

In the digital age, the means of fraudsters and cyber-criminals have drastically increased.

Cyber-criminals have become industrialized, professionalized and organized. The same technology they use against banks is also what gives us the means to protect banks

At NetGuardians we deploy an Artificial Intelligence that monitors on a large scale, in depth and in real time all activities of users, employees of the bank, but also those of its customers, to detect anomalies.
We prevent bank fraud and fight financial crime by detecting and blocking all suspicious activity in real time.

Jérôme Kehrli, how did you manage to convince a sector that is, in essence, very traditional, to trust you with your digital tools to fight against fraud?
Two different worlds, two languages, two visions?

The situation of the banks is a bit peculiar, the digitization and with it the evolution of the means and the behaviours of the customers in the digital age, was at the same time both a traumatic and a formidable solution.

The digital revolution was a traumatic because the banks, which by their very nature are very conservative, especially in Switzerland with our very strong private banking culture, were not prepared for the need to profoundly transform the customer experience of the banking world: to meet the customer where he is, on his channels, with mobile banking, this culture of all and everything immediately, with instant payments, the opening of the information system, with the explosion of the External Asset Managers model and external service providers with the PSD2 European standard, etc.

The digital revolution has imposed these changes, sometimes brutally, in banks and it is the source of a tremendous increase of the attack surface of banks.

But this same technology that spawned the digital revolution has proved to be the solution too.
Technology has made it possible to build digital banking applications that provide all of the bank's services on a mobile device.
Technology has made it possible to implement innovative solutions that secure the information system and protect client funds.

And in this perspective, Artificial Intelligence is really a sort of panacea: robot advisory, chatbots, personalization of financial advice and especially, especially the fight against financial crime: banking fraud and money laundering

In the end, if five years ago our solutions seemed somewhat avant-garde, not to say futuristic and sometimes aroused a bit of skepticism, today the banks are aware of the digital urgency and it is the bankers themselves who eagerly seek our solutions.

You support the digital shift of the banking sector.
Do banks sometimes have to change their way of operating, their habits, to be able to use your technologies?
(Do you have to prepare them to work with you?)

So of course the digital revolution profoundly transforms not only the business model but also the corporate culture, its tools, and so on.

At NetGuardians we have a very concrete example.

Before the use of Artificial Intelligence, banks protected themselves with rules engines. Hundreds of rules were deployed on the information system to enforce security policies or detect the most obvious violations.
The advantage with rules was that a violation was very easy to understand. A violation of a compliance rule reported in a clear and accurate audit report was easy to understand and so was the response.
The disadvantage, however, was that the rules were a poor protection against financial crime and that's why fraud has exploded over the decade.

Today with artificial intelligence, the level of protection is excellent and without comparison with the era of the rules.
But the disadvantage of artificial intelligence is that accurately understanding a decision of the machine is much more difficult.

At NetGuardians, we develop with our algorithms a Forensic analysis application that allows bankers to understand the operation of the machine by presenting the context of the decision.
This forensic analysis application, which presents the results of our algorithms, is essential and almost as important as our algorithms themselves.

This is a powerful application but requires a grip.

Tom Cruise in Minority Report who handles a data discovery application playing an orchestra conductor, it's easy in Hollywood, but it's not in reality.
In reality, we provide initial training to our users and then regular updates.

In the end, a data analysis and forensic application is not Microsoft Word. Our success is to make such an application accessible to everyone, but not without a little help.
In conclusion i would say that the culture transformation end the evolution of the tools do require some training and special care.

In general, what should a company prepare for, before making a digital shift?

In the digital age, many companies must transform their business model or disappear. Some services become obsolete, some new necessities appear.
We can mention Uber of course but also NetFlix, Booking, eBookers, etc.

For the majority of the industrial base, the digitalization of products and services is an absolute necessity, a question of survival.

Successful process and business model transformation often requires a transformation of the very culture of the company, down toits identity:
Among other things one could mention the following requirements:

scaling agility from product development to the whole company level
involving digital natives to identify and design digital services
realizing the urgency or if necessary create a sense of urgency
understanding the scale of the challenge and the necessary transformation. Some say "if it does not hurt, it is not digital transformation"

In summary I would say that a company is "mature" for digitalization if it is inspired by the digitalization of our daily life to adapt its products and services AND if it has the ability to execute its ideas.
Ideas without the ability to execute leads to mess, the ability to execute without the ideas leads to the status quo.

From there I would say that a company must prepare itself on these two dimensions, bring itself the conditions and resources required to identify and to design its digital products and those required to realize them.

Artificial intelligence for banking fraud prevention in the digital era

2018-07-04T15:34:04-04:00

The digitalization with its changes of means and behaviours and the induced society and industrial evolution is putting increasingly more pressure on banks.
Just as if regulatory pressure and financial crisis weren't enough, banking institutions have realized that they need to transform the way they run their business to attract new customers and retain their existing ones.
I detailed already this very topic in a former article on this blog: The Digitalization - Challenge and opportunities for financial institutions.

In this regards, Artificial Intelligence provides tremendous opportunities and very interesting initiatives start to emerge in the big banking institutions.

In this article I intend to present these three ways along with a few examples and detail what we do at NetGuardians in this regards.

Digital urgency

Since 10 years, since the first Iphone, the banking business is under intensive transformation, like the whole society. The iphone, then all the next generations of smartphones, has allowed the always and everywhere interconnection of everyone, 4 billion people today, more tomorrow. But even more than this interconnection, the real revolution was the user experience offered by these new devices, providing access to vital services at one finger touch away.
The current generations, millennials, and the rising Z generation, these young people born with a smartphone, do not share the values than their ancestors. These new generations are characterized by their need of absolute, of immediacy, everything and immediately, of individualization, all about me, myself and I, and of universal service, where I want, when I want and especially how I want.

As a result, banks, if they want to retain their customers, or seduce these young active people who are coming on the market, must adapt, transform, and digitalize their businesses. This topic is far from new, and the amazing slap given to the banks by the emergence of the fintechs and their cannibalization of the banking business was way enough to create the sense of urgency required to trigger the transformation of banks.

Today all banks are adapting, developing more personalized and massively online services, meeting the new active force on its privileged channels, mobile first, but also social networks, youtube, etc
One might want to read my previous article on this very topic: The Digitalization - Challenge and opportunities for financial institutions

In this race for digital, made vital by the crumbling margins, new technologies are both the source of the challenges and their solution. In this perspective, Artificial Intelligence and advanced algorithms are the next step, the way to scale and have the potential to make banks more innovative and smarter. To meet the challenges of digital and the demands of this population with new behaviors and uses, but also to take advantage of opportunities related to digitalization, AI proves to be a panacea.

At every level, from financial research to fraud prevention, AI can do better, faster, farther, stronger.
Today an lot of highly innovative initiatives, taking advantage of the latest advances in artificial intelligence, emerge in the big banks, the most able to consent to the investments needed for the transformation.

These initiatives are mainly around 3 axes:

1. The customer experience

The physical presence vanishes, the contact with a human also and is replaced advantageously, at least for these digital generations, by a computer system: chatbot, personal banking assistant, voice assistant, etc.

Bank of America, for instance, has deployed a virtual assistant, Erica, that interacts with her customers by voice chat or chat and is responsible for answering customers' questions, but who is also able to provide financial advices, recommendations of investment, or to carry out the simplest banking operations such as payments, fund transfers, etc. These new banking channels are available constantly, everywhere and in various forms, "When I want, where I want and especially as I want."

2. Advanced analytics, operational efficiency and service customization

New technologies and machine learning techniques are able to substitute to humans on many analytical tasks in an advantageous way, be it financial research, investment optimization or customer profiling for the personalization of advisory.

UBS, for instance, has developed virtual research agents able to performing investment research tasks, from analyzing market data to company valuation at a level comparable to human analysts but much faster.
RBS has implemented a robot for the loan evaluation process able to approve a credit in 45 minutes, instead of several days as previously required.

3. Prevention of bank ingfraud and anti-money-laundering

The new channels to which banks must subscribe, the new usages of customers and the acceleration of business cause an increase of the attack surface of banks. These uses especially, and the behaviors of these digital generations, share everything with everyone, all about myself, my life online, social networks where everything is shared, facilitate all forms of attacks, from social engineering to theft of eBanking session.

Finally, the new technologies, on one side, since they offer new means to cyber-criminals, and the economic context on the other side, which makes the criminal enterprise attractive to a number of engineers and other qualified computer scientists, cause an explosion of fraud cases, increasingly external.

AI against banking fraud, a bit of history

In the early 2000s, the detection of banking fraud relies mostly on internal control and auditing. The effectiveness of these approaches is pretty low because of their inherent limitations. Working by sampling, internal control and audit leave a lot of frauds pass through the cracks. Some additional securities are implemented within the bank's operational information system, but here too their efficiency is quite relative.
At that time, the subprime crisis and the southern european countries sovereign debt crisis have not yet occurred, the margins are wide, people trust the banks and overall, they feel safe. The fight against fraud is not perceived as a priority.

In the second half of the 2000s, the maturity of cyber-criminals, their organization and the complexity of their attacks explode, multiplying the losses associated with fraud.
Banks react by massively deploying specific analytical systems aimed at detecting fraud. At that time, these systems are rule engines seeking behavioral patterns or pre-established and well-determined conditions or patterns in audit trails of the information system.

Today, the complexity of attacks and the means of cyber-criminals are such that these rule engines are defeated. We can mention the computer crash of the central bank of Bangladesh, where cyber-criminals, safe and untraceable, managed to steal $81 million or the Retefe worm, which despite the means deployed still manages to divert about fifty ebanking sessions every day, today, in Switzerland.
The rule engines are outdated for various reasons, including changes in usage, their multiplication and the complexity of bank customer behavior. How can the same set of rules effectively protect customers with different uses and behaviors, for example a simple saver on one side and an institutional account used to pay the suppliers of a company on another side?

Today, banks no longer have any choice and deploy large-scale AI techniques to protect their Assets Under Management and their customers.

Return of Experience, our approach at NetGuardians

In the first half of the current decade, we started at NetGuardians to develop our first AI approaches, leveraging the analytics capabilities of our Big Data platform. These approaches consist in analyzing in real time all the bank transactions, with a depth of analysis of several years, to let the machine learn the transactional behavior of both the bank's customers (external fraud) and its employees (internal fraud). With this in-depth understanding of the habits, behaviors and practices of these two populations, the machine can qualify each and every transaction as legitimate or potentially fraudulent, and, if necessary, block it before the funds have left the bank.

We then introduced other machine learning algorithms to dynamically build customers peer-groups with similar behavior, allowing us to compare an individual transaction not only with the profile of a specific customer but also with its peer group, and reducing thus the irrelevant alerts, these false-positives which must nonetheless be analyzed. Later, we focused on broadening the vision of AI by trying to make it understand all patterns of interaction between humans, employees or customers, and the information system of the bank, by analyzing not only transactions but also all other types of interaction.

Today we are able to effectively block each suspicious transaction or activity, while drastically reducing the number of cases to be analyzed, these infamous false positives, and also the time required for the investigation of a case by the teams of the bank.

All of this is explained in details in yet another article on this blog: Artificial Intelligence for Banking Fraud Prevention

Customer experience, the machine in contact with customers

The next step consists in putting the IA directly in touch with the bank's customers. When a suspicious transaction is detected, instead of mandating a bank anti-fraud employee to analyze the situation, which investigation usually ends up with a call to the customer for re-confirmation, the future is to let the machine contact the customer itself, by means of an application installed on the mobile of the customer, or by means of a voice chatbot able to contact him and speak to him to obtain directly the confirmation required for the validation of the transaction.

The benefits are numerous. For the customers of the banks it is a question of bringing this "callback" as close as possible to the input of the transaction, from a few hours today to a few seconds in the future. For the banks it is a question of reducing the costs of intervention while eliminating the frauds by systematically delegating the re-confirmation of the suspicious transactions to the customer.

In the end, the bank protects its reputation, its Asset Under Management, the data of its customers, but also meets them - meet the challenges of the digital - while reducing its operational costs - benefit from its opportunities.

Interview on Artificial Intelligence

2018-06-29T10:29:52-04:00

This is a collection of three videos I recorded for the "Empowerment Fundation" as part of their file on Artificial Intelligence.

In parallel and in addition to BeCurious, the Empowerment Foundation launched in 2018 a project of curation files thematic through the bee² program.

Taking up the practice of curating video content, bee² means: exploring the issues that build our world, expand the perspectives of analysis, stimulate awareness to enable everyone to act in a more enlightened and responsible way facing tomorrow's challenges.
It's about bringing out specific issues and allowing everyone to easily discover videos the most relevant, validated by experts, on the given topic without having to browse many sources of information.

The three videos I contributed to are::

The interview was in french (sorry ...)

The three videos can be viewed directly on this very page below.

Enjoy :-)

Lambda Architecture with Kafka, ElasticSearch and Spark (Streaming)

2018-05-04T06:32:20-04:00

In my current company - NetGuardians - we detect banking fraud using several techniques, among which real-time scoring of transactions to compute a risk score.
The deployment of Lambda Architecture has been a key evolution to help us evolve towards real-time scoring on the large scale.

In this article, I intend to present how we do Lambda Architecture in my company using Apache Kafka, ElasticSearch and Apache Spark with its extension Spark-Streaming, and what it brings to us.

Summary

1. Introduction
2. Lambda Architecture
3. Real-time computation with Lambda Architecture
4. Conclusion

1. Introduction

1.1 NetGuardians' key big data software components

NG|Screener, NetGuardians' flasgship product, is a Big Data Analytics Platform aimed at preventing fraud on the large scale within Financial Institutions.
Our platform manages and operates Big Data Analytics Use Cases detecting fraud attempts by analyzing user behaviours and financial transactions. Working in real-time, it can block suspicious business events, e.g financial transactions to prevent fraud effectively.

Our platform is built internally on four key Big Data Open Source Software components:

Apache Kafka: Kafka is an open-source stream processing software aimed at providing a unified, high-throughput, low-latency platform for handling real-time data feeds.

ElasticSearch: ElasticSearch is a distributed, real-time, RESTful search and analytics document-oriented storage engine. It lets one perform and combine many types of searches - structured, unstructured, geo, metric - in real time.

Apache Mesos: Mesos is a distributed systems kernel that runs on every machine and provides applications with API's for resource management and scheduling across entire datacenter and cloud environments.

Apache Spark: Spark is a fast and general engine for large-scale data processing. It provides programmers with an API functioning as a working set for distributed programs that offers a versatile form of distributed shared memory.

1.2 One ring to rule them all

The choice of these specific components under the hood is not anecdotal. Running Apache Spark on Apache Mesos is really still cutting edge nowadays and the choice of Apache Kafka and ElasticSearch, in addition to the good fit with our use case, answers a very important need we have.

We deploy our platform as much in tier 1 banks and big financial services providers than small private banks in Switzerland or even small Credit Institutions in Africa. Some of our customers have a few thousands of transactions daily while some others have dozens of millions of transactions per day.
Considering that some of our Analytics use cases require depth of analysis of several years, when we have billions of events to consider, we deploy our analytics platform on multiple-nodes clusters, sometimes up to a few dozen computation and storage nodes within the cluster. On the other hand, when we we work for small institutions with very low data volumes, we deploy it on a single small machine.
This need is at the very root of our technology choice, we needed technologies able to run efficiently on single small machines while still being able to scale our on hundreds of nodes should we require that.

ElasticSearch, Apache Spark, Apache Mesos and Apache Kaflka have been designed from the grounds up with this horizontal scalability in mind. But they have been implemented in such a way that they run also very well on a single little machine.
This is pretty uncommon in the Big Data Technology / NoSQL family of products. For instance, Apache Hadoop performs most of the time very poorly on single machines.

These products under the hood are key to sustain our "one ring to rule them all" approach. We develop one single platform that we can deploy everywhere, regardless of the volume of data of our customers.

1.3 Real-time readiness

In addition to their unique genes regarding vertical scalability described above, ElasticSearch, Apache Kafka and Apache Spark are providing our platform with another key feature.

With ElasticSearch, real-time updating (fast indexing) is achievable through various functionalities and search / read response time can be astonishingly deterministic.

Apache Kafka comes with the Kafka Stream extension. The Streams API, available as a Java library that is part of the official Kafka project, is the easiest way to write mission-critical real-time applications and microservices with all the benefits of Kafka's server-side cluster technology.
Despite being a humble library, Kafka Streams directly addresses both hardest problems in stream processing:

event-at-a-time processing with millisecond latency and
stateful processing including distributed joins and aggregations.

Kafka enables to implement fast processing on business events, e.g most often financial transactions in real-time and in event-at-a-time mode while dispatching micro-batches further to Spark Streaming. The more complicated processing required by our Analytics use cases occurs then within Spark through the Spark Streaming extension.

Spark Streaming is able to process hundreds of thousands of records per node per second. When using Kafka as a source, it is able to consume nearly half million records per node per second which is striking. It also offers near linear scaling ability, another great perk.
In contrary to Kafka, Spark Streaming works using a micro-batches approach. It works as follows; received input streams and decided into small batches, which are processed by Spark engine and a processed stream of batches is return.
The micro-batches can be as small as a few milliseconds batches, thus enabling sub-second latency while still ensuring a very high throughput and access to the whole Spark power and versatility to implement high level analytics use cases.

This real-time readiness aspect of these components of our technology stack is key to deploy Lambda Architecture within the our platform.

2. Lambda Architecture

When it comes to processing transactions in real-time, our platform provides a state-of-the-art implementation of a Lambda Architecture.

Lambda architecture is a Big Data Architecture that enables us to reunite our real-time and batch analytics layers.

2.1 Lambda Architecture principles

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.
The rise of lambda architecture is correlated with the growth of big data and real-time analytics.

2.2 Lambda Architecture with Kafka, ElasticSearch and Spark (Streaming)

Lambda defines a big data architecture that allows pre-defined and arbitrary queries and computations on both fast-moving data and historical data.
Using Kafka, ElasticSearch, Spark and SparkStreaming, it is achieved using the following layout:

Lambda Architecture enables us to score transactions or other business events in real-time and still consider the most recent events as well as the whole transaction history in its scoring model.

By using Kafka at the beginning of the pipeline to accept inputs, it can be guaranteed that messages will be delivered as long as they enter the system, regardless of hardware or network failure.

The batch layer is largely build on the Apache Spark / Mesos coupled with ElasticSearch as large scale storing component underneath. The reasons why we are running on Spark, Mesos and ElasticSearch have been covered before in this document but interestingly, these components appear to behave extremely well together when it comes to addressing batch processing concerns, thanks to spark's ability to work largely in memory and proper optimization of data co-locality on ElasticSearch and Spark nodes.

In the streaming layer, Kafka messages are consumed in real time using Spark Streaming. In terms of core component to support the speed layer, the usual choice is between Apache Storm or Apache Spark Streaming. The main selection criteria between the two depends on whether one is interested in ultra low latency (Apache Storm) or high throughput (Apache Spark Streaming). There are other factors, but these are some of the main drivers.
In my company, for our use cases, we can afford a little higher latency as long as we work under a second to score a business event (e.g. financial transaction). On the other hand, we face situations where burst of thousands of transactions to be scored per second are common. As such, high throughput is not optional for us, it's a key requirement and as such, the rationality behind the usage of Apache Spark Streaming.
Here, in the speed layer, ElasticSearch is key to reduce latency of integration concerns of the speed layers since it is a real-time querying database in addition to a very powerful database engine.

The Serving Layer, consolidating the batch layer and speed layer partial results, is largely home made in our case and relies on ElasticSearch's ability to fetch both partial sets in real-time.

2.3 Drawbacks and difficulties of Lambda Architecture

There is a natural tendency to duplicate logic between batch layer and speed layer which needs to be addressed through strict design and re-usable logic. Using Spark in batch mode on the batch layer and Spark Streaming on the speed layer in our case really helps us reuse business logic as much as possible between both worlds.

In addition, there is an operational complexity of the systems that are involved in implementing the lambda architecture. Thus the implementation of Lambda architecture is inherently difficult.

3. Real-time computation with Lambda Architecture

The demand for real-time analytics has led to demand for workflows that can effectively balance latency, throughput, scaling and fault tolerance.
In order to accommodate the demand for real-time analytics, we need to design a system that can provide balance between the concept of "single version of truth" and "real-time analytics". Lambda Architecture is one such method.

In my company, some of our analytics use cases require to consider very extended contextual information about trade and transaction activities, for instance, to build user and customer profiles or analyze their past behaviours.
Building such contextual information typical require analyzing over again and again billions of business events and peta-bytes of data.
Rebuilding these profiles or re-creating the aggregated statistical metrics would require several dozens of minutes even on large cluster in a typical batch processing approach.
Happily, all this information supports an incremental building way and as such we can benefit from Lambda architecture to rebuild the historical part while the latest data is taken into consideration by the speed layer to provide an up-to-date (as far as real-time) view of the reality. The serving layer consolidates both results to provide always up-to-date and accurate views of these profiles or other aggregated statistical metrics.
These real-time metrics are thus made available to our real-time scoring and classification systems.

The same technologies and approaches deployed in the speed layer to provide up-to-date views of the reality are used to score and classify business events, e.g. financial transactions in real-time.
Here as well, we have no requirements for strong real-time with millisecond-order latency. As long as we can provide a risk score or a classification for an event under a second, this is sufficient for our use cases.
On the other hand, it happens often that we have to compute burst of events of several hundreds of entries per second. As such, a system benefiting from an acceptable latency but a very high throughput such as Apache Spark Streaming is a key component of our processing platform.

In addition, within NG|Screener UI we provide our customers with a full-blend data discovery application (forensic application). Lambda Architecture is key in enabling us to provide our users with real-time updates and a second close up-to-date view of the reality.

4. Conclusion

Deploying Lambda architecture on our use cases has proven to be the simplest way to reach our objectives:

Up to date and second-close view of the reality in contextual information, user / customer profiles and other key periodic statistical metrics
Classification and scoring of business events with an under-a-second latency and a very high throughput
Resilience and fault tolerance of our business processes on large clusters, both on technical failures and human failures
Simplicity and maintenance, especially in our approach since we can share significant portions of codes between the batch layer and the speed layer since both are built on Apache Spark
Resolution of operational complexity of big computation on historical data by dividing the work to do in an incremental fashion.

Now of course, Lambda Architecture being the simplest way for us to reach our mission-critical objectives doesn't make it simple per se, on the contrary. Lambda Architecture is inherently difficult to deploy and maintain and requires sound design and implementation.

At NetGuardians, we could benefit from our mastery of cutting-edge technologies as well as our in-depth experience of batch computing systems and real-time computing systems to make it an advantage of our approach.

Artificial Intelligence for Banking Fraud Prevention

2018-04-30T08:57:44-04:00

In this article, I intend to present my company's - NetGuardians - approach when it comes to deploying Artificial Intelligence techniques towards better fraud detection and prevention.
This article is inspired from various presentations I gave on the topic in various occasions that synthesize our experience in regards to how these technologies were initially triggering a lot of skepticism and condescension and how it turns our that they are now really mandatory to efficiently prevent fraud in financial institutions, due to the rise of fraud costs, the maturity of cybercriminals and the complexity of attacks.

Here financial fraud is considered at the broad scale, both internal fraud, when employees divert funds from their employer and external fraud in all its forms, from sophisticated network penetration schemes to credit card theft.
I don't have the pretension to present an absolute or global overview. Instead, I would want to present things from the perspective of NetGuardians, from our own experience in regards to the problems encountered by our customers and the how Artificial Intelligence helped us solve these problems.

This article is available as a slideshare presentation here https://www.slideshare.net/JrmeKehrli/artificial-intelligence-for-banking-fraud-prevention-95475760

A video of the speech is available on youtube:

1. Early times, the 2000s

Before 2000, banking institutions are only poorly equipped when it comes to fight financial fraud.

For most of it, detecting fraud cases relies on manual verification and tests performed by

Internal Control
Internal Audit or
External Audits

And unfortunately, this implies a lot of issues:

By working with samples only, Internal control and Audit let a lot of fraud cases pass through the cracks and are found only very late or even never.
Analysis are cumbersome and most often finding fraud cases is not the first and foremost objective of the auditors.

Now of course, the most essential security rules and checks are implemented within the Operational Information System or in the form of procedures to be respected and audited.
Also, some banking institutions already have an Analytics System - or Business Intelligence - at the time and some ad'hoc reports are implemented on top of it that target fraud detection.

In these early times, neither the subprime crisis nor the south European countries debt crisis happened. Margins are important, people trust banks and all in all bankers are happy people.
Fraud cases, mostly internal, exist of course but financial institutions feel rather safe,

2. The late 2000s - fraud costs rise

In the second half of the 2000's, however, the costs linked to fraud, increasingly external, the complexity of attacks and the maturity of attackers rise.
Banking institutions react by deploying quite massively and for the first time specific analytics systems aimed at detecting banking fraud, both external and internal.

At this time, these systems are rules-engines that work by checking or searching pre-defined and well defined conditions within the data extracted from the information system.

In a way these systems can be considered as simple extensions of the security checks and rules implemented directly within the operational information system. These solutions come most of the time from the AML - Anti Money Laundering - World, their editors having understood that banking fraud was a an interesting opportunity to extend their sales.

A very simple rule example would be as follows:

At this time, a first set of papers have already been published on the success, still somewhat relative in this early days, of some Machine Learning approaches implemented towards banking fraud detection.
But Machine Learning and Artificial Intelligence are considered with a lot of condescension and skepticism.
Bankers and their engineers are not willing to consider an approach whose interpretation of results is deemed fuzzy.

NetGuardians has been founded at these times and the NetGuardians platform could be seen then as a gigantic rule engine.

3. The reality of fraud changes dramatically

Unfortunately, the reality of fraud and financial cybercrime evolved fast and dramatically.

Let me give you two examples

3.1 The Bangladesh Bank heist

(Source : https://www.bankinfosecurity.com/bangladeshi-bank-hackers-steal-100m-a-8958)

In February 2016, a group that we deem around 20 persons, composed by financial experts, software engineers and hackers have attacked the information system of the Bangladesh Central Bank.
They manage to compromise the bank internal gateway to the SWIFT Network. The SWIFT network is the international banking messaging network used by banks to communicate and transfer money through electronic wire. The pirates used the SWIFT network to withdraw money from the Bangladesh Central bank VOSTRO account by the US Federal Reserve.
They manage to transfer 81 millions USD to the Philippines and used the Philippino casinos to launder the stolen funds.

As a sidenote, the fact that they have stolen "only" 81 million USD is an amazing luck for the bank, or rather an amazing bad luck for the cybercriminals.
An Anti-Money laundering system - rule-based - deployed in the US federal Reserve for Anti-Money Laundering blocked the 6th transaction because the beneficiary name contained the word "Jupiter". Jupiter was on a sanction screening list in the US because a cargo ship navigating under Iranian flag was named "Jupiter" something. The 6th transaction being blocked, all the further ones, a little less than thirty, have been blocked as well.
But 5 transactions pass through before the 6th has been blocked by the Fed and went further through the correspondent banking network.
Another transaction has been blocked by the Deutsche Bank, a routing bank, because of a typo: "Shilka Fandation" instead of "Shilka Fundation".
So only 4 transactions our of 35 successfully arrived to the Philippines and as such the total loss have been reduced from 951 million USD initially intended to "only" 81 millions USD.

As a fun note, a few week after the heist, all the responsibles of the financial institutions involved, the US Fed Reserve, the Bangladesh Central Bank, even the finance minister of the Philippines were all convinced that the money - or at least a significant part of it - would be recovered and that the cybercriminals would be caught.

Two years after, today, we know that we will never recover these funds.
The attacker are safe, untraceable and will never be found. We believe that this is a group of about 20 persons who worked on the heist preparation for about 18 months. 81 million USD is a pretty number.

Now you think ... But this is Bengladesh ... right ?
Here we are in Europe ... Even better, here we are in Switzerland ... right ? And in Switzerland we don't really feel concerned by the numerous security holes in the Bangladesh Central Bank Information System. So let me give you another example...

3.2 The retefe Worm

Quoting https://www.govcert.admin.ch/blog/33/the-retefe-saga:

"This threat actor has already been around for more than four years...
Their goal remains the same: committing e-banking fraud in Switzerland and Austria.

In August 2017, Retefe still redirects between 10 and 90 e-banking sessions every day."

The Retefe worm is a worm developed by a team of cybercriminals targeting specifically the ebanking platforms of small and mid size Austrian And Swiss Banking Institutions.
The worm is used by the thieves to take control of the victim's ebanking sessions and to submit fraudulent transactions to the system.

This worm is 4 years old.
For 4 years, fraudsters keep on updating it, modifying it and extending it to counter the anti-viruses software and the specific protections put in place by the banks.
This worm is 4 years old and nevertheless, as pointed out by the Computer Security Section of the Federal Finance Separtment, it is still making today between 10 and 90 victims in Switzerland and Austria.

Today, in the swiss banks ...

My conclusion from these examples is as follows:
Today, fraudsters and cybercriminals are professionals. The time when fraud was mostly coming from little hackers working in their garage or back-office employees disappointed by their bonus, is over. Today, attackers are professionals who have industrialized their methods.

4. Facts and Projections

Some facts and projections to understand what reality banking institutions are facing nowadays...

In frebruary 2016, a group of cybercriminals managed to steal 81 million USD from the VOSTRO account of the Bangladesh Central Bank by the US federal Reserve
This is one of the biggest bank heist in history and the most impressive cybercrime ever

In a report called "Report to the nations", the international association of Fraud Examiners estimated that in 2017, the total cost of fraud has been 3000 billions USD.
In banking fraud, a big part of this amount is related to internal fraud, when bank employees divert funds from their employer.
In Switzerland, of course, thanks to the maturity of the banking business as well as the security checks and practices put in place in banking institutions, internal fraud is marginal, compared to external fraud. But external fraud is a cruel reality, think of the Retefe Worm.

Finally, Cyber Security ventures estimates that by 2021 the total cost of cybercrime will reach 6000 billion USD.

The reality to which banks are confronted nowadays is this one.

5. Historical systems are beaten

The principal implication of this reality, the problem which banking institutions are confronted to nowadays is that historical systems deployed to counter fraud - rules engines - are beaten.

Let's assume that a banking institution wants to define a set of rules aimed at detecting when an attacker diverts a customer account to issue fraudulent transactions.

Imagine the situation of a first customer, someone such as myself, using his ebanking account to pay his loan at the end of the month, his mortgage, his taxes, telephone bills, etc.
In my case, a big transaction withdrawing 20 k CHF from my account for a beneficiary located in Nigeria should raise an alert. It's clearly an anomaly, being completely outside of my usual habits and behaviour.
Imagine now the situation of a another customer, a responsible of acquisitions for a big corporation, a frequent traveler, spending most of his time abroad and using the corporate account to pay big amounts to providers all over the world.
In the case of this second customer, it is on the contrary a small payment benefiting to a counterparty in Switzerland that would be the anomaly and should raise an alert.

If one wants to detect anomalies for these two different situations, one would end up implementing a completely different set of rules for the two distinct customers.
And this is impossible.

Every bank customer, and even user up to a certain level, is different.
Representing everyone's own and private situations with rules would require to implement and manage hundreds of thousands of rules on the system, which, obviously, is impossible.
Only the most common set of rules can be implemented, which means that:

A lot of frauds pass through the cracks.
In addition, in order to catch the biggest frauds, the limits enforced by the rules have to be very low, which has the consequence of flagging a lot of cases to be analyzed - the so called false-positives - requiring an army of analysts to be reviewed and discarded.

The direct consequences for our customers are as follows:

Financial impacts: frauds must be reimbursed. And in addition these analysts spending their days discarding false positives must be paid.
Reputation impacts: a fraud case being communicated in the newspapers is a nightmare for banking institutions. Even without a large scale communication, customers impacted by fraud will loose faith in their bank.
Then I do not need to explain the consequences that the thousands of papers published on the Bangladesh Bank heist had on the Bangladesh central bank.

Rule-based systems are beaten today.
Something else is required to protect efficiently Banking institutions from banking fraud.

6. Artificial Intelligence comes in help

Artificial Intelligence provides the solution to this problem.

In 2016, we started at NetGuardians to integrate the first advanced algorithms, so called Machine Learning algorithms, in our systems.

We let an Artificial Intelligence analyze continuously the history of billions of transactions in the system and learn about individuals habits and behaviours.
With big data technologies, AI can analyze a very extended depth of history and build dynamic profiles for each and every individual capturing his transactional behaviour.
Individuals can be both Customer and Users (Internal Employees):

Profiling customers is required for both Internal and External Fraud.
Profiling users is required for Internal Fraud.

Big Data technologies are key to maintain these profiles up-to-date in real time by tracking each and every interaction between the user and the bank systems.
In addition to a financial transaction direct characteristics such as the beneficiary, the target bank country, the amount of the transaction, its currency, etc., the machine can correlate a lot of indirect characteristics, such as where in the world was located the ATM where the user withdrawn money from, where was he connected to his ebanking session, etc.

For each and every individual a dynamic and up to date profile captures his behaviour and his habits.
Then, each and every financial transaction, regardless of its type, it being a security trade order, an ATM withdrawal or an ebanking payment, is compared against the user profile and a risk score is computed.
Based on this risk score, the machine eventually decides whether the transaction is genuine or not and whether it requires further investigation by a human analyst within the bank.

The gains for our customers of this new approach, based on customer profiling done by AI, is striking.
It has been a game changing shift of paradigm.

In the banking institutions where we can deploy this new generation approach, we almost eliminate the amount of fraud cases passing through the cracks.
And that, by still reducing to 1/3 of what it was before the number of cases flagged by the system to be reviewed by an analyst or fraud investigator (most of them being the so-called false positives).
Not only the amount of cases, but the amount of time required to investigate a case could be reduced by 80% by having the machine presenting the profile of the customer and how the individual transaction deviates from it with relevant and meaningful visualization techniques.
Finally, the number of re-confirmation asked to customers could be reduce to 1/4.

Reducing the time required to investigate a case in addition to the amount of cases to be investigated has a direct financial impact: analysts spend less time investigating such cases and can focus on task with more added value. Drastically reducing fraud cases passing through also has obvious financial impacts.
Now all of this, especially reducing the number of times a re-confirmation is asked to customers has positive impacts on reputation

Now working on a per-customer basis is sometimes still sub-optimal. Sometimes a genuine transaction is always very unusual on a per-customer basis and it is required to broaden the view of the Artificial Intelligence.
Let me give you an example.
Let's imagine that tomorrow I buy a new Audi. That would be a transaction of 60 kCHF leaving my account for a beneficiary - Amag Audi Switzerland - that I never used before. Such a transaction, new beneficiary and huge amount is completely outside of my profile.
Based on this, the AI will decide to block the transaction, requiring a further validation from my end which will annoy me.
So how can we avoid that ?
If we look more carefully and globally at the transactions of this kind, big amounts benefiting to Amag Audi Swritzerland, among the customers with same profiles as myself, are quite usual.

The machine needs a broader view to understand that this transaction is not unusual.

7. The Machine can do better

The machine can look at the big picture and analyze transactions at a broader scale.
Recall the Audi example. When such a transaction is very unusual for a specific customer, looking at other customers with similar conditions, habits and behaviour is required.

And here again AI comes in help.

AI can analyze behaviours and habits of customers and group together the people with same patterns. People that are the same age, same wealth level, same origins, live in the same region, etc. will have a strong tendency to behave the same: for instance drive the same kind of car, such as an Audi, live in a appartments of the same size, pay the same amount of telephone bills at the end of the month, etc.
The machine can analyze customer activities and transactions on a very large scale and cluster together customers with same behaviour.
Then, these groups can be profiled just as individuals.
And finally, a transaction can be scored against the customer group profile in addition to the customer profile.

Recalling the Audi example. When scoring this specific payment against the individual profile, the transaction will be flagged as suspicious.
Scoring it against the group profile will clearly indicate that it's a genuine transaction. People buy new cars every day, especially in Switzerland

With this new approach, looking at the broader scale and comparing customers with each others instead of only scoring transactions in the individual context of a customer, we could improve our fraud detection system further.

The number of cases to be analyzed (false positives) could be reduced further.
In addition, the groups and their profiles happen to be an invaluable source of information for other use cases and concerns within the bank such as marketing, trend analysis, etc.
Of course reducing the number of cases to be handled by the investigation team has a direct impact on operational efficiency and induces further financial gains

Now all of this, transaction scoring and customer clustering works amazingly, but it works after the facts. The transaction has been input in the system and if we are not fast enough, depending on how we integrate within the bank information system, we can be too late, doing only fraud detection and not fraud prevention.
Our idea from here was:

What if we could analyze the User or customer activities even before the transaction is input one the system and detect fraud before it happens ?
What if we could interpret weak signals coming from the analysis of how the Customer interacts with the banking information system to qualify him as legitimate or potentially fraudulent ?

All of this require completely different analysis techniques.

8. Even further

Let me give you a simple example of what I mean by analyzing a customer's interaction with the banking Information system.
The interactions of a customer with the ebanking application is the simplest example I can come up with.

Imagine the situation of a genuine user of the ebanking platform whose behaviour when inputting is payments is always the same:

He logs in the ebanking platform.
He looks at his account balance.
He performed all his payment, from input to validation, many of them.
He checks his pending orders, making sure he missed none of them.
He logs out the platform.

Now if a worm hijacks the ebanking session, the worm will do none of that:

The worm will likely go directly from login to payment input, validation, then logout.

Here I am only showing transitions but one can also consider User think time, keyboard stroke speed, etc.

AI can analyze all this behaviour and activity trails a user or customer leaves on the banking information system and build a probabilistic model capturing this behaviour as a succession of interactions.
Then, when an individual action is performed, the machine can compute the likelihood of that action to be performed by a legitimate user or an attacker based on the path-to-action.
And here as well, AI can build profiles of these activities and their likelihood both at individual level and group level through clustering techniques.

With this kind of analysis, by looking at all the interactions of the users or customers with the banking information systems, AI can look at all individual events and qualify these interactions as legitimate or suspicious regardless of the financial transactions being input or not on the system.

AI can detect a fraud, or the intention to commit a fraud, even before a transaction is input on the system, by analyzing the user or customer activity, in the form of its interactions with the Banking Information System, before inputing the transaction.
In addition, by analyzing the behaviour of the customer as a whole, AI can qualify his interaction session (ebanking, mobile banking, PSD2, etc.) as legitimate or suspicious and kill the session in case of a doubt, thus protecting the information he sees and protecting his privacy in addition to his assets.
Finally, all this understanding of the user or customer habits and behaviour can be used to design even more advanced transaction scoring models.

This ability to detect fraud cases before they happen lead to further improvement of the operational efficiency and operational security of the banking institution.
Protecting the customers privacy in addition to their assets is important to protect the reputation of a financial institution. This is especially important for private banking institutions.

With «AI vs AI», I wanted to illustrate the current research topics we are working on today at NetGuardians to improve our algorithms further.
In a few words, we see today that cybercriminals are increasingly using advanced algorithms on their end to study the banks attack surface and discover means to attack the banks and their customers.
We are in a "cat and mouse" game where attackers attempt to counter the security systems put in place by banking institutions, which in their turn deploy new forms of algorithms and intelligence to protect them further.
I can only be looking forward to telling you more on this matter in a near future...

9. Conclusion

Our own experience and conclusion with AI technologies and it's concrete application on our use cases is striking.

Introducing advanced algorithms, machine learning and advanced analytics techniques in our use cases has been key to help us improve the way we secure financial institutions and their customers.
We could:

Reduce the fraud cases passing through and almost eliminate them.
Reduce the number of cases to be analyzed and make the detection system a lot more relevant.
Drastically reduce the amount of time required to investigate a case.

Today, at our customers, Artificial Intelligence monitors every single interaction between individuals, both customers or employees, and the information system, to qualify their actions as legitimate or fraudulent, in addition to analyzing with highly sophisticated models financial transactions input in the system.
Today our reality is as follows: Artificial intelligence monitors human behavior on a large scale to secure banks and their customers.

But Science Fiction advances much faster than reality. Regarding artificial intelligence, the collective imagination, fed by Musk and Hollywood, is way ahead of reality.
In the public collective imagination, artificial intelligence today generates quite a lot of fantasies.

So let's agree on something if you do not mind.
If one calls weak artificial intelligence, a computer solution able to solve a problem in a strict context, to optimize a solution or a mathematical function, or to look for an answer to a question in a strict context, one calls a strong artificial intelligence an intelligence able to argue, contextualize or to show sensitivity or initiative.
If progress in weak artificial intelligence is today very fast and very impressive, we do not have the slightest little trace of a proof that would allow us to believe one day in the emergence of a strong artificial intelligence. Strong artificial intelligence is science fiction.

The problem is that approach names like Neural Network are generating a lot of fantasy in the public imagination who takes this name literally.
With neural networks, the public imagines a digital brain, whereas the reality is that of "matrices of convolutions", intensive iterative calculations carried out on gigantic numerical matrices. On the other hand, powerful technologies with less evocative names, genetic algorithms, random forests or boosted gradient raise less fantasies.

10. Artificial Intelligence vs. Augmented Intelligence

Today, these Artificial Intelligence techniques give the most impressive results when they support the human and not when they supplant it.

Chess is one of the first areas in which computers started to beat humans.
The examples of algorithms that manage to defeat the great masters of chess in a regular if not systematic are legion.
But these are the so-called "centaurs", most of the time amateur players, but helped by Artificial Intelligence, half-human, half-machines, who now win all the "freestyle" games.

I would like to mention a second example with an experience that has been performed last year.
Melanoma specialists have been asked to identify cancerous lesions based on photos of skin lesions.
These experts had a precision, a success rate of the order of 95%.
An AI based on a Neural Network deployed towards the same objective reached a pretty impressive 93% accuracy, yet failing to beat the experts.
But a set of interns, really rather students that actual doctors, accompanied and helped by an artificial intelligence have reached 97% accuracy, beating both Artificial Intelligence alone and experts

Today, the most impressive results of these technologies come from what is called Augmented Intelligence, when Artificial Intelligence intervenes in support of the human decision process and not to replace it.
And Augmented intelligence is exactly what we do at NetGuardians by providing bankers with the means to prevent fraud cases much more effectively.

11. AI Pillars at NetGuardians

The key pillars which enable us to deploy Artificial Intelligence technologies are as folows:

All of this is pretty straightforward to understand.
I would just insist on two key notions:

The ability to run these analyzes in real time. Be able to analyze the activity of bank customers and users in real time is at the root of the difference between preventing fraud and detecting fraud. It must be possible to work with very low processing times to characterize a transaction before it is placed on the market.
The user experience. The deployed algorithms can be as intelligent as one can imagine, if one is not able to provide investigators and analysts with clear, concise and precise information, allowing them to understand the context of the transaction and the reasons for the system to block it, all this does not work. Users reject the solution. Providing analysts with extremely intuitive and visual means to understand machine decisions is essential.

(This article is available as a slideshare presentation here https://www.slideshare.net/JrmeKehrli/artificial-intelligence-for-banking-fraud-prevention-95475760)

Presenting NetGuardians' Big Data technology (video)

2018-01-05T13:00:00-05:00

I am presenting in this video NetGuardians' Big Data approach, technologies and its advantages for the banking institutions willing to deploy big data technologies for Fraud Prevention.

The speech is reported in textual form hereafter.

It keeps puzzling me to see how deploying Big Data Technologies in banking institutions for fraud prevention and other use cases seems to be so difficult.
A large number of such projects have simply failed over the past.
By failure I mean projects that, led to poor results, or exceeded the budget significantly, or even that have been simply cancelled.

When looking at why these projects failed, it always boils down to the two same major issues.

The first major issue is that extracting the required data to build the analytics use cases is a challenge on its own. Let's say the bank managed to extract the required data, which is only a technical problem, but cleaning, enriching, normalizing and re-modeling it for banking fraud use cases is a whole new project.

The second major issue is that technological mastery alone is not sufficient for Big Data projects to succeed.
Implementing data analytics use cases requires a strong involvement from business experts.
It always amazes me to see so many projects had the illusion that putting a dozen of gifted Data Scientists in a room for a few years would be sufficient. Without a clear business understanding, Data Scientists are blind and can go nowhere.

And then even with a clear understanding of these both challenges, deploying big data technologies for fraud prevention is a 10 months to 2 years project. At NetGuardians we typically deploy our technology at our new customers within a few weeks.

So how do we do that ?

First, we are using technology on the bleeding edge of the state of the art, not today's state of the art but tomorrow's, benefiting from the right data extraction approach and the right use cases.

In terms of technology, our NG|Screener platform is using key big data components underneath: ElasticSearch, Mesos and Spark.

Regarding the Data Ingestion System, we have developed at NetGuardians our Data Collection Framework that is simple, efficient and configurable.
Typical data extraction tools are either simple, or efficient or configurable. Our framework is all of that together, without any compromises.

Then, working with numerous financial institutions worldwide over the years made us understand the indispensable role of not only technology but also business expertise when it comes to developing Big Data analytics use cases.

Business experts in banking institutions are only hardly available, right ?
Not a problem for us, we have hired our own.
Today, we have our own business and risk experts with an impressive trackrecord in risk or other banking business departments.

At NetGuardians, we have this multi-competencies team that so many project struggle to build and together, we have designed and implemented the right use cases to make Big Data deployment projects happen smoothly at our customers, and bring them actual added value.

As a result, our customers are able to make sense of their available Big Data, save enormous amount of time, and implement the Big Data technology to proactively prevent growing fraud challenges.

From a personal perspective, I am utmost proud of what we have built, both in terms of technology and approach, as well as the privilege I have to work in a team with such brilliant minds and wonderful persons.

The Agile Collection Book

2017-12-12T17:57:20-05:00

Agility in Software Development is a lot of things, a collection of so many different methods. In a recent article I presented the Agile Landscape V3 from Christopher Webb which does a great job in listing these methods and underlying how much Agility is much more than some scrum practices on top of some XP principles.
I really like this infographic since I can recover most-if-not-all of the principles and practices from the methods I am following.

Recently I figured that I have written on this very blog quite a number of articles related to these very Agile Methods and after so much writing I thought I should assemble these articles in a book.
So here it is, The Agile Methods Collection book.

The Agile Methods Collection book is simply a somewhat reformatted version of all the following articles:

So if you already read all these articles, don't download this book.
If you didn't so far or want to have a kind of reference on all the methods from the collection illustrated above, you might find this book useful.
I hope you'll have as much pleasure reading it than I had writing all these articles.

Deciphering the Bangladesh bank heist

2017-11-15T17:03:49-05:00

The Bangladesh bank heist - or SWIFT attack - is one of the biggest bank robberies ever, and the most impressive cyber-crime in history.

This is the story of a group of less than 20 cyber-criminals, composed by high profile hackers, engineers, financial experts and banking experts who gathered together to hack the worldwide financial system, by attacking an account of the central bank of Bangladesh, a lower middle income nation and one of the world's most densely populated countries, and steal around 81 million US dollars, successfully, after attempting to steal almost a billion US dollars.

In early February 2016, authorities of Bangladesh Bank were informed that about 81 million USD was illegally taken out of its account with the Federal Reserve Bank of New York using an inter-bank messaging system known as SWIFT. The money was moved via SWIFT transfer requests, ending up in bank accounts in the Philippines and laundered in the Philippines' casinos during the chinese New-Year holidays.

Fortunately, the major part of the billion US dollars they intended to steal could be saved, but 81 million US dollars were successfully stolen and are gone for good.

The thieves have stolen this money without any gun, without breaking physically in the bank, without any form of physical violence. (There are victims though, there are always victims in such case, but they haven't suffered any form of physical violence)
These 81 million US dollars disappeared and haven't been recovered yet. The thieves are unknown, untroubled and safe.

The Bangladesh bank heist consisted in hacking the Bangladesh central bank information system to issue fraudulent SWIFT orders to withdraw money from the banking institution. SWIFT is a trusted and closed network that bank use to communicate between themselves around the world. SWIFT is owned by the major banking institutions.

In terms of technological and technical mastery, business understanding, financial systems knowledge and timing, this heist was a perfect crime. The execution was brilliant, way beyond any Hollywood scenario. And the bank was actually pretty lucky that that the hackers didn't successfully loot the billion US dollars as they planned, but instead only 81 million.
As such, from a purely engineering perspective, studying this case is very exiting. First, I cannot help but admire the skills of the team of thieves team as well as the shape of the attack, and second, it's my job in my current company to design controls and systems preventing such attack from happening against our customers in the future.

In this article, I intend to present, explain and decipher as many of the aspects of the Bangladesh bank heist and I know.

(This article is available as a slideshare presentation here https://www.slideshare.net/JrmeKehrli/deciphering-the-bengladesh-bank-heist)

Summary

1. Introduction
2. The SWIFT network - Key Concepts
3. The attack
4. Aftermath
5. Conclusion

1. Introduction

The Bangladesh Bank hack is one of the biggest bank heists in global financial history. There have been larger scams and scandals, but cyber heists from a single bank, this takes the cake.

The heist of over 80 million US dollars sent shock-waves through the global financial system and security experts scrambled to find out how it had happened. Political and administrative authorities played the blame game, as was expected of them. Resignations were offered and statements were issued. It was a complete chaos.

The hackers managed to break into the bank's security system and transferred more than 80 million USD from the New York Federal Reserve account to multiple bank accounts located in Sri Lanka and Philippines.
A significant number of transfer requests, 30 out of 35, were blocked by the Federal Reserve, saving the bank a loss of an additional 850 million US dollars
But the five requests that managed to pass through, amounting to more than a 80 million US dollars, were devastating enough in their consequences.

Perhaps the most troubling aspect of the whole episode was that the hackers managed to hack into the SWIFT software. SWIFT, lies at the heart of the global financial system and is a network which connects majority of the world's financial institutions and enables them to send and receive financial information about financial transactions.

However, It was the bank's own systems and controls that were compromised, not the SWIFT network connection software. The SWIFT software behaved as it was intended to, but was not operated by the intended person or process. This was really a bank problem, not a SWIFT problem.

In the next chapter, I will present the key concepts required to understand the attack before presenting the shape and timeline of the attack.

2. The SWIFT network - Key Concepts

Some key concepts about correspondent banking and the principles of the SWIFT network are required to grasp a basic understanding of the Bangladesh SWIFT attack. These are presented in this chapter.

SWIFT - Society for Worldwide Inter-bank Financial Telecommunication - is a belgian company, located in Belgium, and is a trusted and closed network used for communication between banks around the world. It is overseen by a committee composed of the US Federal Reserve, the Bank of England, the European Central Bank, the Bank of Japan and other major banks.
SWIFT is used by around 11 thousands institutions in more than 200 countries and supports around 25 million communications a day, most of them being money transfer transactions, the rest are various other types of messages.

A cool introduction to SWIFT is available on the Fin website : https://fin.plaid.com/articles/what-is-swift.

2.1 Correspondent Banking

Correspondent Bank

A correspondent bank is a financial institution that provides services on behalf of another financial institution.
It can facilitate wire transfers, conduct business transactions, accept deposits and gather documents on behalf of another financial institution.
Correspondent banks are most likely to be used by domestic banks to service transactions that either originate or are completed in foreign countries, acting as a domestic bank's agent abroad.

Generally speaking, the reasons domestic banks employ correspondent banks include:

limited access to foreign financial markets and the inability to service client accounts without opening branches abroad,
act as intermediaries between banks in different countries or as an agent to process local transactions for customers abroad,
accept deposits, process documentation and serve as transfer agents for funds.

The ability to execute these services relieves domestic banks of the need to establish a physical presence in foreign countries.

NOSTRO / VOSTRO Account

The accounts held between correspondent banks and the banks to which they are providing services are referred to as NOSTRO and VOSTRO accounts (latin words for ours and yours).
An account held by one bank for another is referred to by the holding bank as a VOSTRON account.
The same account is referred as a NOSTRO account by the owning bank (the customer). Generally speaking, both banks in a correspondent relationship hold accounts for one another for the purpose of tracking debits and credits between the parties.

NOSTRO and VOSTRO accounts are really to the same thing but from different perspectives. For example, Bank X has an account with Bank Y in Bank Y's home currency. To Bank X, that is a NOSTRO, meaning "our account on your books," while to Bank Y, it is a VOSTRO, meaning "your account on our books." These accounts are used to facilitate international transactions

Transferring Money Using a Correspondent Bank

Most if not all international wire transfers are executed through SWIFT. Knowing there is not a working relationship with the destination bank, the originating bank can search the SWIFT network for a correspondent bank that has arrangements with both banks.

2.2 Transferring Money

In the scope of funds transfers, correspondent banking relationships often happen between commercial banks and central banks. This is especially useful when a bank has to process massive funds transfers in different currencies.

Imagine that a non-US commercial banking institution has to transfer a massive amount of US Dollars for one of its big customers to some other account in another financial institution abroad. It would be very inconvenient in this case to have to build up enough reserves of US dollars for this kind of transfers.
Instead of building such reserves, big commercial banks worldwide have the tendency to open a correspondent banking relationships with the US federal Reserve in New York - called the Fed - and use their VOSTRO accounts at the Fed to process such big transfers.
Such VOSTRO accounts have typically no limits and do not necessarily need to be credited beforehand. The settlement can happen afterwards, on a regular basis.

Let's illustrate this situation with 2 imaginary customers and the Bangladesh central bank:

In case the money transfer requested by a customer in a foreign currency (here USD for the Bangladesh bank) exceeds some limits or even the reserves of the Bangladesh bank in USDs, the bank decides to go through the VOSTRO account by the correspondent central bank related to the foreign currency (Here the US Fed for USD) and instruct it to proceed with the transfer.
Again, that VOSTRO account doesn't even necessarily need to be credited beforehand, the settlement can happen way later or even never.

2.3 Application Architecture (Bangladesh case)

SWIFT provides a centralized store-and-forward mechanism, with some transaction management. For bank A to send a message to bank B with a copy or authorization with institution C, it formats the message according to standard and securely sends it to SWIFT. SWIFT guarantees its secure and reliable delivery to B after the appropriate action by C. SWIFT guarantees are based primarily on high redundancy of hardware, software, and people.

Principles

SWIFT moved to its current IP network infrastructure, known as SWIFTNet, from 2001 to 2005, providing a total replacement of the previous X.25 infrastructure. During 2007 and 2008, the entire SWIFT Network migrated its infrastructure to a new protocol called SWIFTNet Phase 2.
Today the SWIFT network can be seen as a highly secured private network over Internet.

In order to have access to this network, a financial institution needs to obtain a SWIFT gateway running the SWIFT NetLink software. This is most of the time proprietary hardware running with Linux and requiring a physical security dongle storing cryptographic keys to access the network.

SWIFT also provides a whole lot of other software such as SWIFT Alliance Access that can be used by a financial institution to access the SWIFT network (always through the gateway) in a more convenient way and with higher level or simpler APIs.

SWIFT Network Security

SWIFT's security stems from two major sources. First, it's a private network, and most banks set up their accounts such that only certain transactions between particular parties are permitted. The network privacy means that it should be hard for someone outside a bank to attack the network, but if a hacker breaks into a bank-as was the case here-then that protection evaporates.
The Bangladesh central bank has all the necessary SWIFT software and authorized access to the SWIFT network. Any hacker running code within the Bangladesh bank also has access to the software and network.

The Bangladesh bank architecture

The Bangladesh central bank, at the time of the heist, was handling SWIFT connectivity from the Banking Information System to the SWIFT network using the specific SWIFT Alliance Access software running on a bridge server.
Alliance Access, integrated the way it was at the Bangladesh banks, was setup to read/write SWIFT messages from/to files on the filesystem and record transaction information in an Oracle database. In addition, confirmation and reconciliation messages were handled through a manual process after being sent to a printer.

Again, the order reconciliation process in the Bangladesh central bank was a largely manual process.
In addition, passing through the filesystem to integrate the Banking Information System and the SWIFT network is a huge security weakness. We'll discuss this later.

2.4 Introducing key SWIFT messages

As we will see in the next chapter, the hackers created a malware that generated and manipulated key SWIFT messages to withdraw the money from the Bangladesh Central Bank's VOSTRO account at the Fed.

SWIFT messages consist of five blocks of the data including three headers, message content, and a trailer. Message types are crucial to identifying content.
All SWIFT messages include the literal "MT" (Message Type). This is followed by a three-digit number that denotes the message category, group and type

The key SWIFT messages in question here were of the following types:

MT103 is used for cash transfer specifically for cross border/international wire transfer.
MT202 is the general Financial Institution transfer order, used to order the movement of funds to the beneficiary institution.
MT950 is the Final Statement report on all settlement operations with specified account within a current business day. Can be seen as the confirmation for an MT 202.

The workflow between these messages can be seen as follows:

A lot of stuff goes through SWIFT, here we focus on a really small subset of the supported message tapes, those related to transferring money.
First, the MT103 is an information message, basically announcing that a target counterparty account will receive money from an internal (or correspondent) account to be debited.
Then the MT202 is the inter-bank transfer order, it applies globally to transfer money from a banking institution to another banking institution, covering all individual account level transfer announces relating to the same target institution. There is a relationship between MT103 and MT202, they cross-reference each others (field 21 in MT103 reference MT202 and field 20 in MT202 references all related MT103)
Finally the MT950 is the extract that confirms all executed order. It is the bible and references all positions confirmed and executed by the correspondent bank. It is often an end of day extract used by banking institutions for reconciliations.

Again, when it comes to SWIFT messages processing from and to the Banking information System in the Bangladesh case, the important aspects here are:

SWIFT messages originating from the Banking Information System simply needed to be put on the filesystem somewhere to be "slurped" by the SWIFT Alliance Access software integrated the way is was integrated at Bangladesh Central Bank. In terms of security of the process, this is more that questionable.
Confirmation messages back from the SWIFT network were stored and printed. Reconciliation is a manual process from these printed messages. This is quite unusual (we'll get back to that)

3. The Attack

Summary of the attack

Capitalizing on weaknesses in the security of the Bangladesh Central Bank, hackers attempted to steal around a billion US dollars from the Bangladesh central bank's VOSTRO account with the US Federal Reserve Bank between February 4 and 5 when Bangladesh Bank's offices were closed.

The perpetrators managed to compromise Bangladesh Bank's computer network, observed how transfers were done, and gained access to the bank's credentials for payment transfers.
They used these credentials to authorize about three dozen requests to the Federal Reserve Bank of New York to transfer funds from the Bangladesh Bank VOSTRO to accounts in Sri Lanka and the Philippines.

Thirty transactions worth 851 million USD were flagged by the banking system for staff review, but five requests were granted; 20 million USD to Sri Lanka (later recovered), and 81 million USD lost to the Philippines, entering the Southeast Asian country's banking system on February 5, 2016.
This money was laundered through casinos and a little later transferred to Hong Kong.

The attack is impressive and stands out on various levels, in terms of technical means, maturity and complexity for several reasons:

Technical Mastery: the usage of a custom worm (filename evtdiag.exe) to hack the SWIFT Bridge by the bank and likely other malwares to capture the administration credentials
In-depth understanding of the worldwide financial market: both the attack shape and the money laundering scheme prove in-depth understanding of financial markets
SWIFT knowledge: in-depth Knowledge of the SWIFT messaging details is not that much spread among software engineers

Let's see how this is proven by analyzing the details of the attack.

3.1 Behaviour of the malware

The hackers used a custom version of malware to hack software called SWIFT Alliance Access to both make the transactions and hide the evidence. The hackers used a version of the malware that removed integrity checks within the Alliance software and then monitored the transaction files sent through the system, searching the payment orders and confirmations for specific terms. These terms and the responses to them were specified by a Command and Control server in Egypt.

When a message with one of the search terms was found, the malware would do different things depending on the kind of message. Payment orders were modified to increase the amounts being moved, updating the Alliance database with new values. Confirmation messages from the SWIFT network were also modified. Confirmations are printed and stored in the database. Before being printed, the malware would alter the confirmations to show the original, correct transaction value; it also deleted confirmations from the Alliance database entirely.

It's still not clear how the initial transactions were entered into the system to trigger the malware in the first place.

Again, the SWIFT network key components haven't been compromised, the malware was targeting the Bangladesh Central Bank's own bride to the SWIFT infrastructure running the SWIFT Alliance Access Software:

If an organization can't keep its endpoint secure, it leaves itself very vulnerable to being electronically robbed. That was the case here.
The bank lacked any firewalls and was using second-hand $10 switches on its network. These switches did not allow for the regular LAN to be segmented or otherwise isolated from the SWIFT systems. The lack of network security infrastructure has hindered the investigation.
It's still not known how the hackers penetrated the network, but it looks like the bank didn't make it difficult for them to do so.

How the attackers obtained administration credentials is still unclear. They might have obtained these credentials by using another malware or by exploiting a remotely available vulnerability (not impossible considering the weak security practices in place in the Bangladesh Central Bank) or it might also have been an insider job.
So far there are only speculations in this regards.

Forging fraudulent SWIFT messages

Simplifying a bit the reality, we can picture the malware as forging fraudulent SWIFT messages as follows:

The view above is a simplification of the reality. Actually, the worm was brilliantly implemented since forging from scratch consistent SWIFT announces (MT103) and Money Transfer Orders (MT202) messages would have been more difficult.
Instead, the worm was tampering with genuine messages issued by the Banking Information System and changing the amounts and recipient. This is a lot easier than blank forging.
It is still unclear for now if the initial untampered messages were simply authentic and relevant messages, perhaps duplicated by the worm, or forged through other malwares on different systems. I couldn't find clear information in this regards in all that has been published (if a reader has additional information in this regards, I would be happy to learn about it).

Just as a sidenote, whenever an institution such as the Bengladesh central bank sends a SWIFT funds transfer order, it's always in behalf of one of its customer. The SWIFT message(s) indicates the customer for which the bank requests a funds transfer.
Now of course the target correspondent bank cannot know if such customer exist, it doesn't have access to the list of customers of the sending bank. The SWIFT messages tampered by the worm could have been related to any random customer of the Bengladesh bank, this doesn't matter.
The only important aspect was that the beneficiary account and banking institution were the ones intended by the attackers.

Intercepting SWIFT confirmations

Here as well, by simplifying a little the reality, we can picture the malware as intercepting SWIFT confirmations as follows:

The malware was also developed in such a way that it was intercepting confirmation messages (MT950) back from the Fed (from the SWIFT network in fact). Confirmation of genuine orders were supposed to be allowed to pass through untampered while confirmation of fraudulent messages were supposed to be intercepted and hidden.
But the worm was buggy and while tampering with confirmations sent to the printer, it corrupted them somehow which caused the printer to crash. We'll get back to that later.

Interestingly, going as far as trying to tamper with confirmations was pure genius (even though it didn't work as expected). Had it worked, the bank might well have noticed the attack weeks after the facts since on both sides of the world (the Fed view and the Bengladesh Central bank view), positions would have been very different but yet consistent, the Fed knowing about the orders, taking them as genuine and the Bangladesh bank would have known nothing about them.
Also, one should note that transfer orders (MT202) are executed immediately. So trying to tamper with confirmations was not intended to give more chance to the transfer to succeed, it was really intended as a way to hide the theft until hopefully after the money is laundered.

The Malware

The malware, codenamed Dridex (Addendum 2020-Sep-03 - see note at the end of the article), filenamed evtdiag.exe, was designed to hide the hacker's tracks by changing information on a SWIFT database within Alliance Access and contained the IP address of a server in Egypt the attackers used to monitor the use of the SWIFT system by Bangladesh Bank staff.
It was likely part of a broader attack toolkit that was installed after the attackers obtained administrator credentials.

The malware was compiled close to the date of the heist, contained detailed information about the bank's operations and was uploaded from Bangladesh.
While that malware was specifically written to attack the Bangladesh Bank, the general tools, techniques and procedures used in the attack may allow the gang to strike again and as a matter of fact there have been attempts discussed by Reuters.

The malware was designed to make a slight change to the code of the Access Alliance software installed at the Bangladesh Central Bank, giving attackers the ability to modify a database that logged the bank's activity over the SWIFT network.
Once it had established a foothold, the malware could delete records of outgoing transfer requests altogether from the database and also intercept incoming messages confirming transfers ordered by the hackers.
It was also able to manipulate account balances on logs to prevent the heist from being discovered until after the funds had been laundered. Additionnaly, it manipulated the stream of confirmations sent to a printer that produced hard copies of transfer requests so that the bank would not identify the attack through those printouts.
This part went wrong and led the printer to crash.

More information on the malware is available in this article and this one.

3.2 Complete overview of the attack

On February 4, likely after months of preparation, organizationally and technically, gaining access to the systems, developing the custom worm, obtaining credentials, infecting the systems, etc. unknown hackers sent more than three dozens of fraudulent money transfer requests to the Federal Reserve Bank of New York asking the bank to transfer millions of the Bangladesh Bank's VOSTRO account to bank accounts in the Philippines, Sri-Lanka and other parts of Asia.

The hackers managed to get 81 million USD sent to Rizal Commercial Banking Corporation (RCBC) in the Philippines via four different transfer requests and an additional 20 million USD sent to Pan Asia Banking in a single request.

Fortunately 850 million USD in other transactions managed to be saved initially (thx to the Fed).

The 81 million USD was deposited into three accounts at a Rizal branch in Manila on Feb. 4. These accounts had all been opened a year earlier in May 2015, but had been inactive with just 500 USD sitting in them until the stolen funds arrived in February this year.

Another 20 million USD intended to be sent to Pan Asia Banking have been blocked later in the correspondent bank funds transfer routing process (thx to Deutsche Bank).

But the 81 million USD that went to Rizal Bank in the Philippines was gone. It had already been credited to multiple accounts, reportedly belonging to casinos in the Philippines, and all but 68 thousands USD of it was withdrawn on February 5 and 9 before further withdrawals were halted.
The stolen funds from Bangladesh Bank were transferred to money transfer company Philrem Services Corporation. Philrem converted into pesos some of the 81 million USD and delivered the money in cash tranches to a registered casino junket operator named Weikang Xu, Eastern Hawaii Leisure Company, and Bloomberry Hotels Incorporated (Solaire Resort & Casino).

The hackers might have stolen much more but most transfers have fortunately been stopped by the Fed and one transfer has been stopped by Deutsche Bank.

Deutsche Bank saved 20 million USD

Four requests to transfer a total of about 81 million USD to the Philippines went through, but the fifth one to transfer 20 million USD to a Sri Lankan non-profit organization was held up because the hackers misspelled the name of a non-existent NGO, Shalika Foundation, by writing "fandation" instead of "foundation".

This prompted the Deutsche Bank, a routing bank, to seek clarification from the Bangladesh central bank, thereby stopping the transaction.

The Fed saved 850 million USD

The Federal Reserve Bank did not execute 30 of the 35 transfers, worth around 851 million USD, officially due to "lack of details." These thirty transactions were flagged by the banking system for staff review.

The Fed was still tricked into paying out 101 million USD. But the losses could have been much higher had the name Jupiter not formed part of the address of a Philippines bank where the hackers sought to send hundreds of millions of dollars more.
By chance, Jupiter was also the name of an oil tanker and a shipping company under United States' sanctions against Iran. That sanctions listing triggered concerns at the New York Fed and spurred it to scrutinize the fake payment orders more closely.

It was a "total fluke" that the New York Fed did not pay out the 951 million USD requested by the hackers. There is no suggestion the oil tanker or shipping company was involved in the heist.
The Reuters examination has also found that the payment orders sent by the hackers were exceptional in several ways. They were incorrectly formatted at first; they were mainly to individuals; and they were very different from the usual run of payment requests from Bangladesh Bank.
Yet it was the word Jupiter that set the loudest alarm bells ringing at the New York Fed.

The printer error

A printer "error" helped Bangladesh Bank discover the heist. The bank's SWIFT bridge (running Alliance Access) was configured to automatically print out confirmations back from correspondent banks.
The printer works 24 hours so that when workers arrive each morning, they check the tray for transfers that got confirmed overnight.
But on the morning of Friday February 5, the director of the bank found the printer tray empty. When bank workers tried to print the reports manually, they couldn't. The software on the terminal that connects to the SWIFT network indicated that a critical system file was missing or had been altered.
The problem is deemed to be an unwanted bug with the worm, a failure in the attack if one likes, since the worm was programmed to remove confirmation of fraudulent payments from the confirmation stream being sent to the printer.
Fortunately, in this case, the Fed clarification requests and the Deutsche Bank request would have anyway alerted the bank, so even if the worm had functioned correctly, the bank would have been made aware of the attack.

When they finally got the software working the next day and were able to restart the printer, dozens of suspicious transactions spit out. The Fed bank in New York had apparently sent queries to Bangladesh Bank questioning dozens of the transfer orders, but no one in Bangladesh had responded.
Panic ensued as workers in Bangladesh scrambled to determine if any of the money transfers had gone through - their own records system showed that nothing had been debited to their account yet - and halt any orders that were still pending.
They contacted SWIFT and New York Fed, but the attackers had timed their heist well; because it was the weekend in New York, no one there responded. It wasn't until Monday that bank workers in Bangladesh finally learned that four of the transactions had gone through amounting to 101 million USD.

This article on the Fin website is magnificent and gives a lot of additional and cool information on the attack: https://fin.plaid.com/articles/anatomy-of-a-bank-heist.

3.3 Timeline of the attack

In every movie about bank robberies, timing is presented as critical. Here as well timing has been an essential concern and brilliantly mastered by the attackers.

In details:

May 15, 2015: three dollar bank accounts in the Jupiter, Makati branch of the Rizal Commercial Banking Corporation (RCBC) were opened under the names of Enrico Teodoro Vasquez, Alfred Santos Vergara, Michael Francisco Cruz and Jessie Christopher Lagrosas, with an initial deposit of 500 USD each. These accounts, which were later found to be fake, remained idle until February 4, 2016.
January 2016: The hackers installed the malware on the bank's system some time in January, not long before they initiated the bogus money transfers on February 4. This was brilliant as well since installing it too soon might have made it detected before the heist and installing it too late might not have enabled them to assess its behaviour.
February 4, 2016: Under control of the hackers, the malware broke into Bangladesh Bank's VOSTRO account with the Federal Reserve Bank of New York, ordering 35 transfers worth 951 million USD, bulk of which to be transferred to RCBC Jupiter branch.
The Fed managed to detect and block 30 fraudulent transactions but 5 transfers worth 101 million USD haven't been blocked.
February 5, 2016: The fed tried to contact the Bangladesh Bank to get an explanation about these transfers, including the 5 non blocked.
But Feb 5 was a banking holiday in the Bengladesh and nobody could answer.
February 5 to February 8, 2016: The 5 transfer are executed by the correspondent banks and the routing banks.
One transaction of 20 million USD has been salvaged. This was after an instruction to a fake Sri Lankan foundation was put on hold by Deutsche Bank, one of the routing bank, because of a typographical mistake.
But the remaining 81 million USD stolen funds found their way to 4 fake bank accounts in RCBC.
February 8, 2016: Bangladesh Bank sent a "stop payment" order to RCBC. The request means the central bank was asking to refund the stolen funds or freeze the funds if these were not transferred yet.
February 8 is a Chinese New Year non-working holiday for the Philippines.
February 9: RCBC received a SWIFT code from Bangladesh Bank requesting for a refund or putting it on hold if the funds had been transferred or freeze them for proper investigation.
Despite the "stop payment" order, RCBC Jupiter branch still allowed withdrawals from the accounts.
Money was then consolidated and deposited in a dollar account of William So Go of DBA Centurytex Trading, which was opened on the same day.
In the following days the money was laundered in the Casinos.

The timing was perfect. A unique Week-End preceded by a business holiday in Bengladesh and followed by a the Chinese New Year holiday in the Philippines, an ideal situation. The Fed couldn't get the required clarification from Bangladesh Bank on the next day and as such didn't attempt to recover the 5 orders that passed immediately.
On Monday, the stop orders sent by the Bangladesh bank couldn't be processed by the RCBC to freeze the funds since it was a banking holiday in Philippines.
In addition, the chinese new-year and the volume of exchanges in casinos at such occasion made the laundering of the money straightforward, not to mention the Philippines weak AML laws and practices.

3.4 Laundering of the money

Getting the money out is also difficult. Here the laundering scheme was both easy and magnificent. Money has been laundered through the Philippines.

The 81 million USD that was successfully stolen was sent to the Philippines to accounts at the Rizal Commercial Banking Corp (RCBC) held by two Chinese nationals who organize gambling junkets in Macau and the Philippines. The money was moved to several Philippine casinos and then subsequently to international bank accounts.
Laundering money in a Casino is fairly straightforward. One just need to touch the money as chips at any counter, loose one at any random game and then pretend one's lost enough and put it back in another account. Boom, laundered. The banking institution behind the account one withdrawn money from assumes it's been lost gambling while the other banking institution behind the account the money is put back one assumes it's been won in the Casino.

Philippine casinos are exempted of anti-money laundering law that requires them to report suspicious transactions, making them an attractive target for this kind of crime.
Plus can you imagine the amount of money transferred, spent, won and lost in Philippine's Casinos during Chinese New-Year ?
The volumes and the kind of operations makes everything absolutely untraceable.

Bamm ! Done. Money Laundered.

4. Aftermath

What Does the Heist Mean?

Even if the hackers didn't compromise the SWIFT network itself, such that all of SWIFT banks were vulnerable, it's still bad news for the global banking process. By targeting the methods that financial institutions use to conduct transactions over the SWIFT network, the hackers undermine a system that until now had been viewed as stalwart.

Who's to Blame?

Honestly only the attackers are really to blame.
But still, without such amazing security weaknesses in the Bengladesh Central Bank and with better control and stricter procedures in place at the Fed, the attack would not have been possible.
Of course, the Bangladesh Bank blames the Fed for allowing the money transfers to go through instead of waiting for confirmation from Bangladesh. The Fed counters that it contacted the bank to question and verify dozens of suspicious transfers and never got a response. Authorities at the Fed said that workers followed the correct procedures in approving the five money transfers that went through and blocking 30 others. Bangladesh Bank says the Fed bank should have blocked all money transfers until it got a response on the ones it deemed suspicious. And so on ...

The Bengladesh Bank

Aside from the loss of money, the Central Bank's governor, Atiur Rahman has resigned due to the incident. The bank promised to improve their cyber-security and ensure this kind of bank heist is prevented in the future.

The Fed

The immediate result of the breach for the New York Fed is a claim from the Bangladesh Bank for payment of lost funds and a potential lawsuit.

The Fed focused security resources on other priorities, such as preventing money-laundering and enforcing U.S. economic sanctions, officials with knowledge of the bank's security operations told Reuters. Fed officials took some comfort in the fact that SWIFT's security software had never been cracked.

The Bengladesh heist forced the Fed to invest massively in Fraud prevention solutions and better transaction monitoring systems.

Philrem services

The Philippine central bank has revoked the license of a remittance company that anti-money laundering investigators said was used to transfer some of the 81 million USD hackers looted from the Bangladesh central bank.
The Anti-Money Laundering Council (AMLC) issued a complaint against Philrem Service Corporation on April 28, accusing it of creating a fog around transactions and washing the stolen funds via a web of transfers and currency conversions through Philippine bank accounts, before moving the cash through casinos in Manila and junket operators.

The Philippines

The Philippines' involvement in the 100 million USD Bangladesh Bank heist, which has risked its return to the FATF gray list, showed the urgency of putting more teeth into the Anti-Money Laundering Act (AMLA).

The law, which was first introduced in 2001, left casinos out of the list of entities required to report suspicious transactions to the AMLC. There were efforts in the Senate to include this provision in the amended AMLA in 2013, but this was blocked by some lawmakers and casinos lobbies.

5. Conclusion

If the hackers had indeed managed to get away with the terrifyingly large amount of 1 billion USD, this would have easily been the biggest bank heist in history, not to mention cyber heist.

Interestingly, these kinds of attacks will be increasingly common and if banks aren't updating their security processes and maintaining their network infrastructures, and the success rates of these attacks will only go up. Worse still, if hackers have access to banks and can manipulate funds, any businesses that partner with those banks is also at risk.

Imagine the following, if the worm had functionned correctly and not blocked the printer, if the Deutsche bank didn't find the typo, if the Fed didn't become suspicious because of the Jupiter keyword, the attack might have been a complete success. Not only the attackers would have successfully withdrawn almost one billion US dollars from the Bangladesh bank VOSTRO account at the Fed, but the attack might have been noted only weeks or months after the facts.

Finally, imagine that the same attack succeeds against a american or an european bank. In the US and in Europe, the SWIFT interfaces are integrated in an STP (Straight Through Processing) way. There is no such thing as manual reconciliation from some papers printed on a printer. The handling of confirmations and position reconciliation is mostly completely automated.
As such, the same attack succeeding in Europe for instance might take months to be discovered and uncovered, only a the moment the big position reconciliations between NOSTRO and VOSTRO accounts in correspondent banks are triggered.

And this is where it gets really funny. Everybody always had the illusion that SWIFT was so secure, so sure. It gave banking institutions worldwide the illusion that everything related to SWIFT is just as secure. But if the network itself is pretty secure indeed, the specific bridges and interfaces linking the Banking Information Systems to SWIFT can be very weak, as shown by the Bangladesh Heist.
Today european and US banking institutions and central banks are very worried and investigating transaction monitoring and security solutions to prevent such misadventure to happen to them.
Again the same attack in Europe would be a much bigger disaster.

Now another funny story to conclude this article: imagine a similar hack between two banks in Europe and imagine that one of them suspects something ... They would use SWIFT again to reconcile their views of the truth (MT109 and MT999).
These messages can be hacked just as well, in which case the theft may remain uncovered for months.
This is really hilarious.

(This article is available as a slideshare presentation here https://www.slideshare.net/JrmeKehrli/deciphering-the-bengladesh-bank-heist)

Edit Sep 3rd, 2020

At the time of writing of this article, it was a common belief in the group of people I was discussing this with that the main worm used for the heist (EvtDiag filename) was a custom version of the Dridex worm.
It has been brought to my attention that nothing confirm this hypothesis, on the contrary multiple experts and studies show EvtDiag had nothing to do with Dridex.
So even though some other sources are clearly pointing Dridex as (one of) the the worm(s) used in the attack (e.g. http://www.straitstimes.com/business/dridex-malware-linked-to-bangladesh-heist), this should be taken cautiously.

ELK-MS - ElasticSearch/LogStash/Kibana - Mesos/Spark : a lightweight and efficient alternative to the Hadoop Stack - part III : so why is it cool ?

2017-08-30T16:43:44-04:00

Edited 2017-10-30: I was using ES 5.0.0 with Spark 2.2.0 at the time of writing the initial version of this article.
With ElasticSearch 6.x, and ES-Hadoop 6.x, the game changes a little. The Spark 2.2.0 Dynamic allocation system is now perfectly compatible with the way ES-Hadoop 6.x enforces data locality optimization and everything works just as expected.

So, finally the conclusion of this serie of three articles, the big conclusion, where I intend to present why this ELK-MS, ElasticSearch/LogStash/Kibana - Mesos/Spark, stack is really really cool.
Without any more waiting, let's give the big conclusion right away, using ElasticSearch, Mesos and Spark can really distribute and scale the processing the way we want and out of the box /(using Dynamic Allocation) scale the processing linearly with the amount of data to process.
And this, exactly this and nothing else, is very precisely what we want from a Big Data Processing cluster.

At the end of the day, we want a system that books a lot of the resources of the cluster for a job that should process a lot of data and only a small subset of these resources for a job that works on a small subset of data, with a strong enforcement of data locality optimization.
And this is precisely what one can achieve pretty easily with the ELK-MS stack, in an almost natural and straightforward way.
I will present why and how in this article.

The first article - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

The second article - ELK-MS - part II : assessing behaviour presents a few concerns, assesses the expected behaviour using the niceideas ELK-MS TEST package and discusses challenges and constraints in this ELK-MS environment.

This third and last article - ELK-MS - part III : so why is it cool? presents, as indicated, why this ELK-MS stack is really really cool and works great.

This article assumes a basic understanding of Big Data / NoSQL technologies in general by the reader.

Summary

1. Introduction
2. Data locality and workload distribution
3. Examples
4. Conclusion

1. Introduction

The reader might want to refer to the Introduction of the first article in the serie as well as the introduction of the second article.

Summarizing them, this series of article is about presenting and assessing the ELK-MS stack, the tests done using the test cluster and present the conclusion, in terms of constraints as well as key lessons.
The second article was presenting the technical constraints coming from integrating Spark with ElasticSearch through the ES-Hadoop connector when running Spark on Mesos.
In this second article I focused a lot on what was not working and what were the constraints. A reader might have had the impression that these constraints could prevent a wide range of use cases on the ELK-MS stack. I want to address this fear in this third article since this is all but true, Spark on Mesos using data from ElasticSearch is really a pretty versatile environment and can address most if not all data analysis requirements.

In this last article, I will present how one can use a sound approach regarding data distribution in ElasticSearch to drive the distribution of the workload on the Spark cluster.
And it turns out that it's pretty straightforward to come up with a simple, efficient and natural approach to control the workload distribution using ElasticSearch, Spark and Mesos.

ES index layout strategies

The parameters that architects and developers need to tune to control the data distribution on ElasticSearch, which, in turn, controls the workload distribution on spark, are as follows:

The index splitting strategy
The index sharding strategy
The replication strategy (factor)
The sharding key

Spark aspects

Then on the spark side the only important aspect is to use a proper version of ES-Hadoop supporting the Dynamic Allocation System without compromising data locality optimization (i.e ES-Hadoop >= 6.x for Spark 2.2)

But before digging into this, and if that is not already done before, I can only strongly recommend reading the the first article in this serie, related presenting the ELK-MS stack and the second article which presents the conclusions required to understand what will follow.

2. Data locality and workload distribution

What has been presented in the conclusion section of the ELK-MS part II article is summarized hereunder:

Fine Grained scheduling mode of spark jobs by Mesos screws performances up to an unacceptable level. ELK-MS need to stick to Coarse-Grained scheduling mode.
ES-Hadoop is able to enforce data-locality optimization under nominal situations. Under a heavily loaded cluster, data-locality optimization can be compromised for two reasons:
- If the local Mesos / Spark node to a specific ES node is not available after the configured waiting time, the processing will be moved to another free Mesos / Spark node.
- ElasticSearch can well decide to serve the request from another node should the local ES node be busy at the time it is being requested by the local spark node.
With ES-Hadoop 5.x, Dynamic allocation was messing up data locality optimization between ES and Spark. As such only Static allocation was usable and it was required to limit artificially the amount of nodes for a given job in good correspondance to the amount of shards in ES (usage of property spark.cores.max to limit the amount of spark executors and the search_shards API in ES to find out about the amount of shards to be processed)
But now with ES-Hadoop 6.x, Dynamic allocation doesn't interfere with data locality optimization and everything works well out of the box.
Re-distributing the data on the cluster after the initial partitioning decision is only done by spark under specific circumstances.

ES-Hadoop drives spark partitioning strategy

So what happens with ES-Hadoop 6.x and dynamic allocation is that ElasticSearch sharding strategy drives the partitionning strategy of corresponding data frames in Spark. With Data Locality Optimization kicking in, even with Dynamic Allocation enabled, The Spark / Mesos cluster will do its best to create the Spark partitions on the nodes where the ES shards are located.
And this really works just out of the box.

Eventually, there will be just as many executors booked by Mesos / Spark on the cluster as is requiredto handle every ES shars in a dedicated, co-located partition within Spark.

3. Examples

In order to illustrate why I believe that in fact the way ELK-MS behaves when it comes to distributing the workload following the distribution of the data is efficient and natural, we'll use the examples below.

Imagine the following situation: the ELK-MS test cluster contains 6 nodes with similar configurations. The dataset to be stored is called dataset and contains 2 months of data.
In ElasticSearch the indexing settings are as follows:

The Index splitting strategy is by month. This is not strictly an ElasticSearch setting, this is configured in Logstash or any other data ingestion tool.
As a matter of fact, whenever one wants to store temporal data in ElasticSearch (timeseries), one naturally considers splitting the index by year, month or even day depending on the size of the dataset.
The sharding strategy consists in creating 3 shards.
The replication strategy consists in creating 2 replicas (meaning 1 primary shard and 2 replicas).
We do not care about configuring the sharding key any differently than the default for now (a few words on the sharding key configuration are given in the conclusion).

Initial situation

We can imagine that the above situation ends up in the following data layout on the cluster. (One should note though that this is not very realistic since ES would likely not split both month this way when it comes to storing replicas):

Working on a small subset of data for one month

Now let's imagine that we write a processing script in spark that fetches a small subset of the data of one month, June 2017, so [A] here.

In addition, imagine that the filter ends up identifying precisely the data from a single shard of the index. Spark / Mesos would create in this case a single spark partition on the node co-located to the ES shard.

The processing happens this way in this case:

Since only one shard needs to be read from ElasticSearch, ES-Hadoop will drive the creation of a single partition in the resulting DataFrame (or RDD), which in turn will cause Spark to request a single task in one executor, the one local to the ES shard.

So what actually happens is that working on a single shard located on single ES node will actually drive spark in a way to make it work on one single node as well.
Using replicas has the benefits to give the Mesos / Spark cluster some choice in regards to which this node should be. This is especially important if the cluster is somewhat loaded.

Working on a single month of data

In this second example, the processing script works on a single month of data, the full month of June 2017, so all shards of [A] here.

This will drive Spark to create 3 corresponding partitions on the Mesos / Spark cluster.
The processing works as follows in this case:

Three shards from ES need to be fetched to Spark. ES-Hadoop will create 3 partitions which leads to 3 tasks to be dispatched on the Spark processing stage. These 3 tasks will be executed on the 3 local ES nodes owning the shards.

Again, distributing the input data on one third of the ES cluster on one side, and limiting's Spark resources to the actual number of nodes required on the other side, leads to one third of the Spark cluster to be used for the spark processing.
In this case, the ElasticSearch data distribution strategy drives the workload distribution on spark.
Again replication is useful to ensure a successful distribution even under a loaded cluster.

Working on the whole period

This will drive spark to create partitions on all nodes of the cluster.
The processing happens this way:

When working on the whole period, it happens fortunately in this case that we end up fetching shards from the whole ES cluster, in this case the whole spark cluster will be used to distribute the processing workload, since each and every local spark node will need to work on the local ES shard.

Again, one last time, the ElasticSearch data distribution strategy drives the workload distribution in good understanding to the data distribution, enforcing data-locality optimization.

4. Conclusion

In conclusion, having the ElasticSearch data distribution strategy driving the processing distribution on the Mesos / Spark cluster, thanks to the ES-Hadoop connector requirements given to spark, makes a lot of sense if you think of it.

First it's simple and consistent. One can understand how the first stages of processing will occur within spark by simply looking at the data distribution using for instance Cerebro. Everything is well predictable and straightforward to assess.

But more importantly, it's efficient since, well, whenever we store data in ElasticSearch, we think of the distribution strategy, in terms of index splitting, sharding and replication precisely for the single purpose of performance.
Creating too many indexes and shards, more that the amount of nodes, would be pretty stupid since having more than X shards to read per node, where X is the amount of CPUs available to ES on a node, leads to poor performances. As such, the highest limit is the amount of CPUs in the cluster. Isn't it fortunate that this is also the limits we want in such case for our spark processing cluster?
On the other hand, when one wants to store a tiny dataset, a single index and a single shard is sufficient. In this case, a processing on this dataset would also use a single node in the spark cluster. Again that is precisely what we want.

In the end, one "simply" needs to optimize his ElasticSearch cluster and the spark processing will be optimized accordingly.
Eventually, the processing distribution will scale linearly with the data distribution. As such, it's a very natural approach in addition to being simple and efficient.

Summing things up, the spark processing workload distribution being driven by the ElasticSearch data distribution, both are impacted by the following parameters of an ES index:

The index splitting strategy
The index sharding strategy
The replication strategy (factor)
The sharding key

The sharding key is not very important unless one has to implement a lot of joins in his processing scripts. In this case, one should carefully look at the various situations of these joins and find out which property is used most often as join key.
The sharding key should be this very same join key, thus enabling spark to implement the joins with best data locality, most of the time on the local node, since all shards with same sharding key end up on same node.
This may be the topic of another article on the subject, but likely not soon ... since, after so much writing, I need to focus on something else than Spark and ElasticSearch for a little while ...

As a last word on this topic for now, I would like to emphasize that not only this ELK-MS is working cool, in a simple, natural, efficient and performing way, but in addition all the UI consoles (Cerebro, Kibana, Mesos Console, Spark History Server) are state of the art, the Spark APIs is brilliantly designed and implemented, ElasticSearch itself in addition answers a whole range of use cases on its own, etc.
This stack is simply so amazingly cool.

ELK-MS - ElasticSearch/LogStash/Kibana - Mesos/Spark : a lightweight and efficient alternative to the Hadoop Stack - part II : assessing behaviour

2017-08-23T17:30:37-04:00

This article is the second article in my serie of two articles presenting the ELK-MS Stack and test cluster.

ELK-MS stands for ElasticSearch/LogStash/Kibana - Mesos/Spark. The ELK-MS stack is a simple, lightweight, efficient, low-latency and performing alternative to the Hadoop stack providing state of the art Data Analytics features.

ELK-MS is especially interesting for people that don't want to settle down for anything but the best regarding Big Data Analytics functionalities but yet don't want to deploy a full-blend Hadoop distribution, for instance from Cloudera or HortonWorks.
Again, I am not saying that Cloudera and HortonWorks' Hadoops distributions are not good. Au contraire, they are awesome and really simplifies the overwhelming burden of configuring and maintaining the set of software components they provide.
But there is definitely room for something lighter and simpler in terms of deployment and complexity.

The first article - entitled - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

This second article - ELK-MS - part II : assessing behaviour presents a few concerns, assesses the expected behaviour using the niceideas ELK-MS TEST package and discusses the challenges and constraints on this ELK-MS environment.

The conclusions of this serie of articles are presented in the third and last article - ELK-MS - part III : so why is it cool? which presents, as the name suggests, why this ELK-MS stack is really really cool and works great.

This article assumes a basic understanding of Big Data / NoSQL technologies in general by the reader.

Summary

1. Introduction
2. Testing framework
3. Conclusions from assessment tests
4. Further work
5. Details of Tests
6. References

1. Introduction

The reader might want to refer to the Introduction of the first article in the serie.

Summarizing it, this article is about assessing the behaviour of the ELK-MS Stack using the test cluster introduced in the first article.
Especially two questions need to be answered:

First, how does data-locality optimization work using ES-Hadoop to read data from ElasticSearch to Spark? On a large cluster, achieving Data Locality is the sinews of war. Before considering the ELK-MS stack as an actual alternative to a more standard Hadoop stack, assessing the sound behaviour and a good respect to data locality of the software stack is not optional.
Second, how does Mesos schedule spark executors and how does it impact data-locality? Mesos needs to be an effective alternative to YARN when it comes to dispatching spark executors while still taking into account data-locality.

These are the two main objectives of the tests for which I am reporting the conclusions hereafter, as well as a few other points.
This article is not about testing Spark or Mesos themselves, it's really about testing how ElasticSearch / Mesos / Spark behave all together to support the application architecture from the schema above.

In addition, in contrary to the state of the art on Spark, in my current company, we are not going to be using Java or Scala to implement the spark processing logic, we are going to use python.
The reason for this is simple: our Data Scientists know python, period. They do not know Java, they are not willing to learn Scala. Our Data Scientist know R and python and as such as Head of R&D I have made python our standard language for our Data Analytics algorithms (not that I don't like R, au contraire, but I believe python is at the right intersection between Data Science and Engineering).
Choosing python as processing language has an impact when it comes to programming Spark, the support of python is, as a matter of fact, a little under the support of Scala and Java.

Now all of the above give this article is rationality: programming an ElasticSearch / Mesos / Spark Task with python is something that suffers from really little documentation available.
In the previous article I wanted to present how to set things up as well as share my setup tools and in this article I want to present how to use it, it's behaviour and share some short sample programs in the form of my tests package.

2. Testing Framework

I would summarize the specificities of the usage of Spark in my current company as follows:

Data analytics use cases are implemented in pyspark and python scripts and not native Scala or Java APIs
The input data and results are stored in ElasticSearch, not in HDFS
Spark runs on Mesos and not the more standard YARN on Hadoop.

So I needed a way to test and assess that all of this is working as expected and that the behaviour of the Mesos/Spark stack, both from the perspective of concurrency and respect of data-locality in between ES nodes and Spark nodes, is sound.
This is the objective of the niceideas_ELK-MS-TEST framework.

I am presenting this framework, the approach and the tests it contains herunder.
The test framework is available for download here.

2.1 niceideas ELK-MS TEST

The niceideas ELK-MS TEST package structure, after being properly extracted in a local folder, is as follows:

./vm_execute.sh: this is the script one calls on the host machine to launch a test. The test to be executed should be given as argument.
./tests/*: the test scenario scripts.

Executing a test on the ELK-MS Test cluster, is simply done, for instance, by the following command:

badtrash@badbook:/data/niceideas_ELK-MS-TEST$ ./vm_execute.sh scenarii/5_concurrency_1_swissdata_df.sh

This would execute the test 5_1 and show the spark driver logs on the console.

2.2 Test Scenario Script

Each and every test scenario script has the very same structure :

Create an ad'hoc shell script taking care of downloading a data set and loading in into ElasticSearch
Execute that Data Loading Shell script
Create an ad'hoc python script taking care of implementing the Spark processing
Execute the Data Processing Python script

For instance, a test scenario X_test_Y_variant.sh would have following structure:

#!/bin/bash

# 1. Create Data Ingestion script
# -----------------------------------------------------------------------------
cat > X_test_Y_variant_do.sh <<- "EOF"
#!/bin/bash

# echo commands
set -x

# ...
# Various shell commands to proceed with loading the data in ES
# ...

# turn off command echoing
set +x


EOF

# 2. Exexcute Data Ingestion Script
# -----------------------------------------------------------------------------

bash X_test_Y_variant_do.sh
if [[ $? != 0 ]]; then
    echo "Script execution failed. See previous logs"
    exit -1
fi


# 3. Create pyspark script
# -----------------------------------------------------------------------------
cat > X_test_Y_variant.py <<- "EOF"

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_X_Y")

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# ...
# Various spark processing commands
# ...

EOF

# 4. Exexcute pyspark script
# -----------------------------------------------------------------------------

spark-submit X_test_Y_variant.py
if [[ $? != 0 ]]; then
    echo "Script execution failed. See previous logs"
    exit -1
fi

This key point of these scripts is that they are self contained and idempotent. They make no assumption about the state of the ELK-MS cluster before and they always start by cleaning all the data before reloading the data required for the tests.

2.3 Used Dataset

All the tests scenarii from the niceideas ELK-MS TEST package used either one of the following datasets:

Bank Dataset: from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip - a dataset of financial accounts with owner and balance information.
Shakespeare Dataset: from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json - the complete work of Shakespeare, every line of every speech.
Apache logs Dataset: from https://download.elastic.co/demos/kibana/gettingstarted/logs.jsonl.gz - a set or an apache web server log files.
Swiss AirBnB: two datasets in fact:
- from http://niceideas.ch/mes/swissairbnb/tomslee_airbnb_switzerland_1451_2017-07-11.csv : the list of AirBnB offers in Switzerland as of July 2017.
- from http://niceideas.ch/mes/swissairbnb/swisscitiespop.txt : the list of swiss cities with population and geoloc information.

The last dataset, related to swiss AirBnB offers and cities information is required to test the behaviour of joins on Spark.
The other datasets represent different volumes and enable us to test various aspects.

2.4 Test purposes

The tests from the niceideas ELK-MS TEST package are presented in details with all results in section 5. Details of Tests.
Before presenting the conclusions inferred from these tests in the next section, this is a short summary of the purpose of every family of tests:

Nominal tests - assess how the various kinds of APIs of Spark are used to read data from ES: the RDD API, the legacy DataFrame API (SQLContext) and the new DataFrame API (SQLSession).
Data-locality tests - assess how data-locality optimization between ES and Spark works and to what extent.
Aggregation tests - how aggregation on ES data works.
Join tests - how joining two data frames coming from ES works.
Concurrency tests - how does Mesos / Spark behave when running several jobs at a time.

Again, the section 5. Details of Tests presents each and every test in details, along with the screenshots of Cerebro, Spark History Server, the logs of the spark driver, etc.

3. Conclusions from assessment tests

I am reporting in this section already the conclusions that can be taken from the tests executed in the scope of this work.
The conclusions and important information are presented in this early section already, preventing the reader from the requirement to read all the individual tests presented in details in the next section.

3.1 ES-Hadoop and Data-locality enforcement

Data locality

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same server, or b) immediately start a new task in a farther away place that requires moving data there.

What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU.
The setting indicating how long it should wait before moving the processing elsewhere is spark.locality.wait=10s.

ES-Hadoop

Data-locality enforcement works amazingly under nominal conditions. ES-Hadoop makes the Spark scheduler understand the topology of the shards on the ES cluster and Spark dispatches the processing accordingly. Mesos doesn't interfere in this regards.

But again, it works only under nominal conditions.
As indicated above, there can be several factors compromising Data-locality:

First, imagine that at resource allocation time the Mesos cluster is heavily loaded. Spark will wait for spark.locality.wait=10s trying to get the processing executed on the node where ES stored the target data shard.
But if in this period the node doesn't become free, spark will move the processing elsewhere.
The second case it not anymore related to spark, but to ElasticSearch. Imagine that at the very moment the spark executor submits the request to ES (through the ES-Hadoop connector), the co-located ES node is busy doing something else (answering another request, indexing some data, etc.).
In this case, ES will delegate the answering of the request to another node and local data-locality is broken.

3.2 Spark coarse grained scheduling by Mesos vs. Fine Grained

In Coarse Grained scheduling mode, the default, Mesos considers spark only at the scale of the required spark executor processes. All Mesos knows about spark is the executor processes on the nodes they are running. Mesos knows nothing of Spark's jobs internals such as stages and tasks.
In addition, static allocation makes Mesos Job pretty easy: try to allocate as many resources from the cluster to spark executors for pending jobs as are available. This has the following consequences:

First, if a job is submitted to the cluster at a moment when the cluster is completely free, the job will be allocated the whole cluster. If another job comes even only just a few seconds after, it will still need to wait for the cluster to be freed by the first job, and that will happen only when the first job completes.
Second, if several jobs are waiting to be executed, when the cluster is freed, Mesos will allocate the cluster resources evenly to each and every job. Now imagine that all these jobs are short-lived jobs and only one of them is a long-lived job. At allocation time (static allocation), that long-lived job got only a small portion of the cluster. Even if very soon the cluster becomes free, that job will still need to complete its execution on his small portion, making most of the cluster unused.

Historically, Mesos on Spark can benefit from a Fine Grained scheduling mode instead, where Mesos will schedule not only spark executors on nodes in a rough fashion but really each and every individual spark task instead.
In regards to data-locality optimization, this doesn't seem to have any impact.
In regards to performance on the other hand, Fine Grained scheduling mode really messes performances completely.

The thing is that Mesos requires quite some time to negotiate with the resources providers. If that negotiation happens for every individual spark tasks, a huge amount of time is lost and eventually the impact on performance is not acceptable.

For this reason (and others), the Fine Grained scheduling mode is deprecated: https://issues.apache.org/jira/browse/SPARK-11857

3.3 Spark Static Resource Allocation vs. Dynamic Allocation

By default, Spark's scheduler uses a Static Resource Allocation system. This means that, at the job (or driver) initialization time, Spark, with the help of Mesos in this case, will decide what resource from the Mesos cluster can be allocated to the job. This decision is static, meaning that once decided the set of resources allocated to the job will never change in its whole life regardless of what happens on the cluster (other / additional nodes becoming free, etc.)
This has the consequences listed above in the previous section, the whole cluster is allocated to a single job, further jobs need to wait, etc. and as such it's not very optimal.

Now of course Spark provides a solution to this, the Dynamic Allocation System.

And this is where Spark gets really cool. With Dynamic Allocation, the Spark / Mesos cluster is evenly shared between multiples jobs requesting execution on the cluster regardless of the time of appearance of the jobs.
And with ES-Hadoop 6.x, the Dynamic allocation system is perfectly able to respect the locality requirements communicated by the elastic-search spark connector and respects them as much as possible

3.3.1 ES-Hadoop 5.x and Spark 2.2

With ES-Hadoop version 5.x, the way the elasticsearch-spark connector was enforcing data locality was incompatible with Spark 2.2.0 and as such But unfortunately, when using Dynamic Allocation, Spark simply doesn't take into consideration ES-Hadoop's requirements regarding data locality optimization anymore.

Without going into details, the problem comes from the fact that ES-Hadoop makes spark request as many executors as shards and indicates as preferred location the nodes owning the ES shards.
But Dynamic allocation screws all of this by allocating executors only one after the other (more or less) and only after monitoring evolutions of the job processing needs and the amount of tasks created. In no way does the dynamic allocation system give any consideration for ES-Hadoop requirements.

3.3.2 ES-Hadoop 6.x

As indicated in the release notes of the ElasticSearch-Hadoop connector 6.0.0, the ElasticTeam has added support for Spark 2.2.0. This support has fixed the messing up with Dynamic Allocation problem that was suffering ES-Hadoop 5.x.

Now even with Dynamic Allocation properly enables, which is a requirement for us in order to optimize the Mesos Cluster resources consumption, Data Locality os optimized and properly enforced everywhen possible.

3.4 Latency regarding Python instantiation

Executing some tasks in Python takes time in comparison to executing tasks natively in Java or Scala. The problem is that spark tasks in python require to launch the individual task processing in seperate process than the Spark JVM. Only Java and Scala Spark processings run natively in the Spark JVM.

This problem is not necessarily a big deal since the DataFrame or RDD APIs exposed to python pyspark scripts are actually implemented by Scala code underneath, they resolve to native Scala code.
There is one noticeable exception in this regards: UDF (User Defined functions) implemented in python. While this is very possible, it should be avoided at all cost.
One can very well still use pyspark but write UDF in Scala.

An explanation of this problem : https://fr.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark
UDF in Scala : http://aseigneurin.github.io/2016/09/01/spark-calling-scala-code-from-pyspark.html

3.5 Other ES-Hadoop concerns

Repartitioning

I couldn't find a way to make repartitioning work the way I want, meaning re-distributing the data on the cluster in order to scale out the further workload.
I am not saying there is no way, just that I haven't found one so far.

As such, a sound approach regarding initial sharding in ES should be adopted. One should take into consideration that a priori, initial sharding may well drive the way Spark will be able to scale the processing out on the cluster.
While creating by default one shard per node in the cluster would definitely be overkill, the general idea should tend in this direction.

ES level aggregations

It's simply impossible to forge a query from Spark to ElasticSearch through ES-Hadoop that would make ElasticSearch compute aggregation and returning them instead of the raw data.
Such advanced querying features are not available from spark.

The need is well identified but it remains Work in Progress at the moment: https://github.com/elastic/elasticsearch-hadoop/issues/276.

3.6 Other concerns

Spark History Server

Running Spark in Mesos, there is no long-lived Spark process. Spark executors are created when required by Mesos and the Mesos master and slave processes are the only long lived process on the cluster in this regards.

As such, the Spark Application UI (on ports 4040, 4041, etc.) only live for the time of the Spark processing. When the job is finished, the Spark UI application vanishes.

For this reason, Spark provides an History server. The installation and operation of the History Server is presented in the first article of this serie : ELK-MS - part I : setup the cluster.

Interestingly, that history server supports the same JSON / REST API that the usual Spark Console, with only a very few limitations.
For instance, one can use the REST API to discover about the Application-ID of running jobs in order to kill them (whenever required). For this, simply list the jobs and find out about those that have "endTimeEpoch" : -1, meaning the application is still alive:

curl -XGET http://192.168.10.10:18080/api/v1/applications

Limitations of the ELK-MS stack

As stated in the previous article, ElasticSearch is not a distributed filesystem, it's a document-oriented NoSQL database.

There are situations where a distributed filesystem provides interesting possibilities. Those are not provided by the ELK-MS stack as is. It would be interesting to test Ceph on Mesos for this. See http://tracker.ceph.com/projects/ceph/wiki/Ceph-mesos.

4. Further work

I am still considering some next steps on the topic of the ELK-MS stack testing since there are still a few things I would like to test ot assess:

In a raw fashion:

Find out about how to set maximum nodes booked by Mesos for a single spark job in order to avoid fully booking the cluster.
ElasticSearch on mesos
- This seems quite obvious. I expect the overall cluster performance to be way better if Mesos and ES don't compete with each other for hardware resources on nodes.
- There are workaround of course, such as configuring Mesos to avoid using all the CPUs of a node. But that will never be as efficient as letting Mesos distribute the global workload.
Find a way for repartitioning to work the way I intend it: data should get redistributed across the cluster!
Give Spark Streaming a try to reduce latency.
- https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-streaming
Try FAIR Spark scheduler and play with it.
- I got satisfying results using spark FIFO scheduler in terms of concurrency and haven't seen the need to change to FAIR.
- It really seems Mesos takes care of everything and I do really not see what the FAIR scheduler can change but I want to be sure.
- There are some chances that this makes me rewrite this whole article ... in another article.
Ceph integration on Mesos for binary files processing.
- How to integrate Ceph and spark ? Here as well very little documentation seems to be available.
- I found pretty much only this : https://indico.cern.ch/event/524549/contributions/2185930/attachments/1290231/1921189/2016.06.13_-_Spark_on_Ceph.pdf
What about HDFS on Mesos ?
- I would want to give it a try even though I am really rather considering Ceph for the use cases ElasticSearch forbids me to address.
- The thing is that Ceph integrates much better in the UNIX unified filesystem than HDFS
- Even though there is an approach to reach same level of integration with HDFS based on Fuse https://wiki.apache.org/hadoop/MountableHDFS. But that is still limited (doesn't support ownership informations for now)

5. Details of Tests

This very big section now presents each and every tests in details, along with the results in the form the the logs of the script (data feeding and spark driver logs), the screenshots of the UI applications (Cerebro, Mesos Console, Spark History Server).

The conclusions from the individual tests have been reported in the global 3. Conclusions from assessment tests section above.

5.1 Nominal Tests

Nominal tests - assess how the various kinds of APIs of Spark are used to read data from ES: the RDD API, the legacy DataFrame API (SQLContext) and the new DataFrame API (SQLSession).

5.1.1 Legacy RDD API on bank dataset

Test details

Test Script: 1_nominal_1_test_bank_rdd_legacy.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: see how Spark's RDD API can be used to fetch data from ElasticSearch and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_1")

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# Simplest possible query
q = "?q=*"

es_read_conf = {
    "es.resource" : "bank",
    "es.query" : q
}

es_rdd = sc.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_read_conf)

es_df = sqlContext.createDataFrame(es_rdd)

# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s accounts (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_1_test_bank_rdd_legacy.log
Screenshots from the various admin console after the test execution:

Test 1-1 / Dataset in ES

Test 1-1 / Job Completion in mesos

Test 1-1 / Process Overview on Spark

Test 1-1 / Job 0 / Stage 0

Conclusions

Spark can read data from ElasticSearch using the ES-Hadoop connector and the RDD API really out of the box.
One just needs to configure a few settings to the newAPIHadoopRDD API:
- inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat"
- keyClass="org.apache.hadoop.io.NullWritable"
- valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable"
Mesos spreads the workload on the cluster efficiently.
- This test was run alone on the cluster
- 2 nodes are sufficient to run the job since, thanks to replicas two nodes have actually all the shards
- Mesos creates a dedicated spark executor on each of the 2 nodes
- Sparks then successfully distribute the RDD on the 2 executors
Data-locality optimization works out of the box.
- There are 5 shards in ElasticSearch, which, with replicas, are well spread on the cluster
- Mesos / Spark dispatches the workload efficiently since it creates 5 RDD partitions for the 5 shards, each and every of them respecting data locality (NODE_LOCAL) and as such respecting the requirements given by the ES-Hadoop connector.

5.1.2 Legacy DataFrame API on bank dataset

Test details

Test Script: 1_nominal_2_test_bank_df_legacy.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: see how Spark's legacy DataFrame API can be used to fetch data from ElasticSearch and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_2")

## !!! Caution : this is pre 2.0 API !!! 

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

es_df = sqlContext.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s women accounts (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_2_test_bank_df_legacy.log
Screenshots from the various admin console after the test execution:

Test 1-2 / Dataset in ES

Test 1-2 / Job Completion in mesos

Test 1-2 / Process Overview on Spark

Test 1-2 / Job 0 / Stage 0

Conclusions

Spark can read data from ElasticSearch using the ES-Hadoop connector and the Legacy DataFrame API (SQLContext) really out of the box.
The single configuration required is format("org.elasticsearch.spark.sql") on the SQLContext API
Here as well, the Dynamic allocation system allocates nodes to the job one after the other.
After two nodes alolocated to the job, all shards (thx to replicas) become available locally and data localiy optimization can be satisfied without any other node required. The job executes on these 2 nodes.
In this case, as seen on result_1_2_job_0_stage_0.png, spark successfully respects data locality as well.

5.1.3 DataFrame API on bank dataset

Test details

Test Script: 1_nominal_3_test_bank_df.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: see how Spark's New (>= 2.0) DataFrame API can be used to fetch data from ElasticSearch and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_3")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s women accounts (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_3_test_bank_df.log
Screenshots from the various admin console after the test execution:

Test 1-3 / Dataset in ES

Test 1-3 / Job Completion in mesos

Test 1-3 / Process Overview on Spark

Test 1-3 / Job 0 / Stage 0

Conclusions

Spark can read data from ElasticSearch using the ES-Hadoop connector and the New DataFrame API (SQLSession) really out of the box.
The single configuration required here as well is format("org.elasticsearch.spark.sql") on the SQLSession API
Nothing specific to report regarding the other aspects: Mesos and Spark's dynamic allocation system distribute the workload as expected, still creates a dedicated Spark Executor for 2 nodes of the cluster which is sufficient (thx replicas), Spark respects data locality strictly, etc.

5.1.4 DataFrame API on Apache-logs dataset

Test details

Test Script: 1_nominal_4_test_apache-logs_df.sh
Input Dataset: Apache Logs Dataset from https://download.elastic.co/demos/kibana/gettingstarted/logs.jsonl.gz
Purpose: see how Spark's New (>= 2.0) DataFrame API can be used to fetch data from ElasticSearch from another dataset and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_4")

# es.read.field.exclude (default empty) : 
#  Fields/properties that are discarded when reading the documents 
#  from Elasticsearch
conf.set ("es.read.field.exclude", "relatedContent")

# es.read.field.as.array.include (default empty) : 
#  Fields/properties that should be considered as arrays/lists
conf.set ("es.read.field.as.array.include", "@tags,headings,links")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("apache-logs-*")


# I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s logs (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s logs (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_4_test_apache-logs_df.log
Screenshots from the various admin console after the test execution:

Test 1-4 / Dataset in ES

Test 1-4 / Job Completion in mesos

Test 1-4 / Process Overview on Spark

Test 1-4 / Job 0 / Stage 0

Conclusions

N.D (nothing to declare). Everything works as expected (see previous tests results from the 1 Nominal Tests family).
Interestingly here, the workload justifies the booking of the three nodes of te cluster, which is successfully achieved since the job runs alone on the cluster.

5.1.5 DataFrame API on Shakespeare dataset

Test details

Test Script: 1_nominal_5_test_shakespeare.sh
Input Dataset: Shakespeare's Works Dataset from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json
Purpose: see how Spark's New (>= 2.0) DataFrame API can be used to fetch data from ElasticSearch from another dataset and how sharding in ES impacts executors layout on the cluster

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_1_5")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("shakespeare*")

# Collect result to the driver
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count : THIS IS FUNNY : 
# it relaunches the whole Distributed Data Frame Processing
print ("Fetched %s logs (re-computed)") % es_df.count()

# Print count 
print ("Fetched %s logs (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 1_nominal_5_test_shakespeare.log
Screenshots from the various admin console after the test execution:

Test 1-5 / Dataset in ES

Test 1-5 / Job Completion in mesos

Test 1-5 / Process Overview on Spark

Test 1-5 / Job 0 / Stage 0

Conclusions

N.D (nothing to declare). Everything works as expected (see previous tests results from the 1 Nominal Tests family).
This time, however, due to the lack of replicas, the three nodes are actually required to satisfy data localiy optimization. The allocation of the 2 nodes to the job happens successfully again since the job runs alone.

5.2 Data-locality tests

Data-locality tests - assess how data-locality optimization between ES and Spark works and to what extent.

5.2.1 Bank dataset with 1 shard

Test details

Test Script: 2_collocation_1_bank_one_shard.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality works when using a dataset with a single shard on a single node of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_1")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_1_bank_one_shard.log
Screenshots from the various admin console after the test execution:

Test 2-1 / Dataset in ES

Test 2-1 / Job Completion in mesos

Test 2-1 / Process Overview on Spark

Test 2-1 / Job 0 / Stage 0

Conclusions

In terms of workload distribution, this is where Dynamic Allocation is really cool. Since the single shard is on a single node, the Spark Dynamic Allocation System, with the help of Mesos, takes care of booking that single node as well for the Spark processing job
As a sidenote, using static allocation here, mesos Spark would have booked the whole cluster for the job, which would have been far from optimal in terms of workload distribution. Since the cluster is fully available, Mesos would have booked it all for the Job to come. But eventually 2 of the 3 spark executors won't be used at all.
That wouldn't have been a big deal since this test is running alone. But if some more jobs are added to the cluster and requests an executor, they will have to wait for that first job to be finished before they can share the cluster among them.
Data-locality works as expected. The single shard is located on 192.168.10.12 and both the driver logs and the Spark Console for Job 0 / Stage 0 confirms that the co-located Spark executor has been the only one processing the data.

5.2.2 Bank dataset with 2 shards

Test details

Test Script: 2_collocation_2_bank_two_shards.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality optimization works when using a dataset with a two shards on two nodes of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_2")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_2_bank_two_shards.log
Screenshots from the various admin console after the test execution:

Test 2-2 / Dataset in ES

Test 2-2 / Job Completion in mesos

Test 2-2 / Process Overview on Spark

Test 2-2 / Job 0 / Stage 0

Conclusions

Same remark as above regarding workload distribution.
Dynamic Allocation is really cool. Since the two shards are on two nodes, the Spark Dynamic Allocation System, with the help of Mesos, takes care of booking the two corresponding nodes as well for the Spark processing job.
Data locality works as expected. The two shards are on 192.168.10.10 and 192.168.10.12 and both the driver logs and the Spark Console for Job 0 / Stage 0 confirms that both co-located Spark executor have been used to process the 2 shards.
The tasks have been executed with NODE_LOCAL locality level.

5.2.3 Bank dataset with 3 shards

Test details

Test Script: 2_collocation_3_bank_three_shards.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality optimization works when using a dataset with a three shards on three nodes of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_3")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_3_bank_three_shards.log
Screenshots from the various admin console after the test execution:

Test 2-3 / Dataset in ES

Test 2-3 / Job Completion in mesos

Test 2-3 / Process Overview on Spark

Test 2-3 / Job 0 / Stage 0

Conclusions

Three shards on three nodes so three nodes booked for processing, everything works as expected.
Data locality works as expected. The 3 spark executors consumes data from their co-located shards. This is confirmed by everything behaving as expected as can be seen in the driver logs or in the Spark Application UI.

5.2.4 Bank dataset with 1 shard and replicas

Test details

Test Script: 2_collocation_4_bank_one_shard_with_replicas.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how data-locality works when using a dataset with one shard and two replicas on three nodes of the cluster, see what decisions will Mesos / Spark take from ES-Hadoop's requirements.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_4")

# FIXME GET RID OF THESE TESTS
# Trying some ES settings
# conf.set ("spark.es.input.max.docs.per.partition", 100)
# That doesn't really help => it split the dataframe on several nodes indeed 
# but it doesn't impact the fetching

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'")

# (2) Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("Fetched % rows on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 2_collocation_4_bank_one_shard_with_replicas.log
Screenshots from the various admin console after the test execution:

Test 2-4 / Dataset in ES

Test 2-4 / Job Completion in mesos

Test 2-4 / Process Overview on Spark

Test 2-4 / Job 0 / Stage 0

Conclusions

In this case, it's really as if each and every ES node of the cluster has a copy of the data. ElasticSearch makes no distinction when it comes to serve requests between primary shards and secondary shards (replicas).
Which node will finally execute the processing is really random. Out of several executions I always ended up having a different Spark node executing the whole processing. Data co-locality still kicks in and one single node still does the whole processing every time
Important note: All of the above works under normal behaviour. Under a heavily loaded cluster, the results can be significantly different.
By running different scenarii under different conditions, I have been able to determine 2 different situations in addition to the nominal one (the one on a free cluster):
- First, It can happen that Mesos tries to distribute a specific spark processing part to the Spark executor co-located to the ES shard.
  But then, when the Spark processing finally queries that local node to get the shard, it can well happen that this ES node is busy answering a different request from a different client application.
  In this case, that local ES node will report itself as busy and will ask another node from the ES cluster to server the request.
  So even though initially Mesos / Spark distributed the workload to the local node to the shard in ES, eventually the request will be served by another distant node from te cluster.
- Second, it can also happen that all nodes co-located to the ES node owning all shards (primary and replicas) are busy.
  In this case, Mesos / Spark will only wait a few seconds expecting of this node to become free, and if that fails to happen, eventually a different Mesos node will run the processing, indifferently for data locality.
The difference here is that the existence of replicas suddenly gives ElasticSearch the choice
ElasticSearch has the choice to answer and serve the data from another node than the local node if suddenly the local node is busy!
In addition, Mesos / Spark will only wait spark.locality.wait=10s to try to make the specific processing part local to the ES node owning a shard (or a replica BTW). If none of these nodes (owning one of the primary shard or replicas) becomes free and available for that amount of time, then Mesos will distribute the workload to another available node from the Mesos cluster.

5.2.5 Testing repartitioning

Test details

Test Script: 2_collocation_5_bank_one_shard_repartition_NOT_WORKING.sh
Input Dataset: Bank Dataset from https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
Purpose: Assess how one can redistribute the data on the cluster after loading data from a sub-set of the cluster nodes such as, for instance, only one node, see how Spark can redistribute the data evenly on the cluster after having loaded an unbalanced data.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_2_5")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(pushdown=True) \
            .load("bank") \
            .where("gender='F'") 

# Print size of every partition on nodes
def f(iterable):
    # need to convert iterable to list to have len()
    print("A - % rows stored on node") % len(list(iterable)) 
es_df.foreachPartition(f)

# Doesn't help
#es_df2 = es_df.coalesce(1) 
## Print size of every partition on nodes
#es_df2.foreachPartition(f)

# (2)
es_df3 = es_df.repartition(4 * 3) 
# Print size of every partition on nodes
es_df3.foreachPartition(f)

# (3) I need to collect the result to show them on the console
data_list = es_df3.collect()
print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s women accounts (from collected list)") % len (data_list)

# Print
print (ss._jsc.sc().getExecutorMemoryStatus().size())

Results

Logs of the Spark Driver: 2_collocation_5_bank_one_shard_repartition.log
Screenshots from the various admin console after the test execution:

Test 2-5 / Dataset in ES

Test 2-5 / Job Completion in mesos

Test 2-5 / Process Overview on Spark

Test 2-5 / Job 0 / Stage 0

Conclusions

I haven't been able to make repartitioning work the way I intended it to work.
- Eventually all of my tests led to the underlying RDD being repartitioned, but all the partitions remain local to the initially owning node
- I never managed to find a way to make the Spark cluster redistribute the different partitions to the various Spark executors from the cluster
I don't know if that comes from Spark somehow knowing that it doesn't need to do that for the post-processing to be done efficiently.
Long story short, I have no real conclusions in this regards, reason why the above schema is crossed by an X.

5.3 Aggregation tests

Aggregation tests - assess how aggregation on ES data works.

5.3.1 ES-side Aggregations

Test details

Test Script: 3_aggregation_1_es_shakespeare_rdd_legacy_NOT_WORKING.sh
Input Dataset: Shakespeare's Works Dataset from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json
Purpose: see how Spark can exploit native ElasticSearch features such as ES-side aggregations instead of performing aggregations on its own.

Expected Behaviour

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

# Spark configuration 
conf = SparkConf().setAppName("ESTest_3_1")

# SparkContext and SQLContext
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# -> query dsl
es_aggregations_query = '''
{ 
    "query" : { "match_all": {} },
    "size" : 0,
    "aggregations" : { 
        "play_name": {
            "terms": {
                "field" : "play_name"
            }
        }
    }
}
'''

es_read_conf = {
    "es.resource" : "shakespeare",
    "es.endpoint" : "_search",
    "es.query" : es_aggregations_query
}

# (1)
es_rdd = sc.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_read_conf)

es_df = sqlContext.createDataFrame(es_rdd)

# I need to collect the result 
data_list = es_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s rows (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 3_aggregation_1_es_shakespeare_rdd_legacy.log
Screenshots from the various admin console after the test execution:

Test 3-1 / Dataset in ES

Test 3-1 / Job Completion in mesos

Test 3-1 / Process Overview on Spark

Test 3-1 / Job 0 / Stage 0

Conclusions

There is simply no way at the moment to submit specific requests, such as aggregation requests, from spark to ElasticSearch using the ES-Hadoop connector.
The need is well identified but the solution is still work in progress: https://github.com/elastic/elasticsearch-hadoop/issues/276
Since it's impossible to make this work as expected, the schematic above is crossed with an X.

5.3.2 Spark-side Aggregations

Test details

Test Script: 3_aggregation_2_spark_shakespeare.sh
Input Dataset: Shakespeare's Works Dataset from https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json
Purpose: see how Spark performs aggregations on its own.

Expected Behaviour

Relevant portion of spark Script

from pyspark.conf import SparkConf
from pyspark.sql import SQLContext, SparkSession

# Spark configuration 
conf = SparkConf().setAppName("ESTest_3_2")

# Every time there is a shuffle, Spark needs to decide how many partitions will 
# the shuffle RDD have. 
# 2 times the amount of CPUS on the cluster is a good value (default is 200) 
conf.set("spark.sql.shuffle.partitions", "12")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

# (1)
es_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("shakespeare*")

# (2) Compute aggregates : I want the count of lines per book
agg_df = es_df.groupBy(es_df.play_name).count()

# (3) Collect result to the driver
data_list = agg_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Fetched %s rows (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 3_aggregation_2_spark_shakespeare.log
Screenshots from the various admin console after the test execution:

Test 3-2 / Dataset in ES

Test 3-2 / Job Completion in mesos

Test 3-2 / Process Overview on Spark

Test 3-2 / Job 0 / Stage 0

Conclusions

There aren't a lof of things to conclude here. We can just mention that everything works as expected and return the user to the Data Flow schematic above.
Data locality kicks-in, etc.

5.4 Join test

Test details

Test Script: 4_join_1_swissdata_df.sh
Input Dataset: two datasets in fact:
- from http://niceideas.ch/mes/swissairbnb/tomslee_airbnb_switzerland_1451_2017-07-11.csv : the list of AirBnB offers in Switzerland as of July 2017.
- from http://niceideas.ch/mes/swissairbnb/swisscitiespop.txt : the list of swiss cities with population and geoloc information.
Purpose: see how the ELK-MS stack behaves when its has several datasets to load from ES into Spark and then join.

Expected Behaviour

Relevant portion of spark Script

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
import pyspark.sql.functions as F

# Spark configuration 
# all these options can be given to the command line to spark-submit
# (they would need to be prefixed by "spark.")
conf = SparkConf().setAppName("ESTest_4_1")

# Every time there is a shuffle, Spark needs to decide how many partitions will 
# the shuffle RDD have. 
# 2 times the amount of CPUS oi nthe cluster is a good value (default is 200) 
conf.set("spark.sql.shuffle.partitions", "12")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()


# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

# (1).1 Read city and population
citypop_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-citypop") \
            .alias("citypop_df")

# (1).2. Read airbnb offers
airbnb_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-airbnb") \
            .alias("airbnb_df")

# (2) Join on city
joint_df = airbnb_df \
            .join( \
                  citypop_df, \
                  (F.lower(airbnb_df.city) == F.lower(citypop_df.accent_city)), \
                  "left_outer" \
                 ) \
            .select( \
                    'room_id', 'airbnb_df.country', 'airbnb_df.city', \
                    'room_type', 'bedrooms', 'bathrooms', 'price', 'reviews', \
                    'overall_satisfaction', \
                    'airbnb_df.latitude', 'airbnb_df.longitude', \
                    'citypop_df.latitude', 'citypop_df.longitude', 'population', \
                    'region' \
                   )

# (3) Collect result to the driver
data_list = joint_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Computed %s positions (from collected list)") % len (data_list)

Results

Logs of the Spark Driver: 4_join_1_swissdata_df.log
Screenshots from the various admin console after the test execution:

Test 4-1 / Dataset in ES

Test 4-1 / Job Completion in mesos

Test 4-1 / Process Overview on Spark

Test 4-1 / Job 0 / Stage 0

Test 4-1 / Job 0 / Stage 1

Conclusions

Here as well there aren't a log of things to conclude. Everything works just as expected.
Data locality kicks-in both at the ES data fetching side and on the private Spark Side for the join.

5.5 Concurrency test

Test details

Test Script: 5_concurrency_1_swissdata_df.sh
Input Dataset: two datasets in fact:
- from http://niceideas.ch/mes/swissairbnb/tomslee_airbnb_switzerland_1451_2017-07-11.csv : the list of AirBnB offers in Switzerland as of July 2017.
- from http://niceideas.ch/mes/swissairbnb/swisscitiespop.txt : the list of swiss cities with population and geoloc information.
Purpose: see how the ELK-MS stack behaves when submitting several jobs at the same time to the cluster and what happens in terms of concurrency.

The Spark Script

The concurrency tests simply executes four times in parallel the scenario inspired from 5.3.2 Spark-side Aggregations as follows:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
import pyspark.sql.functions as F

# Spark configuration 
conf = SparkConf()

# Every time there is a shuffle, Spark needs to decide how many partitions will 
# the shuffle RDD have. 
# 2 times the amount of CPUS on the cluster is a good value (default is 200) 
conf.set("spark.sql.shuffle.partitions", "12")

# Spark SQL Session 
ss = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

# Query configuration only (cannot pass any ES conf here :-( )
es_query_conf= { 
    "pushdown": True
}

# 1. Read city and population
citypop_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-citypop") \
            .alias("citypop_df")

# 2. Read airbnb offers
airbnb_df = ss.read \
            .format("org.elasticsearch.spark.sql") \
            .options(conf=es_query_conf) \
            .load("swiss-airbnb") \
            .alias("airbnb_df")

# 3. Join on city
joint_df = airbnb_df \
            .join( \
                  citypop_df, \
                  (F.lower(airbnb_df.city) == F.lower(citypop_df.accent_city)), \
                  "left_outer" \
                 ) \
            .select( \
                    'room_id', 'airbnb_df.country', 'airbnb_df.city', \
                    'room_type', 'bedrooms', 'bathrooms', 'price', 'reviews', \
                    'overall_satisfaction', \
                    'airbnb_df.latitude', 'airbnb_df.longitude', \
                    'citypop_df.latitude', 'citypop_df.longitude', 'population', \
                    'region' \
                   )

# Collect result to the driver
data_list = joint_df.collect()

print ("Printing 10 first results")
for x in data_list[0:10]:
    print x

# Print count 
print ("Computed %s positions (from collected list)") % len (data_list)

Results

The various logs:
- Logs of the script : 5_concurrency_1_swissdata_df.log
- Process P1 logs : log_5_concurrency_1_swissdata_1.log
- Process P2 logs : log_5_concurrency_1_swissdata_2.log
- Process P3 logs : log_5_concurrency_1_swissdata_3.log
- Process P4 logs : log_5_concurrency_1_swissdata_4.log
Screenshots from the various admin console after the test execution:

First the mesos console showing the completion of the 4 jobs:

Test 5-1 / P1 / Job Completion in mesos

Test 5-1 / P2 / Job Completion in mesos

Test 5-1 / P3 / Job Completion in mesos

Test 5-1 / P4 / Job Completion in mesos

Overview of the 4 processes in Spark console, the specific view of each of the process:

Test 5-1 / P1 / Process Overview on Spark

Test 5-1 / P2 / Process Overview on Spark

Test 5-1 / P3 / Process Overview on Spark

Test 5-1 / P4 / Process Overview on Spark

Focusing on Job 1 (P1), each and every relevant views from the Spark Application UI:

Test 5-1 / P1 / Job 0

Test 5-1 / P1 / Job 0 / Stage 0

Test 5-1 / P1 / Job 0 / Stage 1

Test 5-1 / P1 / Job 0 / Stage 2

Focusing on Job 2 (P2), each and every relevant views from the Spark Application UI:

Test 5-1 / P2 / Job 0

Test 5-1 / P2 / Job 0 / Stage 0

Test 5-1 / P2 / Job 0 / Stage 1

Test 5-1 / P2 / Job 0 / Stage 2

Focusing on Job 3 (P3), each and every relevant views from the Spark Application UI:

Test 5-1 / P3 / Job 0

Test 5-1 / P3 / Job 0 / Stage 0

Test 5-1 / P3 / Job 0 / Stage 1

Test 5-1 / P3 / Job 0 / Stage 2

Focusing on Job 4 (P4), each and every relevant views from the Spark Application UI:

Test 5-1 / P4 / Job 0

Test 5-1 / P4 / Job 0 / Stage 0

Test 5-1 / P4 / Job 0 / Stage 1

Test 5-1 / P4 / Job 0 / Stage 2

Conclusions

Before everything else let's mention that this test has been executed, first, using the FIFO scheduler (spark.scheduler.mode=FIFO) and second, using the Dynamic allocation system
Dynamic allocation seems to work a little slower that static allocation in this case.
With static allocation (only actual way on ES-Hadoop 5.x), what happens is that the first job that is prepared by the drivre a tiny little bit before the 3 others will get the shole cluster, and only whenever that first job is done, the three next sones will get an even share of the cluster, i.e one node each and complete almost at the same time.
With dynamic allocation, the cluster is well shares among jobs. Once in a while a job may get an additional executor and another job will need to wait but all in all the 4 jobs really run together on the three nodes.
In terms of concurrency, we can see on the following image that the cluster is used quite effectively, looking at the CPU consumption on the host machine:

(Note : each and every of the 3 VMs can use up to 2 CPUS of the host which has 4 CPUs in total)
Also, all my tests, including this one has been executed using Coarse Grained Scheduling Mode (spark.mesos.coarse=true).
- One might think that using Fine Grained Mode, things would be more efficient since each and ever task would be distributed on the cluster at will and we wouldn't end up in the static topology described above.
- But unfortunately, Mesos latency when it comes to negotiating resources really messes up performances. The dynamic dispatching of tasks works well, but the overall processes performances are screwed by the time Mesos requires for negotiation.
  In the ends, Fine Grained Scheduling mode kills performance of the whole cluster down.
- I have executed this very same test using spark.mesos.coarse=false and the dropdown in terms of cluster usage efficiency is seen by looking at the CPU consumption on the host machine for test 5 - 1 using Fine Grained Mode
In regards to data locality, since the 3 last processes get one single node of the cluster each, only one third of the tasks will execute with locality level NODE_LOCAL. Two thirds of them will require to fetch data from the network.

6. References

Spark and mesos

https://spark.apache.org/docs/latest/running-on-mesos.html (specific spark mesos configuration)

ES Hadoop doc

Pyspark.sql doc

Spark Doc

Configuration : https://spark.apache.org/docs/latest/configuration.html
Spark history server : https://spark.apache.org/docs/latest/monitoring.html
Dynamic resource allocation : https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

Other Pyspark specificities

Pyspark RDD API : https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD
Pyspark performance : https://fr.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark

ELK-MS - ElasticSearch/LogStash/Kibana - Mesos/Spark : a lightweight and efficient alternative to the Hadoop Stack - part I : setup the cluster

2017-08-23T17:29:12-04:00

In my current company, we implement heavy Data Analytics algorithms and use cases for our customers. Historically, these heavy computations were taking a whole lot of different forms, mostly custom computation scripts in python or else using RDBMS databases to store data and results.
A few years ago, we started to hit the limits of what we were able to achieve using traditional architectures and had to move both our storage and processing layers to NoSQL / Big Data technologies.

We considered a whole lot of different approaches, but eventually, and contrary to what I expected first, we didn't settle for a standard Hadoop stack. We are using ElasticSearch as key storage backend and Apache Spark as processing backend.
Now of course we were initially still considering a Hadoop stack for the single purpose of using YARN as resource management layer for Spark ... until we discovered Apache Mesos.

Today this state of the art ELK-MS - for ElasticSearch/Logstash/Kibana - Mesos/Spark stack performs amazingly and I believe it to be a really lightweight, efficient, low latency and performing alternative to a plain old Hadoop Stack.
I am writing a serie of two articles to present this stack and why it's cool.

This first article - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

This article assumes a basic understanding of Hadoop and Big Data / NoSQL technologies in general by the reader.

Summary

1. Introduction
- 1.1 Rationality
- 1.2 Purpose of this serie of articles
2. Target Architecture
3. niceideas ELK-MS
4. Noteworthy configuration elements
5. Conclusion

1. Introduction

Actually deploying a whole Hadoop stack is, let's say, at least heavy. Having HDFS, YARN, the Map Reduce framework and maybe Tez up and running is one thing, and it's maybe not that complicated, sure.

But with such a vanilla stack you're not going very far. You'll at least add the following minimal set of software components: Apache Sqoop for importing data in your HDFS cluster, Apache Pig for processing this data, Apache Hive for querying it. But yeah, then, Hive is so slow for small queries returning small datasets, you'll likely add Stinger ... and then a whole lot of other components.
Now setting all of these software components up and running and tuning them well is a real hassle so one might consider a HortonWorks or Cloudera distribution instead, and this is where it gets really heavy.
Don't get me wrong, both HortonWorks and Cloudera are doing an amazing job and their distributions are awesome.
But I am working in a context where we want something lighter, something more efficient, something easier to set up, master and monitor.

In addition, HDFS is great. But it's really only about distributed storage of data. Vanilla Hadoop doesn't really provide anything on top of this data aside from MapReduce. On the other hand, the NoSQL landscape is filled with plenty of solutions achieving the same resilience and performance than HDFS but providing advanced data querying features on top of this data.
Among all these solutions, ElasticSearch is the one stop shop for our use cases. It fulfills 100% of our requirements and provides us out of the box with all the querying features we require (and some striking advanced features).
Using ElasticSearch for our data storage needs, we have no usage whatsoever for HDFS.
In addition, ElasticSearch comes out of the box with a pretty awesome replacement of Sqoop: Logstash and a brilliant Data Visualization tool that has no free alternative in the Hadoop world: Kibana.

Now regarding Data Processing, here as well we found our one stop shop in the form of Apache Spark. Spark is a (vary) fast and general engine for large-scale data processing. At the ground of our processing needs, there is not one single use case we cannot map easily and naturally to Spark'S API, either using low level RDDs or using the DataFrame API (SparkSQL).

Now Spark requires some external scheduler and resources manager. It can run without it of course but fails in achieving concurrency when doing so.
We were seriously considering deploying Hadoop and YARN for this until we discovered Apache Mesos. Mesos is a distributed systems kernel built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides Spark with API’s for resource management and scheduling across entire Data-center and cloud environments.

1.1 Rationality

I call the software stack formed by the above components the ELK-MS stack, for ElasticSearch/LogStash/Kibana - Mesos/Spark.

The ELK-MS stack is a simple, lightweight, efficient, low-latency and performing alternative to the Hadoop stack providing state of the art Data Analytics features:

lightweight : ELK-MS is lightweight both in terms of setup and runtime.
In terms of setup, the distributed storage engine, ElasticSearch, the resource manager, Mesos, and the distributed processing engine, spark, are amazingly easy to setup and configure. They really work almost out of the box and only very few configuration properties have to be set when it comes to configuring resources in Mesos, honestly trying to optimize anything other than the default value really tends to worsen things.
In terms of runtime, ElasticSearch, Mesos and some components of Spark, the only long-running daemons have a very low memory footprint under low workload. Now of course, both ElasticSearch and Spark have pretty heavy memory needs when working.
efficient : ElasticSearch, in contrary to HDFS, is not just a wide and simple distributed storage engine. ElasticSearch is in addition a real-time querying engine. It provides pretty advanced features such as aggregations and, up to a certain level, even distributed processing (scripted fields or else). With ELK-MS, the storage layer itself provides basic data analytics features.
In addition, Spark supports through the RDD API most if not all of what we can achieve using low-level Map Reduce. It obviously also supports plain old MapReduce. But the really striking feature of Spark is the DataFrame API and SparqSQL.
low-latency : Spark is by design much faster than Hadoop. In addition, jobs on spark can be implemented in such as way that the processing time and job initialization time is much shorter than on Hadoop MapReduce (Tez makes things more even on Hadoop though).
But there again Spark has a joker: the Spark Streaming extension.
performing : in addition to the above, both ElasticSearch and Spark share a common gene, not necessarily widely spread among the NoSQL landscape: the capacity to benefit as much from a big cluster with thousands of nodes than from a big machine with a hundreds of processor.
Spark and ElasticSearch are very good on a large cluster of small machines (and to be honest, the scaling out is really the preferred way to achieve optimal performance with both).
But in contrary to Hadoop, both Spark and ElasticSearch also works pretty good on a single fat machine with hundreds of processors, able to benefit from the multi-processor architecture of one single machine.

The conclusions of the behaviour assessment tests, at the end of the second article, as well as The conclusion of this serie of articles give some more leads on why the ELK-MS stack is cool.

For these reasons, we are extensively using the ELK-MS stack for our Data Analytics needs in my current company.

1.2 Purpose of this serie of articles

Setting up the ELK-MS stack in a nominal working mode is easy, but still requires a few steps. In addition, when assessing the stack and for testing purpose, I needed a way to setup a cluster and test key features such as optimization of data-locality between ElasticSearch and Spark.

I have written a set of scripts taking care of the nominal setup and a test framework based on Vagrant and VirtualBox.

This first article - ELK-MS - part I : setup the cluster in this serie presents the ELK-MS stack and how to set up a test cluster using the niceideas ELK-MS package.

2. Target Architecture

Before presenting the components and some noteworthy configuration aspects, let's dig into the architecture of the ELK-MS stack.

2.1 Technical Architecture

The technical architecture of the ELK-MS stack is as follows

The components in grey are provided out of the box at OS level by Debian Stretch distribution.
The components in yellow are provided by Elastic in the ELK Stack.
Mesos is in light red.
The components in blue are from the Spark Framework.

Let's present all these components.

2.2 Components

This section presents the most essential components of the ELK-MS stack.

2.2.1 ElasticSearch

From ElasticSearch's web site : "ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected."

ElasticSearch is a NoSQL Document-oriented database benefitting from the NoSQL Genes: data distribution by sharding (partitioning) and replication. It can run on all kind of hardware, from a big fat hundred CPUs machine to a multi-data centers cluster of commodity hardware.
The native document storage format is JSON.

ElasticSearch support real-time querying of data and advanced analytics features such as aggregation, scripted fields, advanced memory management models and even some support for MapReduce directly in ElasticSearch's engine.

2.2.2 Logstash

From Logstash's web site : "Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite "stash." (Ours is Elasticsearch, naturally.)."

Logstash is really the equivalent of Sqoop in the Elastic world. It's a largely configurable data processing engine whose primary intent is to feed ElasticSearch with data that can come from pretty much all imaginable data sources and formats. Of course Logstash can also output data to a very extended set of sinks in addition to ElasticSearch.
It's easily extendable through plugins which are straightforward to build, should the 200 provided plugins not be sufficient.

Logstash can also be distributed just as ElasticSearch, enabling not only to scale out the data ingestion processing but also enabling smart co-location strategies with ElasticSearch.

2.2.3 Kibana

From Kibana's web site : "Kibana lets you visualize your ElasticSearch data and navigate the Elastic Stack, so you can do anything from learning why you're getting paged at 2:00 a.m. to understanding the impact rain might have on your quarterly numbers."

Kibana core ships with the classics: histograms, line graphs, pie charts, sunbursts, and more. They leverage the full aggregation capabilities of ElasticSearch.
Kibana as well is easily extendable and integrating any kind of native D3.js visualization is usually done in a few hours of coding.

In the context of ELK-MS, Kibana is an amazing addition to ElasticSearch, since we can write Spark programs that work with data from ES but also stores their results in ES. As such, Kibana can be used out of the box to visualize not only the input data but also the results of the Spark scripts.

2.2.4 Cerebro

From Cerebro's web site : "Cerebro is an open source(MIT License) ElasticSearch web admin tool built using Scala, Play Framework, AngularJS and Bootstrap.."

Cerebro is the one-stop-shop, little and simple but efficient, monitoring and administration tool for ElasticSearch.

Cerebro is a must have with ElasticSearch since working only with the REST API to understand ElasticSearch's topology and perform most trivial administration tasks (such as defining mapping templates, etc.) is a real hassle.
Cerebro is far from perfect but really does the job.

2.2.5 Spark

From Spark's web site : "Apache Spark is a fast and general engine for large-scale data processing."

From Wikipedia's Spark article: "Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.

The availability of RDDs facilitates the implementation of both iterative algorithms, that visit their dataset multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications (compared to a MapReduce implementation, as was common in Apache Hadoop stacks) may be reduced by several orders of magnitude."

2.2.6 Mesos

From Mesos' web site : "Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
Mesos is a distributed systems kernel, built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire Datacenter and cloud environments."

In the context of ELK-MS, and in a general way when considering running Spark in a production environment, Mesos is the way to go if one doesn't want to deploy a full Hadoop stack to support Spark. In the end, it appears that Mesos performs amazingly, both by suffering only form a very small memory footprint on the cluster and by being incredibly easy to setup and administer.

2.2.7 Spark on Mesos specificities

Happily Spark and Mesos, both products from the Apache fundation, know about each other and are designed to work with each other.
There are some specificities though when it comes to run Spark on Mesos as opposed to running Spark on the more usual YARN, as explained below,

Spark Mesos Dispatcher

Interestingly, mesos handles spark workers in a pretty amazing way. Not only does Mesos consider node locality requirements between spark and ElasticSearch, but Mesos also provides required retry policies and else.

When launching a spark job, there is nevertheless one Single Point of Failure that remains: the spark driver that lives outside of the Mesos/Spark cluster, on the machine it is launched by the user or the driving process.

For this reason, spark provides the Spark Mesos Dispatcher that can be used to dispatch the Spark Driver itself on the Mesos/Spark cluster.
Using the Spark Mesos Dispatcher, the driver itself, just as the spark processing is balanced on the cluster to an available node and can be supervised (retried, monitored in terms of memory consumption, etc.).

The Spark Mesos Dispatcher addresses the single weakness of a spark process: the driver that can crash or exhaust resources and handles it just as any other bit of spark processing.

Spark History Server

In contrary to spark running in standalone mode, when spark runs on Mesos, it has no long life running backend that the user can use to interact with when Spark is not actually executing a job.
Mesos takes care of creating and dispatching Spark workers when required. When no Spark job is being executed, there is no spark process somewhere one can interact with to query, for instance, the results of a previous job.

Happily spark provides a solution out of the box for this : the spark History Server.
The Spark History Server is a lightweight process that presents the results stored in the Spark Event Log folder, that folder on the filesystem where Spark stores consolidated results from the various workers.
The documentation of Spark is very unclear about this, but since only the spark driver stores consolidated results in the event log folder, if all drivers are launched on the same machine (for instance the Mesos master machine), there is no specific needs for HDFS.

One should note that running the Spark History Server without HDFS to store the event log can be a problem if one uses the Spark Mesos Dispatcher to distribute the driver program itself on the Mesos cluster. In this case using a common NFS share for instance would solve the problem.

2.3. Making it work together : ES-Hadoop

From ES-Hadoop's web site : "Connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. The Elasticsearch-Hadoop (ES-Hadoop) connector lets you get quick insight from your big data and makes working in the Hadoop ecosystem even better."

Initially ES-Hadoop contains the set of classes implementing the connectors for pretty much all de facto "standards" components of a full hadoop stack, such as Hive, Pig, Spark, etc.
Interestingly, as far as Spark is concerned, Spark can perfectly use ES-Hadoop to load from or store data to ElasticSearch outside of an Hadoop stack. In fact, the spark connector from the ES-Hadoop library has no dependency on a Hadoop stack whatsoever.

In the context of ELK-MS, the ES-Hadoop connector is one of the most important components. When one considers building a large collocated ES / Mesos / Spark cluster and execute tasks requiring to fetch large datasets from ES to Spark, the data-locality knowledge supported by ES-Hadoop is utmost important. The second article of this serie is largely devoted to assessing how and how far the optimization of data-locality works.

When launching a job using Spark, the connector determines the locations of the shards in ElasticSearch that it will be targeting and creates a partition per shard (or even further to allow for greater parallelism). Each of these partition definitions carries with it the index name, the shard id, the slice id and the addresses of the machines where it can find this data on locally. It then relies on Spark's task scheduling to achieve data locality.
Spark will stand up a task for each of the input partitions, and each reading task is pinned to a node that is hosting the shards. This just means that the task will always try to read from that node first, but will target other nodes if that node fails the processing or fails from becoming available before the timeout.

2.4 Application Architecture

Typical Data Flows on the ELK-MS platform is illustrated by the following Application Architecture schema:

The tests presented "in the second article in this serie: ELK-MS - part II : assessing behaviour" are intended to assess the well behaviour of this application architecture.

3. niceideas ELK-MS

So. Again as stated in introduction before, when playing with ES / Mesos / Spark, it happened quite fast that I got two urgent needs:

First, I needed a reference for configuring the various software so that they work well together. Instead of writing pages of documentation indicating the settings to tune, I ended up putting all of that in setup scripts aimed at helping me re-apply the configuration at will.
Second, I needed a test cluster allowing me to assess how various key features were working, among which ensuring optimization of data-locality was one of the most important.

In the end I wrote a set of scripts using Vagrant and VirtualBox aimed at making it possible for me to rebuild the test cluster and reapply the configuration at will. I packaged all these scripts together and call this package the niceideas_ELK-MS package.

This package is available for download here.

3.1 System and principles

The System Architecture of the ELK-MS platform as build by the niceideas_ELK-MS package is as follows:

The master node, MES Master (for Mesos/Elasticsearch/Spark) is called mes_master. It contains the full stack of software including the management UIs. The Master node is also a data node.
The two data nodes, MES Slave X are called mes_node1 and mes_node2. They only provide an ElasticSearch instance and a Mesos worker instance to drive Spark Executors.

Having two possible Mesos Masters is not considered for now but the technical stack is deployed with this possibility wide open by using zookeeper to manage mesos masters.

3.2 The build system

The remainder of this section is a description of the niceideas_ELK-MS package build system and a presentation of the expected results.

3.2.1 Required Tools

First the build system is really intended to work on Linux, but would work as well on Windows except that vagrant commands need to be called directly.

But before digging into this, the following tools need to be installed and properly working on the host machine where the ELK-MS test cluster has to be built:

VirtualBox: is an x86 and AMD64/Intel64 virtualization solution.
The niceideas_ELK-MS package will build a cluster of nodes taking the form of VMs running on the host machine (the user computer).
Vagrant: is a tool for building and managing virtual machine environments in a single workflow.
The niceideas_ELK-MS package uses Vagrant to build and manage the VMs without any user interaction required and to drive the provisioning scripts execution.
vagrant-reload vagrant plugin is require to reload the machines after some changes applied by the provisionning scripts requiring a VM reboot.
See https://github.com/aidanns/vagrant-reload/blob/master/README.md.

3.2.2 Build System Project Layout

The niceideas_ELK-MS package structure, after being properly extracted in a local folder, is as follows:

./setup.sh: basically takes care of everything by calling vagrant to build the 3 VMs
./vagrant/VagrantFile: vagrant definition file to define the 3 VMs and the provisioning scripts
./provisionning/*: the provisoning scripts. The entry point is setup.sh that calls each and every other script.

Rationality

In a DevOps world, there are better tools than shell scripts to proceed with VM or machine provisioning, such as Ansible, Chef, Puppet, etc.
But in my case, I want it to be possible for me to go on any VM, any machine and re-apply my configuration to Spark, Mesos, ElasticSearch or else by simply calling a shell script with a few arguments.
So even though there are more efficient alternatives, I kept shell scripts here for the sake of simplicity.

Building the ELK-MS test cluster on Windows

With VirtualBox and Vagrant properly installed on Windows, nothing should prevent someone from building the cluster on Windows.
But in this case, of course, the root scripts setup.sh, start_cluster.sh, stop_cluster.sh are not usable (or else cygwin ? MingW ?).

In this case, the user should call vagrant manually to build the 3 VMs mes_master, mes_node1 and mes_node2 as follows:

c:\niceideas_ELK-MS\vagrant> vagrant up mes_master
...
c:\niceideas_ELK-MS\vagrant> vagrant up mes_node1
...
c:\niceideas_ELK-MS\vagrant> vagrant up mes_node2
...

3.3 Calling the build system and results

Again, calling the build system to fully build the cluster, on Linux, is as simple as:

badtrash@badbook:/data/niceideas_ELK-MS/setup$ ./setup.sh

A full dump of the result of the setup.sh script is available here.

3.4 Testing the System

After calling the setup.sh script above, the 3 VMs are properly created, as one can check in VirtualBox:

In addition, the 4 UI applications should be available at following addresses (caution, the links below return to your cluster, not niceideas.ch):

Cerebro: (http://192.168.10.10:9000/)

(One can see the 3 nodes available)

Mesos: (http://192.168.10.10:5050/)

(One can see the 3 nodes available)

Spark History Server: (http://192.168.10.10:18080/)

Kibana: (http://192.168.10.10:5601/)

3.5 Tips & Tricks

This closes the presentation of the niceideas_ELK-MS package. The remainder of this article gives some hints regarding the configuration of the different software components.
Readers interested in understanding what the build system of niceideas_ELK-MS presented above does without the hassle of analyzing the setup scripts can continue reading hereunder.
Reader interested only in understanding the cluster layout and the concerns of the ES / Spark integration can move to the second article in this serie: ELK-MS - part II : assessing behaviour.

Killing a stuck job

Once in a while, for various reasons, a job gets stuck. In this case the easiest way to kill it is using the Spark Web console.
But wait, hold on, you just said above that such a console is not available when running through Mesos ?
Well actually, the Spark console is available as long as the spark job is alive ... which is the case, happily, when a spark job is stuck.

So one can follow the link provided by Mesos on the Spark console and use the usual kill link from there.

Spark fine grained scheduling by Mesos

When reading about Mesos fine grained scheduling of spark job, one might think it makes sense to give it a try ... don't!

Spark fine grained scheduling by Mesos is really really messed up.
One might believe that it helps concurrency and better resource allocation but it really doesn't, In practice what happens is that an amazing proportion of time is lost scheduling all the individual spark tasks, plus it often compromises co-location of data between ES and Spark.

It's even deprecated in the latest spark versions.
More information in this regards is available here: https://issues.apache.org/jira/browse/SPARK-11857.

4. Noteworthy configuration elements

The below presents the important configuration aspects taken care of by the provisioning scripts.

4.1 NTP

Related scripts from the niceideas_ELK-MS package are as follows:

Configuration script: setupNTP.sh

Just as with every big data or NoSQL cluster, having a shared common understanding of time in the cluster is key. So NTP needs to be properly set up.

On master mes_master

Sample portion from /etc/ntp.conf:

pool ntp1.hetzner.de iburst
pool ntp2.hetzner.com iburst
pool ntp3.hetzner.net iburst

On slaves mes_node1 and mes_node2

Sample portion from /etc/ntp.conf:

server 192.168.10.10

#enabling mes_master to set time
restrict 192.168.10.10 mask 255.255.255.255 nomodify notrap nopeer noquery

#disable maximum offset of 1000 seconds
tinker panic 0

4.2 Zookeeper

Related scripts from the niceideas_ELK-MS package are as follows:

Configuration script: setupZookeeper.sh

Zookeeper is really only required when considering several Mesos master since in this case we need the quorum feature of zookeeper to proceed with proper election of the master and to track their state.
At the moment ELK-MS has only one Mesos master, but we'll make it production and HA ready by already setting up and using zookeeper.

On master mes_master

Sample portion from /etc/zookeeper/conf/zoo.cfg:

server.1=192.168.10.10:2888:3888

In addition, we need to set zookeeper master id in /etc/zookeeper/conf/myid.
Let's just put a single character "1" in it for now.

4.3 Elasticsearch

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installElasticSearch.sh
Configuration script: setupElasticSearch.sh
Systemd service file: elasticsearch.service

There's not a whole lot of things to configure in ES. The installation and setup scripts are really just created a dedicated users, a whole bunch of folders and simlinks, etc.
The only important configuration elements are as follows:

On master mes_master

Sample portion from /usr/local/lib/elasticsearch-6.0.0/config/elasticsearch.yml :

# name of the cluster (has to be common)
cluster.name: mes-es-cluster

# name of the node (has to be unique)
node.name: mes_master

# Bind on all interfaces (internal and external)
network.host: 0.0.0.0

# We're good with one node
discovery.zen.minimum_master_nodes: 1

#If you set a network.host that results in multiple bind addresses 
#yet rely on a specific address for node-to-node communication, you 
#should explicitly set network.publish_host
network.publish_host: 192.168.10.10

On slaves mes_node1 and mes_node2

Sample portion from /usr/local/lib/elasticsearch-6.0.0/config/elasticsearch.yml :

# name of the cluster (has to be common)
cluster.name: mes-es-cluster

# name of the node (has to be unique, this is for node1)
node.name: mes_node1

# Bind on all interfaces (internal and external)
network.host: 0.0.0.0

# We're good with one node
discovery.zen.minimum_master_nodes: 1

# enabling discovery of master
discovery.zen.ping.unicast.hosts: ["192.168.10.10"]

#If you set a network.host that results in multiple bind addresses 
#yet rely on a specific address for node-to-node communication, you 
#should explicitly set network.publish_host
# (this is for node1)
network.publish_host: 192.168.10.11

4.4 Logstash, Kibana, Cerebro

Related scripts from the niceideas_ELK-MS package are as follows:

Logstash Installation script: installLogstash.sh
Logstash Configuration script: setupLogstash.sh

Cerebro Installation script: installCerebro.sh
Cerebro Configuration script: setupCerebro.sh
Cerebro Systemd service file: cerebro.service

Kibana Installation script: installKibana.sh
Kibana Configuration script: setupKibana.sh
Kibana Systemd service file: kibana.service

There is really nothing specific to report in terms of configuration for these 3 tools.

4.5 Mesos

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installMesos.sh
Configuration script: setupMesos.sh
Mesos startup script: mesos-init-wrapper.sh
Mesos Master Systemd startup file: mesos-master.service
Mesos Slave Systemd startup file: mesos-slave.service

The noteworthy configuration aspects are as follows.

On both master and slaves

The file /usr/local/etc/mesos/mesos-env.sh contains common configuration for both mesos-master and mesos-slave.
So we should create this file on every node of the cluster.

#Working configuration
export MESOS_log_dir=/var/log/mesos

#Specify a human readable name for the cluster
export MESOS_cluster=mes_cluster

#Avoid issues with systems that have multiple ethernet interfaces when the Master 
#or Slave registers with a loopback or otherwise undesirable interface.
# (This is for master, put IP of the node)
export MESOS_ip=192.168.10.10

#By default, the Master will use the system hostname which can result in issues 
#in the event the system name isn't resolvable via your DNS server.
# (This is for master, put IP of the node)
export MESOS_hostname=192.168.10.10

Then, the file /usr/local/etc/mesos/mesos-slave-env.sh configures mesos-slave.
Since we run a mesos-slave process on the mes_master machine as well, we define this file on every node of the cluster as well.

#Path of the slave work directory.
#This is where executor sandboxes will be placed, as well as the agent's checkpointed state.
export MESOS_work_dir=/var/lib/mesos/slave

#we need the Slave to discover the Master.
#This is accomplished by updating the master argument to the master Zookeeper URL
export MESOS_master=zk://$MASTER_IP:2181/mesos

On master mes_master only:

The mesos-master process is configured by /usr/local/etc/mesos/mesos-master-env.sh:

#Path of the master work directory.
#This is where the persistent information of the cluster will be stored
export MESOS_work_dir=/var/lib/mesos/master

#Specify the master Zookeeper URL which the Mesos Master will register with
export MESOS_zk=zk://$192.168.10.10:2181/mesos

# Change quorum for a greater value if one has more than one master 
# (only 1 in our case)
export MESOS_quorum=1

4.6 Spark

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installSpark.sh
Configuration script: setupSpark.sh
Dynamic Allocation Configuration script: setupSparkDynamicAllocation.sh
Spark History Server start wrapper: start-spark-history-server-wrapper.sh
Spark History Server Systemd startup file: spark-history-server.service
Spark Mesos Dispatcher start wrapper: start-spark-mesos-dispatcher-wrapper.sh
Spark Mesos Dispatcher Systemd startup file: spark-mesos-dispatcher.service

Aside from the specific startup wrappers and systemd service configuration files required for the Spark History Server and the Spark Mesos Dispatcher, the noteworthy configuration elements are as follows.

On both master and slaves

The file /usr/local/lib/spark-2.2.0/conf/spark-env.sh defines common environment variables required by spark workers and drivers.
So we should create this file on every node of the cluster (in addition the master also executes spark workers).

#point to your libmesos.so if you use Mesos
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/mesos-1.3.0/lib/libmesos.so

#Important configuration directories
export SPARK_CONF_DIR=/usr/local/lib/spark-2.2.0/conf
export SPARK_LOG_DIR=/usr/local/lib/spark-2.2.0/logs

The file /usr/local/lib/spark-2.2.0/conf/spark-defaults.conf defines common configuration properties required by spark workers and drivers.
So we should create this file on every node of the cluster (since the master as well executes spark workers).

#Finding the mesos master through zookeeper
spark.master=mesos://zk://$MASTER_IP:2181/mesos

#Activating EventLog stuff (required by history server)
spark.eventLog.enabled=true
spark.eventLog.dir=/var/lib/spark/eventlog

#Default serializer
spark.serializer=org.apache.spark.serializer.KryoSerializer

#Limiting the driver (client) memory
spark.driver.memory=800m

#Settings required for Spark driver distribution over mesos cluster 
#(Cluster Mode through Mesos Dispatcher)
spark.mesos.executor.home=/usr/local/lib/spark-2.2.0/

#If set to true, runs over Mesos clusters in coarse-grained sharing mode,
#where Spark acquires one long-lived Mesos task on each machine.
#If set to false, runs over Mesos cluster in fine-grained sharing mode,
#where one Mesos task is created per Spark task.
#(Fine grained mode is deprecated and one should consider dynamic allocation 
#instead)
spark.mesos.coarse=true

#ElasticSearch setting (first node to be reached => can use localhost everywhere)
spark.es.nodes=localhost
spark.es.port=9200
es.nodes.data.only=false

#The scheduling mode between jobs submitted to the same SparkContext.
#Can be FIFO or FAIR. FAIR Seem not to work well with mesos
#(FIFO is the default BTW ...)
spark.scheduler.mode=FIFO

#How long to wait to launch a data-local task before giving up 
#and launching it on a less-local node.
spark.locality.wait=20s

# Configuring dynamic allocation
# (See Spark configuration page online for more information)
spark.dynamicAllocation.enabled=true
#(Caution here : small values cause issues. I have executors killed with 10s for instance)
spark.dynamicAllocation.executorIdleTimeout=120s
spark.dynamicAllocation.cachedExecutorIdleTimeout=300s

# Configuring spark shuffle service (required for dynamic allocation)
spark.shuffle.service.enabled=true

On master mes_master only

In the very same file /usr/local/lib/spark-2.2.0/conf/spark-defaults.conf, we add what is required for Spark History Server:

#For the filesystem history provider, 
#the directory containing application event logs to load.
spark.history.fs.logDirectory=file:///var/lib/spark/eventlog

#The period at which to check for new or updated logs in the log directory.
spark.history.fs.update.interval=5s

4.7 ES-Hadoop

Related scripts from the niceideas_ELK-MS package are as follows:

Installation script: installESHadoop.sh

Nothing specific to report, basically the only thing to be done to install ES-Hadoop is to copy the spark driver elasticsearch-spark-20_2.11-6.0.0.jar to the spark jars folder /usr/local/lib/spark-2.2.0/jars/.

5. Conclusion

With all of the information above, you should be able to set up your own ElasticSearch / Mesos / Spark Cluster in no time.
Or simply use the niceideas_ELK-MS package to build a test cluster using one single command.

Now the next article in this serie, ELK-MS - part II : assessing behaviour will present the tests I did on this test cluster and the conclusions in terms of behaviour assessment.

I'm already telling you the big conclusion: Using ElasticSearch / Mesos / Spark for your Big Data Analytics needs is mind-joggling. It works really amazingly and supports a striking range of use cases while being a hundred times lighter than a plain old Hadoop stack both to setup and to operate.

Kuddos to the folks at Apache and at Elastic for making this possible.

Periodic Table of Agile Principles and Practices

2017-06-29T17:19:29-04:00

After writing my previous article, I wondered how I could represent on a single schematic all the Agile Principles and Practices from the methods I am following, XP, Scrum, Lean Startup, DevOps and others.
I found that the approach I used in in a former schematic - a graph of relationship between practices - is not optimal. It already looks ugly with only a few practices and using the same approach for the whole set of them would make it nothing but a mess.

So I had to come up with something else, something better.
Recently I fell by chance on the Periodic Table of the Elements... Long time no see... Remembering my physics lessons in University, I always loved that table. I remembered spending hours understanding the layout and admiring the beauty of its natural simplicity.
So I had the idea of trying the same layout, not the same approach since both are not comparable, really only the same layout for Agile Principles and Practices.

The result is hereunder: The Periodic Table of Agile Principles and Practices:

(This article is available as a PDF document here https://www.niceideas.ch/Agile_table.pdf and a slideshare presentation there https://www.slideshare.net/JrmeKehrli/periodic-table-of-agile-principles-and-practices)

The layout principle is and the description of the principles and practices is explained hereafter.

Layout Principle

The Origin Method such as XP, Scrum, DevOps, etc is indicated by the color as well as the name of the method on the top-right corner.
The Category, Principle or Practice is indicated by the shape: rectangle or round corners.
The number represents the Complexity expressed as the number of dependencies.
The team or committee concerned with the principle or practice is indicated as note on the bottom-right corner.
The horizontal dimension is related to the complexity. The more on the right is an element, the higher its complexity.
The vertical dimension is related to classifying principles and practices more organization or more related to engineering, in specific layers related to the category or team they apply to.

This is best presented as follows:

Remarks

Interestingly, but not surprisingly, scrum is really in the middle of the schematic, underlying the fact that it impacts as well development principles and the development team organization.
XP is really everywhere down the line.
Product Development is really about Product Management in the Agile world.
DevOps is more related to development practices than everything else.

The next part of this article describes each and every principle and practice.

XP

Sn : Simple Design
A simple design always takes less time to finish than a complex one. So always do the simplest thing that could possibly work next. If you find something that is complex replace it with something simple. It's always faster and cheaper to replace complex code now, before you waste a lot more time on it.

Depends on , , , , ,

Mt : Metaphor
System Metaphor is itself a metaphor for a simple design with certain qualities. The most important quality is being able to explain the system design to new people without resorting to dumping huge documents on them. A design should have a structure that helps new people begin contributing quickly. The second quality is a design that makes naming classes and methods consistent.

Depends on , , ,

Td : TDD = Test Driven Development
Test-driven development is a software development process that relies on the repetition of a very short development cycle: requirements are turned into very specific test cases, then the software is improved to pass the new tests, only. This is opposed to software development that allows software to be added that is not proven to meet requirements.

Depends on , , ,

Oc : Onsite Customer
One of the few requirements of extreme programming (XP) is to have the customer available. Not only to help the development team, but to be a part of it as well. All phases of an XP project require communication with the customer, preferably face to face, on site. It's best to simply assign one or more customers to the development team.

Rf : Refactoring
We computer programmers hold onto our software designs long after they have become unwieldy. We continue to use and reuse code that is no longer maintainable because it still works in some way and we are afraid to modify it. But is it really cost effective to do so? Extreme Programming (XP) takes the stance that it is not. When we remove redundancy, eliminate unused functionality, and rejuvenate obsolete designs we are refactoring. Refactoring throughout the entire project life cycle saves time and increases quality.
Refactor mercilessly to keep the design simple as you go and to avoid needless clutter and complexity. Keep your code clean and concise so it is easier to understand, modify, and extend

Depends on , , , , ,

Cs : Coding Standards
Code must be formatted to agreed coding standards. Coding standards keep the code consistent and easy for the entire team to read and refactor. Code that looks the same encourages collective ownership.

Su : Sustainable Pace
To set your pace you need to take your iteration ends seriously. You want the most completed, tested, integrated, production ready software you can get each iteration. Incomplete or buggy software represents an unknown amount of future effort, so you can't measure it. If it looks like you will not be able to get everything finished by iteration end have an iteration planning meeting and re-scope the iteration to maximize your project velocity. Even if there is only one day left in the iteration it is better to get the entire team re-focused on a single completed task than many incomplete ones.

Wt : Whole Team
All the contributors to an XP project sit together, members of a whole team. The team shares the project goals and the responsibility for achieving them. This team must include a business representative, the "Customer" who provides the requirements, sets the priorities, and steers the project

Ci : Continuous Integration
Developers should be integrating and commiting code into the code repository every few hours, when ever possible. In any case never hold onto changes for more than a day. Continuous integration often avoids diverging or fragmented development efforts, where developers are not communicating with each other about what can be re-used, or what could be shared. Everyone needs to work with the latest version. Changes should not be made to obsolete code causing integration headaches.

Depends on , , ,

Co : Collective Ownership
Collective Ownership encourages everyone to contribute new ideas to all segments of the project. Any developer can change any line of code to add functionality, fix bugs, improve designs or refactor. No one person becomes a bottle neck for changes.

Cr : Code Review
Code review is increasingly favored over strict Pair Programming as initially requires by the XP Method. The problem with Pair programming is that it cannot fitr everybody.
Code reviews are considered important by many large-process gurus. They are intended to ensure conformance to standards, and more importantly, intended to ensure that the code is clear, efficient, works, and has QWAN. They also intended to help disseminate knowledge about the code to the rest of the team.

Depends on , , ,

Pg : Planning Game
The main planning process within extreme programming is called the Planning Game. The game is a meeting that occurs once per iteration, typically once a week. The planning process is divided into two parts: Release Planning and Sprint Planning.

Depends on , , ,

Sr : Small Releases
The development team needs to release iterative versions of the system to the customers often. Some teams deploy new software into production every day. At the very least you will want to get new software into production every week or two. At the end of every iteration you will have tested, working, production ready software to demonstrate to your customers. The decision to put it into production is theirs.

Sc : Source Code Management
A component of software configuration management, version control, also known as revision control or source control, is the management of changes to documents, computer programs, large web sites, and other collections of information. Changes are usually identified by a number or letter code, termed the "revision number", "revision level", or simply "revision". For example, an initial set of files is "revision 1". When the first change is made, the resulting set is "revision 2", and so on. Each revision is associated with a timestamp and the person making the change. Revisions can be compared, restored, and with some types of files, merged.

Bs : Boyscout Rule
The Boy Scouts have a rule: "Always leave the campground cleaner than you found it." If you find a mess on the ground, you clean it up regardless of who might have made the mess. You intentionally improve the environment for the next group of campers. Actually the original form of that rule, written by Robert Stephenson Smyth Baden-Powell, the father of scouting, was "Try and leave this world a little better than you found it."
What if we followed a similar rule in our code: "Always check a module in cleaner than when you checked it out." No matter who the original author was, what if we always made some effort, no matter how small, to improve the module. What would be the result?

Depends on , ,

No : No premature optimization
In Donald Knuth's paper "Structured Programming With GoTo Statements", he wrote: "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

At : Acceptance testing
Acceptance tests are created from user stories. During an iteration the user stories selected during the iteration planning meeting will be translated into acceptance tests. The customer specifies scenarios to test when a user story has been correctly implemented. A story can have one or many acceptance tests, what ever it takes to ensure the functionality works.

Depends on ,

Ac : Automated Tests Coverage
Code Coverage is a measurement of how many lines/blocks/arcs of your code are executed while the automated tests are running.
Code coverage on every dimension should be above possible to 80% (the famous 80/20) rule and close to 100% (TDD).

Depends on , ,

Scrum

Sp : Sprint
A Sprint is a time-box of one month or less during which a "Done", useable, and potentially releasable product Increment is created. Sprints best have consistent durations throughout a development effort. A new Sprint starts immediately after the conclusion of the previous Sprint.

Depends on , , ,

In : Product Increment (Shippable Product)
In Scrum, the Development Team delivers each Sprint a Product Increment.
The increment must consist of thoroughly tested code that has been built into an executable, and the user operation of the functionality is documented either in Help files or user documentation. These requirements are documented in the Definition of Done.
If everything works fine and the Development Team has estimated well, the Product Increment includes all items, which were planned in the Sprint Backlog, tested and documented.

Depends on , , , , , , ,

Sl : Sprint Planning
In Scrum, the sprint planning meeting is attended by the product owner, ScrumMaster and the entire Scrum team. Outside stakeholders may attend by invitation of the team, although this is rare in most companies.
During the sprint planning meeting, the product owner describes the highest priority features to the team. The team asks enough questions that they can turn a high-level user story of the product backlog into the more detailed tasks of the sprint backlog.

Depends on , , , ,

So : Sprint Retrospective
No matter how good a Scrum team is, there is always opportunity to improve. Although a good Scrum team will be constantly looking for improvement opportunities, the team should set aside a brief, dedicated period at the end of each sprint to deliberately reflect on how they are doing and to find ways to improve. This occurs during the sprint retrospective.
The sprint retrospective is usually the last thing done in a sprint. Many teams will do it immediately after the sprint review. The entire team, including both the ScrumMaster and the product owner should participate. You can schedule a scrum retrospective for up to an hour, which is usually quite sufficient. However, occasionally a hot topic will arise or a team conflict will escalate and the retrospective could take significantly longer.

Depends on ,

Sb : Sprint Backlog
The sprint backlog is a list of tasks identified by the Scrum team to be completed during the Scrum sprint. During the sprint planning meeting, the team selects some number of product backlog items, usually in the form of user stories, and identifies the tasks necessary to complete each user story. Most teams also estimate how many hours each task will take someone on the team to complete.

Depends on ,

Pb : Product Backlog
The agile product backlog in Scrum is a prioritized features list, containing short descriptions of all functionality desired in the product. When applying Scrum, it's not necessary to start a project with a lengthy, upfront effort to document all requirements. Typically, a Scrum team and its product owner begin by writing down everything they can think of for agile backlog prioritization. This agile product backlog is almost always more than enough for a first sprint. The Scrum product backlog is then allowed to grow and change as more is learned about the product and its customers.

Depends on ,

Sd : Sprint Demo
In Scrum, each sprint is required to deliver a potentially shippable product increment. This means that at the end of each sprint, the team has produced a coded, tested and usable piece of software.
So at the end of each sprint, a sprint review meeting is held. During this meeting, the Scrum team shows what they accomplished during the sprint. Typically this takes the form of a demo of the new features.

Depends on

Po : Product Owner
The Scrum product owner is typically a project's key stakeholder. Part of the product owner responsibilities is to have a vision of what he or she wishes to build, and convey that vision to the scrum team. This is key to successfully starting any agile software development project. The agile product owner does this in part through the product backlog, which is a prioritized features list for the product.
The product owner is commonly a lead user of the system or someone from marketing, product management or anyone with a solid understanding of users, the market place, the competition and of future trends for the domain or type of system being developed.

Depends on

Ds : Daily Scrum
In Scrum, on each day of a sprint, the team holds a daily scrum meeting called the "daily scrum." Meetings are typically held in the same location and at the same time each day. Ideally, a daily scrum meeting is held in the morning, as it helps set the context for the coming day's work. These scrum meetings are strictly time-boxed to 15 minutes. This keeps the discussion brisk but relevant.

Sm : Scrum Master
What is a Scrum Master? The ScrumMaster is responsible for making sure a Scrum team lives by the values and practices of Scrum. The ScrumMaster is often considered a coach for the team, helping the team do the best work it possibly can. The ScrumMaster can also be thought of as a process owner for the team, creating a balance with the project's key stakeholder, who is referred to as the product owner.
The ScrumMaster does anything possible to help the team perform at their highest level. This involves removing any impediments to progress, facilitating meetings, and doing things like working with the product owner to make sure the product backlog is in good shape and ready for the next sprint. The ScrumMaster role is commonly filled by a former project manager or a technical team leader but can be anyone.

Do: Definition of Done
Definition of Done is a simple list of activities (writing code, coding comments, unit testing, integration testing, release notes, design documents, etc.) that add verifiable/demonstrable value to the product. Focusing on value-added steps allows the team to focus on what must be completed in order to build software while eliminating wasteful activities that only complicate software development efforts.

Depends on , ,

Pp : Planning Poker
Planning Poker is an agile estimating and planning technique that is consensus based. To start a poker planning session, the product owner or customer reads an agile user story or describes a feature to the estimators.
Each estimator is holding a deck of Planning Poker cards with values like 0, 1, 2, 3, 5, 8, 13, 20, 40 and 100, which is the sequence we recommend. The values represent the number of story points, ideal days, or other units in which the team estimates.
The estimators discuss the feature, asking questions of the product owner as needed. When the feature has been fully discussed, each estimator privately selects one card to represent his or her estimate. All cards are then revealed at the same time.
If all estimators selected the same value, that becomes the estimate. If not, the estimators discuss their estimates. The high and low estimators should especially share their reasons. After further discussion, each estimator reselects an estimate card, and all cards are again revealed at the same time.
The poker planning process is repeated until consensus is achieved or until the estimators decide that agile estimating and planning of a particular item needs to be deferred until additional information can be acquired.

Depends on

Es : Estimations in Story Points
Story points are a unit of measure for expressing an estimate of the overall effort that will be required to fully implement a product backlog item or any other piece of work.
When we estimate with story points, we assign a point value to each item. The raw values we assign are unimportant. What matters are the relative values. A story that is assigned a 2 should be twice as much as a story that is assigned a 1. It should also be two-thirds of a story that is estimated as 3 story points.
Instead of assigning 1, 2 and 3, that team could instead have assigned 100, 200 and 300. Or 1 million, 2 million and 3 million. It is the ratios that matter, not the actual numbers.

Tv : Team Velocity
Velocity is simply a metric based on the completed items in a sprint by a single team. The metric is completely subjective to that specific team, and should never be extrapolated for any other comparison.
Velocity is a reflective metric gathered from the sprint throughput of a stable team. Usually, a velocity metric is not considered valid until several sprints have been completed.

Depends on , , ,

Product Development

Us : User Stories
In software development and product management, a user story is an informal, natural language description of one or more features of a software system. User stories are often written from the perspective of an end user or user of a system. They are often recorded on index cards, on Post-it notes, or in project management software. Depending on the project, user stories may be written by various stakeholders including clients, users, managers or development team members.

Depends on ,

Sg : Story Mapping
Story mapping consists of ordering user stories along two independent dimensions. The "map" arranges user activities along the horizontal axis in rough order of priority (or "the order in which you would describe activities to explain the behaviour of the system"). Down the vertical axis, it represents increasing sophistication of the implementation.
Given a story map so arranged, the first horizontal row represents a "walking skeleton", a barebone but usable version of the product. Working through successive rows fleshes out the product with additional functionality.

Depends on , , , ,

Cc : 3 C's - Card, conversation, confirmation
"Card, Conversation, Confirmation"; this formula (from Ron Jeffries) captures the components of a User Story:
a "Card" (or often a Post-It note), a physical token giving tangible and durable form to what would otherwise only be an abstraction;
a "conversation" taking place at different time and places during a project between the various people concerned by a given feature of a software product: customers, users, developers, testers; this conversation is largely verbal but most often supplemented by documentation;
the "confirmation", finally, the more formal the better, that the objectives the conversation revolved around have been reached.

Depends on

Pv : Product Vision (elevator Pitch)
Every Scrum project needs a product vision that acts as the project's true north, sets the direction and guides the Scrum team. It is the overarching goal everyone must share – Product Owner, ScrumMaster, team, management, customers and other stakeholders. As Ken Schwaber puts it: "The minimum plan necessary to start a Scrum project consists of a vision and a Product Backlog. The vision describes why the project is being undertaken and what the desired end state is."

Depends on

Iv : INVEST
The INVEST mnemonic for agile software projects was created by Bill Wake as a reminder of the characteristics of a good quality User Story:
Independent: The User Story should be self-contained, in a way that there is no inherent dependency on another PBI;
Negotiable: User Stories are not explicit contracts and should leave space for discussion;
Valuable: A User Story must deliver value to the stakeholders;
Estimatable: You must always be able to estimate the size of a User Story;
Small: User Storys should not be so big as to become impossible to plan/task/prioritize with a certain level of accuracy;
TestableThe User Story or its related description must provide the necessary information to make test development possible.

Depends on , ,

DevOps

Ff : Feature Flipping
Feature flipping is a technique in software development that attempts to provide an alternative to maintaining multiple source-code branches (known as feature branches), such that the feature can be tested, even before it is completed and ready for release. Feature flipping is used to hide, enable or disable the features, during run time. For example, during the development process, the developer can enable the feature for testing and disable it for remaining users

Depends on ,

Cd : Continuous Delivery
Continuous delivery (CD) is a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time. It aims at building, testing, and releasing software faster and more frequently. The approach helps reduce the cost, time, and risk of delivering changes by allowing for more incremental updates to applications in production. A straightforward and repeatable deployment process is important for continuous delivery.

Depends on , , , , , , , , ,

Ap : Automated Provisioning
(Infrastructure as Code) Server provisioning is a set of actions to prepare a server with appropriate systems, data and software, and make it ready for network operation. Typical tasks when provisioning a server are: select a server from a pool of available servers, load the appropriate software (operating system, device drivers, middleware, and applications), appropriately customize and configure the system and the software to create or change a boot image for this server, and then change its parameters, such as IP address, IP Gateway to find associated network and storage resources (sometimes separated as resource provisioning) to audit the system
With DevOps and Automated Provisioning, this whole configuration pipeline should be completely automated and executable in one-click, either automatically or on-demand.

Depends on , ,

Ic : Infrastructure Continuous Integration
(Infrastructure as Code) Infrastructure Continuous Integration consists in leveraging Continuous Integration techniques to Infrastructure components.
The continuous integration system is necessarily complex, spanning the development, test and staging environments. The continuous integration build should continuously build and test the provisioning, configuring and maintaining of the various infrastructure components.

Depends on , ,

Zd : Zero Downtime Deployments
A Zero Downtime Deployment consists in redeploying (typically for a software upgrade) a production system without any downtime appearing to end users. To achieve such lofty goals, redundancy becomes a critical requirement at every level of your infrastructure. There are various techniques involved such a canari release or blue-green deployments.

Depends on ,

Cm : Configuration Management
Configuration management is a class of tool supporting the automation of the configuration of a system, platform or software. It typically consists in define-with-code the various config elements that prepare a provisioned compute resource (like a server or AWS Ec2 instance) for service (installing software, setting up users, configuring services, placing files with template-defined variables, defining external config resources like DNS records in a relevant zone).

Depends on

Vc : Virtualization and Containers
Hardware virtualization or platform virtualization refers to the creation of a virtual machine that acts like a real computer with an operating system. Software executed on these virtual machines is separated from the underlying hardware resources.
Containerization - also called container-based virtualization and application containerization - is an OS-level virtualization method for deploying and running distributed applications without launching an entire VM for each application. Instead, multiple isolated systems, called containers, are run on a single control host and access a single kernel.

Bp : Build Pipelines
Build pipelines are integrated views of downstream and upstream build jobs on a build server. Build pipelines are requires to automated all the various tasks towards continuous delivery such as : provisionning of the environment, build of the various software (with compilation, tests, packaging, etc.), deployment of the software components, applying configuration and testing the deployed platform.

Depends on ,

Ar : Automated Releases
Release Automation consists in automating all the various steps required to release a new version of a software: building, testing, tagging, branching and deploying the binaries to a Binary management tools.

Depends on ,

St : Share the tools
Share the tools is a DevOps principles aimed at leveraging both Dev and Ops tools and practices to the other side of the wall. Developers should leverage their automation and building tool to Infrastructure Automation, Provisionning and Testing. Ops should share the production monitoring concerns with developers.

Depends on

Os : Operators are stakeholders
Operators as stakeholders is a DevOps principle stating that Operators should be considered the other users of the platform. They should be fully integrated in the Software Development Process.
At specification time, operators should give their non-functional requirements just as business users give their functional requirement. Such non-functional requirements should be handled with same important and priority by the development team.
At implementation time, operators should provide feedback and non-functional tests specifications continuously just as business users provides feedback on functional features.

Depends on

Or : Operators in Rituals
Operators in Rituals is a DevOps principle stating that operators should be integrated in the Development Team Rituals such as the Sprint Planning and Sprint Retrospective and represent non-functional constraints during these rituals just as the Product Owner represents the functional interests.

Depends on , , ,

Bm : Binaries Management
A binary repository manager is a software tool designed to optimize the download and storage of binary files used and produced in software development. It centralizes the management of all the binary artifacts generated and used by the organization to overcome the complexity arising from the diversity of binary artifact types, their position in the overall workflow and the dependencies between them.
A binary repository is a software repository for packages, artifacts and their corresponding metadata. It can be used to store binary files produced by an organization itself, such as product releases and nightly product builds, or for third party binaries which must be treated differently for both technical and legal reasons.

Lean Startup

Fl : Feedback Loop
The Build-Measure-Learn feedback loop is one of the central principles of Lean Startup Method.
A startup is to find a successful revenue model that can be developed with further investment. Build-Measure-Learn is a framework for establishing – and continuously improving – the effectiveness of new products, services and ideas quickly and cost-effectively.
In practice, the model involves a cycle of creating and testing hypotheses by building something small for potential customers to try, measuring their reactions, and learning from the results.

Depends on , , ,

Ft : feature Teams
A feature team is a long-lived, cross-functional, cross-component team that completes many end-to-end customer features—one by one. It is opposed to the traditional approach of Component Team where a team is specialized on an individual software components and maintains it over several projects at the same time.
The Feature team approach seeks to avoid the bottlenecks usually appearing with Component Teams.

Fa : Fail Fast
Fail fast means getting out of planning mode and into testing mode, eventually for every critical component of your model of change. Customer development is the process that embodies this principle and helps you determine which hypotheses to start with and which are the most critical for your new idea.
An important goal of the philosophy is to cut losses when testing reveals something isn't working and quickly try something else, a concept known as pivoting.

Depends on

Mv : MVP
In product development, the minimum viable product (MVP) is a product with just enough features to satisfy early customers, and to provide feedback for future development

Depends on ,

Gb : Get Out of the building
If you are pre-Product/Market Fit and you aren't actually "Getting out of the Building" (actually talking to your customers), you aren't doing Customer Development, and your startup isn't a Lean Startup.
Again: If you aren't actually talking to your customers, you aren't doing Customer Development.

Pt : Pizza Teams
The idea of a "two pizza team" was coined by Jeff Bezo, founder of Amazon.com. If you can't feed a team with two pizzas, it's too large. That limits a task force to five to seven people, depending on their appetites."
The underlying idea is that as a team's size grows, the amount of one-on-one communication channels tend to explode.
Beyond ten, communication loses efficiency, cohesion diminishes, parasitism behaviors and power struggles appear, and the performance of the team decreases very rapidly with the number of members.

As : Actionable Metrics
The only metrics that entrepreneurs should invest energy in collecting are those that help them make decisions. Actionable Metrics are opposed to Vanity Metrics.
This is a precision of another fundamental Lean Startup practice which is "Obsession of Measure" stating that everything should be measured and no decision should be taken in the company if it is not supported by a Key Process Indicator or a Key Risk Indicator.

Depends on ,

Bb : Build vs. Buy
This is a fundamental principle of the Lean Startup and the web giants : favor as much as possible building your own software, your own feature instead of buying a third party software or library.
When initiating a startup, having to pay fees to third party corporations before reaching a sustainable growth is suicidal.

Depends on ,

Ab : A/B Testing
In marketing and business intelligence, A/B testing is a term for a controlled experiment with two variants, A and B. It can be considered as a form of statistical hypothesis testing with two variants leading to the technical term, two-sample hypothesis testing, used in the field of statistics

Depends on ,

Kanban

Ko : Kanban Board
A Kanban board is a work and workflow visualization tool that enables you to optimize the flow of your work. Physical Kanban boards typically use sticky notes on a whiteboard to communicate status, progress, and issues.
An agile corporation should use a KanBan board to monitor all its processes.
A development team will typically use a Kanban board to monitor the Sprint backlog completion during a sprint.

Kaizen

Kb : Kaizen Burst
The Kaizen burst is a specific Kaizen process integrated the the development rituals. In Agile Software Development, it is really integrated in the Sprint Retrospective. This idea is to identify in a visual way (with a post-it on a board for instance) the weaknesses or problems in the development practices or processes. These boxes are called Kaizen burst.
Theses boxes are commented as actions are taken towards improvement and eventually removed when the weakness has been addressed or the problem solved.

Depends on

Wh : 5 Why
5 Whys is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem.
The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question "Why?" Each answer forms the basis of the next question. The "5" in the name derives from an anecdotal observation on the number of iterations needed to resolve the problem.

Depends on

FDD (Feature Driven Development)

Si : SOLID principles
In computer programming, the term SOLID is a mnemonic acronym for five design principles intended to make software designs more understandable, flexible and maintainable. The principles are a subset of many principles promoted by Robert C. Martin.
Though they apply to any object-oriented design, the SOLID principles can also form a core philosophy for methodologies such as agile development or Adaptive Software Development.
The 5 principles are as follows:
SRP : Single responsibility principle - a class should have only a single responsibility (i.e. only one potential change in the software's specification should be able to affect the specification of the class)
OCP : Open/closed principle - "software entities ... should be open for extension, but closed for modification."
LSP : Liskov substitution principle - "objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program."
ISP : Interface segregation principle - "many client-specific interfaces are better than one general-purpose interface."
DIP : Dependency inversion principle - one should "depend upon abstractions, not concretions."

Depends on

DAD

Pm : Product Management Committee
The Product Management Committee is both a team and a ritual that enforces a smart approach to product management.
Product Management consists in identifying and evolving your organization’s business vision; identifying and prioritizing potential products/solutions to support that vision; identifying, prioritizing, and allocating features to products under development; managing functional dependencies between products; and marketing those products to their potential customers.
The Product Management Committee is the weekly (or bi-weekly) ritual enforcing and supporting this process with the required role attending the committee. It is led by the product Owner which has more a role of facilitator and arbitrator that a formal decision role. The Product Owner represents the PMC to the development team.

Am : Architecture Committee
The Architecture Committee is responsible to analyze user stories and define Development Tasks. Every story should be specified, designed and discussed. Screen mockups if applicable should be drawn, acceptance criteria agreed, etc.
Since the Architecture Committee is also responsible for estimating Stories, it's important that representatives of the Development Team, not only the Tech Leads and the Architects, but simple developers as well, take part in it.

Agile Planning : tools and processes

2017-06-14T14:42:48-04:00

All the work on Agility in the Software Engineering Business in the past 20 years, initiated by Kent Beck, Ward Cunningham and Ron Jeffries, comes from the finding that traditional engineering methodologies apply only poorly to the Software Engineering business.

If you think about it, we are building bridges from the early stages of the Roman Empire, three thousand years ago. We are building heavy mechanical machinery for almost three hundred years. But we are really writing software for only fifty years.
In addition, designing a bridge or a mechanical machine is a lot more concrete than designing a Software. When an engineering team has to work on the very initial stage of the design of a bridge or mechanical machine, everyone in the team can picture the result in his mind in a few minutes and breaking it down to a set of single Components can be done almost visually in one's mind.

A software, on the other hand, is a lot more abstract. This has the consequence that a software is much harder to describe than any other engineering product which leads to many levels of misunderstanding.

The waterfall model of Project Management in Software Engineering really originates in the manufacturing and construction industries.
Unfortunately, for the reasons mentionned above, despite being so widely used in the industry, it applies only pretty poorly to the Software Engineering business. Most important problems it suffers from are as follows:

Incomplete or moving specification: due to the abstract nature of software, it's impossible for business experts and business analysts to get it right the first time.
The tunnel effect: we live in a very fast evolving world and businesses need to adapt all the time. The software delivered after 2 years of heavy development will fulfill (hardly, but let's admit it) the requirements that were true two years ago, not anymore today.
Drop of Quality to meet deadlines: An engineering project is always late, always. Things are just a lot worst with software.
Heightened tensions between teams: The misunderstanding between teams leads to tensions, and it most of the time turns pretty ugly pretty quick.

So again, some 20 years ago, Beck, Cunningham and Jeffries started to formalize some of the practices they were successfully using to address the uncertainties, the overwhelming abstraction and the misunderstandings inherent to software development. They formalized it as the eXtreme Programming methodology.

A few years later, the same guys, with some other pretty well known Software Engineers, such as Alistair Cockburn and Martin Fowler, gathered together in a resort in Utah and wrote the Manifesto for Agile Software Development in which they shared the essential principles and practices they were successfully using to address problems with more traditional and heavyweight software development methodologies.

Today, Agility is a lot of things and the set of principles of practices in the whole Agile family is very large. Unfortunately, most of them require a lot of experience to be understood and then applied successfully within an organization.

Unfortunately, the complexity of embracing a sound Agile Software Development Methodology and the required level of maturity a team has to have to benefit from its advantages is really completely underestimated.
I cannot remember the number of times I heard a team pretending it was an Agile team because it was doing a Stand up in the morning and deployed Jenkins to run the unit tests at every commit. But yeah, honestly I cannot blame them. It is actually difficult to understand Agile Principles and Practices when one never suffered from the very drawbacks and problems they are addressing.

I myself am not an agilist. Agility is not a passion, neither something that thrills me nor something that I love studying in my free time. Agility is to me simply a necessity. I discovered and applied Agile Principles and practices out of necessity and urgency, to address specific issues and problems I was facing with the way my teams were developing software.

The latest problem I focused on was Planning. Waterfall and RUP focus a lot on planning and are often mentioned to be superior to Agile methods when it comes to forecasting and planning.
I believe that this is true when Agility is embraced only incompletely. As a matter of fact, I believe that Agility leads to much better and much more reliable forecasts than traditional methods mostly because:

With Agility, it becomes easy to update and adapt Planning and forecasts to always match the evolving reality and the changes in direction and priority.
When embracing agility as a whole, the tools put in the hands of Managers and Executive are first much simpler and second more accurate than traditional planning tools.

In this article, I intend to present the fundamentals, the roles, the processes, the rituals and the values that I believe a team would need to embrace to achieve success down the line in Agile Software Development Management - Product Management, Team Management and Project Management - with the ultimate goal of making planning and forecasting as simple and efficient as it can be.
All of this is a reflection of the tools, principles and practices we have embraced or are introducing in my current company.

This article is available as a slideshare presentation here : https://www.slideshare.net/JrmeKehrli/agility-and-planning-tools-and-processes.

Also, you can read a PDF version of this article here : https://www.niceideas.ch/Agile_Planning.pdf.

Summary

1. Introduction
2. The Fundamentals
3. Principles
4. Overview of the whole process
5. Return on Practices
6. Conclusion

1. Introduction

As stated in my abstract above, embracing sound Agile principles and applying relevant Agile practices is all but easy.
First, out of all the Agile methods available and described and the overwhelming set of practices and principles, an organization needs to understand which makes sense to it. Adopting a method, a set or principles or practices blindly, because the paper said it was good, or because the Scrum Master believes it is state of the art makes only little sense.
The set of methods described nowadays is pretty huge and unfortunately, each and every of these practices make sense whenever a team, an organization or a whole corporation suffers from a drawback or an issue it addresses or simply benefits from its advantages.

The whole set of Agile methods along with their principles and practices are brilliantly represented by Chris Web on the following infographic:

[Click to enlarge]
(Source : Christopher Webb - LAST Conference 2016 Agile Landscape - https://www.slideshare.net/ChrisWebb6/last-conference-2016-agile-landscape-presentation-v1)

Junior teams should go with a base method that makes sense to it, such as Scrum or Kanban while remembering that none of it makes sense without a strict respect to the whole set of XP principles and practices.

More experienced teams will likely come up with their own methodology, cleverly built from the principles and practices of several underlying methods.

Again, in my opinion XP is the most fundamental building block on which all the rest is built, not a method among others.
I often read papers online presenting XP as one Agile Software Development Method among others. My point of view is very different. I strongly believe - and experience everyday - that XP proposes the fundamental principles and practices on which are built all the other methods.
Without a thorough adoption of XP principles and practices, one cannot benefit from the full advantages of Agility. In addition, some principles and practices proposes by other methods such as DevOps, leverage on some XP principles and practices but never voids them.

When explaining this, I like to recover this schema I wrote a few years ago when I was doing consulting missions around Agility and Digital Transformation:

This reads as follows:

Without a proper understanding and adoption of eXtreme Programming values, principles and practices, moving towards Agile Software Development will be difficult.
Without Agility throughout the IT processes, both on the development side (Agile) and on the Production side (DevOps), embracing Lean Startup practices and raising Agility above the IT Department will be difficult.
Without a sound understanding of the Lean Startup Philosophy and practices and a company-wide Agile process (such as a company wide Kanban), transforming the company to an Agile Corporation will be difficult.
Finally, only Agile Corporations can really imagine successfully achieving a Digital Transformation

But then again, referring to Chris Webb's Agile Landscape, picking up the practices that make sense and have an added value in any context is the choice of every organization. Every different mature agile organization will use a slightly different set of practices than every other.

I will now be presenting the fundamental set of practices I deem important when it comes to successfully embracing Agile Planning and Agile Software Development.

2. The Fundamentals

The set of practices I deem essential to embrace Agile Planning comes from the following methods: XP, Scrum, Kanban, DevOps, Lean Startup and a lot of Visual Management tricks.

2.1 eXtreme Programming

eXtreme Programming (XP) is the most fundamental software development method from the Agile tree that focuses on the implementation of an application, without neglecting the project management aspect. XP is suitable for small teams with changing needs. XP pushes to extreme levels simple principles.

The eXtreme programming method was invented by Kent Beck, Ward Cunningham, Ron Jeffries and Palleja Xavier during their work on the project C3. C3 was the calculation of compensation project at Chrysler.
Kent Beck, project manager in March 1996, began to refine the development method used on the project. It was officially born in October 1999 with Kent Beck's Extreme Programming Explained book.

In the book Extreme Programming Explained, the method is defined as:

An attempt to reconcile the human with productivity;
A mechanism to facilitate social change;
A way of improvement;
A style of development;
A discipline in the development of computer applications.

Its main goal is to reduce the costs of change. In traditional methods, needs are defined and often fixed at the start of the IT project, which increases the subsequent costs of modifications. XP is committed to making the project more flexible and open to change by introducing core values, principles and practices:

The principles of this method are not new: they have existed in the software industry for decades and in management methods for even longer. The originality of the method is to push them to the extreme:

Since the code review is a good practice, it will be done permanently (by a binomial);
Since the tests are useful, they will be done systematically before each implementation;
Since the design is important, it will be done throughout the project (refactoring);
Since simplicity makes it possible to advance faster, we will always choose the simplest solution;
Since understanding is important, we will define and evolve metaphors together;
Since the integration of the modifications is crucial, we will do it several times a day;
Since the needs evolve rapidly, we will make cycles of development very rapid to adapt to the change.

The practices listed by the eXtreme Programming method form the fundamental Software Engineering Practices of Agility.
Interestingly, one cannot pick up a subset of these practices and believe that it should work. Kent Beck uses the following schematic to illustrate how these practices work together and depend on each others:

All of this makes a lot of sense if you think of it: doing refactorings without TDD would be suicidal, Continuous Integration without TDD as well, Testing without simple design is complicated, Simple Design is enforced by TDD, etc.

2.2 Scrum

Scrum is a schematic organization of complex product development. It is defined by its creators as an "iterative holistic framework that focuses on common goals by delivering productive and creative products of the highest possible value"

This organizational scheme is based on the division of a project into time boxes, called "sprints". A sprint can last between a few days and a month (with a preference for two weeks).
Each sprint starts with an estimate followed by operational planning. The sprint ends with a demonstration of what has been completed.
Before starting a new sprint, the team makes a retrospective. This technique analyzes the progress of the completed sprint, in order to improve its practices (Continuous Improvement / Kaizen).
The work flow of the development team is facilitated by its self-organization, so there should be no formal Project Manager but a Team Leader instead with a coaching role more than a management role.

The Scrum process can be represented as follows:

Some more information about scrum is available in a previous article here.

Working with Story Points

In waterfall, managers determine a team member's workload capacity in terms of time. Managers ask selected developers to estimate how long they anticipate certain tasks will take and then assign work based on that team member's total available time. In waterfall, tests are done after coding by specific job titles rather than written in conjunction with the code.
The downsides of waterfall are well known: work is always late, there are always quality problems, some people are always waiting for other people, and there's always a last minute crunch to meet the deadline.

Scrum teams take a radically different approach.

First of all, entire Scrum teams, rather than individuals, take on the work. The whole team is responsible for each Product Backlog Item. The whole team is responsible for a tested product. There's no "my work" vs. "your work." So we focus on collective effort per Product Backlog Item rather than individual effort per task.
Second, Scrum teams prefer to compare items to each other, or estimate them in relative units rather than absolute time units. Ultimately this produces better forecasts.
Thirdly, Scrum teams break customer-visible requirements into the smallest possible stories, reducing risk dramatically. When there's too much work for 7 people, we organize into feature teams to eliminate dependencies.

Planning poker, also called Scrum poker, is a consensus-based, gamified technique for estimating, mostly used to estimate effort or relative size of development goals in software development.
In planning poker, members of the group make estimates by playing numbered cards face-down to the table, instead of speaking them aloud. The cards are revealed, and the estimates are then discussed. By hiding the figures in this way, the group can avoid the cognitive bias of anchoring, where the first number spoken aloud sets a precedent for subsequent estimates.

The cards in the deck have numbers on them. A typical deck has cards showing the Fibonacci sequence including a zero: 0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89; other decks use similar progressions.

The reason to use planning poker is to avoid the influence of the other participants. If a number is spoken, it can sound like a suggestion and influence the other participants' sizing. Planning poker should force people to think independently and propose their numbers simultaneously. This is accomplished by requiring that all team members disclose their estimates simultaneously. Individuals show their cards at once, inspiring the term "planning poker."

In Scrum these numbers are called Story Points - or SP.

2.3 DevOps

DevOps is not a question of tools, or mastering chef or docker. DevOps is a methodology, a set of principles and practices that help both developers and operators reach their goals while maximizing value delivery to the customers or the users as well as the quality of these deliverables.

The problem comes from the fact that developers and operators - while both required by corporations with large IT departments - have very different objectives.

DevOps consists mostly in extending agile development practices by further streamlining the movement of software change thru the build, validate, deploy and delivery stages, while empowering cross-functional teams with full ownership of software applications - from design thru production support.

DevOps encourages communication, collaboration, integration and automation among software developers and IT operators in order to improve both the speed and quality of delivering software.

DevOps teams focus on standardizing development environments and automating delivery processes to improve delivery predictability, efficiency, security and maintainability. The DevOps ideals provide developers more control of the production environment and a better understanding of the production infrastructure.
DevOps encourages empowering teams with the autonomy to build, validate, deliver and support their own applications.

So what are the core principles ?

I presented these principles and practices in details in my previous article dedicated to devops..

DevOps is a revolution that aims at addressing the wall of confusion between development teams and operation teams in big corporations having large IT departments where these roles are traditionally well separated and isolated.

2.4 Lean Startup

Some years ago, Eric Ries, Steve Blank and others initiated The Lean Startup movement. The Lean Startup is a movement, an inspiration, a set of principles and practices that any entrepreneur initiating a startup would be well advised to follow.

In my opinion, the most fundamental aspect of Lean Startup is the Build-Measure-Learn loop.
The fundamental activity of a startup is to turn ideas into products, measure how customers respond, and then learn whether to pivot or persevere. All successful startup processes should be geared to accelerate that feedback loop.

The five-part version of the Build-Measure-Learn schema helps us see that the real intent of building is to test "ideas" - not just to build blindly without an objective.
The need for "data" indicates that after we measure our experiments we'll use the data to further refine our learning. And the new learning will influence our next ideas. So we can see that the goal of Build-Measure-Learn isn't just to build things, the goal is to build things to validate or invalidate the initial idea.

The four steps to the Epiphany

The four stages the Customer Development Model are: customer discovery, customer validation, customer creation, and company creation.

Customer discovery: understanding customer problems and needs
Customer validation: developing a sales model that can be replicated
Customer creation / Get new Customers: creating and driving end user demand
Customer building / Company Creation: transitioning from learning to executing

We can represent them as follows:

I added to the schema above the most essential principles and practices introduced and discussed by the Lean Startup approach.

I discussed these principles and practices in length in a previous article on this blog.

2.5 Visual Management and Kanban

Visual Management is an English terminology that combines several Lean management concepts centered on visual perception. The aim is to put the information and its context in order to make the work and the decision-making obvious.

Visual Management is an answer to the well known credo "You can't manage what you can't see". It finds its root in Obeya War Rooms:

(Source : http://alexsibaja.blogspot.ch/2014/08/obeya-war-room-powerful-visual.html)

Obeya (from Japanese "large room" or "war room") refers to a form of project management used in many Asian companies, initially and including Toyota, and is a component of lean manufacturing and in particular the Toyota Production System.
During the product and process development, all individuals involved in managerial planning meet in a "great room" to speed communication and decision-making. This is intended to reduce "departmental thinking" and improve on methods like email and social networking. The Obeya can be understood as a team spirit improvement tool at an administrative level.

Nowadays, visual management is very much linked to Lean Management and Lean Startup, but IMHO it's really a field on its own. In the field of Agile Planning, I believe that Visual Management with sound tools and approaches is not optional.
At the end of the day, as we ill see, a good Project Management tool is a tool than enables anyone in the company to understand what is achievable in a given time or what time is required to deliver a given scope within a few minutes. And nothing competes with Visual Tools in this regards.

I will introduce here the fundamental tools I believe an Agile team should consider when it comes to Visual Management:

2.5.1 Story Map

The purpose of the Story Map is that arranging user stories into a helpful shape - a map - is usually deemed as most appropriate.
A Story Map is a visual management tool aimed at presenting the situation of the Software or the features to be implemented in a clear and graphical way. A Story Map is composed by user stories (see below).

A small story map may look like something like this:

At the top of the map are "big stories." We call them themes. A theme is sort of a big thing that people do - something that has lots of steps, and doesn't always have a precise workflow. A theme is a big category containing actual user stories grouped in Epics.

Epics are big user stories such as the one mentioned in example above. They usually involve a lot of development and cannot be considered as is in an actual product backlog. For this reason, Epics are split in a sub-set of stories, more precise and concrete that are candidate to be put in an actual product backlog.

I presented more information on Story Maps in a previous article here.
For the moment, let's just remember that there is an important notion of priority on the vertical scale: the lower a story, the lesser its priority.
There is also a les obvious notion of priority horizontally: stories on the left should be implemented first since they have a greater value than the stories on the right, but all of that of course with respect of the more important vertical priority.
Long story short: the development team needs to implement all the stories of a row, from left to right, before it can consider the stories of the next row.

An pretty good and straightforward example of a Story Map related to an email client application:

And a real world example built during an real life Workshop:

(Copyright OCTO Technology / Unfortunately I haven't been able to recover the source)

A story Map is usually a visual tool, laid down on the wall of a shared meeting room or even the development team open-space. Distributed teams may consider digital tools but a physical, real and visual map on a wall is way better.

2.5.2 Product Backlog

The product backlog is the tool used by the Development tool to track the tasks to be implemented. These development tasks should be linked to a User Story on the Story Map.
As such, the product backlog should be seen as a much more detailed and technical version of the Story Map.

The product backlog shows the same releases than the Story Map. The development tasks in the current sprints should have a more detailed form than the development tasks not analyzed yet during Sprint Planning.
In a general way, the product Backlog should be kept synchronized with the Story Map and the reverse is true as well. Every User Story on the Map is broken down in development tasks in the Product Backlog and all tasks in the backlog should be attached to a User Story on the Map.

Their difference is as follows:

Story Map : The Story Map is a management tool. It is a visual tool used by the Product Management Team to drive the high level development of the product and to defined releases and priorities.
Product Backlog : The product Backlog is a technical project management tool, not a visual management tool. Its is usually supported by a digital tool (such as Jira or Redmine) and aims at organizing at a fine level the development team activities.

Some important constraints should be noted right away:

Each and every developer activity, not matter how quick and small, should be well identified by a development task in the product backlog.
Each and every development task should be linked to a User Story on the Story Map. I cannot stress enough how much this is important.

2.5.3 Kanban Board

Kanban is model for introducing change through incremental improvements. One can apply Kanban principles to any process one is already running.

In Kanban, one organizes the work on a Kanban board. The board has states as columns, which every work item passes through - from left to right. One pull work items along through the [in progress], [testing], [ready for release], and [released] columns (examples). And you may have various swim lanes - horizontal "pipelines" for different types of work.
The only management criteria introduced by Kanban is the so called "Work In Progress" or WIP. By managing WIP you can optimize flow of work items. Besides visualizing work on a Kanban board and monitoring WIP, nothing else needs to be changed to get started with Kanban.

Kanban boards can be mixed with Story Maps to follow the development of the tasks scheduled for next releases as far as their delivery on the current development version of the product.
In this case, the left-most column of the Kanban board becomes the Story Map containing the Stories to be developed while the right-most column of the Kanban board contains the User Stories identifying features already provided by the product.
I myself call such a mix of Story Map and Kanban a Product Kanban Board.

An real-world example of such a mix of Story Maps and Kanban boards could be as follows:

2.5.4 User Stories

User stories are short, simple descriptions of a feature told from the perspective of the person who desires the new capability, usually a user or customer of the system.

They typically follow a simple template:

As a <type of user>, I want <some goal> so that <some reason>.

User stories are often written on sticky notes and arranged on walls or tables to facilitate planning and discussion.
As such, they strongly shift the focus from writing about features to discussing them. In fact, these discussions are more important than whatever text is written.

It's the product owner's responsibility to make sure a product backlog of agile user stories exists, but that doesn't mean that the product owner is the one who writes them. Over the course of a good agile project, you should expect to have user story examples written by each team member.
Also, note that who writes a user story is far less important than who is involved in the discussions of it.

Some example stories for different application contexts:

User Stories are used to track existing features as well as features to be developed on a mix of Story Map and Kanban, the Product Kanban Board.

3. Principles

Having covered the fundamentals, we will now go through the principles required for Agile Planning and see how the principles and practices introduced in the previous section should be used to achieve reliable forecasts and planning with Agile methodologies.

We should now discover:

The tools, mostly visual management tools that the organization should adopt.
The Organization to be put in place with required roles and committees.
The processes that should be respected and that will lead to accurate estimations and forecasts.
The Rituals supporting the processes.
The Values the team has to embrace to successfully run the processes and deploy the required practices.

3.1 The tools

The tools that the organization should adopt are as follows:

I believe that I introduced these tools in length in the section 2.5 Visual Management and Kanban so I won't be adding a lot. We will see in the next section related to processes how these tools are used and how they complement each other by addressing different needs.

3.2 The Organization

The organization to put in place consists in identifying roles as well as committees and teams.

3.2.1 Required roles

The required roles are as follows:

Team Leader : The Team Leader animates the Team rituals (such as Sprint Planning, Sprint Retrospective, Daily scrum) and acts as a coach and a mentor to the development team. He is not a manager, he is a leader (Lead by Example, Management 3.0, etc.). He also represents the development team in other rituals (PMC).
At the end of the day, the Team Leader should not be made responsible for neither the team successes nor the team failures, the whole team should be accountable for this.
If the team leader is solely responsible for the Team's performance, then we will be tempted to shortcut quality or mess some rituals to speed up the pace and successfully respect some artificial deadline or else. When that is the case, the team requires a Scrum Master who should guarantee the Scrum rituals and processes are well respected.
In my opinion it makes a lot more sense to avoid such situation by making sure everyone in the team is accountable for the team performance and also responsible for the proper respect of the defined Agile processes and rituals. In this case, the Team Leader becomes an arbitrator, a facilitator, a coach and a support, not a manager. At the end of the day, management is too important to be left to managers ;-)
Architect : The Architect (or architects) should be the most experienced developer(s), the one(s) with the biggest technical and functional knowledge. There can be several architects, a lead architect, a technical architect, etc. This doesn't really matter.
The important thing is that the architect should be entitled to take architecture decision by still referring to the whole team as much as possible. The architect leads the Architecture Committee where architecture decisions are taken.
The architect, with the help of the tech leads, provides guidance and support to developers. he is also responsible to check the Code Quality, leading the code reviews, and ensure the sticking to Code conventions, etc.
Tech Leads and Developers : The tech leads and developers form the core of the development team, they eventually develop the software.
Tech Leads are coaches and supports to developers and represent them in the Architecture Committee.
Product Owner : The product Owner represents the stakeholders and drives priorities in good understanding with the market and customer needs. He is not a leader, he is an arbitrator and acts as the bridge between the business requirements and the development team.
I can only recommend the reader to watch the magnificent video "Agile Product Ownership in a Nutshell" from Henrik Kniberg.
Business representatives : Business representatives (sales, customers, etc.) have to be involved in the Product Management Committee by the product Owner whenever required.

Why bother ?

Roles are required mostly for two reasons : efficiency and focus:

Efficiency: roles are required to avoid having the whole organization meeting all the time at every meeting for every possible concern.
Focus: every role owner should acts as required by his role and put himself in the right mindset for every ritual. Rituals are eventually a role playing game.
Roles are not functions ! We are not speaking hierarchy here, it's more a question of role play : when someone is assigned a role, his objective is to act and challenge the matters being discussed in correspondence with his role !

As an important note, roles can well be shared. A same co-worker can have multiple roles if required, even though it would be better to avoid this.

3.2.2 Required Committees and teams

Required committees and teams are as follows:

Development team : The development team is responsible to develop the software. It is composed by Developers, Tech Leads, Architects and the Team Leader. At the end of the day, they're all developers and even the Team Leader should be able to spend a ratio of his time developing the Software (Lead by Example). Its essential ritual is the daily scrum every day.
Product Management Committee : The Product Management Committee - or PMC - is composed by the Development Team Leader, The Architect(s) and The Product Owner. The Product Owner should convoke business representatives as required. The PMC is responsible for identifying the new features to be added to the product and prioritize them. It should take place every week or every two weeks at most.
The PMC identifies new features as User Stories and Uses the Story Map to track them and prioritize them. Priorities are redefined and adapted as Stories Estimations (in Story Points) are refined. This process is explained later. Priorities should be set in respect to the value and the cost (in SP) of each and every story.
Architecture Committee : The Architecture Committee is composed by the Team Leader, The Architect(s), The Product Owner, the Tech Leads and representatives of the Development team.
The Architecture Committee is responsible to analyze user stories and define Development Tasks. Every story should be specified, designed and discussed. Screen mockups if applicable should be drawn, acceptance criteria agreed, etc.
Since the Architecture Committee is also responsible for estimating Stories, it's important that representatives of the Development Team, not only the Tech Leads and the Architects, but simple developers as well, take part in it. Ideally, there should be a rotation and at every meeting a different couple of developers should be convoked. This is required to have everyone agreeing on the estimations.
The Architecture Committee should take place every week or every two weeks at most as well and ideally not long after the PMC.
Sprint Management Committee : The Sprint Management Committee is basically composed by the Development Team plus the Product Owner.
During Sprint Planning, the Sprint Management Committee discusses the implementation concerns of the tasks specified by the Architecture Committee and challenge the estimations if required. The Development Tasks defined by the Architecture Committee are detailed as much as possible.
During Sprint retrospective, the Sprint Management Committee discussed the issues and drawbacks encountered during former sprint and agrees on an action plan to address them.

3.3 The Processes

I will be presenting now the various processes that are required to achieve the ultimate goal of Agile Planning : reliable forecasts and planning.

3.3.1 Design Process

The Design process consists in breaking a User Story identified by the PMC into Development tasks that developers can understand and work on.
It can be illustrated as follows:

A. Identification of User Stories

The PMC produces a User Story laid down on the Story Map.

B. From User Stories to Development Stories

The Architecture Committee analyzes every new story and for each of them it creates a Development Story on the Product Backlog.

Such a Development Story is not anymore a simple post-it in a Story Map, it is a digital User story created in the backlog management tool such as Jira or Redmine. The Development Story is specified and design. It should contain:

The initial user story from the Story Map as it was expressed at that time.
A complete description of the purpose and intents of the feature.
A complete description of the expected behaviour from all perspectives: user, system, etc.
Mock-ups of screens and front-end behaviours as well as validations to be performed on the front-end.
A precise and exhaustive list and description of all business rules.
A list and description of the data to be manipulated.
Several examples of source data or actions and expected results.
Acceptance criteria (functional and non-functional) and a complete testing procedure.
The list of documents - technical and functional - that will need to be updated or adapted and how.

C. From Development Stories to Development Tasks

The Architecture Committee also breaks the Development Story down in several Development Tasks.
Development tasks should be split by logical or functional units or layers. For instance, one task could be related to the GUI while another one could be related to the database changes, etc. But if it is possible, it is always better not to split them by layer but rather vertically by sub-feature.
What should never be done is splitting a Story in tasks by the type of job, for instance development, unit test, integration tests. That should never ever be done. A developer, or a couple of developers should always implement a sub-feature entirely, with all the required tests, functional tests, migration scripts, documentation updates, etc.

D. From Development Tasks to Detailed Tasks

The Sprint Management Committee, during Sprint Planning recovers all these Development Tasks and analyzes them further.

The questions to be answered at this time are:

Are all the information provided by the Architecture Committee clear enough or are some precisions required?
Is there any unforeseen impact on other parts of the software?
Is there any tool or specific environment setup or configuration required to implement these tasks?
etc.

Specifically the developers that were not present at Architecture Committee when a task has been designed should challenge it and make sure they understand not only what need to be done but really also how to do it precisely.
At this stage, the new findings should lead to a refinement of the initial estimations agreed by the Architecture Committee.

In addition, at this stage the Development team discovers about the secondary aspects identified by the Architecture Committee such as documentation to be updated or adapted, automated tests to be implemented or adapted, etc. and mentions in a detailed way each and every step to be done in the Detailed tasks.

3.3.2 Estimation Process

What we want eventually, is a Story Map containing estimations for all the Stories that have been analyzed by the Architecture Committee.
The result we want to achieve here can be represented as follows:

Each and every story that has been broken down by the Architecture Committee and created in the Product Backlog is clearly identified: it has an estimation expressed as a total number of Story Points.
That number corresponds to the total of the estimations in SP of the individual Development Tasks underneath.

A. Initial Estimations

At this stage, The Architecture Committee is in charge of the Initial Estimations.
After a Story has been broken down in tasks, each and every of these tasks is estimated by the Committee using the Planning Poker approach.
The sum of the estimations of every individual tasks is reported on both the Development Story (Product Backlog) and the User Story (Story Map):

B. Refined Estimations

When the Sprint Management Committee recovers the Development Tasks to refine them, there might be new impacts discovered, new unforeseen refactorings required, etc.

The Sprint Management Committee should challenge the initial estimations with their new findings and adapt the estimations accordingly.
Again, these new Refined Estimations should be reported on both the Development Story (Product Backlog) and the User Story (Story Map):

C. Final Estimations

Eventually, during the sprint, it can happen that a developer discovers that a task will take a bigger time than expected, or, in the contrary, a much shorter time.
Reporting such changes in estimations at this very late stage is maybe not important for Scrum, since the sprint is already filled, but it's important for both the Sprint Management Committee and the Architecture Committee to be notified about them in order to improve the way they do estimations.
As part of Continuous Improvement (Kaizen), the Architecture Committee needs to identify where the gap comes from and try to have more accurate estimations next time.

So even at this stage, when a developer discovers gaps or shortcut, it's important that any impact in terms of estimation is reported as far as to the Story Map:

Why bother ?

The management tool is the story map, not the product backlog. The product backlog is a technical tool to organize the development activities. It's not a management tool.

The Product Management Committee should be able to decide about priorities using solely the Story Map. In addition, it should be possible to forecast a delivery date using solely the Story Map.
For this reason, the Story Map should contain as up to date as possible estimations.

Everyone in the company should be able to take is little calculator, go in front of the story map and know precisely when a task will be delivered.
We'll see how soon !

What about updating estimations after the task has been completed and we know how much time we spent on it ?

One needs to understand what we're trying to achieve here.

We're trying to continuously improve our ability to come up with accurate and reliable estimations based on the information we have. When we estimate tasks at ARCHCOM or Sprint Planning, we only have analysis information at our disposal, we have no clue about any post-implementation information such as the actual time that will be spent on the task.
As such, while it is very important to improve our ability to estimate using analysis information (as done at ARCHCOM), it makes no sense to update estimations after implementation since actual implementation time is an information we will never have before implementing the task.

Again, we want to improve our ability to estimate using the information we have. And actual implementation time is an information we don't have so it's useless in regards to improving the estimation process and as such doesn't trigger any estimation update.

In addition, the estimation process is a comparison game, not an evaluation game (or less). An Estimation in SP should have no clear relationship with actual implementation time, for many reasons, among them the fact the different developers have different capacity. A 10 SP task is always a 10 SP task, for every developer. But it may well represent 4 days of work for a junior developer and 2 days of work for a senior developer.
This aspect is a very important reason behind the rationality to think in terms of SP instead of Man/Days. And of course SP should be a measure of the whole team capacity, not individuals.

This is why we don't bother updating estimations after actual implementation. Nevertheless, we should still use that knowledge to improve our estimations, but actually trying to update the estimation in SP makes no sense.

3.3.3 Product Kanban Board Maintenance Process

Maintaining the Product Kanban Board (Mix of Story Map and Kanban Board) as up to date as possible with latest activities of the development team as well as the latest estimations is important.
Again, The Product Kanban Board is the only tool that should be required by the Product Management Committee to come up with estimations and forecasts.

We will now see how this Product Kanban Board should be maintained throughout the sprints and how it is used.

A. Initial Stage: before the first sprint of the nest release

We start with a Board of the following shape:

The boxes in blue indicate how a User Story is moved across the board when it advances in the analysis and development process:

First, when the Architecture Committee has done analyzing and breaking down the Story, the estimation it came up with is reported on the User Story in the violet pellet.
Then, A Story is moved to Implementation / Doing when a first of its development tasks is being implemented in the current sprint
It is moved to Implementation / Done when the last of its development tasks is done being implemented (meaning completely done : with automated tests, IT tests, etc. At this stage it's simply waiting the next continuous deliver build to be available on Test environment for acceptance tests.
When the Continuous Delivery build has been executed, the Story is moved to Testing.
When the product Owner either tested the Story (or delegated such tests) and accepts the results, the Story is moved to Done

The Story Map on the left is a pretty standard Story Map, where releases are identified.

The Story Map on the right, on the other hand, drops the notion of releases completely. It identifies the features as they are available as a whole in the current development version of the product, regardless of both past releases and releases to come.
A story identifying a new feature is simply added to it to capture the fact that the feature is now available on the development version.
On the other hand, a story identifying a modification of an existing feature should be merged with the original story, potentially leading to a new story, corresponding to the new way of expressing the feature.

B. During the first sprint

During the first sprint after this initial stage, the Kanban board in the middle identifies the Stories that are being worked on and their status:

C. During the second sprint

After first sprint, developed stories are put on the Story Map on the right and a next set of Stories are being developed:

D. After the first release

After the first release, we can see that all the tasks from the first release of the Story Map on the left have been moved to the Story Map on the right.
The Story Map on the left is adapted and the next releases are shifted up.

Notes:

Actual releases will differ: we can release potentially at every end of Sprint. Releases identified on the Story Map on the left will likely be broken down in smaller releases.
Again, one should embrace Continuous Delivery: The development Team releases at every end of sprint. Making it a customer release is a Product Management Decision
One should consider feature flipping in order not to compromise a potential release with a story that would not have been completely implemented in one sprint.

E. No notion of release in Done (Right Story Map)

The Story Map on the right shouldn't have any notion of releases. It represents the Product as is the current development version and it makes no sense anymore remembering there which task has been developed in which release.

Also, User stories on the right may need to be merged whenever they relate to the same feature.

3.3.4 Story Map and Backlog synchronization Process

The priorities of the Development Tasks on the Product Backlog should match and follow the priorities of the User Stories on the Story Map.

The Story Map drives the priorities. The Product Management Committee uses estimations and updates provided by the Architecture Committee and the Development Team to adapt the priorities of the stories on the Story Map and move them accordingly.
When a story priority changes, the priorities of the corresponding Development Tasks on the Product Backlog should be changed in order to reflect the new priority of the User Story.

The principle is as follows:

In terms of process, things occur this way:

The Architecture Committee takes Stories created by the Product Management Committee, designs them and estimates them.
The Product Management Committee learns about Stories Estimations and re-prioritizes the Story Map accordingly
The Architecture Committee synchronizes the priorities of the corresponding Development Tasks.

This can be represented this way:

Let's see now how all of this is used to be able to achieve its ultimate objective : reliable planning and forecasting.

3.3.5 Forecasting

So ... forecasting, finally.
At the end of the day, pretty much everything I have presented above, all these tools, charts and processes are deployed towards this ultimate objective: doing planning and being able to produce accurate forecasts.

If one respects well the processes presented above and use the tools the right ways, one should end up with the Story Map presented in 3.3.2 Estimation Process, hence Stories that hold the indication of a pretty accurate estimation in Story Points.

In addition, a story map holds an important notion of priority: the development team needs to implement all the stories of a row, from left to right, before it can consider the stories of the next row.

So how does one know when a story will be implemented by the development team? The answer is simple: when all stories of the previous rows as well as all stories on the left on the same row are implemented.
From there, calculating the amount of Story Points to be developed before a specific story can be implemented is straightforward:

Recovering the example introduced in 3.3.2 Estimation Process, if we want to know when the Story with the blue box around it, we have first to know how many story points have to be implemented first, 1750 SP in this example.

Based on this, we know that this story will be delivered once all the stories before it will be implemented plus this story as well, hence 1750 + 100 SP = 1850 SP.

Estimating a delivery date

In order to estimate a delivery date for that story, we need to know how much time is required to deliver these 1850 SP.
Here comes the notion of Sprint capacity, or rather Sprint velocity. Strictly speaking, Agilists speak of capacity when reasoning of man days and Sprint velocity when reasoning in Story Points.
I myself use Sprint capacity for both cases.

Computing Sprint velocity requires to have all the practices described in introduction in place for several Sprints. I will come back on practices in the next chapters so I'm leaving them aside for now.
If the Agile Team is mature in regards to its practices, it can compute the Sprint Capacity be looking at the range of Story Points achieved during 5 last sprints:

We don't use the most extreme, minimum and maximum values. Extreme values most of the time explain themselves by external factors: people get sick, leaves on holidays, tasks are sometimes finished in next sprint, etc.
Instead, out of five sprints, we'll use the second maximum value and the last-but-one value.

We use this range, and not a single value of average or median, to address a fundamental aspect of software engineering: the uncertainty.
The range gives us a lower value and an upper value which we will use as follows.

In case of fixed time, when we have a fixed delivery date, the lower and upper values give us the minimum or maximum set of features we can have implemented at that date.
In case of fixed scope, when we have to release a version of the software with a given set of features, the lower and upper values will give us the earliest date and the latest date at which we can release.

As a sidenote, when we count Story Points implemented in a sprint, we should focus on developer tasks, not User Stories, since User Stories are too coarse grained.
A User story can well take several sprints to be completed. A developer task within one of these stories should not. Tasks should be designed in such a way that they are small enough to always fit a sprint.

Recovering the example above, let's imagine we are want to achieve a fixed scope, we want to know, using the Story Map as it is, how much time will be required to implement these 1850 Story Points.

Using the lower limit of 128 SP per sprint, it would take us 15 sprints to complete the scope, hence 30 weeks or 6.7 months
Using the upper limit of 138 SP per sprint, it would take us 14 sprints to complete the scope, hence 28 weeks or 6.2 months

Based on this, the PMC or the Product Owner can communicate to the stakeholders that the feature would be release not before 6 months but before 7 months.

3.3.6 Development process: Scrum

I said enough about Scrum in both this article and my previous article.
Let me just introduce this chart that does a great job in introducing the notion of Product Increment as a shippable version of the product since we adopt Continuous Delivery:

This allows me to present the last tool I mentioned in the introduction of this sprint, which is the Sprint Kanban Board:

The sprint Kanban board is used to track the progress of tasks within the sprint and enables to organize developer activities.

Some people use extensively burndown charts to track the proper progress of a sprint or the product backlog towards a specific release as a whole. I myself never find it so useful. I really get all I want to know about how a release or a specific sprint is doing by using the Product Backlog, the Product Kanban Board or the Sprint Kanban.

Commitment

A very important aspect when it comes to the Scrum sprint, which is as well an important value the team has to commit to, is the commitment of the team to close the Sprint scope at the end of the Sprint, whatever it takes.

Postponing tasks from sprint to sprint is a nightmare to manage and ruins forecasting. The development team has to progress estimating tasks, trying to be as accurate as possible, and be realistic when planning the sprint and feeding.
To be honest it takes quite some sprints to find right way both for estimating tasks and feeding the backlog. But after these first initiation sprints, the development team has to commit to the sprint scope at all cost, whatever it takes.

The only answer when discovering at the end of the sprint that the scope won't be completed without overtime is first to work as much as is required to complete it and second to identify how to improve the estimation process and the sprint feeding process to avoid this situation in the future (Kaizen burst).
It's is never an acceptable answer to simply postpone some tasks to the next sprint.

3.4 The Rituals

Rituals of the various teams are as follows.
Committees are rituals by themselves, the difference between a team and a committee is that a committee gathers solely for a specific ritual

3.4.1 Product Management Committee

The Product Management Committee gather every X weeks. It really depends of the corporation, the size of the team, the rate at which new functional requirements appear. Every few 2 weeks should be sufficient in general, otherwise the frequency can increase as far as every week.

The duties of the Product Management Committee are as follows:

Story Mapping

Identification of new needs and requirements (also technical and technological !)
Breakdown of these requirements in User Stories
"Guessing" of an Initial Priority of a User Story based on Value (and foreseen size)

Maintenance (update) of Priorities

Setting of Actual Priorities based on Estimations from Architecture Committee
Review of priorities of Whole Story Map after update of estimations
- From Sprint Management Committee
- From Development Team

3.4.2 Architecture Committee

The Architecture Committee also gather every X weeks. It should meet at least few minutes (coffee break) but not more than one or two days after Product Management Committee.
The Architecture Committee recovers the last User Stories designed at PMC and synchronizes the Product Backlog with the Story Map. Stories are specified, designed and broken downs in Development Tasks.

The duties of the Architecture Committee are as follows:

Specification and Design of User Stories

Specification of functional and non-functional requirements
Identification of business rules
Identification of Acceptance criteria
Design of GUI
Architecture and Design of Software
Identification of documents and procedures to be updated / adapted
Identification of automated tests to be implemented

Estimation of User Stories

Breakdown in individual Development Tasks
- This needs to be done sufficiently in advance
Estimation of Development Tasks
Computing of total Estimation and reporting on User Story
Continuous Improvement: understanding of gaps in estimation after notification of Sprint Committee and how to improve

Software Architecture

Identification and maintenance of Coding Standards and Architecture Standards
Review of ad'hoc architecture topics

3.4.3 Sprint Management Committee

The Sprint Management Committee gathers at every beginning and end of sprint.
A sprint starts with the Sprint Planning and ends with Sprint Demo and Sprint Retrospective:

Sprint Planning

Discuss Development Tasks to ensure whole team has a clear view of what needs to be done → Detailed Tasks
Definition of done: list exhaustively the form of automated tests to be implemented as well as the documentation to be updated and the scope of these changes.
Review and challenge estimations of Detailed Tasks. Update estimation of User Story accordingly
Feed the Sprint Backlog with such Detailed Tasks until Sprint Capacity is reached

Sprint Retro

Review Tasks not completed and create task identifying GAP for next Sprint. Update estimations.
Review SP achieved during sprint and review Sprint Capacity
Discuss issues encountered during Sprint and identify action points. Update processes and rituals accordingly
Continuous Improvement: understanding of gaps in tasks and estimations and how to improve

Sprint Demo

End of Sprint / really optional with Continuous Delivery and Continuous Acceptance Tests
Present sprint developments and integrate feedback. Create new tasks and update estimations.

3.4.4 Development Team - Daily Scrum

The daily scrum happens every day, ideally early in the moment, at the time all the team is in the office.
The scope of the daily scrum is as follows:

Round table - every team member presents:

Past or current development task
Status on that task and precise progress
Next steps
Next task if former is completed
Identification of unforeseen GAPS and adaptation of estimations

Identification of challenges, issues and support needs

Scheduling of ad'hoc meeting and required attendees to discuss specific issues

3.5 The Values

Sticking rituals, respecting principles and enforcing practices is difficult.

It's difficult to ensure and behaves in such a way that breaking the build (failing tests) is an exception.
It's difficult to respect the boyscout rule.
It's a lot more difficult to design things carefully and stick to the KISS principle.
It's difficult and a lot of work to keep the Story Map and Product Backlog in sync and up-to date with the reality.
It's difficult to stick to the TDD approach.
It's difficult not to squeeze the Kaizen phase at the end of every meeting and being objective when it comes to analyzing strengths and weaknesses.

All of this make two Agile values especially important: Discipline and courage.
Both are utmost important and essential to address these difficulties.

Sticking to the Scrum rituals, enforcing TDD and other XP principles and practices require courage and discipline. It also requires a lot of discipline to Maintain and synchronize the Product Backlog and the Story Map.
Updating the estimations of the User Stories continuously as the understanding of the work to be done progresses also takes a lot of discipline.

Finally, discipline and courage are enforced by a strict definition of the processes and rituals and a proper maintenance of this definition as the culture and practices evolve.
At the end of the day, defining these committees and rituals is all about that. Why are all these committees / teams / rituals required if eventually a person can have several roles? Because they enforce discipline: they are scheduled and have precise agendas.

4. Overview of the whole process

The whole process looks as follows:

Product Management Committee (X-Weekly)
- 1 Identification of a new User Story
- 2 Initial foreseen priority (i.e. release) depending on value and initial estimation (oral)
Architecture Committee (X-Weekly)
- 3 Design and specification by architecture committee : Story → Development Story → Task
- 4 Estimation of individual tasks
- 5 Computation of total SP and setting of size of Development Story and User Story
- 6 Re-prioritization (based on new estimation)
Sprint Planning + Sprint retrospective (Sprintly)
- 7 Review of TaskS and discussion : Task → Detailed Task
- 8 Adaptation of Estimation on TaskS
- 9 Update of Total Size of Development Story and User Story
- 10 Notification to Architecture Committee (Kaizen / Sprint retrospective)
Daily Scrum
- 11 Identification of Gap on Task
- 12 Adaptation of Estimation on Task
- 13 Update of Total Size of Development Story and User Story
- 14 Notification to Architecture Committee (Kaizen / Sprint retrospective)

In a graphical way:

5. Return on Practices

As stated a lot of times in this article, all of this, reliable planning and true agility, require a strong commitment of the team to Agile practices and principles.

One cannot apply only a small subset of the Agile Practices and believe he will achieve true agility and Reliable Agile Planning.
The Agile practices I listed in introduction form a package with strong dependencies between each other.

IMHO the dependencies are as follows:

An arrow denotes a dependency between two practices.

Explanations of a few of these dependencies:

You cannot imagine reliable planning and forecasting if you don't provide the management with appropriate tools : Story Map and Kanban boards.
Also, it's going to be difficult without a proper technical tool for the development team: The Product Backlog.
Finally, it obviously requires Reliable Estimations.
Reliable estimations need to have manageable and well planned sprints. 1 week sprints are too small, a lot can happen in 1 week while 3 weeks are too big in my opinion, the fluctuations are too important. I strongly believe that 2 weeks sprints is the right size when it comes to having an accurate and reliable Sprint Capacity (or Velocity) in SP.
With 2 weeks sprints only, the development team cannot afford spending time on releasing the Shippable Product, releasing should be a completely automated procedure and in this regards Continuous Delivery is not optional.
Then achieving Continuous Delivery requires a lot of things and a good mastery of common XP and DevOps Practices.

6. Conclusion

Management needs a management tool to take enlightened decision. The product backlog should not be a management tool, it's really rather the development team's internal business. The Story Map, on the other hand, is a simple, visual and effective management tool.
All the rituals and processes introduced in this article are deployed towards the same ultimate goal: enabling the management to use the Story Map as a management tool for planning and forecasting. In addition, the specific form of Story Map introduced here, the Product Kanban Board, becomes also a Project Management Tool aimed to tracking the progresses of the development team.

The difficulty, the reason why it requires a strict enforcement of processes and rituals, is to synchronize the Story Map and the Product Backlog.
Since the development team works mostly with the Product Backlog, the later has eventually the accurate and realistic information about size and time of deployment, through the notion of Story Points.
But this is in no help for the management, hence the reason why it is required to backfeed the estimations put in the Product Backlog to the Story Map.

Eventually, if these processes and rituals are respected and well applied, anyone in the company can come in front of the Product Kanban Board with a little calculator and compute the delivery date (or rather the range) for any given story.
Anyone can use the Story Map to compute how much work can be done for any given date, or what time is required to deliver a specific scope.

All of this with a simple calculator and a few seconds, without Excel, without any Internet connection, without any complicated tool nor any pile of paper, just a calculator ... or a brilliant mind.

Now having said that, I would like to conclude this article by mentioning that the processes and tools I am presenting here work for us. They may not work as is for another organization. It's up to every organization to discover and find the practices and principles that best fit its needs and individuals.
As an example, the association of two Story Maps, the "to do" on the left and the "done" on the right of a Kanban Board for the needs of both Product and Project Management is a really personal recipe. While I myself got the idea from another organization, I haven't seen that often.
This shows in my opinion the very best qualities of an agilist: the curiosity to discover new ways of working and the courage to try them or invent them.

This article is available as a slideshare presentation here : https://www.slideshare.net/JrmeKehrli/agility-and-planning-tools-and-processes.

Also, you can read a PDF version of this article here : https://www.niceideas.ch/Agile_Planning.pdf.

Bytecode manipulation with Javassist for fun and profit part II: Generating toString and getter/setters using bytecode manipulation

2017-04-24T16:38:42-04:00

Following my first article on Bytecode manipulation with Javassist presented there: Bytecode manipulation with Javassist for fun and profit part I: Implementing a lightweight IoC container in 300 lines of code, I am here presenting another example of Bytecode manipulation with Javassist: generating toString method as well as property getters and setters with Javassist.

While the former example was oriented towards understanding how Javassist and bytecode manipulation comes in help with implementing IoC concerns, such as what is done by the spring framework of the pico IoC container, this new example is oriented towards generating boilerplate code, in a similar way to what Project Lombok is doing.
As a matter of fact, generating boilerplate code is another very sound use case for bytecode manipulation.

Boilerplate code refers to portions of code that have to be included or written in the same way in many places with little or no alteration.
The term is often used when referring to languages that are considered verbose, i.e. the programmer must write a lot of code to do minimal job. And Java is unfortunately a clear winner in this regards.
Avoiding boilerplate code is one of the main reasons (but by far not the only one of course !) why developers are moving away from Java in favor of other JVM languages such as Scala.

In addition, as a reminder, a sound understanding of the Java Bytecode and the way to manipulate it are strong prerequisites to software analytics tools, mocking libraries, profilers, etc. Bytecode manipulation is a key possibility in this regards, thanks to the JVM and the fact that bytecode is interpreted.
Traditionally, bytecode manipulation libraries suffer from complicated approaches and techniques. Javassist, however, proposes a natural, simple and efficient approach bringing bytecode manipulation possibilities to everyone.

So in this second example about Javassist we'll see how to implement typical Lombok features using Javassist, in a few dozen lines of code.

Part of this article is available as a slideshare presentation here: https://www.slideshare.net/JrmeKehrli/bytecode-manipulation-with-javassist-for-fun-and-profit.

You might want to have a look at the first article in this serie available here : Bytecode manipulation with Javassist for fun and profit part I: Implementing a lightweight IoC container in 300 lines of code.

Summary

1. Introduction / Purpose
2. Javassist
3. Java Instrumentation framework and applications
- 3.1 Lombok
- 3.2 Java agents and the linkage problem
4.0 BCG: a simple approach for generating boilerplate code using Javassist
5. Conclusion

1. Introduction / Purpose

I am giving only a few lines here and return the user to my first article on this topic for the complete introduction to bytecode manipulation.

Bytecode manipulation consists in modifying the classes - represented by bytecode - compiled by the Java compiler, at runtime. It is used extensively for instance by frameworks such as Spring (IoC) and Hibernate (ORM) to inject dynamic behaviour to Java objects at runtime.
Bytecode manipulation is traditionally difficult. Out of all the various libraries and tools to achieve it, Javassist stands out due to its natural and simple yet efficient approach to it.

2. Javassist

I am giving only a few lines here and return the user to my first article on this topic for the complete introduction to Javassist.

Quoting Wikipedia:

Javassist (Java programming assistant) is a Java library providing a means to manipulate the Java bytecode of an application.

Javassist enables Java programs to define a new class at runtime and to modify a class file when the JVM loads it. Unlike other similar bytecode editors, Javassist provides two levels of API: source level and bytecode level. Using the source-level API, programmers can edit a class file without knowledge of the specifications of the Java bytecode; the whole API is designed with only the vocabulary of the Java language. Programmers can even specify inserted bytecode in the form of Java source text; Javassist compiles it on the fly. On the other hand, the bytecode-level API allows the users to directly edit a class file as other editors.

3. Java Instrumentation framework and applications

Java 5 was the first version seeing the proper implementation of JSR-163 (Java Platform Profiling Architecture API) support including a bytecode instrumentation mechanism through the introduction of the Java Programming Language Instrumentation Services (JPLIS). At first that JSR only mentioned native (C) interfaces but evolved fast towards a pretty convenient Java API.
This was an interesting breakthrough since it allowed, with the help of an agent, to modify the content of a class bytecode inherent to the methods of a class in such a way as to modify its behavior at runtime.

The key point of the JSR-163 is JVMTI. JVMTI - or Java Virtual Machine Tool Interface - allows a program to inspect the state and to control the execution of applications running in the Java Virtual Machine. JVMTI is designed to provide an Application Programming Interface (API) for the development of tools that need access to the state of the JVM. Examples for such tools are debuggers, profilers or runtime boilerplate code generator.

The scope of this article is to focus on presenting how a boilerplate code generator can benefit from JVMTI and bytecode manipulation to achieve unprecedented comfort and easiness of programming than ever before on the JVM, unless considering a different language than Java, such as Scala or Clojure.

As a way to illustrate Boilerplate code concerns, We will focus on Project Lombok and its features and try to see how to reproduce some of them using Javassist.

3.1 Lombok

Project Lombok is a Boilerplate code generator that addresses one of the most frequent criticism against the Java Programming Language: the volume of this type of code that is found in most projects.
Boilerplate code is a term used to describe code that is repeated in many parts of an application with only slight contextual changes and with little added value.
Project Lombok reduces the need of some of the worst offenders by replacing each of them with a simple annotation. It then takes care of generating the boilerplate code at runtime

Project Lombok also integrates well in IDEs making it possible to use the usual IDE features such as refactoring and usage analysis pretty transparently.

Importantly in our context, since we will focus below on reproducing Lombok features using runtime bytecode generation, I have to mention right away that Lombok doesn't just generate Java sources or bytecode: it transforms the Abstract Syntax Tree (AST), by modifying its structure at compile-time.
The AST is a tree representation of the parsed source code, created by the compiler, similar to the DOM tree model of an XML file. By modifying (or transforming) the AST, Lombok keeps the source code trim and free of bloat, unlike plain-text code-generation.
Lombok's generated code is also visible to classes within the same compilation unit, unlike direct bytecode manipulation.
Our approach, as we will see below, using runtime bytecode manipulation shall not benefit from the same comfort and is hence less efficient than what Lombok is doing.

3.1.1 Example class

Let's see an example. Imagine the following Java POJO:

public class DataExample {

    private final String name;

    private int age;

    private double score;

    private String[] tags;
}

Typical boilerplate code involved when considering such a POJO are:

Getters and Setters for all private fields, making them JavaBean properties
A nice toString method giving the values of its properties when an object is output on the console
Consistent hashCode and equals methods enabling to compare and manipulate two different objects with same values
A default constructor without any argument (Javabean standard)
An all args constructor taking all values as argument to build the instance

3.1.2 Without Lombok

Without Lombok, writing all this code is a nightmare, and the simple class above becomes as follows:

import java.util.Arrays;

public class DataExample {

    private final String name;
    private int age;
    private double score;
    private String[] tags;
  
    public DataExample(String name) {
        this.name = name;
    }

    public DataExample(String name, int age, double score, String[] tags) {
        this.name = name;
        this.age = age;
        this.score = score;
        this.tags = tags;
    }
  
    public String getName() {
        return this.name;
    }
  
    void setAge(int age) {
        this.age = age;
    }
    public int getAge() {
        return this.age;
    }
  
    public void setScore(double score) {
        this.score = score;
    }
    public double getScore() {
        return this.score;
    }
  
    public String[] getTags() {
        return this.tags;
    }
    public void setTags(String[] tags) {
        this.tags = tags;
    }
  
    @Override 
    public String toString() {
        return "DataExample(" + this.getName() + 
            ", " + this.getAge() + 
            ", " + this.getScore() + 
            ", " + Arrays.deepToString(this.getTags()) + ")";
    }

    @Override 
    public boolean equals(Object o) {
        if (o == this) return true;
        if (!(o instanceof DataExample)) return false;
        DataExample other = (DataExample) o;
        if (this.getName() == null ? 
                other.getName() != null : 
                !this.getName().equals(other.getName())) 
            return false;
        if (this.getAge() != other.getAge()) return false;
        if (Double.compare(this.getScore(), other.getScore()) != 0) return false;
        if (!Arrays.deepEquals(this.getTags(), other.getTags())) return false;
        return true;
    }
  
    @Override 
    public int hashCode() {
        final int PRIME = 59;
        int result = 1;
        final long temp1 = Double.doubleToLongBits(this.getScore());
        result = (result*PRIME) + (this.getName() == null ? 43 : this.getName().hashCode());
        result = (result*PRIME) + this.getAge();
        result = (result*PRIME) + (int)(temp1 ^ (temp1 >>> 32));
        result = (result*PRIME) + Arrays.deepHashCode(this.getTags());
        return result;
    }
}

One needs to understand that while the features implemented by the code above are pretty useful and very important, the code itself has no added value whatsoever. Most IDEs generate this code for you using some right-click here and there. A machine can write this code for you, can you imagine this? ... And yet with java it HAS to be written. This makes no sense.

So without Lombok, for a 4 properties, 5 lines of code class ... One has to write more than 60 lines of boilerplate code, that's a ratio of [Boilerplate code / Useful Code] of more than 1600% !!!
This is the main rationality behind project Lombok.

3.1.3 With Lombok

With Lombok we can use the following set of annotations on top of the class to generate the very same boilerplate code that we had to write on our own above:

@Getter
@Setter
@ToString
@EqualsAndHashCode
@RequiredArgsConstructor
@AllArgsConstructor
public class DataExample {

  private final String name;

  private int age;

  private double score;

  private String[] tags;
}

All these annotations are straightforward to understand so I won't be describing them any further.

You might want to look at the Lombok documentation to learn more about these annotations.

With Lombok, we get a much better ratio of boilerplate code concerns vs. useful code of 100% instead of more than 1600% without Lombok.

3.1.4 Lombok Apporoach (AST transformation) vs. Bytecode Manipulation vs. Code Generation

Code generation

Of course there are alternatives to Lombok. For instance, most if not all IDEs enable the developer to generate these boilerplate methods in the class source file just as Lombok does it.
But generating this code in the class source file is IMHO not the right approach. At the end of the day, I believe that generated code - i.e. code that can be written by a machine - should simply not be written! If a machine can write this code then this code has no added value at all! I do not want to see such code in my class files, I do not want to be aware of it.
Lombok generates this code transparently and seamlessly. By modifying (or transforming) the AST, Lombok keeps your source code trim and free of bloat, unlike plain-text code-generation. With Lombok, I do not need to be aware of this code, it's not polluting my code coverage computation (in Sonar for instance or else) and everything behaves as if this code was actually written except that one doesn't see it and doesn't need to care about it.

Bytecode Manipulation

Bytecode manipulation is preferable over Code generation IMHO in regards to generating (at least some of the) boilerplate code. For the same reason indicated above: bytecode manipulation makes the boilerplate code transparent and avoids it polluting my source code.
It is however not as efficient as AST Transformation such as Lombok is doing. Lombok's generated code is also visible to classes within the same compilation unit, unlike direct bytecode manipulation.

For instance when it comes to generating getters and setters, Lombok makes them visible at compile time, making it possible for client code to use them without them appearing in the Java Bean class.
Using bytecode manipulation to generate getters and setters, they become invisible to the compiler and cannot be used by another class, except using runtime reflection, which may be fine for some use cases (hibernate, etc) but not for most of them.

AST Transformation

More information on Lombok's internal is availanle there: Custom AST transformations with Project Lombok.

3.1.5 Just a note on concerns

I sometime hear (... read) arguments online against the use of Lombok. Most of the time, people against Lombok complain about the magic added by Lombok.

It is true that Lombok adds a lot of magic to the application. But that magic is related to boilerplate code with no added value and easy to debug and understand in case of any doubt. In addition, it is very well recognized by most common IDEs such as eclipse and IDEA using well known or even official plugins.

3.2 Java agents and the linkage problem

Javassist cannot modify a class after it has been loaded by a classloader ... as far as this classloader is concerned.
Whenever one tries to modify a class already loaded by the referenced classloader, that attempt to call pool.makeClass( ... ) will fail and complain that class is frozen (i.e. already created via toClass().
Being able to do that would require to unload the class first from the reference Classloader.

The problem here is that one cannot unload a single class from a ClassLoader. A class may be unloaded if it and its ClassLoader became unreachable but since every class refers to its loader that implies that all classes loaded by this loader must have to become unreachable too.
Of course one can (re-)create the class using a different ClassLoader but that would require to make the whole program use that new Classloader and this becomes fairly complicated.
At the end of the day, that would well require reloading the whole application and initializing everything all over again. This makes no sense.

Let's just accept here that a class cannot be changed by Javassist once it has been already loaded by the Classloader.

3.2.1 Overcoming the Linkage problem with Java Agents

The only (easy) way to overcome this problem is to change the class implementation using bytecode manipulation before the class is loaded by any Classloader. And happily, as often in the Java World, the JVM provides a mechanism for this, the JPLIS - Java Programming Language Instrumentation Services - and the concept of Java Agent.

In its essence, a Java agent is a regular Java class which follows a set of strict conventions. The agent class must implement a public static void premain(String agentArgs, Instrumentation inst) method which becomes an agent entry point (similar to the main method for regular Java applications).

Once the Java Virtual Machine (JVM) has initialized, each such premain(String agentArgs, Instrumentation inst) method of every agent will be called in the order the agents were specified on JVM start. When this initialization step is done, the real Java application main method will be called.

The instrumentation capabilities of Java agents are truly unlimited. Most noticeable one is the ability to redefine classes at run-time. The redefinition may change method bodies, the constant pool and attributes. The redefinition must not add, remove or rename fields or methods, change the signatures of methods, or change inheritance.

Please notice that re-transformed or redefined class bytecode is not checked, verified and installed just after the transformations or redefinitions have been applied. If the resulting bytecode is erroneous or not correct, the exception will be thrown and that may crash JVM completely.

3.2.2 ClassFileTransformer

A Java agent premain method takes the Instrumentation entry point - class java.lang.instrument.Instrumentation - as argument.

The Instrumentation entry point provides several commodity methods to check the possibilities of the JVM but the most important API of the java.lang.instrument.Instrumentation class is the method void addTransformer(ClassFileTransformer transformer); that enable the developer to register several java.lang.instrument.ClassFileTransformer.

The java.lang.instrument.ClassFileTransformer interface defines one single method byte[] transform(...) that is responsible to apply transformations (as far as complete rewriting if required) the java classes being loaded by the JVM.
The transform(...) method is called for each and every class being loaded by a classloader. Both the class being loaded and the classloader actually loading it as well as other information are given in argument.

The transform(...) method is the ideal place where bytecode manipulation libraries can be used to modify classes just before they are loaded by the classloader.

(Source : http://www.barcelonajug.org/2015/04/java-agents.html)

3.2.3 Caution

As a sidenote, the Java agent class may also have a public static void agentmain(String agentArgs, Instrumentation inst) method which is used when the agent is started after JVM startup.

A common practice when developing agents is to implement both the agentmain and premain methods and delegate one to the other. See implementation of Java Agent of BCG below.

3.2.4 Simple Example

package ch.niceideas.common.agent;

import java.lang.instrument.ClassFileTransformer;
import java.lang.instrument.IllegalClassFormatException;
import java.lang.instrument.Instrumentation;
import java.security.ProtectionDomain;

public class ClassLoadingLoggingAgent {

    public static void premain(String agentArgument, Instrumentation instrumentation){     
        System.out.println("Hello, Agent [ " + agentArgument + " ]");

        instrumentation.addTransformer (new ClassFileTransformer() {

            @Override
            public byte[] transform(
                    ClassLoader loader, 
                    String className, 
                    Class<?> classBeingRedefined, 
                    ProtectionDomain protectionDomain, 
                    byte[] classfileBuffer) throws IllegalClassFormatException {

                // Transform is called just before class loading occurs :-)
                System.out.println("Class being loaded : " + className);

                // No transformation ...
                return classfileBuffer;
            }
        });
    }
}

Let's now see how to invoke a simple program using this simple agent.

3.2.5 Invoking the Agent

When running from the command line, the Java agent could be passed to JVM instance using -javaagent argument which has following semantic -javaagent:<path-to-jar>[=options].

A java agent needs to be packaged in a jar file and that jar file needs to have a specific and proper MANIFEST.MF file indicating the class containing the premain method.

A proper manifest file for the agent above should be packages within the jar archive containing the agent classes under META-INF/MANIFEST.MF and would be as follows:

Manifest-Version: 1.0
Premain-Class: ch.niceideas.common.agent.ClassLoadingLoggingAgent

Now let's imagine we invoke our agent on a simple program defined as follows:

package ch.niceideas.common.enhancer;

public class TestMain {

    public static void main (String args[]) {
        System.out.println ("Program Main");
    }
}

Sample result on this simple program is as follows:

badtrash@badbook:/data/work/niceideas-commons/target/test-classes$ java \
    -javaagent:/home/badtrash/ClassLoadingLoggingAgent.jar=007 \
    ch.niceideas.common.enhancer.TestMain

Hello, Agent [ 007 ]
Class being loaded : java/lang/invoke/MethodHandleImpl
Class being loaded : java/lang/invoke/MethodHandleImpl$1
Class being loaded : java/lang/invoke/MethodHandleImpl$2
Class being loaded : java/util/function/Function
Class being loaded : java/lang/invoke/MethodHandleImpl$3
Class being loaded : java/lang/invoke/MethodHandleImpl$4
Class being loaded : java/lang/ClassValue
Class being loaded : java/lang/ClassValue$Entry
Class being loaded : java/lang/ClassValue$Identity
Class being loaded : java/lang/ClassValue$Version
Class being loaded : java/lang/invoke/MemberName$Factory
Class being loaded : java/lang/invoke/MethodHandleStatics
Class being loaded : java/lang/invoke/MethodHandleStatics$1
Class being loaded : sun/launcher/LauncherHelper
Class being loaded : java/util/concurrent/ConcurrentHashMap$ForwardingNode
Class being loaded : sun/misc/URLClassPath$FileLoader$1
Class being loaded : java/lang/Package
Class being loaded : java/io/FileInputStream$1
Class being loaded : ch/niceideas/common/enhancer/TestMain
Class being loaded : sun/launcher/LauncherHelper$FXHelper
Class being loaded : java/lang/Class$MethodArray
Class being loaded : java/lang/Void
Program Main
Class being loaded : java/lang/Shutdown
Class being loaded : java/lang/Shutdown$Lock

3.2.6 Workaround

As a sidenote, and to conclude this section, let's just mention that using a java agent to inject behaviour at runtime using bytecode manipulation is not always a requirement, it depends eventually on the use case.
A pretty common approach favored over java agents usage is the subclassing approach. It consists in defining a new class as a subclass of the class to be enhanced and injecting the new behaviour to that subclass instead.
This is a pretty straightforward approach and prevents the usage of a java agent since we don't care whether or not the initial class has already be loaded. Since we define a new class, the subclass, we're good to go no matter what happens with the initial class.

I have given an example of this approach in my previous article as described here.

In the case of boilerplate code generation such as done by Lombok, using an agent is pretty much the only way. The Value Objects or Java Beans enhanced this way can well be used later by the running program for ORM concerns or IoC concerns. These other frameworks, such as hibernate or spring, very often use the subclassing approach to inject their own behaviour.
If a programmer attempts to use the subclass trick to inject his own behaviour and then use his class with hibernate for instance, his changes will likely be ignored by hibernate or spring that will generate their own subclass (using CGLIB or Javassist) that can conflict the developer's own subclass. In this case enhancing the class itself is a way simpler approach.

Finally, using a java agent is a convenient way to avoid situations where the developer attempts to enhance a class that has already been loaded by the classloader. But that is not necessarily always required and the developer can well choose to implement his own control over the application lifecycle to ensure he has the chance to modify classes before the application attempts to load them.
But that is impossible when using annotations of course and hence frameworks such as Lombok using annotations extensively have no other choice than using a java agent.

4. BCG: a simple approach for generating boilerplate code using Javassist

Now back on the Javassist topic.

The purpose of this article is to give a second example of a sound Javassist use case: the generation of boilerplate code using bytecode manipulation, just as project Lombok is doing.
In fact, I will present here the BCG tool that mimic Lombok and re-implement two features of the Lombok feature set.

I am presenting here the few dozen lines of code on the BCG tool - BCG for Boilerplate Code Generator.
BCG is a simple tool that uses Javassist and implement a Java agent to key Lombok features:

toString() method generation
property getters and setters generation

Note that BCG is not a production tool or anything like it, it is really just a Javassist example and intended to demonstrate how straightforward, simple and efficient it would be to re-implement Lombok features using Javassist ... Should one want to do that, which is not likely since Lombok is working so cool and so easily extendable.

As mentioned above, we will only be mimicking project Lombok here using bytecode manipulation. We are not implementing these features the same way Lombok is doing. Lombok is working at compile-time using AST Transformation. We will be working at runtime using bytecode manipulation.

4.1 Principle

We want to be able to implement transformers that take care of performing one specific modification to target classes and activated by the presence of one specific annotation on these classes.

The key idea is to implement a Java Agent that analyze each and every class just before is loaded by the classloader and verifies if this class needs to be transformed.
We want to implement Transformers that recognize classes declaring a specific annotation and proceed with the transformation of these classes.
We want the system to be easily extendable with new transformers.

4.2 Design

The design of BCG is as follows:

Principal components are as follows:

EnhancerAgent : This is a JVM Agent that implements classes transformation from recognized annotations.
The EnhancerAgent is called before the application starts and registers a java.lang.instrument.ClassFileTransformer that enhances classes declaring specific annotations before they are loaded by classloader(s).
The java.lang.instrument.ClassFileTransformer here is a simple anonymous adapter.
ClassTransformer : This is an interface implemented by actual Class Transformers. A Class Transformer transforms Java classes declaring the recognized annotation(s) using Javassist.
AbstractTransformer : This is the base class for all Transformers. It provides commodity routines for ClassTransformers and simplifies registration API.

Then all actual transformers extend AbstractTransformer and simply declare the annotation they recognize.

4.3 Implementation

The source code of all classes and interfaces from the design above is given below.

(In all snippets of code from now on, I will be coloring relevant Javassist API calls in dark red)

4.3.1 The code of the Agent

[Class EnhancerAgent]

package ch.niceideas.common.enhancer;

import ch.niceideas.common.enhancer.impls.CountInstanceTransformer;
import ch.niceideas.common.enhancer.impls.DataTransformer;
import ch.niceideas.common.enhancer.impls.ToStringTransformer;
import javassist.CannotCompileException;
import javassist.ClassPool;
import javassist.CtClass;

import java.io.IOException;
import java.lang.annotation.Annotation;
import java.lang.instrument.ClassFileTransformer;
import java.lang.instrument.IllegalClassFormatException;
import java.lang.instrument.Instrumentation;
import java.security.ProtectionDomain;
import java.util.ArrayList;
import java.util.List;

/**
 * This is a JVM Agent that implements classes transformation from recognized
 * annotations.
 * <p />
 *
 * The EnhancerAgent is called before the application starts and registers a
 * <code>java.lang.instrument.ClassFileTransformer</code> that enhances
 * classes declaring specific annotations before they are loaded
 * by classloader(s).
 * <p />
 *
 * The <code>java.lang.instrument.ClassFileTransformer</code> here is a simple
 * anonymous adapter.
 */
public class EnhancerAgent {

    private static ClassTransformer[] transformers = null;

    // for now I don't have any better way than declaring all transformers here
    static {
        transformers = new ClassTransformer[] {
                new CountInstanceTransformer(),
                new ToStringTransformer(),
                new DataTransformer()
        };
    }

    // Java Agent API
    public static void premain(String agentArgs, Instrumentation inst) {
        agentmain (agentArgs, inst);
    }

    // API used when agent invoked after JVM Startup
    public static void agentmain(String agentArgs, Instrumentation inst) {

        inst.addTransformer(new ClassFileTransformer() {

            @Override
            public byte[] transform (
                    ClassLoader loader,
                    String className,
                    Class<?> classBeingRedefined,
                    ProtectionDomain protectionDomain,
                    byte[] classfileBuffer)
                    throws IllegalClassFormatException {

                // Can return null if no transformation is performed
                byte[] transformedClass = null;

                CtClass cl = null;
                ClassPool pool = ClassPool.getDefault();
                try {
                    cl = pool.makeClass(new java.io.ByteArrayInputStream(classfileBuffer));

                    for (ClassTransformer transformer : transformers) {

                        if (transformer.accepts (cl)) {
                            transformer.transform (cl);
                            System.out.println ("Transformed class " + cl.getName()
                                    + " with " + transformer.getClass().getSimpleName());
                        }
                    }

                    // Generate changed bytecode
                    transformedClass = cl.toBytecode();

                } catch (IOException | CannotCompileException e) {
                    e.printStackTrace();

                } finally {
                    if (cl != null) {
                        cl.detach();
                    }
                }

                return transformedClass;
            }
        });
    }
}

4.3.2 Interface ClassTransformer

Transformers implement this interface:

[Class ClassTransformer]

package ch.niceideas.common.enhancer;

import javassist.CtClass;

/**
 * A class Transformer transforms Java classes declaring the recognized
 * annotation(s) using Javassist.
 */
public interface ClassTransformer {

    /**
     * Used by the EnhancerAgent to know whether this class accepts the supported anotation
     *
     * @param cl the class to test
     * @return true if the passed annotation is accepted
     */
    boolean accepts(CtClass cl);

    /**
     * Proceed with the transformation of the javassist loaded class given as argument
     *
     * @param cl the javassist loaded class to be transformed
     */
    void transform(CtClass cl);

}

4.3.3 Common Abstraction

And an abstract class provides some binding commodities to transformers

[Class AbstractTransformer]

package ch.niceideas.common.enhancer.impls;

import ch.niceideas.common.enhancer.ClassTransformer;
import javassist.CtClass;

import java.lang.annotation.Annotation;

/**
 * Base class for ClassTransformers.
 * <br />
 * Provides commodity routines for ClassTransformers and simplifies registration API.
 */
public abstract class AbstractTransformer implements ClassTransformer {

    @Override
    public final boolean accepts(CtClass cl) {
        return cl.hasAnnotation(getAnnotationClass());
    }

    /**
     * Classes that wants to be transformed by this transformer needs to declare
     * this annotation.
     *
     * @return the type of the annotation accepted by this transformer.
     */
    protected abstract Class<? extends Annotation> getAnnotationClass();

    @Override
    public abstract void transform(CtClass cl);
}

4.3.4 The set of Class Transformers

A first Class Transformer : outputs the count of instances of classes declaring the @CountInstance annotation.

[Class CountInstanceTransformer]

package ch.niceideas.common.enhancer.impls;

import ch.niceideas.common.enhancer.ClassTransformer;
import ch.niceideas.common.enhancer.annotations.CountInstance;
import javassist.CannotCompileException;
import javassist.CtBehavior;
import javassist.CtClass;
import javassist.CtField;

import java.lang.annotation.Annotation;

/**
 * This transformer accepts classe declaring the "@CountInstance" annotation.
 * 

 * It enhances the class with an instanceCounter and outputs the value of ths
 * instanceCounter everytime an instance is built.
 */
public class CountInstanceTransformer extends AbstractTransformer 
        implements ClassTransformer {

    @Override
    protected Class<? extends Annotation> getAnnotationClass() {
        return CountInstance.class;
    }

    @Override
    public void transform(CtClass cl) {
        try {
            if (!cl.isInterface()) {

                // Add a static field in the class
                CtField field = CtField.make("private static long _instanceCount;", cl);
                cl.addField(field);

                CtBehavior[] constructors = cl.getDeclaredConstructors();
                for (int i = 0; i < constructors.length; i++) {

                    // Increment counter and output it
                    constructors[i].insertAfter("_instanceCount++;");
                    constructors[i].insertAfter("System.out.println(\""
                            + cl.getName() + " : \" + _instanceCount);");
                }
            }

        } catch (CannotCompileException e) {
            e.printStackTrace();
            throw new RuntimeException (e);
        }
    }
}

The CountInstanceTransformer accepts classes declaring the CountInstance annotation :

package ch.niceideas.common.enhancer.annotations;

/**
 * Classes declaring this annotation will have an instancecounter which value is
 * output everytime an instance is constructed
 */
public @interface CountInstance {
}

Second Class Transformer: generates getters and setters for classes declaring the @Data annotation.

[Class DataTransformer]

package ch.niceideas.common.enhancer.impls;

import ch.niceideas.common.enhancer.ClassTransformer;
import ch.niceideas.common.enhancer.annotations.Data;
import javassist.*;

import java.lang.annotation.Annotation;

/**
 * This transformer accepts classes declaring the "@Data" annotation.
 * <br />
 * It generates a getters and setters dynamically for every field of the class
 * if they do not already exist
 */
public class DataTransformer extends AbstractTransformer implements ClassTransformer {

    @Override
    protected Class<? extends Annotation> getAnnotationClass() {
        return Data.class;
    }

    @Override
    public void transform(CtClass cl) {
        try {
            if (!cl.isInterface()) {

                for (CtField field : cl.getDeclaredFields()) {

                    String camelCaseField = field.getName().substring(0, 1).toUpperCase()
                            + field.getName().substring(1);

                    if (!hasMethod("get" + camelCaseField, cl)) {
                        cl.addMethod(CtNewMethod.getter("get" + camelCaseField, field));
                    }

                    if (!hasMethod("set" + camelCaseField, cl)) {
                        cl.addMethod(CtNewMethod.setter("set" + camelCaseField, field));
                    }
                }
            }

        } catch (CannotCompileException e) {
            e.printStackTrace();
            throw new RuntimeException(e);
        }
    }

    /** javassist has unfortunately no hasMethod API */
    private static boolean hasMethod (String methodName, CtClass cl) {
        try {
            cl.getDeclaredMethod(methodName);
            return true;
        } catch (NotFoundException e) {
            return false;
        }
    }
}

The DataTransformer accepts classes declaring the Data annotation :

package ch.niceideas.common.enhancer.annotations;

/**
 * Classes declaring this annotation will have getters and setters automatically
 * generated
 */
public @interface Data {
}

Third Class Transformer : generates the toString method for classes declaring the @ToString annotation.

[Class DataTransformer]

package ch.niceideas.common.enhancer.impls;

import ch.niceideas.common.enhancer.ClassTransformer;
import ch.niceideas.common.enhancer.annotations.ToString;
import javassist.*;

import java.lang.annotation.Annotation;

/**
 * This transformer accepts classes declaring the "@ToString" annotation.
 * <br />
 * It generates a toString method dynamically. The toString method is generated using
 * bytecode manipulation and avoids reflection.
 */
public class ToStringTransformer extends AbstractTransformer implements ClassTransformer {

    @Override
    protected Class<? extends Annotation> getAnnotationClass() {
        return ToString.class;
    }

    @Override
    public void transform(CtClass cl) {
        try {
            if (!cl.isInterface()) {

                StringBuilder bb = new StringBuilder("{\n");
                bb.append("    StringBuilder sb = new StringBuilder(\"" 
                        + cl.getName() + "\");\n");
                bb.append("    sb.append(\"[\");\n");

                for (CtField field : cl.getDeclaredFields()) {
                    field.setModifiers(Modifier.PUBLIC); // hacky hack

                    bb.append("    sb.append(\"" + field.getName() + "\");\n");
                    bb.append("    sb.append(\"=\");\n");
                    bb.append("    sb.append(this." + field.getName() + ");\n");
                    bb.append("    sb.append(\" \");\n");
                }

                bb.append("    sb.append(\"]\");\n");
                bb.append("    return sb.toString();\n");
                bb.append("}");

                try {
                    CtMethod toStringMethod = cl.getDeclaredMethod("toString");
                    toStringMethod.setBody(bb.toString());

                } catch (NotFoundException e) {

                    CtMethod newMethod = CtNewMethod.make("public String toString() \n"
                            + bb.toString(), cl);
                    cl.addMethod(newMethod);
                }
            }

        } catch (CannotCompileException e) {
            e.printStackTrace();
            throw new RuntimeException (e);
        }
    }
}

The ToStringTransformer accepts classes declaring the ToString annotation :

package ch.niceideas.common.enhancer.annotations;

/**
 * Classes declaring this annotation will have a toString method generated
 * automagically
 */
public @interface ToString {
}

4.3.5 Test Class Example

A test class for the DataTransformer for instance, with the usage of a nice library to test agents: ElectronicArts AgentLoader.

EA Agent Loader is a collection of utilities for java agent developers. It allows programmers to write and test their java agents using dynamic agent loading (without using the -javaagent jvm parameter).

package ch.niceideas.common.enhancer;

import ch.niceideas.common.enhancer.testData.TestData;
import com.ea.agentloader.AgentLoader;
import junit.framework.TestCase;
import org.apache.log4j.Logger;
import org.junit.Before;
import org.junit.Test;

import java.lang.reflect.Method;

public class DataTest extends TestCase{

    private static final Logger logger = Logger.getLogger(DataTest.class);

    @Before
    public void setUp() throws Exception {
        AgentLoader.loadAgentClass(EnhancerAgent.class.getName(), "");
    }

    @Test
    public void testDataTransformer() throws Exception {

        TestData testData = new TestData();

        Method getI = TestData.class.getDeclaredMethod("getI");
        assertEquals (0, getI.invoke(testData));

        Method getMyString = TestData.class.getDeclaredMethod("getMyString");
        assertEquals ("abc", getMyString.invoke(testData));

        Method getValue = TestData.class.getDeclaredMethod("getValue");
        assertEquals (-1L, getValue.invoke(testData));

        Method setI = TestData.class.getDeclaredMethod("setI", int.class);
        setI.invoke(testData, 9);
        assertEquals (9, getI.invoke(testData));

        Method setMyString = TestData.class.getDeclaredMethod("setMyString", String.class);
        setMyString.invoke(testData, "xyz");
        assertEquals ("xyz", getMyString.invoke(testData));

        Method setValue = TestData.class.getDeclaredMethod("setValue", long.class);
        setValue.invoke(testData, 999L);
        assertEquals (999L, getValue.invoke(testData));
    }
}

This test case uses the following test data :

package ch.niceideas.common.enhancer.testData;

import ch.niceideas.common.enhancer.annotations.Data;

/**
 * a test Data for the DataTest test case
 */
@Data
public class TestData {

    private int i = 0;

    private String myString = "abc";

    private long value = -1;
}

5. Conclusion

Again. Bytecode manipulation opens a whole lot of new possibilities for the JVM and is key to address one of the biggest weakness of the JVM: the overwhelming verbosity of the language. The approaches and techniques presented above are extensively used by so many libraries and frameworks that have become the facto standard nowadays: AspectJ, Spring, Hibernate, Jprofiler, etc.
For that reason, and because it's really a lot of fun, one might find bytecode manipulation a pretty valuable mechanism to master.

In addition, even though Lombok uses for very good reasons a different technique (AST Transformation), I find it astonishing to see how bytecode manipulation enables a developer to mimic its features in so few lines of code.

One can use bytecode manipulation to perform many tasks that would be difficult or impossible to do otherwise, and once one learns it, the sky is the limit.

Javassist put this power in hands of every Java developer in a simple, intuitive and efficient way.

Part of this article is available as a slideshare presentation here: https://www.slideshare.net/JrmeKehrli/bytecode-manipulation-with-javassist-for-fun-and-profit.

The Digitalization - Challenge and opportunities for financial institutions

2017-03-21T16:52:55-04:00

A few weeks ago, I did a speech about the Digitalization and its impact on financial institutions, both in terms of challenges and opportunities in the context of my role as Head of R&D in my current company.
I am reporting here my speech as an article.

Even though the Digitalization and its impacts is something so widely discussed and studied nowadays, even in banking institutions, I still find it puzzling that so many of them struggle following the pace.
Having said that, many others on the other hand have well understood how much technology is about to disrupt the banking business just as Uber has disrupted the transportation business and AirBnB the lodging business and many good and enlightening initiatives start to flourish in the news.

But still, it seems to me that most innovations in banking are really coming from small players or even startups - think of fintechs - instead of coming from the big players of the banking industry. For instance, paying everything with a cellphone is a thing for a few years now in many African countries while it's not at all in Europe, even in Switzerland, THE country of banking.
Especially in Switzerland, financial institutions struggle keeping up with evolution of their business coming from the digitalization on one side and the regulatory pressure as well as the reduction of the margins on the other side.
Discussing this very matter further exceeds the scope of this article of course but I want to report below my speech notes and present what I see as the most important challenges and opportunities for the banking industry coming from the digitalization.

Part of this article is available as a slideshare presentation here: https://www.slideshare.net/JrmeKehrli/digitalization-a-challenge-and-an-opportunity-for-banks.

Summary

1. Introduction
2. Financial Services ARE impacted !
3. Some definitions
- 3.1 Digitalization
- 3.2 Digital Transformation
4. Challenges and Opportunities for Banking Institutions
- 4.1 Challenges
- 4.2 Opportunities
5. Conclusion

1. Introduction

1.1 An evolving society

Our society is evolving.

Yesterday - in 2008, we were amazed by the first smartphones. Today they have almost become a part of ourselves. We cannot go without them anymore.

Today everybody has in his pocket a computer that is more powerful by several order of magnitudes that the computer that sent the first people to the moon.
Later, 30 years ago, we had computers still way less powerful than an iPhone that were fitting an entire room, today an iPhone stands in our pocket.

Is it the biggest invention of the decade? Likely, but the previous decade, not the current one. I'll get back to that.

Nowadays, new technologies emerge first in the consumer market and then spread into business. New solutions emerge every month and corporations cannot keep up the pace.
This new reality has a name: it's the consumerization.

The consumerization has a consequence. Increasingly, the trend is to hire employees with their devices and applications. This is the BYOD for "Bring Your Own Device" trend. It comes from the fact that employees are more comfortable and more efficient with their own devices.

The direct consequence of the consumerization is a use of a mix of professional and personal tools by employees (Office Suite, Gmail, Google+, Twitter, Facebook, Dropbox, Evernote, ...)
Nowadays many companies are still blocking access to these personal tools from their employees, mostly financial institutions I have to say.
Tomorrow, that won't be possible anymore.

People are used to be connected all the time, with highly efficient devices on highly responsive services, everywhere and for every possible need.

1.2 Some Facts

Global sales of PCs never really exploded.
On the other hand, Global sales of smartphones and tablets explodes.

Global Mobile traffic went from 1% in 2009 to 4% in 2010 and 12% in 2012. Today it reaches 40%. In 2020, Global Mobile Traffic will exceed fix PCs Internet Traffic.

This may be hard to understand for people in Europe or the US for instance. But look at India for instance: the wired telecommunication infrastructure there could never be developed as it could be in Europe or in the US. There, the mobile traffic already exceeded the Desktop traffic in 2012.

Global sales of smartphones and tablets has exploded!

In 2017, over 3 billion people will be connected all the time, everywhere and for every kind of needs.

1.3 More Facts

We look at our smartphones 150 times a day.
We are using our smartphones all the time, even when watching another media.

Even when watching TV, we cannot refrain from using a connected device at the same time, either a smartphone or a tablet.

As a funny note, men and women are using their smartphone or tablet while watching TV for significantly similar reasons.
But there are 2 exceptions:

Looking at sport results on an iPad while watching TV seems to be rather a man thing.
On the other hand, looking at Facebook feeds while watching TV seems to be rather a woman thing.

1.4 Today

I cannot stress enough how much what we are experiencing from a few years is important and what it means in terms of change of society.

Today, we are inter-connected on different kind of medias, during a continuous time and for every possible need. This has become a part of the human behaviour. In a few years (OK maybe a little more) the majority of the workforce will be composed by millennials, by people almost born with an iPhone.
At that moment, exchanging data on Internet all the time and for every possible need will seem to people to be as natural as breathing.

But this is today ...

1.5 Tomorrow

Tomorrow there will be dozens of billions of additional sources, in the form of smart devices connected on internet and exchanging data.

The Internet of Things - or IoT - refers to "uniquely identifiable objects and their interconnection on internet, as well as their automatic exchange of information with third party services".

Dr. Henrik Christensen, Professor of Computer Science and the Chair of Robotics at the Georgia Institute of Technology, said, not long ago: "My current prediction is that kids born today will never have to drive a car".

There are 3 billion people connected in 2017 and exchanging data on Internet.
Gartner thinks there will be 26 billion devices on the Internet of Things by 2020.
ABI Research is even more optimistic and claims that 30 billion devices will be wirelessly connected to the Internet of Things by 2020.

The internet of things is the coming big thing!

The "internet of people" and the "Internet of Things" form the "Internet of Everything".

Cisco defines the Internet of Everything (IoE) as follows "The Internet of Everything brings together people, process, data, and things to make networked connections more relevant and valuable than ever before-turning information into actions that create new capabilities, richer experiences, and unprecedented economic."

The Internet of Everything is the coming evolution from the interconnection of people and objects, always, all the time, everywhere and for every possible need.

1.6 Big Data

Since we started estimating and measuring the amount of produced data until 2003, 5 exabytes (5 billions gigabytes) have been produced.
In 2011, this quantity was generated in 2 days (think of facebook, twitter, google search logs, financial transaction logs, etc.).
In 2014, this quantity is generated in 10 minutes.
Today it is likely generated in a few minutes.

Not only do we generate more and more data but today, thanks to Big Data technologies, we have the means and the technology to analyze, exploit and mine this data to extract meaningful business insights

The data generated by the company's own systems, such as logs, user and customer activity tails, etc. can be a very interesting source of information regarding customer behaviours, profiles, trends, desires, etc.
But Big Data becomes really relevant only when one also considers the data external to the corporation, such as facebook status feeds, twitter logs, linkedin news, etc.

Today, a whole lot of additional sources are available to corporations to gather business insights related to market trends.

1.7 Digitalization

To summarize. what we are experiencing today:

the digitalization of the masses,
the era of power,
the availability of massive amount of data,
more importantly, the ability to analyze and use this data,
the always and everywhere interconnection of people,
the internet of things and
the coming internet of everything,

must lead corporation to adapt.

The digitalization urges corporations to
search for new business models and
rethink their operating model

2. Financial Services ARE impacted !

Now when I run this speech in financial institutions, it happens sometimes that I hear comments such as "Yeah, well, all of this makes surely a lot of sense for fancy internet companies. But we're a bank here. We're doing serious business, we're not Facebook."

I'm always puzzled by this kind of reactions because in my opinion, serious businesses such as banking institutions are in contrary on the front line when it comes to meeting the digitalization challenges.
It's now wonder fintechs have become such a thing and are increasingly eating the banking business.

Think of millennials, think of these people that are almost born with a tablet or a smartphone in their hands.
My father, used to go to the banking institutions that was closest from where he lived and where he worked. That's how he made his choice.
I myself, I have chosen my banking institutions at the time I was a student. My choice was driven by the conditions that banks were giving to students, such as a free credit card, no additional costs, etc. Then I simply remain loyal to my first choice.
Millennials choice will be different. They will choose the bank that provides them with the best online experience. That will be their main driver. For these people born with twitter, facebook and all these fancy online services, it will simply seem impossible to have to physically go to a branch to perform whatever operation they will need related to their banking account, including its initial opening.

Let me give you some examples ...

2.1 The rise of Mobile Banking

Only twenty years ago, we had to move to a physical location - a branch - to perform any kind of financial transaction, such as simply paying a phone bill.
What is the situation today?

(Source : http://money.cnn.com/2017/01/13/investing/wells-fargo-branch-closures/)

Wells Fargo announced plans earlier this year to close over 400 branches in the US. The tendency of big financial institutions is to close physical branches at an unprecedented pace, as a reflection of people's preferences for online and mobile banking.

(Source : https://thefinanser.com/2013/06/more-debate-about-the-shift-to-mobile.html/)

In online banking there is also a clear tendency from 2012 onward. Mobile banking usage skyrocks while fixed Internet banking is stagnating or even slightly shrinking.

(Source : https://www.forbes.com/sites/oliviergarret/2017/02/22/goldman-sachs-recent-move-marks-the-end-of-traditional-banking/#32be5e5069aa)

On one hand, rising compliance costs and restrictive regulations is the new normal. This forces banks to increase operational efficiency at all costs, which is pretty difficult when regulation related costs tend to explode.
On the other hand, reduced margins and increased costs is forcing banking institutions to adapt. Growing the Investment management business line is a relevant approach of course but diversifying earnings with new retail banking initiatives aimed at ensuring a first place on the online banking market is mandatory.

The digitalization is shaking the fundamentals of the banking business.

2.2 The rise of Alternative Financing

Here as well, twenty years ago there was no fintech, no crowdfunding.
Today the banking business is increasingly eaten by fancy technology companies (fintechs).

(Source : http://www.mergersandinquisitions.com/future-of-investment-banking-2015/)

Just as technology as disrupted the transportation business with Uber, the lodging business with AirBnB, the consumer lending business with so many peer-to-peer lending platforms available, Technology is about to disrupt investment banking.
This is happening.

(Source : https://www.crowdfundinsider.com/2015/02/63013-moving-mainstream-centre-for-alternative-finance-of-cambridge-judge-business-school-publishes-european-alternative-finance-benchmarking-report/)

Alternative financing models are progressing throughout the business lines. Peer-2-Peer consumer lending, Crowdfunding, Peer-2-Peer business lending, all of them are exploding over the past years.
This chart shows the situation in Europe but the worldwide situation is pretty similar.
The important information here is that the volume of peer-to-peer and crowd financing is doubling every year since 2012.

(Source : http://www.kaplancollectionagency.com/resource-center/alternative-financing-landscape/)

This is the landscape of alternative financing firms and startups. More companies are appearing every month, almost every week.
If you look at the global fintech landscape, you can multiply the count of companies here by 20.

Fintechs are eating the banking business.
Banking institutions need to adapt, or eventually a lot of them will disappear.

2.3 A whole new world for banking

Back in 2010 nobody ever heard that word. Only very high tech or finance specialists were aware of any crypto-currency.
Today everybody knows bitcoin and most people heard about blockchain.

(Source : https://blog.bitpay.com/understanding-bitcoins-growth-in-2015/)

While the blockchain technology is not ready yet to completely replace the trust third party, it has the potential to disrupt the very root of the worldwide financial system.
Happily here financial institutions have understood this from the beginning and a lot of blockchain initiatives nowadays are led by big financial institutions.

(Source : http://www.cxotoday.com/story/blockchain-to-disrupt-banking-industry-by-2020-infosys/)

I focused a lot on how the digitalization is challenging the banking business.
But the blockchain technology is a good example that there are opportunities as well.

(Source : http://www.prnewswire.com/news-releases/royal-bank-of-scotland-engages-ibm-watson-for-cognitive-insights-to-better-serve-customers-300340461.html)

Another interesting example. Royal Bank of Scotland engaged IBM Watson to take care of the simplest customer requests coming from online channels, as a way to enhance operational efficiency.
I myself am not necessarily a big fan of IBM or Watson but this is actually a pretty sound use case for Watson.
And in any case it's a brilliant example of how technology can help banking institutions increase operational efficiency.

(Source : http://www.dailymail.co.uk/sciencetech/article-2120416/Twitter-predicts-stock-prices-accurately-investment-tactic-say-scientists.html)

A team in the university of California has designed a model aimed at predicting stock prices evolution by using statistics of tweets, sentiment analysis and relationship discovery algorithms.
Not only they are able to predict stock prices evolution one day in advance but they came up with a model much more accurate that pretty much every other initiative so far.

Technology, here Big Data, also offers unprecedented opportunities for financial institutions.

2.4 Adapt ... or Vanish

Again, both the evolution of means and the evolution of behaviours induced by the new technologies such as

the digitalization of the masses,
the Big Data technologies,
the internet of everything,
etc.

have strong consequences on corporations and the society in general.

The digitalization
shakes the fundamentals of our society,
shocks our references and
revolutionizes our business models

Corporations need to adapt. The time is now or never.

3. Some definitions

I think the reader should now have a grasp on what I mean with the term "digitalization" so it's a good time to give a few formal definitions.

3.1 Digitalization

A first definition that I think is good, from Business Dictionary:

"The digitalization is
the integration of digital technologies
into everyday life
by the digitization of everything
that can be digitized"

(Source : BusinessDictionary - http://www.businessdictionary.com/definition/digitalization.html)

This is happening today: everything that can be digitized, either is digitized or is getting digitized.
I realized recently that I myself haven't written anything down on paper for a pretty long time.
I use Internet to do my payments, book my holidays or business trips, search for a phone number.
I take notes on my laptop or my smartphone.
I even take my medical appointments using email.

The digitalization is the increasing integration of digital technologies into everyday life.
Corporations need to adapt their business models and operating model to follow this trend. They need to transform their business and make it suited to the digital era.

The following is the definition of the digitalization from OCTO Technology. I think it's brilliant.

"The digitalization is
the impact on corporations and organizations
of the fact that
people and things
are always and everywhere inter-connected
for every possible need."

(Source : OCTO Technology - http://blog.octo.com/digitalisation-une-definition/)

What interests us today is the impact of the digitalization on corporations and in this context the definition from OCTO Technology is crystal clear, accurate and most relevant.

Another way to put it would be:
The digitalization is the impact on enterprises and organizations of the Internet of Everything.

3.2 Digital Transformation

One definition remains: what is the digital transformation?
I could not find any easy way to present the notion of digital transformation as a one sentence.

Instead, I find the following schematic most relevant in presenting what is the digital transformation.

This square is a corporation. The orange form represent its organization, its processes and its culture.

The most recent technological evolutions, influence the society and causes an evolutions both of means and behaviours.

In terms of means, think of the always and everywhere interconnection of people and things, the consummerization, the new businesses such as crowdfunding and crowdsourcing, The availability of massive amount of data and the ability to analyze it, etc.
In terms of behaviour, think of the increasing digital literacy of people, the digital natives, all the behaviour changes brought by social networks and the increasingly connected world and real-time communication means. People want everything, now and tailor made.

This evolution of our society as a whole forces the corporation to transform its operating model, adapt its organization and its culture.

This is the 4th industrial revolution.

Corporations need to adapt three most essential aspects:

First, the internal organization of corporations needs to be adapted to match the responsiveness and dynamic required to design products in the digital era.
Key practices here are Agility and DevOps at every level in the company around the IT organization designing the digital products.
Management and hierarchy should also be adapted to enable low response time to market events and customer feedbacks.
Finally, every action and decision within the corporation should be taken with customer centricity in mind.
Entering the digital era requires a significant evolution of the culture of the company.
Lean Startup principle and practices should be embraces and a thorough Customer Development Process should accompany the Product development processes in place.
Also, in the digital era with shrinking margins and increasing complexity of products and regulations, operational efficiency should be a constant focus.
Finally, The marketing approach should evolve to meet the customer expectations in a digital world. Customers expect corporations to meet them where they are, in a mostly online world. Corporations already started the digital marketing move many years ago, but that is not sufficient.
In this ever more selfish world, people are looking for tailor-made. It's all about me, myself and I. Corporations need to consider this and thanks to a sound adoption of Lean Startup principles such multivariate tests, corporations have the possibility to provide customers with very customized solutions. Even further, it is nowadays common to implicate customers in the design of the product itself and even in the process of searching for new products to develop. Think co-creation and co-innovation.

Interestingly, just as technology is the driver behind the evolutions forcing corporations to adapt, technology is also the solution to the transformation of corporations.

Corporations need to understand that they have no choice. They need to digitalize significant portions of their business and understand the central place that technology has to take.
Corporations that still believe today that IT and Technology is a center of cost instead of the key vector for innovation will eventually disappear. Whatever the industry, IT and Technology must be considered as a key investment and the most important vector of innovation.

4. Challenges and Opportunities for Banking Institutions

So all this definition part has been pretty generic.
I would want now to focus a little bit more on the digital transformation of financial institutions.

I would like to consider the digital transformations on two perspective:

First perspective concerns the challenges that the evolution of means and behaviour from the digital era is causing to financial institutions
Second perspective is the opportunities that the digital era is offering to financial institutions.

4.1 Challenges

Competitiveness

In the digital ear, comparisons and advising web sites flourish and new comparison services appear almost every month.
As an example, my father still buys his consumer electronic devices from the physical store closest from where he lives.
I myself, I am using an aggregation and comparison web site to buy my devices from the online shop offering me the cheapest price.
The same evolution that impacted consumer goods will eventually apply to absolutely every business including the banking business.
Banking institutions need to adopt a fair price policy and emphasize clarity and simplicity when designing their products.

Customer satisfaction

In a digital world, people wants everything immediately. In addition, reputation is very important and can be harmed in no time. People suffering from stolen credit cards will express themselves on social network and can harm the reputation of an institution that dealt badly with such a situation in no time.

Customer centricity

Today more than ever replacing enhancing a thorough product development process with a customer development process, meeting the customer where he is, focusing on needs and demands should be the core focus of financial institutions.

Marketing and branding

It's all about innovation and reputation. Design the best products, the most innovative ones, implement striking online and mobile services, communicate about them on the right channels and the chances they become viral are important.
At the same time, an anecdotal fact, discussed widely on the internet and becoming viral, can cause a lot more harm to a company than a bad balance sheet.
Again, making a difference in the digital era is all about innovation and reputation.

Operation Efficiency

With shrinking margins and increasing product development costs, corporations need to rethink the way they work. Tracking and eliminating waste should be part of the business processes used to run the company, not a side activity to be done once in a while when shareholders complain.
The key leads here are Process automation and the reduction of intervention delays.

Risk Management and Mitigation

In a digital world, the attack surface for cyber-criminal or simply thefts is much larger than in the traditional world. New channels, especially digital channels, to access the banking institutions products come with higher risks.
While a sampling approach to control and audit could be sufficient before, it is not the case anymore with the Digitalization. Banking institutions need to move their controls towards continuous, automated, comprehensive and real-time control and audit approaches.
State of the art fraud detection systems are not optional anymore.

4.2 Opportunities

Competitiveness

In a digital world, technology can help putting in places platforms for co-creation and co-innovation, implicating the customers in both the identification of new products and their definition, which are ultimately the best way to handle innovation and the requirement for tailor made. In addition, digital products more than ever before can be made very customized.

Customer satisfaction

When processes and products are digitalized, achieving 24/7 availability is straightforward. A computer doesn't sleep. In addition, communication about fair price strategies is easier when the catalog of products is available online. Finally, even customer follow-up processes can be automated.

Customer centricity

The digitalization requires financial institutions to meet the customer on his preferred channels, but is also provides them with the means to do so. Getting online and digital, from a purely technological perspective, is not difficult. Changing the organization of the corporation to achieve the digitalization is the difficult part.

Marketing and branding

It's all about innovation and reputation.
The digital world offers unprecedented opportunities to convince a customer to buy a product. Think of trial systems, demo systems, sandboxes...
In addition, digital marketing is now a field on its own.

Operation Efficiency

Technology is also key to achieve Operational Efficiency. New products or technologies aimed at moving to a paperless corporation and de-materializing processes appear every month, not to say every week. Think of digital signatures, responsive interfaces, etc.

Risk Management and Mitigation

Here as well, most recent technologies such as Big Data Analytics, Machine Learning and real-time processing systems offer unprecedented opportunities to move towards continuous, automated, comprehensive and real-time control and audit approaches.
In addition, web technologies have significantly progressed over the past 10 years making it possible to build responsive dashboards to monitor Key Performance Indicators and Key Risk Indicators as well as state of the art Data Discovery Platforms.

5. Conclusion

I am presenting above the challenges and opportunities for banking institutions coming from the Digitalization. But again that is really one side of the medal, on the other side there is the increasing regulatory pressure and the shrinking margins.

With the digitalization, new opportunities for growth and innovation are emerging. Many banking institutions have started to move towards accomplishing the digital transformation in many aspects of their business. But the target seems to be still so far.
Others are struggling to identify the path and are put in danger by smaller actors or fintechs that are increasingly eating the banking business. They need to understand that the moment is now or never.

Part of this article is available as a slideshare presentation here: https://www.slideshare.net/JrmeKehrli/digitalization-a-challenge-and-an-opportunity-for-banks.

Agile Landscape from Deloitte

2017-03-02T17:51:12-05:00

I've seen this infographic from Christopher Webb at Deloitte (at the time) recently.
This is the most brilliant infographic I've seen for years.

Christopher Webb presents here a pretty extended set of Agile Practices associated to their respective frameworks. The practices presented are a collection of all Agile practices down the line, related to engineering but also management, product identification, design, operation, etc.

[Click to enlarge]
(Source : Christopher Webb - LAST Conference 2016 Agile Landscape - https://www.slideshare.net/ChrisWebb6/last-conference-2016-agile-landscape-presentation-v1)

I find this infographic brilliant since its the first time I see a "one ring to rule them all" view of what I consider should be the practices towards scaling Agility at the level of the whole IT Organization.

Very often, when we think of Agility, we limit our consideration to solely the Software Build Process.
But Agility is more than that. And I believe an Agile corporation should embrace also Agile Design, Agile Operations and Agile Management.
This infographic does a great job in presenting how these frameworks enrich and complements each others towards scaling Agility at the level of the whole IT Organization.

To be honest there are even many more frameworks that those indicated on this infographic and Chris Webb is presenting some additional - reaching 43 in total - in his presentation.
But I believe he did a great job in presenting the most essential ones and presenting how these practices, principles and framework work together to achieve the ultimate goal of every corporation: skyrocketing employee productivity and happiness, maximizing customer satisfaction and blowing operational efficiency up.

Now I would want to present why I think considering Agility down the line in each and every aspect around the engineering team and how these frameworks completing each other are important.

1. Agile Design

Normally I am a little sensitive with formal meaning of the word design in software engineering.

But for once I'll make an exception.
So for once, by design here, I mean the largest possible definition of the term, encompassing as much the discovery of the key features as well as the architecture of the system to be implemented.

Agility in identifying beforehand the product to be implemented and its key features is a must.
Later when the rough form of the product is identified, the process consists in having a Vision workshop to align the stakeholders on the product vision, then Story Mapping workshops, all of these emphasizing Agility, Adaptation and lightweight processes in comparison to the tons of documents produced by more traditional methods.

This is pretty well covered in the infographic above and Design thinking covers all the practices that seem key to me such from the light Business Model Canvas to Product Vision definition workshops and Story Mapping workshops.

At the end of the day, Agility is mostly about the capacity to adapt and react to changing requirements and changing priorities. Enforcing thorough product identification and feature design phases before actually initiating the development of an MVP aimed at validating (or contradicting) the hypothesis makes little sense in my opinion.
One important framework to consider here is the Lean approach and the Lean Startup Practices.

At the end of the day, Agile Software Development methodologies cannot deploy their full potential if the company itself is not Agile.

2. Agile Development

At the root of everything there is XP. eXtreme Programming was mostly initiated by Kent Beck, strong from his experience on the C3 project. Kent Beck hardly invented a lot of things but rather took some practices more or less used previously in the industry and took them to extreme levels.

Agile Software Development is really built on top of XP genes. Today XP is considered just another Agile Software Development Framework, but I don't share that view. To me, XP and the related practices form the most fundamental core of Agile Software Development Methodologies.
XP Practices take a form or another in the various Agile Frameworks such as RDD, Scrum, Kanban, Scrumban, etc. In some of them some core XP practices are not mentioned; not because they should not be applied, but really because they're nowadays considered so natural that they're assumed. Think for instance of TDD (Unit Tests first), Continuous Integration, Simple Metaphor (Meaningful Naming, Domain Driven Design, Design patterns), etc.

I discussed in a previous article on this this blog the software development methodology we are using in my current company and interestingly all of our practices are pretty well identified on the infographic above.

3. Agile Operation

Agile operation is really about DevOps.

I developed in length in a dedicated article on this very blog what DevOps is and why it's important so I let the reader refer to this article.

Let's just mention that here as well it is hard for the development team to leverage its Agile practices if the other departments of the corporation - and out of those the operation is crucial - have not embraced Agility.

4. Agile Management

Agile management is about Leadership and Leadership pursues the goal of growing and transforming organizations into great places to work for, where people are engaged, the work is improved and customers are simply delighted.

Agile Management is a lot about Management 3.0.
Management 1.0 was about doing the wrong thing, by treating people like cogs in a system. Management 2.0 was about doing the right think wrong, with understanding the goals and having good intentions, but using old-fashioned top-down initiatives.
Management 3.0 is about doing the right thing for the team, involving everyone in improving the system and fostering innovation.

Agile Management is about making the components of the Agile corporation collaborate together towards anticipating changes and adapt smoothly and flawlessly.
There are three most essential vectors:

Collective Intelligence: which is key to address and control the increasing complexity of organizations and businesses and based on having everyone in the company taking part in the continuous improvement processes
Optimal use of Technology : Technology is an amazing vector of efficiency in regards to tools supporting the organization
A sound adoption of Continuous Improvement Processes : making the organization identify and build on its strength while continuously addressing its weaknesses to adapt itself continuously.

Agile Management values individual and interactions over formal processes and hierarchy. It really conists in empowering people and making the organization a place where they can develop themselves with passion and energy, leveraging their capacity for both action and innovation.
Now of course this needs to be driven and Agile Management encourages continuous feedback in the form, for instance, of O3s - One-On-One - on a regular basis where both the employee and the manager can provide feedback on the organization, respectively the performance of the employee.

Managing performance in this sense is identifying the strengths of the employee, which we should leverage, and the weaknesses, which we should address and improve.

Empowering people is a key practice since, at the end of the day, Management is too important to be left to Managers ;-)
Agility is about adaptation but also about efficiency and quality (think XP practices here) and Agile Management is about putting practices in place aimed at making engineers give the best they can and participate at every level in the success of the company.

I would conclude this section by giving my favorite definition of management:

"Hire great people, and then get the hell out of their way."

5. Conclusion

This infographic is an awesome view of what we have achieved over the last 10 to 15 years in regards to understanding of how to design, engineer, build and manage better.
I believe finding better ways of working should be an everyday concern for organizations, from startups to international corporations.

Quoting Jack Welsh:

"If the rate of change on the outside exceeds the rate of change on the inside, the end is near."

My personal pick-up is:

Lean (Startup) - See my article on Lean Startup
DevOps - See my article on DevOps
XP, Scrum and Kanban (Agile Development) - See my article on agility
Management 3.0 - Empowering and energizing people, Developing competences, Aligning teams, Continuous Improvement
Kaizen (of course)

I have no experience on Scaling Agile frameworks for now. It's becoming a pretty hot topic in my current company though and I'll revert with an article on this blog when I have some.
My preference would go to LeSS I think, since it seems more natural to me. But that is just a pretty initial opinion, and it may change ...

Bytecode manipulation with Javassist for fun and profit part I: Implementing a lightweight IoC container in 300 lines of code

2017-02-13T15:30:33-05:00

Java bytecode is the form of instructions that the JVM executes.
A Java programmer, normally, does not need to be aware of how Java bytecode works.

Understanding the bytecode, however, is essential to the areas of tooling and program analysis, where the applications can modify the bytecode to adjust the behavior according to the application's domain. Profilers, mocking tools, AOP, ORM frameworks, IoC Containers, boilerplate code generators, etc. require to understand Java bytecode thoroughly and come up with means of manipulating it at runtime.
Each and every of these advanced features of what is nowadays standard approaches when programming with Java require a sound understanding of the Java bytecode, not to mention completely new languages running on the JVM such as Scala or Clojure.

Bytecode manipulation is not easy though ... except with Javassist.
Of all the libraries and tools providing advanced bytecode manipulation features, Javassist is the easiest to use and the quickest to master. It takes a few minutes to every initiated Java developer to understand and be able to use Javassist efficiently. And mastering bytecode manipulation, opens a whole new world of approaches and possibilities.

The goal of this article is to present Javassist in the light of a concrete use case: the implementation in a little more than 300 lines of code of a lightweight, simple but cute IoC Container: SCIF - Simple and Cute IoC Framework.

A new version of comet-tennis demo app with the SCIF framework integrated is available here.

Part of this article is available as a slideshare presentation here: https://www.slideshare.net/JrmeKehrli/bytecode-manipulation-with-javassist-for-fun-and-profit.

You might also want to have a look at the second article in this serie available here : Bytecode manipulation with Javassist for fun and profit part II: Generating toString and getter/setters using bytecode manipulation.

Summary

1. Introduction / Purpose
- 1.1 Runtime Reflection
- 1.2 Bytcode manipulation
2. Javassist
- 2.1 Javassist prupose and behaviour
- 2.2 A gentle example with Javassist
3. IoC
4.0 SCIF : Simple and Cute IoC Framework
5. Conclusion

1. Introduction / Purpose

Bytecode manipulation consists in modifying the classes - represented by bytecode - compiled by the Java compiler, at runtime. It is used extensively for instance by frameworks such as Spring (IoC) and Hibernate (ORM) to inject dynamic behaviour to Java objects at runtime.
But first let's look at a very summarized view of the Java toolchain to remind a few concepts:

Java source files are compiled to Java class files by the Java Compiler. These Java classes take the form of bytecode. This bytecode is loaded by the JVM to execute the Java program.
In principle the bytecode is read only and cannot be change once loaded. That is true, but:

The java classes bytecode can be modified before being loaded by the classloader through the usage of an agent
Classes bytecode can be modified at runtime without an agent as long as the class has not been loaded yet by a classloader.
Classes can be generated entirely dynamically at runtime using bytecode manipulation techniques

In this article we'll dig into the library Javassist which is a bytecode manipulation framework that can help in achieving all of the above mechanisms.
But before, let's describe three different, unrelated but complementary techniques: Introspection, Reflection and Bytecode Manipulation.

1.1 Runtime Reflection

Runtime reflection is the ability of a computer program to examine, introspect, and modify its own structure and behavior at runtime.

Reflection is commonly used by programs which require the ability to examine or modify the runtime behavior of applications running in the Java virtual machine.
Reflection is a powerful technique and can enable applications to perform operations which would otherwise be impossible.

The ability to examine and manipulate a Java class from within itself may not sound like very much, but in other programming languages this feature simply doesn't exist. For example, there is no way in a Pascal, C, or C++ program to obtain detailed information about the functions defined within that program.

In java there is no specific introspection API available natively. Introspection is performed as well using the Java Reflection API.

Still, conceptually, introspection and reflection are different things:

Type introspection is the ability of a program to examine the type or properties of an object at runtime.
This is short example of Type Introspection in Java, where we discover the fields of an object and show their values dynamically. Again, Introspection in Java is really done using the Reflection API

import java.lang.reflect.Field;

public class TestIntrospection {

    public static class TestData {
        private int i = 0;
        private String myString = "abc";
        private long value = -1;
    }

    // test Introspection
    public static void main (String[] args) {
        try {
            // Using Introspection, we really don't care of the actual type
            Object td = new TestData();

            // List fields of TestData and get their values
            for (Field field : td.getClass().getDeclaredFields()) {
                field.setAccessible(true); // just make private fields accessible

                System.out.println (field.getName() + "=" + field.get(td));
            }
        } catch (IllegalAccessException e) {
            e.printStackTrace();
        }
    }
}

Runtime Reflection is a native feature in the Java programming language. It allows an executing Java program to examine or "introspect" upon itself, and manipulate internal properties of the program.
A short example could be as follows, where we change the value of a field using reflection:

import java.lang.reflect.Field;

public class TestReflection {

    public static class TestData {
        private int i = 0;
        private String myString = "abc";
        private long value = -1;
    }

    // test Reflection
    public static void main (String[] args) {
        try {
            // Using Reflection, we really don't care of the actual type
            Object td = new TestData();

            // Change the value of the field myString
            Field myStringField =  td.getClass().getDeclaredField("myString");
            myStringField.setAccessible(true); // just make private fields accessible
            myStringField.set(td, "xyz");

            System.out.println (myStringField.getName() + "=" + myStringField.get(td));

        } catch (NoSuchFieldException | IllegalAccessException e) {
            e.printStackTrace();
        }
    }
}

Runtime Reflection is a very powerful feature of the JVM.
I wrote a previous article on this very blog showing how to dynamically add values to an Enum Type in Java using Runtime Reflection.

Why are those important in the scope of bytecode manipulation and Javassist ?

Runtime Reflection is important in our context for two reasons:

First, because Javassist attempts to keep an API as close as possible to the Java Runtime Reflection API as a way to appear as natural as possible to Java developers.
Second, and this is maybe more important, because behaviour injected in Java Classes using bytecode manipulation is not known by the compiler. Thus, it is sometimes only available through runtime reflection.

1.2 Bytcode manipulation

Bytecode manipulation allows the developer to express instructions in a format that is directly understood by the Java Virtual Machine, without passing from source code to bytecode through compiler.
Bytecode is somewhat similar to assembly code directly interpretable by the CPU. But with Java the bytecode is, first, interpreted by a Virtual Machine, the JVM, and second much more understandable that assembly code.

One might wonder why one would want to get interested in bytecode manipulation and generation. As a matter of fact, every java developer has likely already been using bytecode manipulation all over the place without knowing it.
Since the JVM can modify bytecode and use new bytecode while it is running, this generates a whole new universe of languages and tools that by far surpasses the initial intent of the Java language.

Bytecode manipulation use cases

Some examples are:

ORM frameworks such as Hibernate use bytecode manipulation to inject, for instance, relationship management code (lazy loading, etc.) inside mapped entities.
FindBugs inspects bytecode for static code analysis
Languages like Groovy, Scala, Clojure generate bytecode from different source code.
IoC frameworks such as Spring use it to seamlessly weave your application lifecycle together
language extensions like AspectJ can augment the capabilities of Java by modifying the classes that the Java compiler generated
etc.

The Java platform provides you with many ways to work with bytecode, for instance:

One can write his own compiler for any kind of new and crazy language
One can generate on the fly sub-classes of already loaded classes and use them instead of original classes to get additional behaviour
One can write an instrumentation agent that plugs right into the JVM and modifies behaviour of classes before they are loaded by the classloader
etc.

With so many options available, one of these will certainly fit any experiment that one wants to play around with. With bytecode manipulation, one really gets the whole power of the JVM for free and the capacity to slot in any idea exactly where it's needed while reusing the rest of the Java platform.

From my perspective, this is what excites me the most, as a developer I can really focus on my crazy idea that is not supported by the Java language and I don't have to write an entire platform to make it come to life.
Certainly this has been one of the key areas why the Java community has constantly been experimenting with new ways to push the programming toolset further.

This article wont present the details of the Java bytecode any further. We'll focus instead on high level libraries aimed at manipulating the Java Bytecode.
Should you be interested in the low level details, I can only recommend that you read this excellent paper from ZeroTunaround, the guys behind JRebel.

Most common bytecode manipulation libraries

As a matter of fact, while Runtime Reflection is supported natively by the JVM, Bytecode manipulation, on the other hand, is fairly difficult to achieve without the usage of a specific library

The most common bytecode manipulation libraries in Java are as follows:

ASM s a project of the OW2 Consortium. It provides a simple API for decomposing, modifying, and recomposing binary Java classes. ASM exposes the internal aggregate components of a given Java class through its visitor oriented API. ASM also provides, on top of this visitor API, a tree API that represents classes as object constructs. Both APIs can be used for modifying the binary bytecode, as well as generating new bytecode
BCEL provides a simple library that exposes the internal aggregate components of a given Java class through its API as object constructs (as opposed to the disassembly of the lower-level opcodes). These objects also expose operations for modifying the binary bytecode, as well as generating new bytecode (via injection of new code into the existing code, or through generation of new classes altogether).
CGLIB is a powerful, high performance and quality Code Generation Library, it is used to extend JAVA classes and implements interfaces at runtime. CGLIB is really oriented towards implementing new classes at runtime, as opposed to modifying existing bytecode such as other libraries.
Javassist is a Java library providing a means to manipulate the Java bytecode of an application. In this sense Javassist provides the support for structural reflection, i.e. the ability to change the implementation of a class at run time.

Javassist is much easier to use that lower level libraries such as BCEL or ASM. It is also less limited and more powerful than CGLIB.

This chart shows in addition the AspectJ framework, as a way to have the user get an understanding of the level of abstraction provided by these tools:

Now, as you might have guessed from its title, this rest of this article will focus on Javassist.

2. Javassist

From the Javassist web site:

"Javassist (Java Programming Assistant) makes Java bytecode manipulation simple. It is a class library for editing bytecode in Java; it enables Java programs to define a new class at runtime and to modify a class file when the JVM loads it.
Unlike other similar bytecode editors, Javassist provides two levels of API: source level and bytecode level. If the users use the source-level API, they can edit a class file without knowledge of the specifications of the Java bytecode.
The whole API is designed with only the vocabulary of the Java language. You can even specify inserted bytecode in the form of source text; Javassist compiles it on the fly. On the other hand, the bytecode-level API allows the users to directly edit a class file as other editors."

The fact that Javassist is presented above as being able to modify classes at loading time is not a limitation of the Javassist framework itself, but rather a consequence from the linking system of the JVM. Once a class has already been loaded, changing it would result in a Linkage Error (unless the JVM is launched with the JPDA [Java Platform Debugger Architecture] enabled, which would make a class dynamically reloadable, but that is another story).
Interestingly, Javasssist is perfectly able to modify a class long after the application has started as long as that specific class has not been loaded.
This is just to emphasize that Javassist can perfectly be used to modify classes at runtime and not only at "pre-main" time by the usage of a JVM agent, even though this suffers from great constraints.

In my opinion, the great strength of Javassist over its competitors is that Javassist enables the user to generate bytecode on the fly from actual Java code given to it in the form of a string by calling the Java Compiler on the fly on such strings.
And that is freaking awesome.

2.1 Javassist purpose and behaviour

Javassist provides the developer with a high level API around classes, methods, fields, etc. aimed at making it as easy as possible to change the implementation of existing classes or even implement completely new classes, dynamically, at runtime, using bytecode manipulation.

(The following is explained in more details in the official Javassist tutorial.)

The most important elements of the Javassist API are presented on the schema below:

The class Javassist.CtClass is an abstract representation of a class file. A CtClass (compile-time class) object is a handle for dealing with a class file. The following program is a very simple example:

(In all examples of code from now on, I will be coloring relevant Javassist API calls in dark red)

ClassPool pool = ClassPool.getDefault();
CtClass cc = pool.get("test.Rectangle");
cc.setSuperclass(pool.get("test.Point"));
cc.writeFile();

ClassPool

This program first obtains a ClassPool object, which controls bytecode modification with Javassist. The ClassPool object is a container of CtClass object representing a class file. It reads a class file on demand for constructing a CtClass object and records the constructed object for responding later accesses.

To modify the definition of a class, the users must first obtain from a ClassPool object a reference to a CtClass object representing that class.get() in ClassPool is used for this purpose.
In the case of the program shown above, the CtClass object representing a class test.Rectangle is obtained from the ClassPool object and it is assigned to a variable cc. The ClassPool object returned by getDefault() searches the default system search path.

CtClass

The CtClass object obtained from a ClassPool object can be modified.
In the example above, it is modified so that the superclass of test.Rectangle is changed into a class test.Point. This change is reflected on the original class file when writeFile() in CtClass() is finally called.

writeFile() translates the CtClass object into a class file and writes it on a local disk. Javassist also provides a method for directly obtaining the modified bytecode. To obtain the bytecode, call toBytecode():

byte[] b = cc.toBytecode();

(Bear in mind that this is especially useful when implementing a Java agent)

You can directly load the CtClass as well:

Class clazz = cc.toClass();

A class can be returned to the pool, making it available to the classloader and hence the whole application:

pool.toClass(cc, Thread.currentThread().getContextClassLoader(), null);

Defining a new class

To define a new class from scratch, makeClass() must be called on a ClassPool.

ClassPool pool = ClassPool.getDefault();
CtClass cc = pool.makeClass("Circle");
cc.setSuperclass(pool.get("test.Point"));

This program defines a class Circle including no members except those inherited by the parent class Point. Member methods of Circle can afterwards be created with factory methods declared in CtNewMethod and appended to Circle with addMethod() in CtClass.
makeClass() cannot create a new interface; makeInterface() in ClassPool can do. Member methods in an interface can be created with abstractMethod() in CtNewMethod. Note that an interface method is an abstract method.

Implementing / Modifying a class

Methods are represented by CtMethod objects. CtMethod provides several methods for modifying the definition of the method. Note that if a method is inherited from a super class, then the same CtMethod object that represents the inherited method represents the method declared in that super class. A CtMethod object corresponds to every method declaration.
Constructors are represented by their very own type in Javassist : CtConstructor. Both CtMethod and CtConstructor extends the same base class and have a lot of their API in common.

Javassist does not allow to remove a method or field, but it allows to change the name. So if a method is not necessary any more, it should be renamed and changed to be a private method by calling setName() and setModifiers() declared in CtMethod, for instance to hide it. But beware of linkage errors at runtime if you mess with a method used by another class.

CtMethod and CtConstructor can be used to completely implement / rewrite a constructor or a method from scratch. They also provide methods insertBefore(), insertAfter(), and addCatch(). They are used for inserting a code fragment into the body of an existing method.

When implementing or rewriting completely a method from scratch, using CtNewMethod.make() is in my opinion the most convenient approach. It enables the developer to implement a method by providing Java Source Code syntax in a simple string.
For instance:

CtClass point = ClassPool.getDefault().get("Point");
CtMethod m = CtNewMethod.make(
        "public int xmove(int dx) { x += dx; }",
        point);
point.addMethod(m);

CtNewMethod provides a lot of high level methods for implementing getters, setters and other commodity methods directly, sometimes even without having to bother providing an implementation on your own.

Some pretty complete information in this regards is available on the Second official javassist tutorial.

2.2 A gentle example with Javassist

We'll see now a simple and yet complete example using Javassist. We'll implement the getters and setters for the fields of the class TestData introduced here using bytecode manipulation.
Then we'll test the getter and setter for the field myString. Since these getters and setters are injected using bytecode manipulation at runtime, we'll have to use runtime reflection to call them:

(Reminder, I am coloring relevant Javassist API calls in dark red)

import javassist.*;

import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;

public class TestJavassist {

    public static class TestData {
        private int i = 0;
        private String myString = "abc";
        private long value = -1;
    }

    // test Javassist
    public static void main (String[] args) {
        try {

            ClassPool cp = ClassPool.getDefault();
            CtClass clazz = cp.get("ch.niceideas.common.utils.TestJavassist$TestData");

            for (CtField field : clazz.getDeclaredFields()) {
                String camelCaseField = field.getName().substring(0, 1).toUpperCase()
                        + field.getName().substring(1);

                // We don't need to mess with implementation here. CtnewMethod has a
                // commodity method to implement a getter directly
                CtMethod fieldGetter = CtNewMethod.getter("get" + camelCaseField, field);
                clazz.addMethod(fieldGetter);

                // Just for the sake of an example, we'll define the setter by actually
                // providing the implementation, not using the commodity method offered
                // by CtNewMethod
                CtMethod fieldSetter = CtNewMethod.make(
                            "public void set" + camelCaseField + " \n" + 
                            "    (" + field.getType().getName() + " param) { \n" +
                            "    this." + field.getName() + " = param; \n" +
                            "}",
                        clazz);
                clazz.addMethod(fieldSetter);
            }

            // Save class and make it available
            cp.toClass(clazz, Thread.currentThread().getContextClassLoader(), null);

            // Now instantiate a new TestData
            TestData td = new TestData();

            // Get the value of the field 'myString' using the newly defined getter
            Method getter =  td.getClass().getDeclaredMethod("getMyString");
            System.out.println (getter.invoke(td));

            // Change the value of field 'myString' using newly defined setter
            Method setter =  td.getClass().getDeclaredMethod("setMyString", String.class);
            setter.invoke(td, "xyz");

            // Get the value again
            System.out.println (getter.invoke(td));

        } catch (  NotFoundException | CannotCompileException | NoSuchMethodException
                 | IllegalAccessException | InvocationTargetException e) {
            e.printStackTrace();
        }
    }
}

3. IoC

OK. Since the example code I want to implement below with Javassist is a simple IoC container, I guess I should present what IoC actually is beforehand.

Inversion of Control is a design pattern related to lifecycle management of components in an application benefiting from a services architecture.
In such an application, business components are usually implemented in the form of various services, such as business services, business managers, DAOs, etc. The main class delegates specific business concerns to business services, which delegate finer aspects in their turn to managers, which further delegate various business of technical aspects to smaller managers, or DAOs, adapters. etc.

These various services need to know about each others to be able to call each others. Managing the construction and instantiation of these services is called components lifecycle management.

Very often, business services are stateless components, not keeping any state in instance variables or else. Traditionally, for a very long time, these stateless services have been implemented as singletons. For a very long time this was a very convenient approach since the main singleton simply needs to get the other singletons it was using, which in turn simply needed to get the other singletons they were using, and so on.
By separating the instantiation of singletons and their initialization in two stages, cycles could be handled easily and everyone was happy.
But with the rise of XP and unit testing, singleton-based applications were suffering from a very critical drawback:

Singletons were enforcing strict dependencies on other service implementation at compile time, making it pretty impossible to replace the dependencies by mock objects or stubs as required for efficiently unit testing a specific service.
Using singletons, testing a specific service often meant to be required to completely build and initialize the whole application, which can well turn into a nightmare.

Java EE was of course not an answer to this problem since it required to have a Java EE container, which is even more of a nightmare (well let's not get me started on Java EE, shall we ?).

Inversion of Control

Inversion of control is initially mostly an answer to this problem, as a way to increase the modularity of the application and making it more extensible, and more importantly testable in an easier way by removing the strict dependencies between components.
The key idea is to delegate the management of the lifecycle of the components and the injection of each component's dependencies to a framework, or rather a container, borrowing the term from Java EE, called here a lightweight container.

Instead of every component getting each other references on the singletons or specific instances by building them, the container takes care of instantiating the components, managing their lifecycle in the required scope, and injecting their dependencies at runtime.
Injecting the dependencies at runtime, with a configurable approach, using a configuration file, annotations or even a dedicated API, opens the possibility to inject a different implementation of a service depending on the context, as long as it respects the required interface.
For instance, injecting a mock object instead of the real deal for unit testing becomes straightforward.

Inversion of Control and Dependency Injection are two different things - and yet strongly related to each other - often confused in some documentation:

Inversion of Control - IoC : is the name of the design pattern, the approach. It is considered a design pattern, which, in my opinion, is wrong! IoC is an architecture pattern. But yeah that is really no big deal.
Dependency Injection - DI : is the name of a technique, a mechanism on which IoC often relies to take place. It consists in injecting the components required by a specific component at runtime, based on some configuration rules. DI is really just one aspect of IoC.

3.1 IoC history

Inversion of Control, as a term, was popularized in 1998 by the Apache Avalon team trying to engineer a "Java Apache Server Framework" for the growing set of server side Java components and tools at Apache.
To the Avalon team, it was clear that components receiving various aspects of component assembly, configuration and lifecycle was a superior design to those components going at getting the same themselves.

Later, the authors of the "Java Open Source Programming" book wrote XWork and WebWork2 to support their forthcoming book. Their concepts were very much like those from IoC/Avalon, but dependencies were passed into the component via setters. The need for those dependencies was declared in some accompanying XML.
That was actually the first IoC framework close to the form they have today.

In 2002, Rod Johnson, leader of the Spring Framework wrote the book "Expert One-on-One : J2EE Design and Development" that also discussed the concepts of setter injection, and introduced the codeline that ultimately became the Spring Framework at SourceForge in February of 2003.
I myself discovered the concept in 2003 when reading that book, following the advise of Mr. Parick Gras, which I take the opportunity to thank a lot for this here.

This whole history is presented in details on PicoContainer / Inversion of Control History and can be represented as follows:

(Source : PicoContainer / Inversion of Control History)

3.2 IoC Principle

As stated above, in a usual application, the lifecycle of components starts with a main component (or class) that either creates the other services it requires or get their singletons.
These other components, in their turn, create or get references on their own dependencies, and so on.

With IoC, a container, called lightweight container - as opposed to Java EE craps that are very heavy (and very bad) containers - takes care of instantiating and managing the lifecycle of the components as well as, more importantly, injecting the dependencies in every component.

3.3 Various frameworks

The most important IoC containers today are the following:

The Spring Framework is an application framework and inversion of control container for the Java platform. The core of spring is really about IoC and components management but nowadays there is a complete ecosystem of tools and side frameworks around spring core aimed at developing web application, ORM concerns, etc.
The Pico Container is a very lightweight IoC Container and only that. Unlike spring, it is designed to remain small and simple and targets only IoC concerns, nothing else. It is not heavily maintained.
Apache Tapestry is an open-source component-oriented Java web application framework conceptually similar to JavaServer Faces and Apache Wicket. It provides IoC concerns in addition to the web application framework.
Google Guice is an open source software framework for the Java platform released by Google. It provides support for dependency injection using annotations to configure Java objects.

4. SCIF : Simple and Cute IoC Framework

The rest of this article is dedicated to present the implementation of a very simple IoC framework using Javassist, rather as a way to illustrate how easy and straightforward that is with Javassist than for any other reason :-)
Implementing Dependency Injection is actually a state-of-the-art use case for Javassist and a nice way to present the possibilities and whereabouts of bytecode manipulation.
We'll see now how to use Javassist in the light of a concrete use case: the implementation in a little more than 300 lines of code of a lightweight, simple but cute IoC Container: SCIF - Simple and Cute IoC Framework.

4.1 Principle

SCIF - the system we want to build - is an MVP (Minimum Viable Product). We want it to implement Dependency Injection in its simplest form:

Services are managed by the framework and stored in a Service Registry
Services should declare the annotation @Service
to be discovered by the framework. The framework searches for services declaring this annotation in the classpath.
Dependencies are identified in services using the annotation @Resource. The framework analyze services to discover about their dependencies at runtime.
If @Resource is declared on a field, the framework injects the dependency directly, at build time.
If @Resource is declared on a getter, the framework uses bytecode manipulation to override the getter in a subclass and implement lazy loading of the dependency.

In case of getter (property) injection instead of field injection, SCIF is forced to generate a sub-class of the initial class and override the getter in that sub-class to implement lazy-loading.
This is a consequence of the usage of an annotation to identify the services to be enhanced: in order for the framework to be able to query the annotation on a class, that class needs unfortunately to be loaded by the Classloader. Javassist is not able to change a class once that class has been loaded (well at least not easily).
Changing a class that is already loaded leads unusually to a linkage error and that is forbidden without usage of very advanced techniques, too difficult for this simple framework.

Example

The code below presents a ServiceA having two dependencies: ServiceB and ServiceC.
The first dependency, ServiceB is declared by ServiceA on the field itself, using the annotation @Resource.
The second dependency, ServiceC is declared on the getter, indicating the will of the developer to benefit from lazy loading.

The example below illustrates in red the code or behaviour that should be injected at runtime by SCIF.

import ch.niceideas.common.service.Registry;
import ch.niceideas.common.service.Service;
import javax.annotation.Resource;

/** A Business Service  */
@Service
public class ServiceA {

    @Resource // here we inject the dependency on the field
    private ServiceB serviceB = service_injected_by_reflection;

    private ServiceC serviceC;

    public ServiceB getServiceB() { return serviceB; }
    public void setServiceB(ServiceB serviceB) { this.serviceB = serviceB; }

    @Resource // here we inject the dependency on the property (getter)
    public ServiceC getServiceC() { return serviceC; }
    public void setServiceC(ServiceC serviceC) { this.serviceC = serviceC; }

    // we want to use javassist to generate a sub-class on the fly, at runtime, to handle
    // the lazy loading of ServiceC at runtime
    public static class javassist_sub extends ServiceA {

        private Registry registry = injected_by_reflection;

        public ServiceC getServiceC() {
            ServiceC retObject = super.getServiceC();
            if (retObject == null) {
                retObject = (ServiceC) registry.getService(
                        ServiceC.class.getCanonicalName());
                super.setServiceC(retObject);
            }
            return retObject;
        }
    }
}

Now let's see what is the design of the SCIF framework enabling this behaviour.

4.2 Design

SCIF is implemented by the following fundamental classes:

Registry : a Registry is a service manager. It stores services following a specific scope passed to the storeService method.
Stored services can be retrieved regardless of the scope. They are searched in the smallest scope first and then in larger scopes.
StaticRegistry : a static registry stores services in a static map.
Only the APPLICATION scope is supported. Attempting to store a service in another scope results in an exception.
RegistryInitializer : the RegistryInitializer is the most important component of SCIF.
It is responsible for:
- Searching the classpath for classes declaring the @Service annotation
- Injecting dependencies in the various form supported by the IoC Framework:
  - Field injection. This is done using simply runtime reflection.
  - Method (getter) injection. This is done by generating dynamically a subclass that takes care of the Lazy Loading of the dependency.
@Service : this annotation identifies services to be searched for in the classpath.
@Resource : this annotation identifies dependencies to be injected either at field level or getter (property) level.

The design is as follows:

Rules regarding @Resource handling

The RegistryInitializer analyzes the classes annotated with @Service and handles @Resource annotations the following way:

If @Resource is declared on a field: in this case there is actually no need for bytecode manipulation. When the RegistryInitializer analyzes the services, it simply injects the reference to the field.
If @Resource is declared on a getter, and a corresponding setter is found: in this case, the system assumes he can use the setter to implement a cache for lazy loading. A subclass is created which overrides the getter to implement lazy loading. When calling the overriden getter, the latest will first call the original getter to see if the result service is already available. If that is the case, that service is returned. if it is not the case, the system will get the target service from the registry and use the corresponding setter to store it before returning it.
If @Resource is declared on arbitrary method or on a getter without a corresponding setter: in this case the system cannot assume he can use the underlying field (through the setter) as a cache, and simply returns the target service from the registry at every call.

The RegistryInitializer uses the following tools / libraries:

ReflectionUtils : a little helper class simplifying some operations related to Runtime Reflection.
Reflections (org.reflection.Reflections) : a mandatory package when it comes to analyzing the classpath looking for classes declaring a certain annotation.
Javassist : the bytecode manipulation library.

4.3 Some focus on code

We'll see below the most important pieces of code of the SCIF Framework.

(Reminder, I am coloring relevant Javassist API calls in dark red)

It all starts with the method RegistryInitializer.init() that takes care of the whole shebang:

/**
 * Initialize a registry from the root package (prefix) given as argument.
 * <br />
 * The RegistryInitializer will search for classes annotated with @Service and add them 
 * to the returned registry.
 *
 * @param rootPackage
 * @return A Registry containing all discovered services
 * @throws RegistryInitializationException in case of any error
 */
public static Registry init(String rootPackage) throws RegistryInitializationException {

    try {
        Reflections reflections = new Reflections(rootPackage);

        Set<Class<?>> annotated = reflections.getTypesAnnotatedWith(Service.class);

        // Create service wrappers
        Map<Class<?>, ServiceWrapper> wrappers = initServiceWrappers(annotated);

        // Build set of methods and fields to be analyzed
        analyzeWrappers(wrappers);

        // Now the complicated part, overwrite getters / methods using Javassist,
        // dynamically creating a subclass with bytecode manipulation
        enhancePropertyGetters(wrappers);

        StaticRegistry registry = new StaticRegistry();

        // Instantiate all the services and store them in Registry 
        // with class name as service name
        initializeRegistry(wrappers, registry);

        // Then do the easy thing : inject service on dependencies expressed on fields
        proceedFieldInjection(wrappers, registry);

        return registry;

    } catch (ReflectiveOperationException | NotFoundException | CannotCompileException e) {
        logger.error (e, e);
        throw new RegistryInitializationException (e.getMessage(), e);
    }
}

The really interesting call here is enhancePropertyGetters(wrappers);. This is where we use bytecode manipulation to generate the subclass dynamically and override the getter of the method declaring the @Resource annotation.
We won't present the other methods but we'll see the listing of this enhancePropertyGetters() method:

private static void enhancePropertyGetters(Map<Class<?>, ServiceWrapper> wrappers) 
        throws NotFoundException, CannotCompileException, ClassNotFoundException {
    ClassPool.doPruning = true;

    ClassPool pool = ClassPool.getDefault();
    pool.appendClassPath(new LoaderClassPath(
            Thread.currentThread().getContextClassLoader()));

    for (ServiceWrapper wrapper : wrappers.values()) {

        if (wrapper.getMethodsToEnhance().size() > 0) {

            CtClass superClazz = pool.get(wrapper.getServiceName());

            // Unfortunately I need to go with the sub-class approach
            // I cannot change the original class since it has already been loaded and
            // javassist cannot change a class that is already loaded (that would require
            // changing linking and javassist cannot do that)

            CtClass clazz = pool.makeClass(wrapper.getServiceName() + "$javassist_sub");
            clazz.stopPruning(true);

            clazz.setSuperclass(superClazz);
            clazz.setModifiers(Modifier.PUBLIC);

            ...

            // Add registry on class if not already one. The field might already have been
            // added on a parent class. If this is the case, don't add it again
            injectRegistryField(pool, clazz);

            // Proceed with method modification
            for (Method method : wrapper.getMethodsToEnhance()) {

                // Various cases :

                // 1. Method doesn't have the form of a getter
                if (!method.getName().startsWith("get")) {

                    // => Just override method so it returns the service from registry
                    overrideMethod(clazz, method);
                }

                // 2. Method is a getter
                else {

                    try {
                        Method setter = ReflectionUtils.getSetter(wrapper.getServiceClass(),
                                ReflectionUtils.getPropertyName(method),
                                method.getReturnType());

                        // 2.2 setter is found
                        // => Lazy loading : use underlying field as cache.
                        // If it it set, do nothing.
                        // If is is null, look for service and attach it
                        // Needs to override the getter for doing my business and
                        // delegate to the setter for setting the field
                        overrideGetter(clazz, method, setter);

                    } catch (NoSuchMethodException e) {

                        logger.debug (e, e);

                        // 2.1 No setter could be found
                        // => Just rewrite method so it returns the service 
                        // from the registry
                        overrideMethod(clazz, method);
                    }
                }
            }
            
            ...

            // make new subclass available to classloader
            pool.toClass(clazz, Thread.currentThread().getContextClassLoader(), null);
            clazz.stopPruning(false);

            // use the new subclass instead of the original class from now on
            Class<?> subClazz = Class.forName(clazz.getName());
            wrapper.overrideClass(subClazz);
        }
    }
}

In the code above, the interesting calls are overrideGetter(), overrideMethod() and injectRegistryField() since these are the methods where bytecode manipulation occurs.
Let's look at these methods:

private static void overrideGetter(CtClass clazz, Method getter, Method setter) 
        throws CannotCompileException {
    String targetService = getter.getReturnType().getCanonicalName();
    CtMethod newMethod = CtNewMethod.make(
                "public " + targetService + " " + getter.getName() + "() { \n" +
                "" +
                "    " + targetService + " retObject =  super." + getter.getName() + "(); "+
                "    if (retObject == null) {" +
                "         retObject =  (" + targetService + ") \n" + 
                "                 getRegistry().getService(\"" + targetService + "\"); " +
                "         super." + setter.getName() + "(retObject);" +
                "    }" +
                "    return retObject;" +
                "" +
                "}",
            clazz);
    clazz.addMethod(newMethod);
}

private static void overrideMethod(CtClass clazz, Method method) 
        throws CannotCompileException {
    String targetService = method.getReturnType().getCanonicalName();
    CtMethod newMethod = CtNewMethod.make(
                "public " + targetService + " " + method.getName() + "() { \n" +
                "" +
                "    " + targetService + " retObject =  \n" + 
                "            (" + targetService + ")\n" + 
                "            getRegistry().getService(\"" + targetService + "\"); " +
                "    return retObject;" +
                "" +
                "}",
            clazz);
    clazz.addMethod(newMethod);
}

/**
 * Inject a field to store the registry in the target clazz as well as a getter 
 * to retrieve that registry.
 *
 * @param pool the Javassist ClassPool to be used
 * @param clazz the class to be modified this way
 * @throws NotFoundException
 * @throws CannotCompileException
 */
public static void injectRegistryField(ClassPool pool, CtClass clazz) 
        throws NotFoundException, CannotCompileException {
    CtField registryField = null;
    try {
        registryField = clazz.getField("registry");
    } catch (NotFoundException e) {
        // ignored
    }
    if (registryField == null) {
        CtClass registryClass = pool.get(Registry.class.getName());
        registryField = new CtField(registryClass, "registry", clazz);
        registryField.setModifiers(Modifier.setPrivate(Modifier.STATIC));
        clazz.addField(registryField, "null");

        CtMethod registryGetter = CtNewMethod.getter("getRegistry", registryField);
        registryGetter.setModifiers(Modifier.PUBLIC);
        clazz.addMethod(registryGetter);

        CtMethod registrySetter = CtNewMethod.make(
                    "public static void setRegistry (" + Registry.class.getName() + 
                    "            holder) { " +
                    "    registry = holder; " +
                    "} ",
                clazz);
        clazz.addMethod(registrySetter);
    }
}

We've seen the most important pieces of code of the SCIF Framework above.
The framework itself is available for download in the next section.

4.4 DemoApp : Comet Tennis

I integrated the SCIF framework in the comet-tennis demo application. It's a small application I wrote initially here : comet-tennis.
This application uses a few services and the idea here is to use the SCIF framework to manage these services and inject their dependencies.

The new package with the SCIF framework integrated is available here.

5. Conclusion

Bytecode manipulation is a lot of fun and opens a whole new world of possibilities on the JVM. It's the only way to implement advanced tooling such as IoC Containers, ORM frameworks, boilerplate code generators, etc.
Normally, bytecode manipulation is something rather pretty difficult to achieve ... except with Javassist.
Javassist makes bytecode manipulation so easy and straightforward. The ability to write dynamically in simple strings actual java source code and add it on the fly as bytecode to classes being manipulated is striking. Javassist is in my opinion the simplest way to perform bytecode manipulation in Java.

I covered above some use cases for bytecode manipulation, there are many others, for instance tampering with licence checking systems of non-free software (Hush. I said nothing)

In my career, I have encountered many situations where I wish afterwards I have known Javassist since it would have been pretty helpful. Let me mention two:

Some 15 years ago, I was working on a pretty big J2EE Websphere application with a lot of EJBS. Tracking the user flow in the distributed system was a nightmare due to the complexity of the business processes and the business rules, so we ended up adding logging information each and every time a business method was entered and left, such as log.debug ("ENTER - business method") and log.debug ("LEAVE - business method").
In regards to troubleshooting, this may sound stupid but it ended up being not only pretty convenient but rather really our single and only way to figure what was going on in some situations in such a enormous software.
Adding these two lines of code (plus a few try { ... } finally { ... } statements to make sure the leaving trace was always output made us add thousands of lines of code to the application ... which could have been replaced by a few lines of code java agent and some javassist magic.
Some 10 years ago, I was working for a banking institution on a big Java application making an extensive use of Hibernate. The problem there is that we were trying to map a nice and meaningful business model to a legacy data model. With a lot of hibernate tricks we pretty much succeeded in achieving the mapping, using a lot of custom and pretty tricky code in hibernate session listeners to handle the relationships that hibernate as not able to handle natively.
There as well, we ended up writing thousands of lines of specific glue code in hibernate listeners which we could have replaced by a pretty simple Javassist base framework to complement the missing features of hibernate.

You might want to have a look at the second article in this serie available here : Bytecode manipulation with Javassist for fun and profit part II: Generating toString and getter/setters using bytecode manipulation.

Part of this article is available as a slideshare presentation here: https://www.slideshare.net/JrmeKehrli/bytecode-manipulation-with-javassist-for-fun-and-profit.

The Lean Startup - A focus on Practices

2017-01-28T05:05:07-05:00

A few years ago, I worked intensively on a pet project : AirXCell (long gone ...)
What was at first some framework and tool I had to write to work on my Master Thesis dedicated to Quantitative Research in finance, became after a few months somewhat my most essential focus in life.
Initially it was really intended to be only a tool providing me with a way to have a Graphical User Interface on top of all these smart calculations I was doing in R. After my master thesis, I surprised myself to continue to work on it, improving it a little here and a little there. I kept on doing that until the moment I figured I was dedicated several hours to it every day after my day job.
Pretty soon, I figured I was really holding an interesting software and I became convinced I could make something out of it and eventually, why not, start a company.

And of course I did it all wrong.

Instead of finding out first if there was a need and a market for it, and then what should I really build to answer this need, I spent hours every day and most of my week-ends developing it further towards what I was convinced was the minimum set of feature it should hold before I actually try to meet some potential customers to tell them about it.
So I did that for more than a year and a half until I came close to burn-out and send it all to hell.

Now the project hasn't evolve for three years. The thing is that I just don't want to hear about it anymore. I burnt myself and I am just disgusted about it. Honestly it is pretty likely that at the time of reading this article, the link above is not even reachable anymore.
When I think of the amount of time I ~~invested~~ wasted in it, and the fact that even now, three years after, I still just don't want to hear anything about this project anymore, I feel so ashamed. Ashamed that I didn't take a leap backwards, read a few books about startup creation, and maybe, who knows, discover The Lean Startup movement before.
Even now, I still never met any potential customer, any market representative. Even worst: I'm still pretty convinced that there is a need and a market for such a tool. But I'll never know for sure.

Such stories, and even worst, stories of startups burning millions of dollars for nothing in the end, happen every day, still today.

Some years ago, Eric Ries, Steve Blank and others initiated The Lean Startup movement. The Lean Startup is a movement, an inspiration, a set of principles and practices that any entrepreneur initiating a startup would be well advised to follow.
Projecting myself into it, I think that if I had read Ries' book before, or even better Blank's book, I would maybe own my own company today, around AirXCell or another product, instead of being disgusted and honestly not considering it for the near future.
In addition to giving a pretty important set of principles when it comes to creating and running a startup, The Lean Startup also implies an extended set of Engineering practices, especially software engineering practices.

This article focuses on presenting and detailing these Software Engineering Practices from the Lean Startup Movement since, in the end, I believe they can benefit from any kind company, from initiating startup to well established companies with Software Development Activities.
By Software Engineering practices, I mean software development practices of course but not only. Engineering is also about analyzing the features to be implemented, understanding the customer need and building a successful product, not just writing code.

Part of this article is available as a slideshare presentation here : http://www.slideshare.net/JrmeKehrli/lean-startup-72100971 as well as a PDF document here : https://www.niceideas.ch/lean-startup.pdf.

Summary

1. The Lean Startup
2. The four steps to the Epiphany
- 2.1 Overview
- 2.2 A 4 steps process
3. Lean startup practices
4. Conclusions

1. The Lean Startup

The Lean Startup is today a movement, initiated and supported by some key people that I'll present below.
But it's also a framework, an inspiration, an approach, a methodology with a set of fundamental principles and practices for helping entrepreneurs increase their odds of building a successful startup.
Lean Startup cannot be thought as a set of tactics or steps. Don't expect any checklist (well, at least not only checklists) or any recipe to be applied blindly.

The approach is built around two main objectives:

Teaching entrepreneurs how to drive a startup through the process of steering (Build-Measure-Learn feedback loop).
Enabling entrepreneurs to scale and grow the business with maximum acceleration

Lean Startup Practices

The Lean Startup methodology can be divided in two sets of practices:

The steering practices : designed to minimize the total time through the Build-Measure-Learn feedback loop and
The acceleration practices : which allow Lean Startups to grow without sacrificing the startup's speed and agility

This is developed further in 2. The four steps to the Epiphany.

1.1 Origins

The Lean Movement

Lean thinking is a business methodology that aims to provide a new way to think about how to organize human activities to deliver more benefits to society and value to individuals while eliminating waste.
Lean thinking is a new way of thinking any activity and seeing the waste inadvertently generated by the way the process is organized

The aim of lean thinking is to create a lean enterprise, one that sustains growth by aligning customer satisfaction with employee satisfaction, and that offers innovative products or services profitably while minimizing unnecessary over-costs to customers, suppliers and the environment.

The Lean Movement finds its roots in Toyotism and values performance and continuous improvement. The Lean Movement really rose in the early 90's and the lean tradition has adopted a number of practices from Toyota's own learning curve.
Some worth to mention:

Kaizen (Continuous Improvement) : is a strategy where employees at all levels of a company work together pro-actively to achieve regular, incremental improvements to the manufacturing process. The point of Kaizen is that improvement is a normal part of the job, not something to be done "when there is time left after having done everything else", that should involve the company as a whole, from the CEO to the assembly line workers.
Kanban (Visual Billboard) : is a scheduling system and visual management tool used in Lean Manufacturing to signal steps in their manufacturing process. The system's highly visual nature allows teams to communicate more easily on what work needed to be done and when. It also standardizes cues and refines processes, which helps to reduce waste and maximize value.

Plus strong emphasizes on Autonomation, Visualization, etc.

The Lean Startup

The author, I should say initial author, of the Lean Startup methodology, Eric Ries, explains in his book "The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses", that traditional management practices and ideas are not adequate to tackle the entrepreneurial challenges of startups.

By exploring and studying new and existing approaches, Ries found that adapting Lean thinking to the context of entrepreneurship would allow to discern between value-creating activities versus waste.

Thus, Ries, decided to apply lean thinking to the process of innovation. After its initial development and some refinement, as he states, the Lean Startup represents a new approach to creating continuous innovation that builds on many previous management and product development ideas, including lean manufacturing, design thinking, customer development, and agile development.

1.2 The movement

I would highly recommend this enlightening article - The History Of Lean Startup - that does a pretty great job explaining how and why the following guys got together and initiated the Lean Startup Movement (aside from a few things I do not agree with).

Blank, Ries, Osterwalder and Maurya are the founders or initiators of the Lean Startup Movement. Eric Ries is considered as the leader of the movement, while Steve Blank considers himself as its godfather.
Osterwalder and Maurya's work on business models is considered to fill a gap in Ries and Blank's work on processes, principles and practices. In Steve Blank's "The four Steps the the Epiphany", the business model section is a vague single page.
Furtherly, Maurya's "Running Lean" magnificently completes Blank's work on Customer Development. We'll get to that.

1.3 Principles

In my opinion, the most fundamental aspect of Lean Startup is the Build-Measure-Learn loop, or, in the context of the Customer Development Process, the Customer Discovery - Customer Validation - Re-adapt the product loop.
The idea is to be able to loop in laboratory mode, mostly with prototypes and interviews, in an iterative research process, with as little costs as possible, about the product to be developed. A startup should spend as little investment as possible in terms of product development as long as it has no certainty in regards to the customer needs, the right product to be developed, the potential market, etc.
This is really key, before hiring employees and starting to develop a product, the entrepreneur should have certainty about the product to be developed and its market.
Premature scaling is the immediate cause of the Death Spiral.

Before digging any further into this, below are the essential principles that characterize The Lean Startup approach, as reported by Eric Ries' book.

Entrepreneurs are everywhere

You don't have to work in a garage to be in a startup. The concept of entrepreneurship includes anyone who works within Eric Ries' definition of a startup, which I like very much BTW.
His definition is as follows :

A startup is a human institution designed to create new products and services under conditions of extreme uncertainty.

That means entrepreneurs are everywhere and the Lean Startup approach can work in any size company, even a very large enterprise, in any sector or industry.

Entrepreneurship is management

Validated learnings

Innovation accounting

Build-Measure-Learn

1.4 The Feedback Loop

Again, the goal of Build-Measure-Learn is not to build a final product to ship or even to build a prototype of a product, but to maximize learning through incremental and iterative engineering.
In this case, learning can be about product features, customer needs, distribution channels, the right pricing strategy, etc.
The "build" step refers to building an MVP (Minimal Viable Product).
It's critical here to understand that an MVP does not mean the product with fewer features. Instead, an MVP should be seen as the simplest thing that you can show to customers to get the most learning at that point in time. Early on in a startup, an MVP could well simply be a set of Powerpoint slides with some fancy animations, or whatever is sufficient to demonstrate a set of features to customers and get feedback from it. Each time one builds an MVP one should also define precisely what one wants to test/measure.
Later, as more is learned, the MVP goes from low-fidelity to higher fidelity, but the goal continues to be to maximize learning not to build a beta/fully featured prototype of a product or a feature.

In the end, the Build-Measure-Learn framework lets startups be fast, agile and efficient.

1.5 Business Model Canvas and Lean Canvas

Business Model Canvas

Lean Canvas

So how should one use the Lean Canvas?

Customer Segment and Problem
Both Customer Segment and Problem sections should be filled in together.
Fill in the list of potential customers and users of your product, distinguish customers (willing to pay) clearly from users, then refine each and every identified customer segment. Be careful not no try to focus on a too broad segment at first, think of facebook whose first segment was only Harvard students.
Fill in carefully the problem encountered by your identified customers.
UVP - Unique Value Proposition
The UVP is the unique characteristic of your product or your service making it different from what is already available on the market an that makes it worth the consideration of your customers. Focus on the main problem you are solving and what makes your solution different.
Solution
Filling this is initially is tricky, since knowing about the solution for real requires trial and error, build-measure-learn loop, etc. In an initial stage one shouldn't try to be to precise here and keep things pretty open.
Channels
This consists in answering: how should you get in touch with your users and customers ? How do you get them to know about your product ? Indicate clearly your communication channels.
Revenue Stream and Cost Structure
Both these sections should also be filled in together.
At first, at the time of the initial stage of the startup, this should really be focused on the costs and revenues related to launching the MVP (how to interview 50 customers ? Whats the initial burn rate ? etc.)
Later this should evolve towards an initial startup structure and focus on identifying the break-even point by answering the question : how many customers are required to cover my costs ?
Key Metrics
Ash Maurya refers to Dave McClure Pirate Metrics to identify the relevant KPIs to be followed :
Aquisition - How do user find you ?
Activation - Do user have a great first experience ?
Retention - Do users come back ?
Revenue - How do you make money ?
Referral - Do users tell others ?
Unfair Advantage
This consists in indicating the adoption barriers as well as the competitive advantages of your solution. An unfair advantage is defined as something that cannot be copied easily neither bought.

Lean Startup : test your plan !

Using the new "Build - Measure - Learn" diagram, the question then becomes, "What hypotheses should I test?". This is precisely the purpose of the initial Lean Canvas,

And it brings us to another definition of a startup:

A startup is a temporary organization designed to search for a repeatable and scalable business model.

And once these hypotheses fill the Lean Canvas (Or Business Model Canvas), the key approach is to run experiments. This leads us to the next section.

1.6 Customer Development

The Customer Development process is a simple methodology for taking new venture hypotheses and getting out of the building to test them. Customer discovery (see below) captures the founders' vision and turns it into a series of business model hypotheses. Then it develops a series of experiments to test customer reactions to those hypotheses and turn them into facts. The experiments can be a series of questions you ask customers. Though, most often an MVP to help potential customers understand your solution accompanies the questions.

Startups are building an MVP to learn the most they can, not to get a prototype!

The goal of designing these experiments and minimal viable products is not to get data. The data is not the endpoint. Anyone can collect data. The goal is to get insight. The entire point of getting out of the building is to inform the founder's vision.
The insight may come from analyzing customer answers, but it also may come from interpreting the data in a new way or completely ignoring it when realizing that the idea is related to a completely new and disruptive market that even doesn't exist yet.

Customer Development instead of Product Development

More startup fail from a lack of customers rather than from a failure of Product Development.

The Customer Development model delineates all the customer-related activities in the early stage of a company into their own processes and groups them into four easy-to-understand steps: Customer Discovery, Customer Validation, Customer Creation, and Company Building.
These steps mesh seamlessly and support a startup's ongoing product development activities. Each step results in specific deliverables and involves specific practices.

As its name should communicate, the Customer Development model focuses on developing customers for the product or service your startup is building. Customer Development is really about finding a market for your product. It is built upon the idea that the founder has an idea but he doesn't know if the clients he imagines will buy. He needs to check this point and it is better if he does it soon.

2. The four steps to the Epiphany

2.1 Overview

The Path to Disaster: The Product Development Model

The traditional product development model has four stages:

concept/seed,
product development,
beta test,
and launch.

That product development model, when applied to startups, suffers from a lot of flaws. They basically boil down to:

Customers were nowhere in that flow chart
The flow chart was strictly linear
Emphasis on execution over learning
Lack of meaningful milestones for sales/marketing
Treating all startups alike

What's the alternative? Before we get to that, one final topic is the technology life cycle adoption curve, i.e. adoption happens in phases of early adopters (tech enthusiasts, visionaries), mainstream (pragmatists, conservatives), and skeptics.
Between each category is a chasm, the largest is between the early adopters and the mainstream.
Crossing the chasm is a success problem. But you're not there yet, "customer development" lives in the realm of the early adopter.

The Path to Epiphany: The Customer Development Model

2.2 A 4 steps process

The four stages the Customer Development Model are: customer discovery, customer validation, customer creation, and company creation.

Customer discovery: understanding customer problems and needs
Customer validation: developing a sales model that can be replicated
Customer creation / Get new Customers: creating and driving end user demand
Customer building / Company Creation: transitioning from learning to executing

We can represent them as follows:

I won't go any further in this article in describing these steps, their purpose and reasons.
To be honest Blank's book is pretty heavy and not very accessible. Happily Blank's did a lot of presentations around his book that one can find on youtube or elsewhere. In addition, there are a lot of excellent summaries and text explanations available online on Blank's book and I let the reader refer to this material should he want more information.

Instead, I want to focus in this article on the Software Engineering Practices inferred from The Lean Startup approach, since, again, I believe they are very important for any kind of corporation with an important Software Development activity.
And yet again, Software Engineering practices go beyond solely Software Development practices, but cover every activity in the company aimed at identifying and developing the product.

3. Lean startup practices

Important notes

I attached the practices to the step where I think they make more sense, where I think they bring the most added value or should be introduced. But bear in mind that such a categorization is highly subjective and questionable. If you yourself believe some practices should be attached to another step, well just leave a comment and move on.
Also, there are other practices of course. I mention here and will be discussing below the ones that seem the most appealing to me, myself and I. Again my selection is highly subjective and personal. If you think I am missing something important, just leave a comment and move on.

The rest of this paper intends to describe all these engineering - mostly software engineering - practices since, again, at the end of the day, I strongly believe that they form the most essential legacy of the Lean Startup movement and that they can benefit any kind of company, not only startups.

3.1 Customer Discovery

Customer Discovery, focuses on understanding customer problems and needs. Its really about searching for the Product-Solution Fit, turning the founders' initial hypotheses about their market and customers into facts.
The Problem-Solution Fit occurs when entrepreneurs identify relevant insights that can be addressed with a suggested solution. As Osterwalder describes it, this fit happens when there is evidence that customers care about certain problems that need to be solved or needs, and, there is a value proposition designed that addresses those needs.

In Customer Discovery the startup aims at understanding customer problems and needs and, also, to ideate potential solutions that could be valuable based on the findings. Similarly, Osterwalder calls these problems and needs as jobs, pains and gains.

The three practices I want to emphasize at this stage are as follows:

3.1.1 Get out of the building

If you're not Getting out of the Building, you're not doing Customer Development and Lean Startup.
There are no facts inside the building, only opinions.

In fact, so many engineers, just as myself, spent months of working on a prototype or even a complete solution, sometimes for several years, before actually meeting a first potential customer, and discovering the hard way that all this work has been for nothing.
As hard as it is, Engineers should not work one one single line of code, even not one single powerpoint presentation before having met at least a twenty potential customers or representatives and conducted formal Problem interviews.
After that, it's still not a question of writing lines of code, it's a question of investing a few hours - not more ! - in designing a demonstrable prototype for the next set of interviews, the Solution interviews. That prototype doesn't need to be actually working, it should only be demonstrable. A powerpoint presentation with clickable animations works perfectly!

A difficulty that people always imagine is that young entrepreneurs with an idea believe that they don't know anybody, so how to figure out who to talk to ?
But at the time of Linkedin, facebook, twitter, it's hard to believe one cannot find a hundred of people to have a conversation with.

And when having a conversation with one of them, whatever else one's asking (3.1.2 Problem interview, 3.1.3 Solution interview), one should ask two very important final questions:

"Who else should I be talking to ?"
And because you're a pushy entrepreneur, when they give you those names, you should ask "Do you mind if I sit here while you email them introducing me ?"
"What should I have really asked you ?"
And sometimes that gets into another half hour related to what the customer is really worried about, what's really the customer's problem.

A few hints in regards to how to get out of the building:

3.1.2 Problem interview

Problem Interview is Ash Maurya's term for the interview you conduct to validate whether or not you have a real problem that your target audience has.

In the Problem Interview, you want to find out 3 things:

Problem - What are you solving? - How do customers rank the top 3 problems?
Existing Alternatives - Who is your competition? - How do customers solve these problems today?
Customer Segments - Who has the pain? - Is this a viable customer segment?

The interview script - at least the initial you should follow until you have enough experience to build yours - is as follows:

If you have to remember just three rules for problem interviews here they are:

Do not talk about your business idea or product. You are here to understand a problem, not imagine or sell a solution yet.
Ask about past events and behaviours
No leading question, learn from the customer

And what if a customer tells you that the issues you thought are important really aren't? Learn that you have gained important data.

3.1.3 Solution interview

In the Solution Interview, you want to find out three things:

Early Adopters - Who has this problem? - How do we identify an early adopter?
Solution - How will you solve the problems? - What features do you need to build?
Pricing/Revenue - What is the pricing model? - Will customers pay for it?

A demo is actually important. Many products are too hard to understand without some kind of demo. If a picture is worth a thousand words, a demonstration is probably worth a million.

3.2 Customer Validation

The second step of the Customer Development model, Customer Validation, focuses on developing a sales model that can be replicated. The sales model is validated by running experiments to test if customers value how the startup's products and services are responding to the customer problems and needs identified during the previous step.
If customers show no interest, then the startup can pivot to search for a better business model.

Customer Validation needs to happen to validate if the customers really care about the products and services that could be valuable to them. This second step is hence really about Product-Market Fit which occurs when there is a sales model that works, when customers think the proposed solution is valuable to them. This should be proven by evidence that customers care about the products and services that conform the value proposition.

Blank believes that product-market fit needs to happen before moving from Customer Validation to Customer Creation (or the Search Phase to the Execution Phase).

The two practices I want to emphasize at this stage are as follows:

3.2.1 MVP

Eric Ries defines the MVP as:

"The minimum viable product is that version of a new product a team uses to collect the maximum amount of validated learning about customers with the least effort."

The following chart is pretty helpful in understanding why both terms minimum and viable are equally important and why designing an MVP is actually difficult:

When applied to a new feature of any existing prodUct instead of a brand new product, the MVP approach is in my opinion somewhat different. It consists of implementing the feature itself not completely; rather, a mock-up or even some animation simulating the new feature should be provided.
The mock-up or links should be properly instrumented so that all user reactions are recorded and measured in order to get insights on the actual demand of the feature and the best form it should take (Measure Obsession),
This is called a deploy first, code later method.

Fred Voorhorst' work does a pretty good job in explaining what an MVP is:

(Fred Voorhorst - Expressive Product Design - http://www.expressiveproductdesign.com/minimal-viable-product-mvp/)

Developing an MVP means developing a sequence of prototypes through which you explore what is key for your product idea and what can be omitted.

3.2.2 Fail Fast

The key point of the "fail fast" principle is to quickly abandon ideas that aren't working. And the big dfficulty of course is not giving up too soon on an idea that could potentially be working. should one find the right channel, the right approach.
Fail fast means getting out of planning mode and into testing mode, eventually for every component, every single feature, every idea around your product or model of change. Customer development is the process that embodies this principle and helps you determine which hypotheses to start with and which are the most critical for your new idea.

Fail fast,
Learn faster,
Succeed sooner !

So how do you know when to turn, when to drop an approach and adapt your solution ? How can you know it's not too soon?

Measure, measure, measure of course!

The testing of new concepts, failing, and building on failures are necessary when creating a great product.
The adage, "If you can't measure it, you can't manage it" is often used in management and is very important in The Lean Startup approach. By analyzing data, results can be measured, key lessons learned, and better initiatives employed.

3.3 Re-adapt the product

Customer development isn't predictable; you don't know what you're going to learn until you start. You'll need the ability to think on your feet and adapt as you uncover new information.
Adapting, in my opinion, is really re-adapting the product to the new situation, to the new knowledge you gained from the previous steps. And re-adapting the product, your solution, your approach is pivoting.

But I want to emphasize here that pivoting, or re-adapting the product, should only happen with the right data, the precise insights that give a clear new direction. Metrics and insight are essential.

The key practices here are as follows:

3.3.1 Metrics Obsession

This is why I like thinking of it as a Metrics Obsession. Measure everything, everything you can think of!
And repeat a hundred times:

I will never ever again think that
Instead I will measure that ...

Or as Edward Deming said :

"In god we trust, all others must bring data"

Imagine you work on a webite. You should enhance your backend to measure, at least: amount of times a page has been displayed, count of users and different users dispalying the pages, amount of times a link or button has been clicked, by who it has been clicked, how much time after the containing page has been displayed, what is the user think time between 2 actions, what is the path of navigation from each and every user (actually build the graph and the counts along the branches), etc.
Measure everything! Don't hesitate to measure something you do not see any use for now. Sooner or later you will find a usage for that metrics, and that day, you better have it.

How to choose good metrics ?

Honestly there is no magic silver bullet and it can in fact be pretty difficult to pick up the right metric that would be most helpful to validate a certain hypothesis.
However, metrics should at all cost respect the three A's. Good metrics

are actionable,
can be audited
are accessible

An actionable metric is one that ties specific and repeatable actions to observed results. The actionable property of picked up metrics is important since it prevents the entrepreuneur from distorting the reality to his own vision. We speak of Actionable vs. Vanity Metrics.
Meaningless metrics such as "How many visitors ?", "How many followers ?" are vanity metrics and are useless.

Ultimately, your metrics should be useful to measure progress against your own questions.

3.3.2 Pivot

There are various kind of pivots:

Zoom-In : a single feature becomes the whole product
Zoom-Out : the whole initial product becomes a feature of a new product
Customer segment : Good product, bad customer segment
Customer need : Repositioning, designing a completely new product (still sticking to the vision)
Platform : Change from an application to a platform, or vice versa
Many others ...

Pivot or Persevere

One thing seems pretty clear though, if it becomes clear to everyone in the company that another approach would better suit the customer needs, the startup needs to pivot, and fast.

3.4 Get new customers

The third step, the Customer Creation step, to "start building end user demand to scale the business", is the precursor to achieve Business Model Fit. Therefore, the Business Model Fit stage can be understood as validating the value for the company, where as the product-market fit focuses on validating the value for the customer.

The set of practices I deem important here are as follows:

Again, attaching some of these practices here or in the next and last step can be subjective. In my opiniom, the startup needs to embrace this Lean and Agile principles and practices before it attempts to scale its organization, hence the reason why I considere these practices at this stage.

3.4.1 Pizza Teams

Jeff Bezos, Amazon's founder and CEO, always said that a team size shouldn't be larger than what two pizzas can feed, two american pizzas, not italian, needless to say.
This makes it 7 +/- 2 co-workers inside an Agile Team.

More communication isn't necessarily the solution to communication problems - it's how it is carried out. Compare the interactions at a small dinner - or pizza - party with a larger gathering like a wedding. As group size grows, you simply can't have as meaningful of a conversation with every person, which is why people start clumping off into smaller clusters to chat.
For Bezos, small teams make it easier to communicate more effectively rather than more, to stay decentralized and moving fast, and encourage high autonomy and innovation. Here's the science behind why the two-pizza team rule works.

As team size grows, the amount of one-on-one communication channels tend to explode, following the formula to compute number of links between people which is n ( n - 1) / 2 .
This is O(n²) (Hello Engineers) and is really a combinatorial explosion.
If you take a basic two-pizza team size of, say, 6. That's 15 links between everyone. Double that group for a team of 12. That shoots up to 66 links.
The cost of coordinating, communicating, and relating with each other explodes to such a degree that it lowers individual and team productivity.

Under five co-workers, the team becomes fragile to external events and lacks creativity.
Beyond ten, communication loses efficiency, cohesion diminishes, parasitism behaviors and power struggles appear, and the performance of the team decreases very rapidly with the number of members.

The right size for an Agile Team is 7 +/- 2 persons.

3.4.2 Feature Teams

Let's first have a look at what is the other model: Component Teams.

Component Teams

Components Teams are the usual, the legacy model. In large IT oranizations, there is usually a development team dedicated to the front-end, the Graphical User Interface, another team dedicated to developing the Java (Or Cobol :-) backend, a team responsible to design and maintain the database, etc.
A Component Team is defined as a development Team whose primary area of concern is restricted to a specific component, or a set of components from a specific layer or tiers, of the system.
Prior to Agile, most large-scale systems were developed following the component team approach and the development teams were organized around components and subsystems.

The most essential drawback of Component Teams is obvious : most new features are spread among several components, creating dependencies that require cooperation between these teams. This is a continuing drag on velocity, as the individual teams spend much of their time discussing dependencies between teams and testing, assessing, fixing behaviour across components rather than delivering end user value as efficiently as possible.
An important direct consequence of this dependency is that any given feature can only be delivered as fast as can be delivered the component changes by the slowest (or most overloaded) component team.

Feature Teams

As such, in an Agile Organization, where the whole company is organized around Feature backlogs or Kanban, it makes a lot more sense to organize the various development teams in Feature Teams.
Feature teams are organized around user-centered functionality. Each and every team, is capable of delivering end-to-end user value throughout the software stack. Feature teams operate primarily with user stories, refactors and spikes. However, technical stories may also occasionally occur in their backlog.
A feature team is defined as a long-lived, cross-functional, cross-component team that completes many end-to-end customer features, one by one.

More Information on Feature Teams:

The difference between both models is well illustrated this way:

(Source : https://less.works/less/structure/feature-teams.html)

A pretty good summary of the most essential differences between both models is available on the LeSS web site:

component team	feature team
optimized for delivering the maximum number of lines of code	optimized for delivering the maximum customer value
focus on increased individual productivity by implementing 'easy' lower-value features	focus on high-value features and system productivity (value throughput)
responsible for only part of a customer-centric feature	responsible for complete customer-centric feature
traditional way of organizing teams - follows Conway's law	'modern' way of organizing teams - avoids Conway's law
leads to 'invented' work and a forever-growing organization	leads to customer focus, visibility, and smaller organizations
dependencies between teams leads to additional planning	minimizes dependencies between teams to increase flexibility
focus on single specialization	focus on multiple specializations
individual/team code ownership	shared product code ownership
clear individual responsibilities	shared team responsibilities
results in 'waterfall' development	supports iterative development
exploits existing expertise; lower level of learning new skills	exploits flexibility; continuous and broad learning
works with sloppy engineering practices-effects are localized	requires skilled engineering practices-effects are broadly visible
contrary to belief, often leads to low-quality code in component	provides a motivation to make code easy to maintain and test
seemingly easy to implement	seemingly difficult to implement

(Source : https://less.works/less/structure/feature-teams.html)

The Analogy with a Star Trek team makes suprisingly and funnily a lot of sense.

Think of a Star Trek spaceship. The crew is constituted by Commanding Officers, Medical Officers, Medical Staff, Engineering Officers, Engineering Staff, Science Officers, Scientists, etc.
These different functions, competencies and responsibilities are grouped together to work towards a common objective, its continuing mission: to explore strange new worlds, to seek out new life and new civilizations, to boldly go where no one has gone before.

Now imagine if Starfleet had instead put all the Commanding Officers in one ship, all medical staff in another ship, and so on. It would have been pretty difficult to make those ships actually do anything significant, don't you think ?
This is precisely the situation of Component Teams.
Just as with a Star Trek Ship, it makes a lot more sense to put all the required competencies together in a team (or ship) and assign them a clear objective, implementing that feature throughout the technology and software stack.

3.4.3 Build vs. Buy

This dilemma is as old as the world of computers: is it better to invest in developing a software that is best suited to your needs or should you rely on a software package or third party product that embed the capitalization and R&D of another software editor in order to - apparently - speed up your time to market ?

In order to be as efficient as possible on the build-measure-learn loop, it is essential to master your development process. For this reason, tailor made solutions are better because the adoption of a third party software package often requires to invest a lot of resources not in the development of your product, but instead in the development of workarounds, hacks and patchs to correct all the points on which the software package is poorly adapted to the specific and precise behavior required by your own product feature.

In the case of a startup, this aspect is catastrophic. Investing in the development of hacks and workarounds around a third party product, a product that one has in addition to pay for, sometimes depending on the number of machine or users, instead of developing the startup's core business, should just not happen.

This cost aspect is particularly critical of course when scaling the solution. When one multiplies the processors and the servers, the invoice climbs very quickly and not necessarily linearly, and the costs become very visible, no matter whether it is a business software package or an infrastructure brick.

This is precisely one of the arguments that led LinkedIn to gradually replace Oracle with a home solution: Voldemor.

Most technologies that make the buzz today in the world of high performance architectures are the result of developments made by the Web Giants that have been released as Open Source: Cassandra, developed by Facebook, Hadoop and HBase inspired by Google and developed at Yahoo, Voldemort by LinkedIn, etc.

Open-Source software is cool

Of course the cost problem doesn't apply to Open-Source and free to use software. In addition, instead of developing workarounds and patches around Open-Source Software, you can instead change its source, fork it and maintain your different baseline while still benefiting frome the developments made on the official baseline by merging it frequently.

At the end of the day, integrating an Open-Source software, in contrary to Editor / Closed Source Software, is pretty closed to developing it on your own, as long as you have the competencies to maintain it on your own should you need to.
Open-Source software is cool, go for it!

3.4.4 A/B Testing

A/B testing is a marketing technique that consists in proposing several variants of the same object that differ according to a single criterion (for example, the color of a package) in order to determine the version which lead to the best apprciation and acceptance from consumers.
A / B testing is used to qualify all kinds of multivariate tests.

An A/B test evaluates the respective performance of one or more partially or totally different versions of the same product or functionality by comparing them to the original version. The test consists in creating modified versions of the functionality by modifying as many elements as desired.
The idea is to split the visitors into two groups (hence the name A / B) and to present to each group a different version of the functionality or the product. Then, we should follow the path of the two groups, their appreciation of the functionality by means of ad'hoc metrics, and we consider which of the two variants gives the best result with respect to a given objective.

For instance, in order to tests if a trial first approach is more appealing and leads eventually to more sales than a mandatory buying:

The A/B test enables to validate very quickly the idea of introducing a trial period for a feature or a product.

3.4.5 Scaling Agile

Transforming a startup into a company, changing and scaling its organization is a unique, and yet challenging, opportunity to make it an agile organization keeping the lean genes on which it has been built.
The agile aspect here is essential and the approach here actually has a name: Scaling Agile.

Scrum and Kanban are two agile frameworks often used at the team level. Over the past decade, as they gained popularity, the industry has begun to adapt and use Agile in larger companies. Two methods (amongst others) emerged to facilitate this process: LeSS (Large Scale Scrum) and SAFe (Scaled Agile Framework). Both are excellent starting points for using Agile on a large scale within a company.

Both approaches differ a little but also have a lot in common: they consist of scaling agility first among multiple agile team within the R&D or Engineering department and then around it, by having the whole company organizing its activities in an agile way and centered on the engineering team, the product development team.
I won't be describing these both approaches any further here and I let the reader refere to both links above.

I just want to emphasize how important I believe that is. Scaling Agile is key in aligning business and IT engagement models.

3.5 Company creation

Company creation is the end phase, when all assumptions have been confirmed or adapted, when the product is build in an acceptable form, when the break-even pointit reached, and the startup should evolve to a corporation. When that moment is reached, startups must begin the transition from the temporary organization designed to search a business model to a structure focused on executing a validated model.

Company creation happens at the moment the company can transition from its informal, learning and discovery-oriented Customer Development team (startup, temporary organization) into formal departments with VPs of Sales, Marketing and Business Development.
At that moment, these executives should focus on building mission-oriented departments that can exploit the company's early market success.

This is a change of bracket. We think of Company Creation since it is really a question of creating a company, from what was "only" a startup. The temporary organization should evolve towards a sustainable and viable organization.

Describing anything further in regards to Company Creation exceeds the scope of this article focused on Lean Startup Practices.
I can only recommend reading Steve Blank's article on the subject (or the big chapter in the "Four Steps to the Epihpany"):

4. Conclusions

The Lean Startup is not dogmatic. It is first and foremost a question of being aware that the market and the customer are not in the architecture meetings, marketing plans, sales projections or key feature discussions.

Bearing this in mind, you will see assumptions everywhere. The key approach then consists in putting in place a discipline of validation of the hypotheses while keeping as key principle to validate the minimum of functionalities at any given time.

Before doing any line of code, the main questions to ask revolve around the triplet : Client / Problem / Solution
Do you really have a problem that is worth resolving? Is your solution the right one for your customer? Is he likely to buy it? For how much ? All the means are good to remove these hypotheses: interviews, market studies, models, whatever you can think of.

The next step is to know if the model that you came up with and have been able to test on a smaller scale is really repeatable and extensible.
How to put a product they have never heard of in the hands of the customers ? Will they understand it as well with its use and benefits ?

The Lean Startup is not an approach to be reserved only to mainstream websites or fancy internet products. Innovating by validating hypotheses as quickly as possible and limiting financial investment is obviously a logic that can be transposed to any type of engineering project, even if it is internal.
I am convinced that the practices and principles from the Lean Startup approach should be more widely used to avoid so many projects burning so much money and effort before being simply dropped.

DevOps explained

2017-01-04T15:56:41-05:00

So ... I've read a lot of things recently on DevOps, a lot of very interesting things ... and, unfortunately, some pretty stupid as well. It seems a lot of people are increasingly considering that DevOps is resumed to mastering chef, puppet or docker containers. This really bothers me. DevOps is so much more than any tool such as puppet or docker.

This could even make me angry. DevOps seems to me so important. I've spent 15 years working in the engineering business for very big institutions, mostly big financial institutions. DevOps is a very key methodology bringing principles and practices that address precisely the biggest problem, the saddest factor of failure of software development projects in such institutions : the wall of confusion between developers and operators.

Don't get me wrong, in most of these big institutions being still far from a large and sound adoption of an Agile Development Methodology beyond some XP practices, there are many other reasons explaining the failure or slippage of software development projects.
But the wall of confusion is by far, in my opinion, the most frustrating, time consuming, and, well, quite stupid, problem they are facing.

So yeah... Instead of getting angry I figured I'd rather present here in a concrete and as precise as possible article what DevOps is and what it brings. Long story short, DevOps is not a set of tools. DevOps is a methodology proposing a set of principles and practices, period. The tools, or rather the toolchain - since the collection of tools supporting these practices can be quite extended - are only intended to support the practices.
In the end, these tools don't matter. The DevOps toolchains are today very different than they were two years ago and will be very different in two years. Again, this doesn't matter. What matters is a sound understanding of the principles and practices.

Presenting a specific toolchain is not the scope of this article, I won't mention any. There are many articles out there focusing on DevOps toolchains. I want here to take a leap backwards and present the principles and practices, their fundamental purpose since, in the end, this is what seems most important to me.

DevOps is a methodology capturing the practices adopted from the very start by the web giants who had a unique opportunity as well as a strong requirement to invent new ways of working due to the very nature of their business: the need to evolve their systems at an unprecedented pace as well as extend them and their business sometimes on a daily basis.
While DevOps makes obviously a critical sense for startups, I believe that the big corporations with large and old-fashioned IT departments are actually the ones that can benefit the most from adopting these principles and practices. I will try to explain why and how in this article.

(This article is available as a PDF document here https://www.niceideas.ch/devops.pdf and as a slideshare presentation here https://www.slideshare.net/JrmeKehrli/devops-explained-72664261)

Summary

1. Introduction
2. Infrastructure as Code
3. Continuous Delivery
4. Collaboration
5. Conclusion

1. Introduction

The problem comes from the fact that developers and operators - while both required by corporations with large IT departments - have very different objectives.

This difference of objectives between developers and operators is called the wall of confusion. We'll see later precisely what that means any why I consider this something big and bad.

DevOps is a methodology presenting a set of principles and practices (tools are derived from these practices) aimed at having both these personas working towards an unified and common objective : deliver as much value as possible for the company.

And surprisingly, for once, there is a magic silver bullet for this. Very simply, the secret is to bring agility to the production side!
And that, precisely that and only that, is what DevOps is about !

But there are quite a few things I need to present before we can discuss this any further.

1.1 The management credo

What is the sinews of war of IT Management ? In other words, when it comes to Software Development Projects, what does management want first and foremost ?

Any idea ?

Let me put you on tracks : what is utmost important when developing a startup ?

Improve Time To Market (TTM) of course !

The Time To Market or TTM is the length of time it takes from a product being conceived until its being available to users or for sale to customers. TTM is important in industries where products are outmoded quickly.
In software engineering, where approaches, business and technologies change almost yearly, the TTM is a very important KPI (Key Performance Indicator).
The TTM is also very often called Lead Time

A first problem lays in the fact (as believed by many) that TTM and product quality are opposing attributes of a development process. As we will see below, improving quality (and hence stability) is the objective of operators while reducing lead time (and hence improving TTM) is the objective of developers.
Let me explain this.

An IT organization or department is often judged on these two key KPIs : the quality of the software, where the target is to have as little defects as possible, and the TTM, where the target is to be able to go from business ideas (often given by business users) to production - making the feature available to users or customers - as soon as possible.
The problem here is that most often these two distinct objectives are supported by two different teams : the developers, building the software, and the operators, running the software.

1.2 a typical IT organization

A typical IT organization, in a corporation owning an important IT department, looks as follows :

Mostly for historical reasons (operators come from the hardware and telco business most often), operators are not attached to the same branch than developers. Developers belong to R&D while operators most of the time belong to Infrastructure department (or dedicated operation department).

Again, they have different objectives:

In addition, and as a sidenote, these both teams sometimes even run on different budget. The development team uses the build budget while the operation team uses the run budget. These different budgets and the increasing needs to control and shorten the costs of IT in corporation tend to emphasize the opposition of objectives of the different teams.
(In parenthesis: nowadays, with the always and everywhere interconnection of people and objects pushing the digitalization of businesses and society in general, the old Plan / Build / Run framework for IT budgeting makes IMHO really no sense anymore, but that is another story)

1.3 Ops frustration

Now let's focus on operators a little and see, in average, how a typical operation team spends its time:

(Source : Study from Deepak Patil [Microsoft Global Foundation Services] in 2006, via James Hamilton [Amazon Web Services] http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20090226.pdf)

So almost 50% (47) of total time of Production Teams is dedicated to deployment related topics:

Actually doing deployment or
Fixing problems related to deployments

This is actually a pretty crazy KPI, one that should have been followed much sooner. The truth is, operator teams have been since their inception in the early age of Computer Engineering - 40 years ago, at the time computers were massively introduced in the industry - this kind of hackers running tons of commands manually to perform their tasks. They are used to long checklists of commands or manual processes to perform their duties.
Somehow, they suffer from the "We always did it like this" syndrome and challenged very little their ways of working over these 40 years.
But if you think of it, this is really crazy. In average, operators spend almost 50% of their time doing deployment related tasks!

This underlines two critical needs for evoluting these processes:

Automate the deployments to reduce the 31% time dedicated to these currently manual tasks.
Industrialize them (just as software development has been industrialized, thanks to XP and Agile) to reduce the 16% related to fixing these deployment related issues.

1.4 Infrastructure automation

In this regards, another statistic is pretty enlightening:

Probability of succeeding an installation expressed as a function of the number of manual operations

This is read the following way :

With only 5 manual commands, the probability of succeeding an installation drops to 86% already.
With 55 manual commands, the probability of succeeding an installation drops to 22%.
With 100 manual commands, the probability of succeeding an installation is close to 0! (2%)%

Succeeding the installation means that the software behaves in production as intended. Failing it means something will go wrong and some analysis will be required to understand what went wrong with the installation and some patches will need to be applied or some configuration corrected.

So automating all of this and avoiding manual commands at all cost seems to be rather a good idea, doesn't it ?

So what's the status in this regards in the industry:

(Source : IT Ops & DevOps Productivity Report 2013 - Rebellabs - http://pages.zeroturnaround.com/rs/zeroturnaround/images/it-ops-devops-productivity-report-2013%20copy.pdf)

(To be perfectly honest, this statistic is pretty old - 2013 - I would expect a little different numbers nowadays)

Nonetheless, this gives a pretty good idea of how much is still to be accomplished in regards to Infrastructure automation and how much DevOps principles and practices are very important.

Again the web giants had to come up with a new approach, with new practices to address their needs of responsivness. What they started their engineering business in their early days, the practices they put in place is at the root of what is today DevOps.

Let's look at where the web giants stand now in this regards. A few examples:

Facebook has thousands of devs and ops, hundreds of thousands of servers. In average, an operator takes care of 500 servers (think automation is optional ?). They do two deployments a day (concept of deployment ring)
Flickr does 10 deployments a day
Netflix designs for failure! The software is designed from the grounds up to tolerate system failures. They test it all the time in production: 65'000 failure tests in production daily by killing random virtual machines ... and measuring that everything still behaves OK.

So what is their secret ?

1.5 DevOps : For once, a magic silver bullet

The secret is simply to Extend Agility to Production:

So what are the core principles ?

We'll now dig into these 3 essential principles.

2. Infrastructure as Code

Because humans make mistakes, because the human brain is terribly bad at repetitive tasks, because humans are slow compared to a shell script, and because we are humans after all, we should consider and handle infrastructure concerns just as we handle coding concerns!

Infrastructure as code (IaC) is the prerequisite for common DevOps practices such as version control, code review, continuous integration and automated testing. It consists in managing and provisioning computing infrastructure (containers, virtual machines, physical machines, software installation, etc.) and their configuration through machine-processable definition files or scripts, rather than the use of interactive configuration tools and manual commands.

I cannot stress enough how much this is a key principle of DevOps. It is really applying software development practices to servers and infrastructure.
Cloud computing enables complex IT deployments modeled after traditional physical topologies. We can automate the build of complex virtual networks, storage and servers with relative ease. Every aspect of server environments, from the infrastructure down to the operating system settings, can be codified and stored in a version control repository.

2.1 Overview

In a very summarized way, the levels of infrastructure and operation concerns at which automation should occur is represented on this schema:

The tools proposed as examples on the schema above are very much oriented towards building the different layers. But a devops toolchain does much more than that.
I think it's time I tell a little more about the notion of DevOps Toolchains.

2.2 DevOps Toolchains

Because DevOps is a cultural shift and collaboration between development, operations and testing, there is no single DevOps tool, rather, again, a set ogf them, or DevOps toolchain consisting of multiple tools. Such tools fit into one or more of these categories, which is reflective of the software development and delivery process:

Code : Code development and review, version control tools, code merging
Build : Continuous integration tools, build status
Test : Test and results determine performance
Package : Artifact repository, application pre-deployment staging
Release : Change management, release approvals, release automation
Configure : Infrastructure configuration and management, Infrastructure as Code tools
Monitor : Applications performance monitoring, end user experience

Though there are many tools available, certain categories of them are essential in the DevOps toolchain setup for use in an organization.

Tools such as Docker (containerization), Jenkins (continuous Integration), Puppet (Infrastructure building) and Vagrant (virtualization platform) among many others are often used and frequently referenced in DevOps tooling discussions as of 2016.

Versioning, Continuous Integration and Automated testing of infrastructure components

The ability to version the infrastructure - or rather the infrastructure building scripts or configuration files - as well as the ability to automated test it are very important.
DevOps consists in finally adopting the same practices XP brought 30 years ago to software engineering to the production side.
Even further, Infrastructure elements should be continuously integrated just as software deliverables.

2.3 Benefits

There are so many benefits to DevOps. A non-exhaustive list could be as follows:

Repeatability and Reliability : building the production machine is now simply running that script or that puppet command. With proper usage of docker containers or vagrant virtual machines, a production machine with the Operating System layer and, of course, all the software properly installed and configured can be set up by typing one single command - One Single Command. And of course this building script or mechanism is continuously integrated upon changes or when being developed, continuously and automatically tested, etc.
Finally we can benefit on the operation side from the same practices we use with success on the software development side, thanks to XP or Agile.
Productivity : one click deployment, one click provisioning, one click new environment creation, etc. Again, the whole production environment is set-up using one single command or one click. Now of course that command can well run for hours, but during that time the operator can focus on more interesting things, instead of waiting for a single individual command to complete before typing the next one, and that sometimes for several days...
Time to recovery ! : one click recovery of the production environment, period.
Guarantee that infrastructure is homogeneous : completely eliminating the possibility for an operator to build an environment or install a software slightly differently every time is the only way to guarantee that the infrastructure is perfectly homogeneous and reproducible. Even further, with version control of scripts or puppet configuration files, one can rebuild the production environment precisely as it was last week, last month, or for that particular release of the software.
Make sure standards are respected : infrastructure standards are not even required anymore. The standard is the code.
Allow developer to do lots of tasks themselves : if developers become themselves suddenly able to re-create the production environment on their own infrastructure by one single click, they become able to do a lot of production related tasks by themselves as well, such as understanding production failures, providing proper configuration, implementing deployment scripts, etc.

These are the few benefits of IaC that I can think of by myself. I bet there are so many much more (suggestions in comments are welcome).

3. Continuous Delivery

Continuous delivery is an approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time. It aims at building, testing, and releasing software faster and more frequently.
The approach helps reduce the cost, time, and risk of delivering changes by allowing for more incremental updates to applications in production. A straightforward and repeatable deployment process is important for continuous delivery.

Important note : Continuous Delivery ≠ Continuous Deployment - continuous delivery is sometimes confused with continuous deployment. Continuous deployment means that every change is automatically deployed to production. Continuous delivery means that the team ensures every change can be deployed to production but may choose not to do it, usually due to business reasons. In order to do continuous deployment one must be doing continuous delivery

The key ideas behind continuous deliveries are:

The more often you deploy, the more you master the deployment process and the better you automate it. If you have to do something 3 times a day, you will make it bullet proof and reliable soon enough, when you will be fed up of fixing the same issues over and over again.
The more often you deploy, the smallest will be the changesets you deploy and hence the smallest will be the risk of something going wrong, or the chances of losing control over the changesets
The more often you deploy, the best will be your TTR (Time to Repair / Resolution) and hence the sooner will be the feedback you will get from your business users regarding that feature and the easier it will be to change some things here and there to make it perfectly fit their needs (TTR is very similar to TTM in this regards).

(Source : Ops Meta-Metrics: The Currency You Pay For Change - http://fr.slideshare.net/jallspaw/ops-metametrics-the-currency-you-pay-for-change-4608108)

But continuous delivery is more than building a shippable, production-ready version of the product as often as possible. Continuous delivery refers to 3 key practices:

Learn from the fields
Automation
Deploy more often

3.1 Learn from the field

Continuous Delivery is key to be able to learn from the field. There is no truth in the development team, the truth lies in the head of the business users. Unfortunately, no one is able to really clearly express his mind, his will in a specification document, no matter the time he dedicates to this task. This is why Agility attempts to put the feature in the hands of the users to get their feedback as soon as possible, at all cost.
Doing Continuous delivery, as far as continuous deployment, and hence reducing lead time to its minimal possible value, is key to be able to learn the truth from the users, as soon as possible

But the truth doesn't come out in the form of a formal user feedback. One should never trust its users or rely on formal feedback to learn from users. One should trust its own measures.
Measure obsession is a very important notion from the Lean Startup movement but it's also very important in DevOps. One should measure everything! Finding the right metrics enabling the team to learn about the success or failures of an approach, about what would be better and what has the most success can be sometimes tricky. One should always take too many measures instead of missing the one that would enable the team to take an enlightened decision.

Don't think, know! And the only way to know is to measure, measure everything: response times, user think times, count of displays, count of API calls, click rate, etc. but not only. Find out about all the metrics that can give you additional insights about the user perception of a feature and measure them, all of them!

This can be represented as follows:

3.2 Automation

Automation has already been discussed above in section 2. Infrastructure as code.

I just want to emphasize here that continuous delivery is impossible without a properly and 100% automation of all infrastructure provisioning and deployment related tasks.
This is very important, let me repeat it once more: setting up an environment and deploying a production ready version of the software should take one click, one command, it should be entirely automated. Without it, it's impossible to imagine deploying the software several times a day.

In section 3.5 Zero Downtime Deployments below we will mention additional important techniques helping Continuous Delivery as well.

3.3 Deploy more often

The DevOps credo is:

"If it hurts, do it more often !"

This idea of doing painful things more frequently is very important in agile thinking.
Automated Testing, refactoring, database migration, specification with customers, planning, releasing - all sorts of activities are done as frequently as possible.

There are three good reasons for that:

Firstly most of these tasks become much more difficult as the amount of work to be done increases, but when broken up into smaller chunks they compose easily.
Take Database migration for instance: specifying a large database migration involving multiple tables is hard and error prone. But if you take it one small change at a time, it becomes much easier to get each one correct. Furthermore you can string small migrations together easily into a sequence. Thus when one decomposes a large migration into a sequence of little ones, it all becomes much easier to handle. (As a sidenote, this is the essence of database refactoring)
The second reason is Feedback. Much of agile thinking is about setting up feedback loops so that we can learn more quickly. Feedback was already an important and explicit value of Extreme Programming. In a complex process, like software development, one has to frequently check where one stands and make course corrections. To do this, one must look for every opportunity to add feedback loops and increase the frequency with which one gets feedback so one can adjust more quickly.
The third reason is practice. With any activity, we improve as we do it more often. Practice helps to iron out the kinks in the process, and makes one more familiar with signs of something going wrong. If you reflect on what you are doing, you also come up with ways to improve your practice.
With software development, there's also in addition the potential for automation. Once one has done something a few times, it's easier to see how to automate it, and more importantly one becomes more motivated to automate it. Automation is especially helpful because it can increase speed and reduce the chance for error.

Now one question remains : how often to deliver with DevOps ?

There is no straight answer to that. It really depends on the product, the team, the market, the company, the users, the operation needs, etc.
My best answer would be as follows: If you don't deliver at least every 2 weeks - or at the end of your sprint duration period - you do not even do Agile, not to speak of DevOps.
DevOps encourages to deliver as frequently as possible. In my understanding (please challenge that in the comments if you like), you should train your team to be able to deliver as frequently as possible. A sound approach, the one I'm using with my team is to deliver twice a day on a QA environment. The delivery process is fully automated: twice a day, at noon and at midnight, the machinery starts, builds the software components, runs integration tests, builds the Virtual Machines, start them, deploys the software components, configures them, runs functional tests, etc.

3.4 Continuous Delivery requirements

What does one need before being able to move to Continuous Delivery?
My checklist, in a raw fashion :

Continuous integration of both the software components development as well as the platform provisioning and setup.
TDD - Test Driven Development. This is questionable ... But in the end let's face it: TDD is really the single and only way to have an acceptable coverage of the code and branches with unit tests (and unit tests makes is so much easier to fix issues than integration or functional tests).
Code reviews ! At least codereviews ... pair programming would be better of course.
Continuous auditing software - such as Sonar.
Functional testing automation on production-level environment
Strong non-functional testing automation (performance, availability, etc.)
Automated packaging and deployment, independent of target environment

Plus sound software development practices when it comes to managing big features and evolutions, such as Zero Downtime Deployments techniques.

3.5 Zero Downtime Deployments

"Zero Downtime Deployment (ZDD) consists in deploying a new version of a system without any interruption of service."

ZDD consists in deploying an application in such a way that one introduces a new version of an application to production without making the user see that the application went down in the meantime. From the user's and the company's point of view it's the best possible scenario of deployment since new features can be introduced and bugs can be eliminated without any outage.

I'll mention 4 techniques:

Feature Flipping
Dark launch
Blue/Green Deployments
Canari release

Feature flipping

Feature flipping allows to enable / disable features while the software is running. It's really straightforward to understand and put in place: simply use a configuration properly to entirely disable a feature from production and only activate it when its completely polished and working well.

For instance to disable or activate a feature globally for a whole application:

if Feature.isEnabled('new_awesome_feature')
    # Do something new, cool and awesome
else 
    # Do old, same as always stuff
end

Or if one wants to do it on a per-user basis:

if Feature.isEnabled('new_awesome_feature', current_user)
    # Do something new, cool and awesome
else 
    # Do old, same as always stuff
end

Dark Launch

The idea of Dark Launch is to use production to simulate load!

It's difficult to simulate load of a software used by hundreds of millions of people in a testing environment.
Without realistic load tests, it's impossible to know if infrastructure will stand up to the pressure.

Instead of simulating load, why not just deploy the feature to see what happens without disrupting usability?
Facebook calls this a dark launch of the feature.

Let's say you want to turn a static search field used by 500 million people into an autocomplete field so your users don't have to wait as long for the search results. You built a web service for it and want to simulate all those people typing words at once and generating multiple requests to the web service.
The dark launch strategy is where you would augment the existing form with a hidden background process that sends the entered search keyword to the new autocomplete service multiple times.
If the web service explodes unexpectedly then no harm is done; the server errors would just be ignored on the web page. But if it does explode then, great, you can tune and refine the service until it holds up.

There you have it, a real world load test.

Blue/Green Deployments

Blue/Green Deployments consists in building a second complete line of production for version N + 1. Both development and operation teams can peacefully build up version N + 1 on this second production line.
Whenever the version N + 1 is ready to be used, the configuration is changed on the load balancer and users are automatically and transparently redirected to the new version N + 1.
At this moment, the production line for version N is recovered and used to peacefully build version N + 2.
And so on.

(Source : Les Patterns des Géants du Web – Zero Downtime Deployment - http://blog.octo.com/zero-downtime-deployment/)

This is quite effective and easy but the problem is that it requires to double the infrastructure, amount of servers, etc.
Imagine if Facebook had to maintain a complete second set of its hundreds of thousands of servers.

So there is some room for something better.

Canari release

Canari release is very similar in nature to Blue/Green Deployments but it addresses the problem to have multiple complete production lines.
The idea is to switch users to the new version in an incremental fashion : as more servers are migrated from the version N line to the version N + 1 line, an equivalent proportion of users are migrated as well.
This way, the load on every production line matches the amount of servers.

At first, only a few servers are migrated to version N + 1 along with a small subset of the users. This also allows to test the new release without risking an impact on all users.
When all servers have eventually been migrated from line N to line N + 1, the release is finished and everything can start all over again for release N + 2.

(Source : Les Patterns des Géants du Web – Zero Downtime Deployment - http://blog.octo.com/zero-downtime-deployment/)

4. Collaboration

Agile software development has broken down some of the silos between requirements analysis, testing and development. Deployment, operations and maintenance are other activities which have suffered a similar separation from the rest of the software development process. The DevOps movement is aimed at removing these silos and encouraging collaboration between development and operations.
Even with the best tools, DevOps is just another buzzword if you don't have the right culture.

The primary characteristic of DevOps culture is increased collaboration between the roles of development and operations. There are some important cultural shifts, within teams and at an organizational level, that support this collaboration.

This addresses a very important problem that is best illustrated with the following meme:

(Souce : DevOps Memes @ EMCworld 2015 - http://fr.slideshare.net/bgracely/devops-memes-emcworld-2015)

Team play is so important to DevOps that one could really sum up most of the methodology's goals for improvement with two C's: collaboration and communication. While it takes more than that to truly become a DevOps workplace, any company that has committed to those two concepts is well on its way.

But why is it so difficult ?

4.1 The wall of confusion

Because of the wall of confusion :

In a traditional development cycle, the development team kicks things off by "throwing" a software release "over the wall" to Operations.
Operations picks up the release artifacts and begins preparing for their deployment. Operations manually hacks the deployment scripts provided by the developers or, most of the time, maintains their own scripts.
They also manually edit configuration files to reflect the production environment, which is significantly different than the Development or QA environments.
At best they are duplicating work that was already done in previous environments, at worst they are about to introduce or uncover new bugs.

The IT Operations team then embarks on what they understand to be the currently correct deployment process, which at this point is essentially being performed for the first time due to the script, configuration, process, and environment differences between Development and Operations.
Of course, somewhere along the way a problem occurs and the developers are called in to help troubleshoot. Operations claims that Development gave them faulty code. Developers respond by pointing out that it worked just fine in their environments, so it must be the case that Operations did something wrong.
Developers are having a difficult time even diagnosing the problem because the configuration, file locations, and procedure used to get into this state is different then what they expect. Time is running out on the change window and, of course, there isn't a reliable way to roll the environment back to a previously known good state.

So what should have been an eventless deployment ended up being an all-hands-on-deck fire drill where a lot of trial and error finally hacked the production environment into a usable state.
It always happens this way, always.

Here comes DevOps

DevOps helps to enable IT alignment by aligning development and operations roles and processes in the context of shared business objectives. Both development and operations need to understand that they are part of a unified business process. DevOps thinking ensures that individual decisions and actions strive to support and improve that unified business process, regardless of organizational structure.

Even further, as Werner Vogel, CTO of Amazon, said in 2014 :

"You build it, you run it."

4.2 Software Development Process

Below is a simplified view of how the Agile Software Development Process usually looks like.
Initially the business representatives work with the Product Owner and the Architecture Team to define the software, either through Story Mapping with User stories or with more complete specification.
Then the development team develops the software in short development sprints, shipping a production ready version of the software to the business users at the end of every sprint in order to capture feedback and get directions as often and as much as possible.
Finally, after every new milestone, the software is deployed for wide usage to all business lines.

The big change introduced by DevOps is the understanding that operators are the other users of the software ! and as such they should be fully integrated in the Software Development Process.
At specification time, operators should give their non-functional requirements just as business users give their functional requirement. Such non-functional requirements should be handled with same important and priority by the development team.
At implementation time, operators should provide feedback and non-functional tests specifications continuously just as business users provides feedback on functional features.
Finally, operators become users of the software just as business users.

With DevOps, operators become fully integrated in the Software Development Process.

4.3 Share the Tools

In traditional corporations, teams of operators and teams of developers use specific, dedicated and well separated set of tools.
Operators usually don't want do know anything about the dev team SCM system as well as continuous integration environment. They perceive this as additional work and fear to be overwhelmed by developer requests if they put their hands on this systems as well. After all, they have well enough to do by taking care of production systems.
Developers, on their side, usually have no access to production system logs and monitoring tools, sometimes due to lack of will on their side, sometimes for regulation or security concerns.

This needs to change! DevOps is here for that.

(Source : Mathieu Despriee - OCTO Technology - Introduction to DevOps)

One should note that this can be difficult to achieve. For instance for regulation or security reasons, logs may need to be anonymized on the fly, supervision tools need to be secured to avoid an untrained and forbidden developer to actually change something in production, etc. This may take time and cost resources. But the gain in efficiency is way greater that the required investment, and the ROI of this approach for the whole company is striking.

4.4 Work Together

A fundamental philosophy of DevOps is that developers and operations staff must work closely together on a regular basis.
An implication is that they must see one other as important stakeholders and actively seek to work together.

Inspired from the XP practice "onsite customer", which motivates agile developers to work closely with the business, disciplined agilists take this one step further with the practice of active stakeholder participation, which says that developers should work closely with all of their stakeholders, including operations and support staff.
This is a two-way street: operations and support staff must also be willing to work closely with developers.

In addition, other collaboration leads:

Have operators taking part in Agile rituals (Daily scrum, sprint planning, sprint retro, etc.)
Have devs taking part in production rollouts
Share between Dev and Ops objectives of continuous improvement

5. Conclusion

DevOps is a revolution that aims at addressing the wall of confusion between development teams and operation teams in big corporations having large IT departments where these roles are traditionally well separated and isolated.

Again, I've spent two thirds of my fifteen years career working for such big institutions, mostly financial institutions, and I have been able to witness this wall of confusion on a daily basis. Some sample things I got to hear:

"It worked fine on my Tomcat. Sorry but I know nothing about your Websphere thing. I really can't help you." (a dev)
"No we cannot provide you with an extract of this table from the production database. It contains confidential customer-related data." (an ops)

And many more examples such as those every day .... every day!

Happily DevOps is several years old and increasingly even these very traditional corporations are moving in the right direction by adopting DevOps principles and practices. But a lot remains to be done.

Now what about smaller corporations that don't necessarily have split functions between developers and operators?
Adopting DevOps principles and practices, such as deployment automation, continuous delivery and feature flipping still brings a lot.

I would summarize DevOps principles this way:

DevOps is simply a step further towards Scaling Agility!

(This article is available as a PDF document here https://www.niceideas.ch/devops.pdf and as a slideshare presentation here https://www.slideshare.net/JrmeKehrli/devops-explained-72664261)