For most of us, the change of the year is an occasion for thinking about what we missed doing last year and where we want to improve. I decided there are a couple of things where I would like to do better in 2017 than in 2016. One of the thing is that I would like to do more blogging and writing in general.

As a first blog post in 2017, I would like to write about How to design and build reliable, scalable and maintainable applications.


As you are all aware that many applications today are data-intensive rather than compute-intensive. The bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.

For many application we need to :

  • Store data – databases.
  • To speed up reads – caches.
  • For search data by keyword or filter it in various ways – search indexes.
  • Send a message to another process, to be handled asynchronously (stream processing, realtime, etc).
  • Periodically crunch a large amount of accumulated data (batch processing, jobs, etc).

How can we figure out which tools and which approaches to use based on the requirements ?

For example, if you have an application-managed caching layer (using Hazelcast or similar), or a search server separate from your main database (such as Solr), it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database.

When you combine several tools in order to provide a service, the service’s interface or API usually hides those implementation details from clients.  Thus system may provide certain guarantees, e.g. that the cache will be correctly invalidated or updated on writes, so that outside clients see consistent results. You are now not only an application developer, but also a data system designer.

If you are designing a data system or service, a lot of tricky questions arise.

  • How do you ensure that the data remains correct and complete, even when things go wrong internally ?
  • How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load ?
  • What does a good API for the service look like ?

There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the time‐scale for delivery, organization’s tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on the situation. But there are three concerns which are important in any data systems.


The system should continue to work correctly even in the face of adversity (hardware or software faults, and even human error).

Let’s dig more about various aspects of reliability

Hardware faults:

When we think of causes of system failure, hardware faults quickly come to mind. Hard disks crash, RAM becomes faulty,  someone unplugs the wrong network cable. These things happen all the time when you have a lot of machines/cluster. Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

However, as data volumes and applications computing demands increase, more
applications are using larger numbers of machines, which proportionally increases the rate of hardware faults. Moreover, in some cloud platforms such as Amazon Web Services it is fairly common for virtual machine instances to become unavailable without warning as the platform is designed to prioritize flexibility and elasticity over single-machine reliability.

Software errors:

  • A software bug that causes every instance of an application server to crash when  given a particular bad input. For example, consider the leap second on June 30, 2012 that caused many applications to hang simultaneously, due to a bug in the Linux kernel.
  • A runaway process uses up some shared resource — CPU time, memory, disk
    space or network bandwidth.
  • A service that the system depends on slows down, becomes unresponsive or
    starts returning corrupted responses.

The bugs that cause these kinds of software fault often lie dormant for a long time until they are triggered by an unusual set of circumstances. There is no quick solution to the problem of systematic faults in software. Lots of small things can help, carefully thinking about assumptions and interactions in the system, thorough testing, process isolation, allowing processes to crash and restart, measuring, monitoring and analyzing system behavior in production.

Human errors:

Humans design and build software systems, and the operators who keep the system running are also human. Even when they have the best intentions, humans are known to be unreliable. For example, The configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages based on the survey.

How do we make our system reliable, in spite of unreliable humans ?

  • Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs and admin interfaces make it easy to do “the right thing”, and discourage “the wrong thing”. However, if the interfaces are too restrictive, people will work around them, negating their benefit, so this is a tricky balance to get right.
  • Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully-featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
  • Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.
  • Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually (so that any unexpected bugs affect only a small subset of users), and provide tools to recompute data (in case it turns out that the old computation was incorrect).
  • Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures) Monitoring can show us early warning signals, and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue.
  • Good management practices.

How important is reliability ?

Bugs in business applications cause lost productivity (and legal risks if figures are reported incorrectly), and outages of platform can have huge costs in terms of lost revenue and reputation. Even in non-critical applications we have a responsibility to our users.

Consider  a user who stores all his important files, pictures and videos in the cloud application. How would they feel if that database was suddenly corrupted ? Would they know how to restore it from a backup ?

There are situations in which we may choose to sacrifice reliability in order to reduce development cost or operational cost but we should be very conscious of when we are cutting corners.


Over time, many different people will work on the system maintaining current behavior and adapting the system to new use cases and they should all be able to work on it productively.

It is well-known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance — fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.

Yet, unfortunately, many people working on software systems dislike maintenance of so-called legacy systems — perhaps it involves fixing other people’s mistakes, or working with platforms that are now outdated, or systems that were forced to do things they were never intended for. Every legacy system is unpleasant in its own way, and so it is difficult to give general recommendations for dealing with them. However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves.

Operability: making life easy for operations

“Good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations”.

While some aspects of operations can and should be automated, it is still up to humans to set up that automation in the first place, and to make sure it’s working correctly.
Operations teams are vital to keeping a software system running smoothly. A good 0perations team typically does the following, and more

  • Monitoring the health of the system, and quickly restoring service if it goes into a bad state.
  • Tracking down the cause of problems, such as system failures or degraded performance.
  • Keeping software and platforms up-to-date, including security patches.
  • Aanticipating future problems and solving them before they occur, e.g. capacity planning.
  • Establishing good practices and tools for deployment, configuration management and more.
  • Performing complex maintenance tasks, such as moving an application from one platform to another.
  • Provide visibility into the runtime behavior and internals of the system, with good monitoring.
  • Avoid dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted).
  • Good default behavior, but also giving administrators the freedom to override defaults when needed.
  • Self-healing where appropriate, but also giving administrators manual control over the system state when needed.
  • Predictable behavior, minimizing surprises.

Simplicity: managing complexity

Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. A software project mired in complexity is sometimes described as a big ball of mud.

There are many possible symptoms of complexity – explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere, and many more.

When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change. when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences and unexpected interactions are more easily overlooked. Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build.

Making a system simpler does not necessarily mean reducing its functionality. It can also mean removing accidental complexity. One of the best tools we have for removing accidental complexity is abstraction. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand facade. A good abstraction can also be used for a wide range of different applications. Not only is this reuse more efficient than re-implementing a similar thing multiple times, but it also leads to higher-quality software, as quality improvements in the abstracted component benefit all applications that use it.

However, finding good abstractions is very hard. In the field of distributed systems, although there are many good algorithms, it is much less clear how we should be packaging them into abstractions that help us keep the complexity of the system at a manageable level.

Evolvability: making change easy

It’s extremely unlikely that your system’s requirements will remain unchanged forever. Much more likely, it is in constant flux: you learn new facts, previously unanticipated use cases emerge, business priorities change, users request new features, new platforms replace old platforms, legal or regulatory requirements change, growth of the system forces architectural changes, etc.

In terms of organizational processes, agile working patterns provide a framework for adapting to change. The agile community has also developed technical tools and patterns that are helpful when developing software in a frequently-changing environment, such as test-driven development (TDD) and refactoring.

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions – simple and easy-to understand systems are usually easier to modify than complex ones.


As the system grows (in data volume, traffic volume or complexity), there should be reasonable ways of dealing with that growth.

Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in future. One common reason for degradation is increased load, perhaps it has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is processing much larger volumes of data than it did before.

Scalability is the term we use to describe a system’s ability to cope with increased

Describing load:

First, we need to succinctly describe the current load on the system, only then can we discuss growth questions (what happens if our load doubles?). Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system, perhaps it’s requests per second to a webserver, ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else. Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

Describing performance:

Once you have described the load on our system, you can investigate what happens when the load increases. You can look at it in two ways:

  • When you increase a load parameter, and keep the system resources (CPU,
    memory, network bandwidth, etc.) unchanged, how is performance of your system affected?
  • When you increase a load parameter, how much do you need to increase the
    resources if you want to keep performance unchanged?

In a batch-processing system such as Hadoop, we usually care about throughput the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. In online systems, the response time of a service is usually more important — that is, the time between a client sending a request and receiving a response.

Latency and Response Time:

  • The response time is what the client sees, besides the actual time to process the request (the service time), it includes network delays and queueing delays.
  • Latency is the duration that a request is waiting to be handled — during which it is latent, awaiting service.

Approaches for coping with load:

How do we maintain good performance, even when our load parameters increase by some amount ?

An architecture that is appropriate for one level of load is unlikely to cope with ten times that load. If you are working on a fast-growing service, it is therefore likely that you will need to re-think your architecture on every order of magnitude load increase, perhaps even more often than that.

People often talk of a dichotomy between scaling up (vertical scaling, moving to a more powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines). Distributing load across multiple machines is also known as a shared nothing architecture.

A system that can run on a single machine is often simpler, but high-end machines can become very expensive, so very intensive workloads often can’t avoid scaling out. In reality, good architectures usually involve a pragmatic mixture of approaches – for example, several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually. An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises.

While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity. For this reason, common wisdom until recently was to keep your database on a single node (scale up) until scaling cost or high availability requirements forced you to make it distributed.

As the tools and abstractions for distributed systems get better, this common wisdom may change, at least for some kinds of application. It is conceivable that distributed data systems will become the default in future, even for use cases that don’t handle large volumes of data or traffic. Over the course of the rest of this book we will cover many kinds of distributed data system, and discuss how they fare not just in terms of scalability, but also ease of use and maintainability.

Even though they are specific to a particular application, scalable architectures are nevertheless usually built from general-purpose building blocks, arranged in familiar patterns.


An application has to meet various requirements in order to be useful. There are functional requirements (what it should do, e.g. allow data to be stored, retrieved, searched and processed in various ways), and some non-functional requirements (general properties like security, reliability, compliance, scalability, compatibility and maintainability).