If you have never been exposed to software software system design challenges, you might be totally lost on even where to begin. I believe in finding the limits to a certain extend first and then start getting your hands dirty. The way you can start this is by finding some interesting product or services (ideally you are a fan of), and learning about their implementations. You will be surprised that how simple they may look, they most probably involve great deal of complexity. Don’t forget:
simple is usually complex and that’s OK™.
I believe the biggest suggestion I can give you while approaching to system design challenges is this: not to assume anything! You should pin down the facts and expectations from this system first. Some good questions to ask here are which will help you start this process:
- What is the problem you are trying to solve?
- What is the the peak volume of users that will interact with your system?
- What are the data write and read patterns going to be?
- What are the expected failure cases, how do you plan to mitigate them?
- What are the availability and consistency expectations?
- Do you need to worry about any auditing, regulation aspects?
- What type of sensitive data are you going to be storing?
These are just a questions few that have worked for me and the teams that I worked with over the years. Once you have answers to these questions (or any other which are relevant to the context you are in), then you should be starting to dive into the technical side of the problem.
Setting Your Baseline
What do I mean by the baseline here? Well, in this era of software development, most of the problems "can" be solved by already existing techniques and technologies. Knowing these to a certain extend will give you a head start when you are faced with similar problems. Remember, we are writing software to solve business' and our users' problems and the desire is to do that in a most straight-forward and simple way from a user experience point of view. Why do you need to remember this? It could well be your reality that you should solve problems in unique ways as you might be thinking "what's the point of me writing software then if I am here to follow a pattern?". The craft here is in the decision making process to define where to do what. Surely, we may have challenging, unique problems which we can face at certain times. However, if we have our baseline solid, we will surely know whether we should direct our efforts into finding out ways to solve the problems or further understand the depth of it.
I believe I have convinced you at this point now that having a solid knowledge on how some of the exciting systems are architecturally shaped is quite critical for you to progress on having some appreciation on the craft and a solid baseline.
However, before jumping into this, you might want to have some insights on what matters the most in the architectural challenges. This is important because there are A LOT of aspects involved in disambiguating a gnarly, ambiguous problem and solving it within the guidelines of a defined system.
Jackson Gabbard, an ex-Facebook employee, has
a 50 mins video on system design interviews based on his experience on interviewing hundreds of candidates at Facebook. Even if this is focused on the system design interview objective and what success looks like for that, it's still a very comprehensive resource on what matters the most when it comes to system design. There is also
a write-up of this video.
Start Building up Your Data Storage and Retrieval Knowledge
Most of the time, the choice of how you decide to persist and serve data will play a crucial role on the performance of your system. Therefore, you should be able to understand the expectations around data writes and reads about your system first. Then, you should be able to assess these and convert that assessment into a choice. However, you can only do this effectively if you know the existing storage patterns. This essentially means having a good knowledge around database choices.
Databases are really scalable and durable data structures. So, all your knowledge around data structures should be really beneficial around understanding the various database choices. For example,
Redis is a data structures server, supporting different kinds of values. It allows you to work with the concept of data strictures such as sets and lists, and provides you to read data through commonly-known algorithms such as
LRU in a durable and highly available fashion.
Once you get enough grip around the various data storage patterns, it's now time for you to get into data consistency and availability land. CAP theorem is the first thing you should try to have a good grip of, which you can polish it off by looking deeper into established consistency and availability patterns. These will allow you to have a wide spectrum when it comes to understanding data writes and reads are really very separate concerns and have separate challenges associated to them. By embracing several consistency and availability patterns, you can gain a lot of performance while serving the data to your applications.
Finally around data storage needs, you should also be aware of caching. Should it be both on the client and server? What data will you cache? And why? How will you invalidate the cache? (will it be based on time? If so, how long?).
This section of system-design-primer should be a good starting point on this topic.
Communication Patterns
Systems are composed of various components, which can be different processes living inside the same physical node or different machines sitting at the separate parts in your network. Some of these resources might be private within your network but some needs to be accessed publicly by your consumers.
These resources needs to be able to communicate between them and to the outside world. In context of system design, this again introduces another set of unique challenges. Understanding how
asynchronous workflows can help you and what are the
various communication patterns available such as TCP, UDP, HTTP (which sits on top of TCP), etc. will help you understand the breadth of the problem space and solutions currently available.
When dealing with communication to the outside world,
security is always another side-effect that you need to be aware of and actively deal with.
Connection Distribution
I am not sure if this logical grouping makes sense here. I will go with it anyway since it’s the closest term that reflects what I want to cover here.
Systems are formed by gluing multiple components together, and how they communicate with each other often is designed through well-established protocols such as TCP and UDP. However, these protocols are often not enough on their own to cover the needs of today’s systems which can have high load and demands from our consumers. We often need ways to be able to distribute connections in order to handle the high load of our system.
Domain Name System (DNS) sits at the core of this distribution. A DNS translates a domain name such as
www.example.com to an IP address. Besides this, some DNS services can route traffic through various methods such as weighted round robin and latency-based to help distribute the load.
Load balancing is very vital and nearly every major system on the Web we interact with today sits behind one or multiple load balancers. Load balancers help us distribute incoming client requests to multiple instances of resources. There both hardware and software forms of load balancers but it’s often that you see software based ones used such as HAProxy and ELB. Reverse proxies are also very smilar to the concept of load balancing with some distinctive differences though. These differences will have an effect on your choice based the needs.
Content Delivery Networks (CDN) are also something which you should be aware of. A CDN is a globally distributed network of proxy servers, serving content from locations closer to the user. CDNs are usually preferred when you are serving static files such as JavaScript, CSS and HTML. It’s also common that you see cloud services offer traffic managers (such as
Azure Traffic Manager) which gives you global distribution and reduced latency benefits for your dynamic content. However, these services are mostly beneficial if you have stateless web services.
What About My Business Logic? Structuring Business Logic, Workflows and Components
Thus far, we talked about all the infrastructure related aspects of a system. These are the parts of your system which your users probably have no idea about and to be frank, they don't give a damn about them. What they care about is how they interact with your system, what they can achieve by doing so and how the system acts on behalf of them to make certain decisions and process their data.
As you might guess from this post’s title, I intended this blog post to be about software architecture and system design. Therefore, I wasn’t going to cover the software design patterns which are concerned with how the components are built. However, thinking about this more and more, it’s clear to me that the line between them are very blurred and usually both sides are interconnected. Take
Event Sourcing for example. Once you adopt this software architecture pattern, it pretty much effects most parts of your system; how you persist data, what level consistency you choose for your system’s clients to deal with, how you shape the components within your system, so on and so forth. Therefore, I decided to touch on some of the design and architectural patterns related which directly concerns your business logic. Even if it’s going to be just touching the surface, it should be useful for you have some ideas. Here is a few of them:
Collaboration Approaches
It's highly unlikely that you are going to be the only one involved in a project where you need to be part of a system design process. Therefore, you need to be able to collaborate with other folks in your team, both inside and outside of your job function. There is also a breadth and depth of this surface area and as the technical leader, you should be able to address the concerns on each level by going into it with a required depth. The activities here may involve evaluating technology choices together or pinning down the business needs and understanding how the work needs to be parallelised.
First and foremost, you need to have an accurate and shared understanding of what you are trying to achieve as a business goal and what moving parts involved in this aspect. Group modeling techniques such as
event storming are powerful methods to accelerate this process and increases your changes of success. You may get into this process before or after you define your
service boundaries, deepening on your product/service maturity stage. Based on the level of alignment you see here, you may want to facilitate a separate activity to define the
Ubiquitous Language for the bounded context you are operating on. When it comes to communicating the architecture of your system, you may find
the C4 model for software architecture from
Simon Brown useful, especially when it comes to understanding what level of depth you should go into while visualising what you are trying to convey.
There are most probably other mature techniques available in this space. However, all will tie back to your domain understanding and your experience and knowledge around Domain-driven Design will prove to be handy.
Some Other Resources
Here are some resources which may help you. These are not in any particular oder.