Service Mesh - works great if you can accept its trade-offs

Cloud
DevOps

17/02/202116/02/2021

Service mesh is a networking solution for organizations dealing with microservice proliferation. Simply said, as soon as you go further than a few services, inter-service communication becomes challenging. Service mesh attempts to make its functionality trivial to implement, from the app developer side.

Service mesh works by running a lightweight proxy next to each service instance that handles everything (commonly using the sidecar pattern). This approach can bring quite some flexibility to the system, but like with everything in life, it comes with a few costs and tradeoffs of its own.

Assumptions bundled with Service meshes

Let’s start by looking at the positive sides of the service mesh, as well as some of its underlying assumptions. Starting with a few key assumptions a service mesh commonly assumes.

Containers and Container orchestration platforms

The main operation mode of service mesh deployments is on top or alongside container orchestration platforms. Container platforms such as docker have brought a whole new set of opportunities and challenges to software development. At the time of writing, the de facto way to orchestrate many containers is using Kubernetes. Most service meshes have been build to solve networking issues that came from using many microservices on top of such platforms.

It’s quite common to run hundreds of containers on top of kubernetes clusters. You could say its as common as mistyping catainers or dokcer into Google 😀

Ability to use sidecars

The magic of service mesh comes from the non-intrusive way that they integrate with your applications. Just running a small efficient applet next to your microservice is a very elegant way of separating networking concerns from the app itself. But it assumes that you can run such a sidecar in the first place. With containers, this is almost trivial, but your app might not be containerized. We already covered cases where you want to avoid using Docker containers. While it may still be possible to integrate such apps into your service mesh, you lose the “just works” part that is so crucial to a service mesh.

Layer 7 network control

You can look at service mesh as a Level 7 OSI network. It only exposes and operates using application-level abstractions. This is perfectly adequate for most web applications, as they mostly run on top of HTTP web servers. But if your application needs to live closer to networking hardware or requires some lower-layer networking, service mesh might prove considerably more difficult to implement.

A clear boundary between the service mesh and the other systems

The place where service mesh shines is in cloud computing. Your rent a bunch of instances, provision kubernetes, and service mesh on top, and deploy all your microservices there. On one side there is a clear boundary with the internet and a clear internal network. In this case, all the benefits below will just work like a charm. But if you try to implement a service mesh and colocate it with a bunch of legacy services, servers, and gear, you might end up needing quite some unexpected troubleshooting or funky hacks.

Benefits of Service Mesh

It’s not all doom and gloom 😀 Service mesh has clear use cases and clear places where it shines. So let’s look over some of its positive sides:

Just works – if your use case is: “make this bunch of k8s microservices accessible online”, the service mesh is perfect for you.
Finely grained networking – Also if you need a more detailed way of describing which application can talk with which application, service mesh will shine.
load-balancing – distributing load between different instances of the same app means that you can more horizontally scale the whole architecture. Especially if said apps are stateless.
Service discovery – you don’t bother with where which app runs, the mesh takes care of discovering where your apps are running. Also, it takes care of ensuring accessibility stays the same even if apps move to another host in the cluster.
Encryption – most major service mesh providers allow for both encrypted and unencrypted networks. If you don’t trust a network with plaintext communication, you can turn on encryption for that part of the network. Just bear in mind that encryption has a performance, computing, and latency cost.
Authentication and authorization – you can offload most of app-app auth&autorization to a service mesh. This has the added benefit of adding said functionality without having to be added to the app codebase.
Observability and metrics – automagical metrics and insight into the network are part of the service mesh package. Now the question of do you need it and can you afford its cost is another matter.

Tradeoffs that might decide if service mesh is right for you

A lot of decision-making when it comes down to a few key tradeoffs. Not being able to live with some of the drawbacks or limitations of service mesh might mean that it’s not the right solution for your company. That being said, it can still be part of your infrastructure. Just keep in mind that operational complexity grows with every new thing you add and might make it hard to focus on what your team is supposed to be doing.

Fixed costs of running Service Mesh

Service mesh typically comes with some centralized controller that you access and a bunch of small microservices that run as sidecars next to your applications. Now, these components are quite negligible, but in cases where you have many containers they might add up to a major cost. For instance, kubernetes has a clear limit for the number of containers per node. Also what capacity kubernetes master needs to handle a specified number of containers. Implementing service mesh means that you effectively cut this container limit in half as for each container you need to start 2 containers.

Latency and Throughput when using service mesh network

Service mesh will all 2 network hops, as the origin app talks to its local proxy, and the proxy talks to another near the destination app. In most cases such latency is negligible, but some report considerably worse edge cases. In worst cases, latency can be 5-8 times higher than with a more traditional network.

Sadly there is a clear cost of adding latency, as only a few second load time delay can lead to significant loss of interest or sales. As most of this networking logic happens in containerized software, which is considerably less efficient than bare metal or hardware-accelerated networking.

While you might assume disregard such a case, as not relevant at your scale. You might do a docker pull from your laptop with SSD and an in-mesh docker registry with SSD storage over a 100Gig network, and the server CPU might end up being slowing you down as it can’t sustain the load. Linux networking stack has a clear limit on what it can do, the closer you get to the less efficient networking gets. Not being able to offload parts to specialized hardware might put a hard cap on your scalability no matter how much money you throw at the problem.

Don’t get me wrong, software-based routing, and load balancing can still operate under high load. Haproxy can achieve 2 million SSL connections on what in the server world counts as commodity hardware. But if you’re trying to scale something globally to millions of users, those limits might end up being much closer than you expect.

Vendor Lock-in

Like pretty much everything else in networking, migrating to another vendor might be extremely painful. Changing a vendor might involve gear change, application rework, and complete reimplementation of everything in another vendor’s system. A nice way to look at every production-grade networking decision is a marriage with a costly divorce contract. Also, keep in mind that vendors can go out of business. That might leave you stuck with vulnerabilities and bugs for years to come.

Most service meshes on the market are open source (or OSS+premium), which makes this risk slightly smaller. But OSS projects can get abandoned by major backers or become unpopular among developers. It can even end up oriented towards features that hurt your use case.

Service mesh limits your networking options

AsS previously mentioned service mesh operates as an OSI L7 network. Meaning that you can specify that this app on this URL can access this other app. If you need more detailed networking, you might have to use something else. Stuff like creating a VXLAN, opening a UDP port, creating a Vswich, connecting a container to a VPN is simply out of the scope of service mesh platforms. Some of these features might be supported, but often they are second-class citizens compared to the main seller features Also finding performance comparisons of such features might be difficult to find or even do.

Many service management overhead

The last concern is not exactly a trade-off or problem per se. Just like kubernetes, docker, and many other tools, service mesh makes it easier to run microservices in production environments. Easy running is good unless taken to the extreme. This can easily happen while breaking a monolith or just adding microservices to a mesh by many teams over many years.

If your business function (eg. the stuff that your customers actually use/do), has to go through 50 microservices to “complete”, your app will end up being too slow for anyone to use. This is sadly not something any particular tool can help solve. It’s an application architecture design problem. The solution sadly won’t be in a mesh or any other current buzzword de jour. Such hard edge cases have to be handled by stepping back and having several hard discussions

Service Mesh – works great if you can accept its trade-offs

Assumptions bundled with Service meshes

Containers and Container orchestration platforms

Ability to use sidecars

Layer 7 network control

A clear boundary between the service mesh and the other systems

Benefits of Service Mesh

Tradeoffs that might decide if service mesh is right for you

Fixed costs of running Service Mesh

Latency and Throughput when using service mesh network

Vendor Lock-in

Service mesh limits your networking options

Many service management overhead

Further Reading

Share this:

Millions of Linux servers voulnerable to RegreSSHion – new OpenSSH vulnerability

How to resize Kubernetes Statefulset Storage

Calico to Flannel – changing kubernetes CNI plugin

How to choose a password manager

buzzwrd.me