If you’re working with Kubernetes, you know that it can be complex. The complexities only increase when you start looking at things like dynamic storage provisioning, different container runtimes, security and GitOps — and networking.
When we talk about Kubernetes networking, we’re typically talking about pod and service networking and tools like Calico, Flannel, Cilium, Canal, etc.
While tools like Calico are great for providing communication between Kubernetes components and running services and applications, they aren’t always enough to combat all the challenges we might face with distributed or microservices architectures.
This is where a service mesh comes in.
In its simplest form, all a service mesh on Kubernetes does is manage the communication between Services. A service mesh allows the cluster administrator to apply routing rules, based on business logic, the desired performance and availability of services, and the underlying network infrastructure. Three generally accepted main functional areas of a service mesh are traffic management, security, and observability.
Choosing a service mesh: where to start?
Just as with network plugins, there are a bunch of service mesh solutions to choose from.
A great place to start looking for Kubernetes native tools is the CNCF Landscape and, as you can see, you have quite a few to choose from. “So, how do I pick the right one?” You might be asking yourself.
Well, I wouldn’t recommend just picking the one with the coolest logo or the most features won’t always be the right choice — because more features means more complexity.
And there are a lot of possible features, as the table below shows.
Power comes at a cost
In life everything comes at a cost, and service meshes are no exception. When evaluating whether to implement one, and which one to choose, you’ll need to consider:
- Resource consumption. Service mesh adds another resource hungry component to your precious Kubernetes clusters. The extra resources, doing extra processing on all of your traffic (depending on which features you use), creates more overhead and requires more CPU and perhaps RAM to function efficiently.
- Latency. Most service meshes use sidecar proxies, meaning that each service call has to run through them first, which adds extra hops and thus extra latency.
- Architectural complexity. While networking is abstracted away, this adds another layer of complexity and configurations to manage. Service meshes with sidecars but without auto sidecar injection are an extra pain to manage.
The more awesome traffic shaping, security and observability features you use, the greater the cost.
What do you actually need?
It’s always best to think about your requirements and why you need a service mesh in the first place.
Valid requirements might include specific metrics, service discovery, traffic monitoring, tracing, failure recovery, load balancing, circuit breaking or more complex components like encryption, access control, A/B testing, canary deployments, traffic flow and end-to-end authentication. So there are many reasons you might want to implement a service mesh, but they do come at a cost. The only way you can keep that cost to a minimum is by thinking about what you actually need.
So, what is it that organizations and users actually need? I’m sorry to do this to you, but it depends ¯\_(ツ)\_/¯. As with most things in tech, this really depends on your own situation.
What type of organization are you working for? Is it big or small? How many people will be managing this environment? Is the environment mission critical? Is it internal, external or are there any SLA’s on the environment and services? And how deep is your (Kubernetes) networking knowledge?
All of these questions might impact your decision and change your requirements. At a minimum they might move things around on your ‘must-have or nice-to have’ list.
From what I’ve seen in the field, most organizations are looking for a way to monitor and control the traffic flow in their environment, provide mutual TLS (mTLS) between pods and/or services, remove certain hardcoded traffic management and security components from their microservices, diagnose communication problems and track performance and health with more ease.
So, in most cases a lightweight service mesh with basic functionality and low management, configuration and resource overhead is more than enough.
Open source options
The list of available service meshes is pretty long. If you take a look at: https://landscape.cncf.io/card-mode?category=service-mesh&grouping=category, you’ll see service meshes that have been donated to the CNCF or that are members of the CNCF. So we don’t make this blog too long, I’ll pick the 5 open source options with the most Github stars.
In the table above, we can see that the most popular, by far, is Istio. Does this mean that you should just pick Istio? Well, no.
While Istio is feature-rich, it’s also one of the most complex service meshes. A lot of effort has been put into making the installation simpler, but you’ll still have to cut through a forest of configuration options to get what you need.
There are a couple of new service meshes available that I’ve yet to test, but my recommendation today is for Linkerd. It was one of the first with a simple mTLS setup and automatic proxy injection, although those options are becoming pretty standard across the board. If you want to take a look at Linkerd’s “Getting Started guide” you can click here.
What is Envoy?
Let’s take a little side-step and talk about Envoy, because you will run into this name a lot while researching service meshes.
For those of you who have never heard of it. “The Envoy Proxy is an open source edge and service proxy, designed for cloud-native applications” and it’s the underlying technology for most service mesh sidecar proxies. From the top 5 highest ranking service meshes, Istio, Consul, Kuma and OSM leverage the Envoy proxy.
The only one up there not using Envoy is Linkerd, which last I checked is also the one with the highest performance. Linkerd runs on a micro-proxy written in Rust. Istio is the most popular, because it leverages most of Envoy’s rich feature set and can be extended and adapted like no other, but that’s also why it’s a pain to configure. If you would like to see the difference in resource utilization between linkerd-proxy and Envoy, check out this blog. Of course, that blog is written by Linkerd, so you might want to run your own test as well, because the actual resource usage might be similar when using the same features. Then the question still remains, which one is the simplest to manage?
Keep an eye out for Cilium Service Mesh
In my previous blog Getting started with Cilium for Kubernetes networking and observability I discussed the benefits of Cilium as a network plugin, which uses eBPF. And now the team at Isovalent have now also released a Cilium Service Mesh and I highly recommend keeping an eye out for this one! No more sidecar proxies (one per node, not per pod/service) and a lot more possibilities with eBPF working on the kernel layer, so we should be expecting many cool things from eBPF powered Service Meshes.
Other service mesh options
Of course there are also non open source solutions or open source options not listed on the CNCF, but most of them are specific to an entire set of tools or (cloud) provider. For example:
- AWS App Mesh
- Azure Service Fabric Mesh
- Red Hat OpenShift Service Mesh
- Gloo Mesh
- Tigera Calico Cloud
- VMware Tanzu Service Mesh
If you’re investing in an ecosystem like AWS or OpenShift, there may be clear advantages to choosing the platform’s native service mesh — or you may not even have a choice.
Manage everything in one place!
We all know that tools like service meshes only add complexity on top of complexity, so you might want to manage all of those components from one single pane of glass.
That’s what our Palette platform does for your Kubernetes environment.
Here’s an example of a Palette blueprint for a cluster with all necessary components to run a Wordpress website and Istio deployed for your service meshy needs. When deployed, this cluster will be able to fully manage itself and any direct changes to the configurations will be auto reconciled. This is the power of declarative management and deployment of your entire stack using Cluster Profiles in Palette — and you can try Palette absolutely free.
Final thoughts
While Kubernetes by itself is complex enough, there are people and teams out there making full use of microservices that need a way to do more advanced tracing, traffic shaping and observability. Or of course for the rest of us, that would just like to make our service to service communication safer with features like automatic mTLS. Either way, the answer is a service mesh.
But given all the choices out there (and the cost and complexity they come with), all of us need to make sure we do our research and only pick the simplest and best tool for the job. If you don’t, well, good luck opening Pandora's box! 😂