Why I Became a Google Cloud Certified Fellow in Hybrid Multi-Cloud
Robin Percy, CTO of OpsGuru, is now a Google Cloud Certified Fellow in Hybrid Multi-Cloud. He shares why Hybrid Multi-Cloud is so critical to your Cloud Adoption journey. ...
This is the second post in our Service Mesh series. Where the first post introduced service meshes at a high level, this one will drill down further into common service mesh features and the problems they address.
Understanding the problems addressed by a service mesh is key to determining their suitability for your systems. However, it's not until we uncover these system issues that we can begin to identify the trade-offs involved in adoption. In broad strokes, service meshes have three fundamental benefits:
Network communication can fail in a number of ways. It may be due to an error generated by the remote service, or failure in the infrastructure. Overloaded networks may also result in latency spikes and bandwidth contention. Whatever the cause, applications should handle these issues gracefully. The worst response is for the application to enter a dysfunctional state that masks the underlying error.
Service Meshes implement the most common strategies for enabling graceful degradation in the event of communication errors, namely timeouts, retries, and circuit breakers.
Every request to a remote service must have an explicit timeout. Without it, your application may wait indefinitely for a response from a failing or overloaded upstream service. It is one of the simplest defensive measures, but also one of the most easily neglected. Service meshes allow timeouts to be specified as simple configuration rather than being baked into each application.
In distributed systems - specifically, applications built with microservices - it is quite common for failures to be transient. For example, when calling a load-balanced upstream service, the first request may fail due to errors on one service instance while a subsequent request would be successfully routed to a different instance immediately. The logic for retrying a request is implemented transparently within the mesh so that applications code remains uncluttered by transient error handling.
Circuit breakers are one of the more complex failure handling strategies and should be used with care. They allow you to define conditions under which no further connections will be sent to a given host/instance on the mesh. For example, you could set a limit on the number of concurrent connections to a host to prevent overloading. Another example is to specify the number of consecutive error responses that cause a host to be proactively removed from load balancing.
Due to the way service meshes address network reliability concerns, their architectures lend well to layering on additional cross-cutting functionality. Security features may be among the most critical of those value-add features. Described below is how the security features - beginning with public-key infrastructure to application-tier authorization and authentication policies - quickly develop to provide an in-depth defence.
Public Key Infrastructure (PKI) may seem like the least exciting feature of a service mesh, but it is one that most security features depend on. While the exact implementations differ between meshes, PKI allows cryptographic identities to be dynamically generated for workloads and distributed via certificates. These certificates then provide the basis for encrypting, authenticating, and authorizing communication within the mesh. Due to their dynamic lifecycles, certificates are automatically rotated by the mesh control plane - a recommended security practice that can be painful to implement without a team.
An extension of the PKI features is to encrypt communication between workloads with the same certificates and keys used to verify their identities. Mutual TLS (mTLS) not only ensures that traffic is encrypted ("one-way" TLS) but also provides a two-way authentication mechanism that can ensure there is communication within a legitimate workload. A service mesh removes the complexity of managing the TLS handshake process within the application code. Instead, microservices can simply address communication in plaintext, and the service mesh will transparently encrypt and decrypt the traffic between workloads.
Another advantage of having cryptographic workload identities is that those identities can be used to restrict access patterns to microservices. Most service meshes also support alternate authentication mechanisms, like JWT tokens - which are typically used for user-based authentication. Regardless of authentication methods, once the source of a request has been identified, it can be compared to authorization policies for the microservice. These authorization policies typically provide rich control over which HTTP methods. Paths may be accessed on the service for layer 7 traffic or layer 4 traffic. Essentially, the capabilities vary by service mesh, but restricting application-tier traffic is a valuable tool for achieving defence in depth.
The complexities introduced by microservice architectures lead to a number of new costs in the development and operations lifecycles. The increased coordination required when deploying interdependent microservices results in a multiplication of failure modes and, therefore, more planning efforts and an increased risk of outages. These issues are further exacerbated when coordination must occur across multiple and/or hybrid environments.
Adding insult to injury, modern-day redundancy typically means running microservices across two or more failure domains (e.g. availability zones) and traffic that crosses failure domains incurs transit costs.
Fortunately, service meshes provide mechanisms to reduce or eliminate all of these costs; namely traffic shifting, multi-cluster support, and topology-aware routing.
Traffic shifting, or traffic splitting, typically refers to the ability to flexibly route traffic between two versions of the same workload. This provides significant risk mitigation for microservice deployments with dependencies from other services. It enables practices like canary testing, where a new version receives only a small fraction of requests until it is proven stable, and blue-green deployments. Orchestrating these practices outside of a service mesh is time-consuming and error-prone. The effort required increases exponentially with the number of microservices involved. Service meshes make these operations relatively trivial at any scale, providing a declarative language for routing traffic based on workload labels.
Multi-Cluster and Hybrid Cloud support is a featureset most service meshes have well underway - though most are still being iterated on. While some meshes, like Linkerd and Istio, were designed with specific underlying infrastructure in mind (e.g. Kubernetes), others, like Consul, have been platform agnostic from the start. Regardless of initial constraints, most service meshes attempt to solve the problems of service discovery, load balancing, and failover across clusters and cloud providers. While your journey may vary within the current state of these features, attempting to design an equivalent solution from scratch is generally a non-starter.
Network transit fees may be one of the most stealthy costs when it comes to modern microservice deployments. East-west traffic increases exponentially with the number of microservices. It is also common for microservices to be spread across availability zones for example, if load balancing algorithms are unaware of the underlying zones, at least 50% of your requests will end up incurring transit fees. Furthermore, these costs are amplified by the number of microservices involved in processing a request.
Consider a simple application made up of three microservices each with three instances, spread evenly across three zones. If service A receives incoming requests and passes them to service B for processing, which in turn passes to service C for processing for every 90 requests received by service A, you will incur egress charges for 120 requests worth of data. Each instance of service A has a two-thirds chance of routing to an instance of service B in a different zone and service B has the same chances of routing to a different zone. Therefore at each "hop" of the transaction, 60 of 90 requests will be sent across zone boundaries.
In order to prevent this explosion of costs, some service meshes, like Istio, provide topology-aware load balancing. This makes routing decisions based on available failure domain (e.g. availability zone) labels. The most common configuration is to have all traffic routed to service instances within the same zone when possible. Only if nodes in the current zone become unhealthy, will traffic be routed to another zone. This provides a good balance between cost optimization and resiliency that is difficult to achieve on most platforms. (However, as of v1.17, Kubernetes also provides this functionality with Service Topology, though it is currently in alpha).
As previously mentioned, microservice architectures come at the cost of complex application topologies. It is common for a single web transaction to result in dozens of calls within the system, as front-end services fan-out to multiple tiers of backends. Diagnosing failures requires a detailed view of exactly which services failed or incurred latency. Fortunately, service mesh proxies are ideally located in the mesh to provide exactly that information. Linkerd, Envoy, and Consul proxies all contain rich telemetry and tracing subsystems that yield detailed insight into network performance, application behaviour, and communication patterns. Armed with this data, it is possible to see exactly which microservices were called in a transaction, which calls resulted in errors, and how long each call took to complete. Not only is this level of visibility essential to microservice architectures, but it is costly to implement yourself.
Whilst there is a lot of hype around service meshes, there are also very good reasons to adopt one. Obviously, there are always trade-offs, but hopefully, this post has clarified reasons to begin evaluating some of the products out there. In a future post, we will provide guidelines to evaluate trade-offs and compare available solutions.