Building planet-scale distributed apps (2/2)
In the previous blog post in this two-part series, we discussed designing and architecting a distributed system. This post is about building and deploying such a system.
The pattern we follow covers specific aspects of implementing the taxi booking service we aspire to build (for details, check the previous post). Often we'll cover topics at useful levels of abstractness to apply this learning to other apps we may choose to build. Those applications need not be planet-scale but the learning is meant to be applied to a broader range of scenarios. The high-level categories of our discussion include:
- Cloud platform architecture - Google Cloud
- Kubernetes and microservices
- Orleans deployment - Actors - Design for concurrency and distribution
- CI/CD - Canary and Rolling upgrades - Spinnaker & Helm
- Liveness and Readiness probes
Cloud platform architecture and GKE - The design proposes to use a managed instance of Kubernetes as a container orchestrator. This may be run in Google Cloud, Azure or AWS. The choice of cloud in this case is Google Cloud and GKE (Google Kubernetes Engine) for Kubernetes. The taxi booking service in our case can be segregated based on region. Individual instances of the entire service may be deployed in appropriate geographical regions with no interconnections. The service would include separate instances of GKE deployed in different geographical regions serving local users. For ease of deployment, this may be packaged into a Google Cloud Deployment Manager of Terraform template once there is clarity on the platform deployment parameters and their optimal values. The design involves placement of Kubernetes clusters in their own dedicated VPCs (virtual private clouds) in appropriate regions; each of these regions are placed with cloud functions that direct the client apps or endpoint devices with region-specific URLs. The performance of DNS is critical but is not addressed in this article.
Kubernetes and microservices - The Kubernetes implementation outlined here demonstrates how microservices architecture, container orchestration, and cloud management platform come together to help build a robust, scalable solution. We'll not cover every service within the app, instead, we will focus on elements that convey new technology capabilities that are important to build a distributed system of this nature.
The figure below represents some of the microservices used within the application. This helps visualize the structure better and helps in subsequent discussions.
From a Visual Studio standpoint it is a single solution with all microservices added as separate projects. Each project/service that is part of the solution will have a corresponding Dockerfile that dictates its composition, runtime and how the app should be packaged.
Now we will discuss some of the services/Kubernetes components that directly aid in scaling the app. In our example from the previous post, the taxis are sending out geo-coordinates on a regular basis to indicate their position. They do that using a message queue in Google Pub/Sub. Every such message contains a Taxi ID, latitude, and longitude details. The Event Consumer microservice reads the Taxi ID and populates the current latitude and longitude of the taxi which sent out the message. The taxis are represented as actors within the SiloHost microservices that implements Orleans. The EventConsumer also sends the latest latitude and longitude of every taxi to Azure Search that allows the mobile app user to locate a geographically close taxi using a geospatial search. The EventProducer service exists only to simulate a massive fleet of moving taxis for load generation. The WebAPI service forms the gateway between the client app and the backend systems that allows to book or search for taxis.
Any internet facing service, in our case the WebAPI service, uses Kubernetes Ingress that provisions an L7 load balancer in the Google Cloud platform. If we chose to, we can make this an L4 load balancer and implement extensible service proxy to translate between REST and gRPC if our implementation dictates so. The SiloHost service is set to scale up to 90% of the capacity of the Kubernetes nodes. Optimizing this will be addresed in the 'stabilize and operate' phase.
Orleans and Actors - The system represents every user/rider, taxi, driver, etc as actors. This ensures there is a single in-memory representation of every entity that is important for the use case. This model ensures there is no need for a database lookup or an equivalent. Any transaction is committed to a single place in memory and thus there is no need to worry about race conditions.
As an example, every taxi is represented by a Taxi ID and in turn as an actor. Every service within the large system looks at one single source of truth about the state of a taxi by the Taxi ID/ actor ID. In the code sample below, the term used for the taxi is ‘car’. Eg: TaxiID = carid
The SiloHost is implemented using as below code:
Each container that runs SiloHost connects to the storage account name and establishes itself as a node in the cluster. The membership table is maintained in an Azure Table. Every container that is of type SiloHost that gets spun up within the cluster will do a lookup in the Azure Table and try to establish itself as a member. These instances should be reachable between one another via the network on the SiloHost port specified in the definition as shown in the code above. If they are unable to reach each other, SiloHosts gets a container ejected from the cluster and that information is recorded in the membership table. Here is an example of such an activity:
You could use other methods like SQL, Consul or custom mechanisms to maintain cluster membership information--here we chose to use Azure Tables.
The cluster automatically devises ways to distribute actors across one another. We can choose to store persistent actor data on a disk by using persistent volumes in Kubernetes and mount such volumes on to the SiloHost containers.
CI/CD - It is essential to be able to use a mature and robust CI/CD pipeline to innovate on a system like this continuously. In our scenario, we use VSTS (Azure DevOps) for continuous integration (CI) and Spinnaker for continuous deployment (CD).
The code maintenance model we follow has two branches - master and dev (you could have more). Master represents a stable version of the product/ service. Dev branch is used for features, or bug fixes or code reviews (there should be separate branches for these, but for simplicity, there's only two). Dev branch is purely a code repo and checking-in the code does not trigger any action. However, merging the changes from dev to master, once tests are completed, triggers a build.
Upon meeting a trigger condition, the build is executed as below:
The build implements the packer pattern where the code is built using larger sdk image and finally the binaries are copied over to a lighter runtime image.
Upon build, the image is pushed to Docker Hub which is the container repo/registry in this case. Note that every component used here is modularised and it could be swapped with almost any other product/service that does the equivalent function.
Once an image is pushed into the docker hub, Spinnaker takes over the deployment from this stage. Spinnaker is deployed using Helm charts and Halyard pre-packaged container. Consider this link if you plan to deploy Spinnaker. The pipeline setup is depicted below. A new image in the Docker Hub will trigger the pipeline.
A Kubernetes 'Deployment' is created to maintain the desired state config and the kind of update we need for this service.
Autoscaling may also be set here using Spinnaker. It uses constructs within Kubernetes and orchestrates the processes with ease.
Liveness and Readiness probes - While Kubernetes acts as a container orchestration engine, there are certain aspects the application should be conscious of. For example a container's running state in Kubernetes does not accurately indicate the application components or microservice is ready to take on traffic. We can aid Kubernetes with this information in the form of readiness and liveness probes. In our scenario of the SiloHost service described earlier, the container might get into a running state and Kubernetes could start receiving traffic to it due to the load balancer above it. But, the application is ready only when it has contacted the membership table and other nodes in the cluster and gets an active state. We should provide this information to Kubernetes before traffic is sent to the container.
In this blog post we have explored the details of deploying and implementing a high-scale, distributed app with Google Cloud, Kubernetes, Microsoft Orleans, .Net Core & Xamarin Forms. We’ll follow this up with the operational and stabilization aspects of this system in the next and concluding part of the blog series.
May 16, 2022
May 10, 2022
Dec 20, 2021