This submit was co-authored by Anubhav Mehendru, Group Engineering Supervisor, Kaizala.

Cellular-only employees rely on Microsoft Kaizala—a easy and safe work administration and cellular messaging app—to get the work finished. Since COVID-19 has pressured many people to work at home internationally, Kaizala utilization has surged near 3x from pre-COVID-19. Whereas it is a good alternative for the product to develop, it has elevated strain on the engineering staff to make sure that the service scales together with the elevated utilization whereas sustaining the shopper promised SLA of 99.99 p.c.

As we speak, we’re sharing among the learnings about managing and scaling an enterprise grade safe productiveness app and the backend service behind it.

Basis of Kaizala

Kaizala is a productiveness instrument primarily focused for mobile-only customers and relies on Microservice structure with Microsoft Azure because the core cloud platform. Our workload runs on Azure Cloud Services, with Azure SQL DB and Azure Blob Storage used for major storage. We use Azure Cache for Redis to deal with caching, and Azure Service Bus and Azure Notification Hub manages async processing of occasions. Azure Active Directory (Azure AD) is used for our person authentication. We use Azure Data Explorer and Azure Monitoring for information analytics. Azure Pipelines is used for automated secure deployments the place we are able to deploy updates quickly a number of instances in per week with excessive confidence.

We observe a secure deployment course of, making certain minimal buyer affect, and stage smart launch of latest function and optimizations with full management on publicity and rollback potential.

As well as, we use a centralized configuration administration system the place all our config might be managed, comparable to publicity of a brand new function to a set of customers/teams or tenants. We fantastic grained management on msg processing fee, receipt processing, person classification, precedence processing, decelerate non-core functionalities and so forth. This permits us to quickly prototype new function and optimization over a person set.

Key resiliency methods

We make use of the next key resilience methods:

API fee restrict

To guard our present service from misuse, we have to management the incoming calls coming from a number of shoppers inside a secure restrict. We integrated a fee limiter totally primarily based on in-memory caching that does the work with negligible latency affect on buyer operations.

Optimized caching

To offer optimum person expertise, we created a generic in-memory caching infra the place a number of compute nodes are ready the rapidly sync again the state modifications utilizing Azure Redis PubSub. Utilizing this a major variety of exterior API calls have been averted which successfully decreased our SQL load.

Prioritize crucial operations

In case of overload of service because of heavy buyer visitors, we prioritize the crucial buyer operations comparable to messaging over non-core operations comparable to receipts.

Isolation of core parts

Core system parts that assist messaging at the moment are completely remoted from different non-core elements in order that any overload doesn’t affect the core messaging operations. The isolation is completed at each useful resource stage comparable to separate compute nodes, separate service bus for occasion processing and completely separate storage for non-core operations.

Discount in intra node communication

We made a number of enhancements in our message processing system the place we considerably decreased situations of intra node communication that brought about a heavy intra node dependency and slows down the complete message processing.

Managed service rollout

We made a number of modifications in our rollout course of to make sure managed rollout of latest options and optimizations to attenuate and unfavourable buyer affect. The deployments moved to early morning slots the place the shopper load is minimal to forestall any downtime.

Monitoring and telemetry

We setup particular monitoring dashboards to provide a fast overview of service well being that monitor necessary parameters, comparable to CPU consumption, thread depend, rubbish assortment (GC) fee, fee of incoming messages, unprocessed messages, lock competition fee, and linked shoppers.

GC fee

We’ve got finetuned the choices to regulate the speed of Gen2 GC occurring in a cloud service as per the wants of the net and employee situations to make sure minimal latency affect of GC throughout buyer operations.

Node partitioning

Customers should be partitioned throughout a number of nodes to distribute the possession duty utilizing a constant hashing mechanism. This grasp possession helps in making certain that solely required person’s info is saved within the in-memory cache on a selected node.

Energetic passive person

In massive group messaging operations, there are at all times customers who’re actively utilizing the app whereas plenty of customers are usually not energetic. Our thought is to prioritize message supply for energetic customers in order that the final bucket energetic person obtained the message quick.

Serialization optimization

Default JSON serialization is dear when the enter output operations are very frequent and burn valuable CPU cycles. ProtoBuf gives a quick binary serialization protocol that was leveraged to optimize the operations for giant information buildings.

Scaling compute

We re-evaluated our compute utilization in our inner a number of check and scale environments and judiciously decreased the compute node SKU to optimize as per the wants and optimize price of products bought (COGS). Whereas most of our visitors in an Azure area is throughout the day time, there may be minimal load on the night time the place we leverage to do heavy duties, comparable to redundant information cleanup, cache cleanups, GC, database re-indexing, and compliance jobs.

Scaling storage

With rising scale, the load of receipts turned big on the backend service and was consuming plenty of storage. Whereas crucial operations required extremely constant information, the requirement is much less for non-critical operations. We moved the receipt to extremely accessible No-SQL storage, which prices a tenth of the SQL storage.

Queries for background operations have been unfold out lazily to scale back the general peak load on SQL storage. Sure non-critical Operations have been moved from being strongly constant to eventual consistency mannequin to flatten the height storage load, thus creating extra capability for extra customers.

Our future plans

Because the COVID-19 state of affairs continues to be grave, we predict an accelerated tempo of Kaizala adoption from a number of clients. To maintain up with the rise in messaging load and excessive buyer utilization, we’re engaged on new enhancements and optimizations to make sure that we stay forward of the curve together with:

  • Creating different messaging flows the place customers actively utilizing the app can straight pull group messages even when the backend system is overloaded. Message supply is prioritized for energetic customers over passive customers.
  • Aggressively engaged on distributed in-memory caching of information entities to allow quick person response and different designs to maintain cache in sync whereas minimizing stale information.
  • Transferring to container-based deployment mannequin from the present digital machine (VM)-based mannequin to convey extra agility and scale back operational price.
  • Exploring different storage mechanism which scale properly with huge write operations for giant shopper teams supporting batched information flush in a single connection.
  • Actively exploring concepts round active-active service configuration to attenuate downtime because of information heart outages and decrease Restoration Time Goal (RTO) and Restoration Level Goal (RPO).
  • Exploring concepts round shifting among the non-core functionalities to passive scale items to make the most of the standby compute/storage assets there.
  • Evaluating the dynamic scaling skills of Azure Cloud companies the place we are able to robotically scale back the variety of compute nodes throughout nighttime hours the place our person load is lower than fifth of the height load.

Leave a Reply

Your email address will not be published. Required fields are marked *