From Monolith to Microservices
Druva is a cloud-based data protection company with 4,000 customers in 20 countries. To support its expanding global customer base, Druva containerized its architecture and adopted low-cost Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances on Amazon Web Services (AWS).
By transitioning from a monolith to a microservices style of architecture, Druva can autoscale its architecture in any country, ensure low-latency data transfers, and remain compliant with regulations on data security and residency. Its teams are now focused on differentiated value-added tasks rather than infrastructure maintenance.
The speed at which we can launch instances and the flexibility of the overall compute infrastructure on AWS are astounding.”
Kiran Chitnis
Senior Director of Cloud Operations, Druva
Compliance with Regulatory Guidelines
Druva began its containerization journey in 2018 to dynamically scale its services and greatly reduce single points of failure. As its volume of microservices grew, the company began evaluating container orchestration services, both cloud-based and open-source, to ease the administrative burden on its operations team.
Most of Druva’s customers are Fortune 500 enterprises, such as big pharma corporations, U.S. government entities, or public agencies. Druva chose to use Amazon Elastic Container Service (Amazon ECS) for its global availability and compliance with the Federal Risk and Authorization Management Program (FedRAMP). Compliance with FedRAMP—as well as SOC 2, the Health Insurance Portability and Accountability Act (HIPAA), and the Payment Card Industry Data Security Standard (PCI DSS)—is a vital first step in obtaining and retaining high-profile customers.
“Our customers recognize AWS compliance and security certifications, which makes our lives much easier because we don’t need to pursue time-consuming, costly third-party assessments,” says Kiran Chitnis, senior director of cloud operations at Druva.
Global Footprint Ensures Data Residency and Low Latency
Containers have proven an efficient means for scaling the enterprise. As of 2021, Druva is active in 18 AWS Regions including AWS GovCloud (US), up from 11 regions when it started containerizing in 2018. Expanding its footprint and activating new AWS Regions has allowed Druva to onboard new customers in countries such as South Africa and Sweden while satisfying data residency and latency requirements. Low latency is critical for Druva’s customers when backing up large volumes of high-value data.
Druva’s customers can also maintain a lower recovery time objective (RTO) and recovery point objective (RPO) thanks to the scalability offered with containers. Previously, the rerouting of a customer’s backup job to an AWS Region at maximum server capacity would go into a queue, thus driving up the RPO indefinitely until more resources were available. Now, with containerization, Druva will seldom face a scenario where a customer’s backup job has to wait because its infrastructure is saturated. “With automation now in place, we have seamless elasticity to grow and the orchestration capabilities to dynamically spin up resources in real time for varying workloads,” says Chitnis.
Compute Cost Reduction with 99.5% Uptime
As Druva expanded, it looked at ways to control operations costs, particularly in relation to labor. Before implementing containers, Druva was increasing its operations team headcount by 20–25 percent each year. Its entire infrastructure was static, so for each change in customer behavior, engineers had to manually adjust resources and monitor fleet performance.
In the first three years after containerizing, Druva nearly quadrupled the amount of data under management from 45 PB to 175 PB. But by leveraging out-of-the-box functionality of Amazon ECS, the business only increased its operations headcount by two during this period.
Costs have also decreased from using Amazon EC2 Spot Instances in conjunction with Amazon ECS. Before 2020, the company relied on a mix of up to 80 percent Amazon EC2 Reserved Instances, 10 percent On-Demand Instances, and very few Spot Instances.
The balance has now shifted so that Spot Instances handle most workloads, with Reserved Instances deployed when Spot Instances become unavailable. The company’s On-Demand Amazon EC2 instance usage has dropped to zero. As of 2021, Druva projects 20–25 percent monthly savings on its Amazon EC2 monthly computing bill. Critically, the shift hasn’t affected system uptime, which maintains 99.5 percent availability for over 10 years on AWS.
Speedier Deployment with Native Integrations
Druva developed an API-based tool to automate the orchestration of Spot Instances, and it has benefited from native integrations among AWS services, including Amazon ECS. “The speed at which we can launch instances and the flexibility of the overall compute infrastructure on AWS are astounding,” Chitnis shares. “AWS manages the control plane for Amazon ECS and we only manage the data workloads and configuration. This gave us substantial leverage to speed up our deployment time.”
Druva recently onboarded its largest customer to date, the Port of New Orleans, with 20,000 account users and about 4 PB of data to secure. Accelerated software delivery on AWS enabled the company to onboard this customer in just three weeks with no manual intervention.
Druva’s customers are also saving time with its modern microservices architecture. In the case of the Port of New Orleans, backups that used to take a day now take just 30 minutes or less. “The move to Druva gave us the opportunity to consolidate data that was previously dispersed throughout the enterprise and off-site. We have the ability to restore and backup data within seconds and continue to meet data requirements in a timely manner,” says David Cordell, chief information officer at the Port of New Orleans.
Shifting Focus to Cloud Economics and New Technologies
The Druva platform’s high degree of automation allows the company’s operations team to build their skill sets in cloud economics instead of simply performing maintenance tasks. They can also more effectively identify and remove bottlenecks related to capacity restraints to further optimize their cloud architecture.
Druva is also investing more into R&D with the cost savings it has achieved. It’s giving engineers the freedom to experiment with new technology—such as using telemetry data to develop and train machine learning (ML) models using Amazon SageMaker—to better understand how customers use Druva. Engineering teams are starting to build business intelligence dashboards using tools including Amazon Athena and Amazon QuickSight for ML-powered insights. Chitnis concludes, “AWS makes it straightforward for us to take on any new projects and scale them for our global customer base.”