How We Scaled Salesforce Edge up to 5 Million Orgs

What was the business trigger for your project?

Since 2018, Salesforce Edge has been providing internal content delivery network (CDN) services — onboarding approximately 130,000 web domain names, including some of the largest internal web properties.

Over the years, we’ve worked on improving the stability of the service, but it has struggled to keep up with our rapid business growth. We realized that our control plane architecture is nearing its scaling limits in terms of memory utilization and visible latencies. This realization provided us the opportunity to design a new architecture. We can now reflect on the lessons learned from the initial project, identifying what worked well, what was overlooked, what is no longer needed, and what has not scaled up effectively.

Growth in number of onboarded customers over time.

Have you considered rewriting your software from scratch?

Yes, we considered rewriting our software from scratch for various reasons, including a significant change in underlying technologies or the belief that starting fresh would be faster than dealing with existing technical debts. In the case of our Edge software, we approached the re-architecture of the control plane and data plane components differently.

For the control plane server, we made the decision to switch from Python to Java, Etcd to Aurora, Docker-compose to Kubernetes, and private cloud to public cloud. These drastic changes warranted starting from scratch.

For the data plane, which handles internet traffic, we took a different approach. We valued Trust as our top priority and were concerned about losing the interoperability, compliance, and security hardening that we had gained over years of operation. Therefore, we opted for an intensive refactoring exercise on the existing code.

By choosing the refactoring approach for the data plane, we were able to leverage over 300 functional tests running in our existing CI, reducing the risk of introducing functional regressions. Our rollout strategy involved feature-flagging the new refactored code, allowing for a slow and staggered rollout of each feature. This approach also provided the flexibility to quickly roll back to the legacy code if needed.

How did you kick off the refactoring project?

We followed a systematic approach. First, we identified the measurable metrics we wanted to improve and agreed on their target values. Then, we collected profiling data on the current code to pinpoint the areas that needed improvement. Next, we generated ideas to address the identified pain points, prototyped them, and measured their impact on each metric.

Additionally, we estimated the effort required for implementing each idea and used these figures to calculate a “cost-effectiveness score” for each idea. By prioritizing the highest-scored ideas, we focused on the “low hanging fruits” that would deliver the largest impact. This approach ensured a structured and data-driven kick off for the refactoring project.

Feature prioritizing matrix based on their cost-effectiveness.

What was the biggest pain point you needed to address?

Our biggest pain point was the limited scale of configuration size, specifically the number of onboarded customers, which frequently triggered the kernel’s Out-Of-Memory Killer. It became clear that keeping all the configuration in memory was not a viable solution.

To address this, we decided to change our configuration processing model. We adopted a streaming approach, where we loaded one customer’s configuration at a time, processed it, and then discarded it, similar to streaming. This shift to streaming mode allowed us to optimize our memory footprint and scale indefinitely, overcoming the limitations imposed by the configuration size.

Memory footprint growth before and after the refactoring.

Have you also considered scaling vertically?

To maximize core utilization, we implemented multi-threading for our configuration processing. Instead of using locks to serialize access to shared resources, which can be inefficient, we adopted a map-reduce-like approach. Each worker thread operates on its own local resource, which is later combined into the global resource by the main thread. This approach minimizes contention and achieves a near-linear relationship between the number of worker cores and the configuration processing time.

Time to load the configuration by the number of config processing workers.Can you share one learning from the original Edge architecture?

One key learning from the original Edge architecture was the need to prioritize scalability as a nonfunctional requirement. Initially, the focus was on developing new features to cater to a wide range of customer use cases. However, introducing new features without considering their impact on scalability could lead to a complex and inefficient architecture.

The realization was that, even if a feature was not heavily utilized, it could still have a negative impact on performance due to the architectural changes it introduced. This raised the question of whether to clean up these features or keep them as part of the refactoring process.

To address this, a decision was made to plan the new architecture as if these features did not exist. The highest strategic priority became ensuring that the predominant features, which required scalability, were well-supported. By taking this approach, the team aimed to create an architecture that prioritized scalability and avoided unnecessary complexity caused by less-utilized features.

How drastic was the refactoring impact?

After several months of iterating over the new code, we successfully addressed all the initial pain points, optimized our performance metrics, and resumed the massive migration of domains into Edge. This was achieved through continuous integration, with bi-weekly releases, health-mediated rollout, and no downtime. Notably, we were able to onboard 5 million customers in our test lab and increase our production customer base from 130,000 to 2 million.