Service Architecture at SoundCloud — Part 3: Domain Gateways

This article is the last part in a series of posts aiming to cast some light onto how service architecture has evolved at SoundCloud over the past few years, and how we’re attempting to solve some of the most difficult challenges we encountered along the way.
In the second part of this series, we discussed how we evolved the use of the BFF pattern in SoundCloud by moving existing duplicated logic to a more centralized and elegant solution called Value-Added Services (VAS). We covered how we benefit from this architecture pattern, as we have all the authorization, content policies, and fetching logic of tracks and playlists in a single service.
In this blog post, we’ll cover how we evolved the concept of Value-Added Services to Domain Gateways, which allow us to extend those services to have read and write operations in a single and centralized service for each business domain.
Growing Aggregates
As we described in the previous installment, the core responsibility of a VAS is to serve our core aggregates, such as Track and Playlist. To do this, a VAS fetches states for associated entities and value objects from corresponding Foundation1 services, and then it applies business authorization rules. For example, the Tracks VAS will filter out all tracks that are geoblocked in certain territories.

Roughly speaking, one can imagine a VAS as a big fanout together with authorization logic. One of the first challenges that we faced while extending our Value-Added Services was the size of this fanout. As we were adding features to the platform, our aggregates — and hence the amount of network calls — were growing as well.
On the other side, our BFFs often have different needs dictated by their applications. For example, one track feature might only be available on mobile, which makes fetching the entire track aggregate from the Web API unnecessary. Moreover, even within a single BFF, we sometimes support multiple aggregate representations that can be built without fetching all dependencies.
How can we provide centralized endpoints for serving aggregates that can be customized to the specific needs of BFFs? Luckily, this problem has a pretty straightforward solution — partial responses. This pattern allows API consumers to tell the producer which part of the response they’re going to consume by specifying a FieldMask in the request. Field masks support protobuf and JSON representations that make them essentially protocol agnostic.2
In our particular case, we use Twinagle — a protobuf IDL based on the Twirp protocol. Protobuf definitions provide type-safe construction and validation out of the box via FieldMaskUtils that we’ve ported to the ScalaPB library.

One disadvantage of field masks for partial responses is a tighter coupling between microservice topology and aggregate schemas (IDLs). Field masks can be defined according to service dependencies and network calls to reduce the number of requests necessary to produce a BFF representation. At SoundCloud, our focus is more on the reduction of complexity in the edge layer (specifically in BFFs). While field masks can optimize network calls as well, it isn’t necessary to have a 1:1 mapping between field masks and network calls.
Commands
While we were extending the scope of the VAS to serve aggregates of our entities, we identified that we could also extend our VAS to those actions that mutate the state of the core entity (i.e. write operations) but at the same time would require authorization logic. To centralize even more core entities, we extended our VAS with commands. Some examples of these command operations in the Tracks domain include “download a track,” “like a track,” and “repost a track.”
Since it’s an operation that lives in the VAS, it also has the benefit that we reduce complex logic in BFFs (in case such logic was duplicated there) and improve reliability in terms of access logic of those operations that require grant access to a given track.
We can illustrate the case of liking a track in the Track VAS:

As we can see in the graph, BFFs would send a request to the Tracks service to perform a track operation. The service that usually registers “like” operations lives in the Likes service. This service isn’t aware of track authorizations; it only creates/deletes links between tracks/playlists and users. That’s why we need to check first if the user who wants to like a track has access to it. The best place to achieve such logic in a centralized place is the Tracks VAS.
Separation of Queries and Commands
To summarize, the VAS interface consists of two parts: an endpoint to serve its aggregate according to BFF needs, which we call queries; and endpoints that expose core entity operations, which we call commands. This separation is the core idea behind the CQRS pattern and provides some practical benefits, as it’s possible to provide separate upstream services or stores for read and write operations. For example, the foundational service that provides operations to add or remove a follow/er/s relationship between two users (a write) is different from the service that serves follower counts. This relationship between foundational services is now abstracted away from users of the Tracks VAS, which improves consistency and reduces complexity in the BFFs.
Beyond Core Entities: Domain Gateways
As our VAS grew in scope, we identified that a single core entity (like a track) can be used in different domains, for different purposes, and with different access patterns and authorization requirements. For example, SoundCloud not only provides a consumer application to a music catalog; it also provides tools for creators to upload and distribute their music. Consumer and Creator are different domains, owned by different teams — all of them referencing and using tracks for different purposes within their specific domains.
A possible approach in this case is to implement everything that can possibly be done with a track (in all the different domains) — including related queries and commands — in a single VAS. This can work well for some time, but eventually there’s a risk of creating large amounts of coupling and complexity, causing friction and decreasing productivity.
A more scalable approach is to identify the different business domains that need to make use of a given entity and create a Domain Gateway for each of them. In essence, a Domain Gateway is an implementation of a VAS tied to a specific business domain. Each one can be maintained by different teams and represent different views on a given entity, relying on the same foundational layer of services. This façade can provide stability and act as an anti-corruption layer for each of the domains.
The Domain Gateway approach involves a certain level of duplication in exchange for autonomy and increased scalability, and it makes sense to apply in cases in which the different domains have very different access patterns and highly disjoint feature sets, or when communication and collaboration between teams is difficult (for example, due to geographic location of teams and non-overlapping time zones).

Summary
As we discussed in the previous blog post, the evolution of SoundCloud’s architecture into a three-tier architecture with Value-Added Services as authoritative entry points for accessing aggregates has proven successful — even more after evolving them into the concept of Domain Gateways. This is a pattern that we’ll continue adopting and applying in the future.
We plan to move other operations that are duplicated in our codebases to their respective gateways. This will provide more flexibility to evolve our system as soon as we want to add new functionality without the hassle of duplication in each of the BFFs.
In parallel, we’ll continue encouraging feature teams to evolve their microservices architecture around their core domain. This will create a more solid landscape where business logic is centralized and more easily accessible from other dependent systems.
Finally, we’re still exploring the possibilities enabled by Domain Gateways — including improved team autonomy and reduced cycle times for our development process.

1: For a review of SoundCloud’s architectural layers, refer to our previous blog post.
2: GraphQL is an alternative approach to provide an API that can be customized to consumer needs. Although it provides more flexibility, we decided that its benefits won’t offset the cost of migration from our standard Twinagle stack.