In Computer Science class, I learned to avoid premature optimizations. Thus, when building the first versions of Whaly, we never focused on optimizing it. As our business grew, however, we had to start scaling it. Below are some notes on the steps and paths we’ve taken so far!
This is a work in progress that is based on our own experiences so far, which may differ from yours. We’d love to hear your thoughts and perspectives, so feel free to reach out to email@example.com with any questions or comments! 💌
I hope this article can help you solve similar problems.
Whaly is a cloud Business Intelligence solution, distributed as a SaaS.
Business Users use it to query their business data (customers, operations, finance) to get answers. They build charts based on those queries and arrange them in dashboards.
The interesting properties of scaling such a system is that:
During our scaling journey, our monitoring stack helped us:
None of these optimizations can be done without a proper monitoring stack. If you can’t measure what’s going on, spending effort and energy optimizing the wrong thing can be fatal to your business.
Our monitoring stack currently includes:
If you don't have these 3 types of tools in place, it’s probably not a good time to try to optimize your system.
Focus on implementing your monitoring stack before trying optimize anything else.
Even if the house is on fire and your whole system is fucking slow, you need to diagnose the issue before pushing optimization in production.
Your brain and your coworkers will want to switch into solution-mode without having a clear view of the problem. Try to resist that urge! While it’s reassuring to work on a solution (even a wrong one), it doesn’t mean that’s the right thing to do 😉
Scaling horizontally means that you deploy more instances of your apps on more servers. With the proper load balancing done upstream of your apps, it means that you can throw more CPUs and memory at your problem.
It’s not the silver bullet of optimization problems, but it’s a useful tool.
Our apps are following the 12 factors (https://12factor.net/) and are run on Kubernetes that make it easy to scale them up or down.
We’re using Gitops methodology to avoid going on the web console of our Cloud Provider and for a better audit of all our production operations.
We read and applied all the cool stuff that everyone is talking about on the internet, so we thought that we just had to push the “Scale” button that we earned with our top notch infra design.
We added more replicas to our problems. It helped a little, but not so much 😓 So we went further down the rabbit hole ⤵️
Caching is an obvious solution when dealing with expensive operations (HTTP calls, calculations, queries).
The main idea is that you don't have to do the same thing multiple times.
We began with local caches stored in the memory of our apps to start quickly.
To avoid exploding our app memory, we're using fixed cache size with an LRU eviction policy (when too many values are cached, we evict the least used one).
This approach was a really easy way to get started as we simply needed to add a “cache” library in our code, but it quickly began to fail due to the replication factor of our applications that increased over time.
As each cache is stored locally in a replica, the more we replicate our apps, the more we have duplications of the cache. This means that the cache hit ratio quickly decreases as we add more replicas to our system. So local caching and horizontal caching don’t work well together 💥
We then centralized our caches in a central service backed by Redis to avoid this duplication issue.
Redis is a high performing in-memory database that can be used to store keys and values and share them across replicas.
If deployed in the same network as your apps, the network call needed to check if a cache value exists is not too high to pay and is still worth it, if calculating the cached value is taking longer than this network call.
As a lot of cached values are themselves requiring HTTP calls to be built, this is practically always worth it 👌
To check if our caches are working properly, we’re tracking the Cache Hit Ratio for each cache using 2 metrics.
Then, we can calculate the cache hit ratio on our dashboarding solution as cache_hits/cache_requests.
Another metric that we’re not currently tracking, would be the mean time of execution when there is a cache hit vs. when there is not. This would give us an idea of the latency gain of the cache.
As our processing grows along with the volume of cache values, we'll probably have to shard our Redis database and use different servers per range of cache key.
Our Redis is currently managed by our Cloud Provider, but self hosting it is an option to reduce costs in the future.
As we are processing a lot of concurrent requests when people are opening dashboards at the same time, we have to optimize the order in which we execute those requests to:
In order to solve those challenges, we're mainly using 2 concurrency controls:
A mutex helps you to enforce the sequential execution of a processing identified by a key. It can also be called “lock”. It can be thought of a queue with a concurrency of 1.
So, let's say that you have a processing like “Load Chart 236”. You can put a mutex on this key and when receiving 5 simultaneous requests needing this specific processing, all the loading of this chart will be done one after the other.
Hence, the first processing of “Load Chart 236” will fill the associated cache and all following processing will read the cached value.
In this example, we can avoid 4 calculations out of 5, boosted our cache hit ratio and even decreased the latency as we're no longer trying to access the same resources at the same time when calculating our value.
We started with local mutex managed at the replicas level but when we scaled our replicas horizontally, it became less efficient as many mutexes of the same key were existing in the system, one for each replica. Same story as our cache, and we did the same thing: we moved our Mutex inside Redis.
This pattern is detailed here: https://redis.io/docs/reference/patterns/distributed-locks/
Mutexes are helping us to do each calculation exactly once across all our replicas and with the help of cache, reusing the value for subsequent queries.
A semaphore is a kind of Mutex, but this time, multiple concurrent executions are allowed at the same time. For example, we can say that for any given key, such as “SQL Query on Warehouse 123”, we can execute 5 concurrent executions, and the 6th+ will have to wait for a slot to be released.
It’s like a queue but with a configurable concurrency limit. You define the number of maximum concurrent slots and the semaphore is what’s making sure that things are under control.
Again, it can be used locally or centralized inside Redis quite easily.
We’re using Semaphore to represent our resource-hungry (HTTP calls, CPU) processings. Semaphore is really helpful to spread a peak of requests while maintaining a limit on the allowed concurrency.
We’re monitoring the wait time in Mutex and Semaphore in order to understand where requests might be stalled or delayed. This gives us a few helpful hints:
As Mutex and Semaphore create queues that are easy to monitor, we get good insights into where time is being spent in our system and what we should improve to reduce the lead time of a request.
Of course, most optimizations were lying in our code and in our algorithms. Below are the main ones that we implemented to speed up the platform.
Our system has more reads than writes. This is because a dashboard is built once and displayed hundreds of times.
Hence, we realized that whenever we could do a calculation on writes, we should, rather than during reads.
For example, we calculate a cache key that checks if any configuration on the data lineage of a dashboard has changed since the last query: did an underlying SQL query change? Is there a new filter in place on the table? Is the used columns still the same?
If the cache key change, it means that we have to “re-compile” our internal data structure of a dashboard which is expensive to do. If it doesn’t, we can still use the existing cache which is still up to date.
We started by calculating this cache key at every read...
We soon realized that it was way more effective to calculate it during the writes.
For things that don’t happen during writes, like when people are changing their databases that we query (data, schema), doing background checks and calculation is a good way of preparing ahead for future read queries.
We’re using GraphQL to communicate between our backend and our frontend service. It is a truly beautiful piece of technology as the frontend developers no longer have to combine multiple REST APIs results to generate the frontend state.
A single GraphQL query can query multiple objects and their relationships. It is giving a nice boost of velocity to the frontend work.
However, in some case, we were too fast and wrote sub-efficient queries.
For example, we have a chart object in GraphQL and a dataset object. Each chart is linked to a dataset that feeds it with data, and many charts can be built on the same dataset.
So, naively, when building the dashboard page, we asked for each chart object and its related dataset object. This meant that the dataset objects used in many charts were read and duplicated many times in the GraphQL response.
As “dataset” is a big object with many properties and children, duplicating it is not a good idea, it generates a more complicated SQL query, bigger payloads, etc.
In this case, it was more efficient to ask for the list of charts and for the list of datasets and do reconciliation on the frontend directly rather than having a fan out in the GraphQL query.
In a lot of our applications, we were calculating things at a high level. Like doing cache at the customer level for all its configuration.
It worked fine when our customers were small and barely using us. When multiple teams on the same customers started using us full time, the amount of data to process per customer went to the roof and we had to chase all over the place when our code wasn’t granular enough.
The main idea is to do as few things as possible for any given query.
Doing everything every time is easy to code, hard to scale.
That’s all I have so far! Of course, this is a work in progress, and many optimizations will be needed to achieve the scale that we’re aiming for, but we’re already happy to have come this far 👍 needless to say, scaling a BI is a full-time job and proper resources should be put in place to support it.