Scaling a BI system

Our journey at Whaly so far scaling our self-service business intelligence platform. Here are the steps we've taken and the learnings we've gathered along the way, from an engineering standpoint.

Scaling a BI system

In Computer Science class, I learned to avoid premature optimizations. Thus, when building the first versions of Whaly, we never focused on optimizing it. As our business grew, however, we had to start scaling it. Below are some notes on the steps and paths we’ve taken so far!

This is a work in progress that is based on our own experiences so far, which may differ from yours. We’d love to hear your thoughts and perspectives, so feel free to reach out to with any questions or comments! 💌

I hope this article can help you solve similar problems.


Whaly is a cloud Business Intelligence solution, distributed as a SaaS.

Business Users use it to query their business data (customers, operations, finance) to get answers. They build charts based on those queries and arrange them in dashboards.

The interesting properties of scaling such a system is that:

  • It is multi-tenant: in order to save costs, the “control plane” of our platform contains a lot of logic to manage all the workloads of our customers. Meaning that the same servers and databases are used to orchestrate queries, store the dashboard configurations, user access, and more - across all customers. The queries are executed in the “data plane,” which is the data warehouse of our customers through which we connect.
  • It has to deal with expensive calculations: most queries are run on a large volume of data stored into data warehouses that have high query latency and a per-query pricing model, which means that we have to play it smart to save on costs (the customer data warehouse bill 💸) and latency (to maintain a great customer experience ⚡).
  • High concurrency: when 10 people open 5 dashboards, each containing 100 charts at the same time (during the weekly “all hands” meeting, for example), you have 10x5x100 = 5,000 simultaneous requests, and people still expect the system to run swiftly. And this is just for a single customer. When 100s of customers are opening dashboards, you’re talking 500,000 requests all happening at the same time…
  • It's growing fast: each month we are serving 2x more queries than the previous month, which is great, but our engineering team and resources are not growing at the same rate 😅


During our scaling journey, our monitoring stack helped us:

  • Identify which processing/service/route needs to be optimized based on measured concurrency, latency and throughput
  • Check the impact of our optimizations to be sure that the approach was the proper one or that its implementation was done correctly

None of these optimizations can be done without a proper monitoring stack. If you can’t measure what’s going on, spending effort and energy optimizing the wrong thing can be fatal to your business.

Our monitoring stack currently includes:

  • A timeseries store (Prometheus) + dashboard UI (Grafana)
  • A logging solution (Google Cloud Logging)
  • Tracing solutions (Google Cloud Trace for all traces + custom trace service based on Redis for our query engine, that is also exposed to end users to help them understand how to optimize their queries)

If you don't have these 3 types of tools in place, it’s probably not a good time to try to optimize your system.

Focus on implementing your monitoring stack before trying optimize anything else.

Even if the house is on fire and your whole system is fucking slow, you need to diagnose the issue before pushing optimization in production.

Your brain and your coworkers will want to switch into solution-mode without having a clear view of the problem. Try to resist that urge! While it’s reassuring to work on a solution (even a wrong one), it doesn’t mean that’s the right thing to do 😉

Horizontal scaling

Scaling horizontally means that you deploy more instances of your apps on more servers. With the proper load balancing done upstream of your apps, it means that you can throw more CPUs and memory at your problem.

It’s not the silver bullet of optimization problems, but it’s a useful tool.

Our apps are following the 12 factors ( and are run on Kubernetes that make it easy to scale them up or down.

We’re using Gitops methodology to avoid going on the web console of our Cloud Provider and for a better audit of all our production operations.

We read and applied all the cool stuff that everyone is talking about on the internet, so we thought that we just had to push the “Scale” button that we earned with our top notch infra design.

We added more replicas to our problems. It helped a little, but not so much 😓 So we went further down the rabbit hole ⤵️


Caching is an obvious solution when dealing with expensive operations (HTTP calls, calculations, queries).

The main idea is that you don't have to do the same thing multiple times.

We began with local caches stored in the memory of our apps to start quickly.

To avoid exploding our app memory, we're using fixed cache size with an LRU eviction policy (when too many values are cached, we evict the least used one).

This approach was a really easy way to get started as we simply needed to add a  “cache” library in our code, but it quickly began to fail due to the replication factor of our applications that increased over time.

As each cache is stored locally in a replica, the more we replicate our apps, the more we have duplications of the cache. This means that the cache hit ratio quickly decreases as we add more replicas to our system. So local caching and horizontal caching don’t work well together 💥

Centralized cache

We then centralized our caches in a central service backed by Redis to avoid this duplication issue.

Redis is a high performing in-memory database that can be used to store keys and values and share them across replicas.

If deployed in the same network as your apps, the network call needed to check if a cache value exists is not too high to pay and is still worth it, if calculating the cached value is taking longer than this network call.

As a lot of cached values are themselves requiring HTTP calls to be built, this is practically always worth it 👌

Measuring performance

To check if our caches are working properly, we’re tracking the Cache Hit Ratio for each cache using 2 metrics.

  • cache_requests
  • cache_hits

Then, we can calculate the cache hit ratio on our dashboarding solution as cache_hits/cache_requests.

Another metric that we’re not currently tracking, would be the mean time of execution when there is a cache hit vs. when there is not. This would give us an idea of the latency gain of the cache.

Next step for caching

As our processing grows along with the volume of cache values, we'll probably have to shard our Redis database and use different servers per range of cache key.

Our Redis is currently managed by our Cloud Provider, but self hosting it is an option to reduce costs in the future.

Concurrency control

As we are processing a lot of concurrent requests when people are opening dashboards at the same time, we have to optimize the order in which we execute those requests to:

  • Avoid overloading systems when receiving peak traffic. Most services perform better when the load is spread. So we have to build some kind of queuing system and make some queries wait for the other to be processed in order.
  • Don't do the same thing at the same time before the caches are loaded (especially as our services are replicated and queries are balanced between the replicas).
  • Don’t let a single customer take up all the system resources: we have the responsibility of fairly allocating our processing power to our customers, so each one of them should be isolated at the processing level when ordering requests.

In order to solve those challenges, we're mainly using 2 concurrency controls:

  • Mutex
  • Semaphore


A mutex helps you to enforce the sequential execution of a processing identified by a key. It can also be called “lock”. It can be thought of a queue with a concurrency of 1.

So, let's say that you have a processing like “Load Chart 236”. You can put a mutex on this key and when receiving 5 simultaneous requests needing this specific processing, all the loading of this chart will be done one after the other.

Hence, the first processing of “Load Chart 236” will fill the associated cache and all following processing will read the cached value.

In this example, we can avoid 4 calculations out of 5, boosted our cache hit ratio and even decreased the latency as we're no longer trying to access the same resources at the same time when calculating our value.

We started with local mutex managed at the replicas level but when we scaled our replicas horizontally, it became less efficient as many mutexes of the same key were existing in the system, one for each replica. Same story as our cache, and we did the same thing: we moved our Mutex inside Redis.

This pattern is detailed here:

Mutexes are helping us to do each calculation exactly once across all our replicas and with the help of cache, reusing the value for subsequent queries.


A semaphore is a kind of Mutex, but this time, multiple concurrent executions are allowed at the same time. For example, we can say that for any given key, such as “SQL Query on Warehouse 123”, we can execute 5 concurrent executions, and the 6th+ will have to wait for a slot to be released.

It’s like a queue but with a configurable concurrency limit. You define the number of maximum concurrent slots and the semaphore is what’s making sure that things are under control.

Again, it can be used locally or centralized inside Redis quite easily.

We’re using Semaphore to represent our resource-hungry (HTTP calls, CPU) processings. Semaphore is really helpful to spread a peak of requests while maintaining a limit on the allowed concurrency.

Measuring performance

We’re monitoring the wait time in Mutex and Semaphore in order to understand where requests might be stalled or delayed. This gives us a few helpful hints:

  • Which processing we have to speed up to reduce the waiting time
  • When it’s the right moment to increase the maximum concurrency of our system by raising the bar on Semaphore.

As Mutex and Semaphore create queues that are easy to monitor, we get good insights into where time is being spent in our system and what we should improve to reduce the lead time of a request.

Code optimization

Of course, most optimizations were lying in our code and in our algorithms. Below are the main ones that we implemented to speed up the platform.

1. Do expensive calculation on writes or in background instead of on reads

Our system has more reads than writes. This is because a dashboard is built once and displayed hundreds of times.

Hence, we realized that whenever we could do a calculation on writes, we should, rather than during reads.

For example, we calculate a cache key that checks if any configuration on the data lineage of a dashboard has changed since the last query: did an underlying SQL query change? Is there a new filter in place on the table? Is the used columns still the same?

If the cache key change, it means that we have to “re-compile” our internal data structure of a dashboard which is expensive to do. If it doesn’t, we can still use the existing cache which is still up to date.

We started by calculating this cache key at every read...

We soon realized that it was way more effective to calculate it during the writes.

For things that don’t happen during writes, like when people are changing their databases that we query (data, schema), doing background checks and calculation is a good way of preparing  ahead for future read queries.

2. Be careful on nested GraphQL queries

We’re using GraphQL to communicate between our backend and our frontend service. It is a truly beautiful piece of technology as the frontend developers no longer have to combine multiple REST APIs results to generate the frontend state.

A single GraphQL query can query multiple objects and their relationships. It is giving a nice boost of velocity to the frontend work.

However, in some case, we were too fast and wrote sub-efficient queries.

For example, we have a chart object in GraphQL and a dataset object. Each chart is linked to a dataset that feeds it with data, and many charts can be built on the same dataset.

So, naively, when building the dashboard page, we asked for each chart object and its related dataset object. This meant that the dataset objects used in many charts were read and duplicated many times in the GraphQL response.

As “dataset” is a big object with many properties and children, duplicating it is not a good idea, it generates a more complicated SQL query, bigger payloads, etc.

In this case, it was more efficient to ask for the list of charts and for the list of datasets and do reconciliation on the frontend directly rather than having a fan out in the GraphQL query.

3. Granular context scoping

In a lot of our applications, we were calculating things at a high level. Like doing cache at the customer level for all its configuration.

It worked fine when our customers were small and barely using us. When multiple teams on the same customers started using us full time, the amount of data to process per customer went to the roof and we had to chase all over the place when our code wasn’t granular enough.

The main idea is to do as few things as possible for any given query.

Doing everything every time is easy to code, hard to scale.

That’s all I have so far! Of course, this is a work in progress, and many optimizations will be needed to achieve the scale that we’re aiming for, but we’re already happy to have come this far 👍  needless to say, scaling a BI is a full-time job and proper resources should be put in place to support it.