A Tuning Session for Allowing High Traffic Apps to Use a New System
A few months ago, we deployed to production a new feature — “Re-attribution”, and enabled usage for 30 clients in a closed beta. We usually uses gradual rollout as working methodology, and this was the case here.
The New Challenges
About a month ago, one of our account managers came over to ask if we can enable this ability to a few of his clients: “Just 5 more apps”.
Let’s take a moment to explain the difference, technology-wise, between the regular attribution matching and re-attribution matching:
- Regular Attribution tries to match between installs and their clicks. Meaning we are searching the clicks database for the install event to find a match. This event will be sent to our system once per app per user because the install event is singular by nature.
- Re-attribution tries to find a match for every app launch because re-attribution means trying to target existing users. Each launch can potentially derive from a different re-attribution campaign.
It means the traffic the re-attribution service handles is at least one order of magnitude higher than the regular attribution traffic. The actual re-attribution match should be made in less than a second. This rigid requirement demanded that we either improve the existing way of working or develop an entire new toolset. Another technical requirements of this system, is to be able to query external attribution sources via HTTP calls, which in the case of re-attribution, leads up to thousands of HTTP calls per seconds.
We realized that one of the bottlenecks of the new system is the time it spends on multiple HTTP requests that were sent to those said external attribution sources. Our regular attribution service only deals with around 50 of such requests per second. To improve this path, a new service was introduced – the IOCluster. It is relatively simple micro-service that was written in-house, that uses Kafka or RabbitMQ as its input, and is responsible for sending HTTP requests to the “outside world.” The logic was that each service that required a solution for outgoing HTTP traffic won’t have to implement it itself – it only needs to define an agreed-upon message detailing the actual HTTP request and then just put it on an outgoing queue – the IOCluster will take care of the HTTP traffic via async IO and write the results, if needed, into another queue. This decoupling will allow services to scale out more easily.
The Tuning Steps
We needed to balance between the following steps:
- Find the suitable number of worker threads (IO Threads) for one IOCluster node, so that the node itself would still consume from the input queue, while its threads are busy sending HTTP requests and waiting for their responses. All this while maintaining an acceptable CPU utilization of under 80 percent.
- Decide the actual number of needed nodes per cluster that can handle the potential high input traffic.
- The CPU/IO workload of the physical node instances was also an issue. Because of the fact that the main problem was traffic handling, we decided to use smaller machines with fewer CPU cores, but with better IO optimization, and to increase these machines number when needed.
This is how the actual work was implemented:
- We enabled the re-attribution feature for an app with many worldwide users, which immediately bumped our input traffic from 80-100 msgs/sec to around 1700 msgs/sec.
- The IOCluster began to work with 4 double-core machines, but we observed that it couldn’t handle all the inbound traffic, and the lag on its input queue began to climb. We also noticed that the 50 threads per node wasn’t enough, because the CPU utilization was pretty low (~30% avg). The first response was to increase the amount of worker threads, which enabled us to handle more open HTTP connections simultaneously and have better CPU utilization. Ultimately, the input queue lag decreased while the output increased radically.
- As additional apps were enabled to use the feature, the inbound traffic jumped. We added more and more input traffic on an app-to-app basis, while monitoring the limits of the single node. When we saw that the CPU load of a single node was too high and we couldn’t add any more worker threads, we added more nodes to the cluster.
We started with 4 IOCluster nodes with 50 worker threads each for an input traffic of 20 messages per second. we ended up with 8 IOCluster nodes with 300 worker threads per node, and managed to handle inbound traffic of more than 1500 messages per sec pretty easily, with options to add more nodes in case of increased traffic increase.
Sometimes tuning requires architectural changes. In other cases it requires more computing power and it is important to identify what exactly is needed – more threads, more CPU, etc. Sometimes it is simple code refactoring. In our case, with fast growing traffic, it is a combination in most of the cases.
We are planning to open source the IOCluster in the coming future. Stay tuned.