The performance degradation in the load testing tool
During benchmark run on a large dataset, I started to notice that longer benchmarks were showing decidedly worse numbers than short ones. In other words, a benchmark that is run for 1 minute is orders of magnitude higher latencies than a benchmark that is run for 30 seconds. And the longer the benchmark, the worst things off.
That raised a lot of red flags, and spawn a pretty serious investigation. We take performance very seriously and the observed behavior was that we were getting slower over time. We suspected a leak, high number of GC calls, or memory fragmentation. The scenario under test was a web application using the RavenDB API to talk to RavenDB. We run both the web application and the server under profilers and found a few hot spots, but nothing really major. There was no smoking gun.
Then we noticed that the load testing machine was sitting there with 100% CPU. I initially thought that this is us generating too much load for the machine, but that wasn’t it. We are using wrk2, which is capable of handling million of requests per seconds.
We were generating the requests dynamically using a Lua script, and in one of the scenarios under test, we have code like this:
path = "/orders/user/" .. page * pageSize .. "/" .. pageSize .. "/?userId=" .. item.id .. "&deep=y"
That isn’t the most optimal way to do things, I’ll admit. We can do better by using something like table.concat(), but the problem was that regardless of how you build the string, this is supposed to be fairly cheap. The wrk2 project is using LuaJIT, which has a reputation as a really scripting system. I never really thought that this would be a problem. Sure, it is a little wasteful, but it isn’t too bad, a few string temporaries and maybe some realloc() calls, but nothing major.
Instead, this resulted in us getting far worse results over time. It took a while to actually figure out why, but the root cause is in the way LuaJIT handles string hashing.
a = lj_getu32(str);
h ^= lj_getu32(str+len-4); b = lj_getu32(str+(len>>1)-2); h ^= b; h -= lj_rol(b, 14); b += lj_getu32(str+(len>>2)-1);
Strings in Lua are interned, which means that there is just a single copy of a string per value. That means that hashing is important, but the way it does hashing is to take the first 4 bytes, the last 4 bytes and the 4 bytes in the middle and use that for a hash. And that is it.
If you have a bunch of strings where those 3 locations match… well, welcome to hash collisions. At which point, what is supposed to be a O(1) call becomes an O(N) call. And creating strings will turn the operations into an O(N^2) operation!
Here is the reproduction code:
Change the prefix to be an empty string for a major performance boost. The actual bug is well known (5 or 6 years), but it was only recently fixed and not on the version that wrk2 is using.
We had to toss out the entire benchmarking set and start over because of this.
We were generating requests with random data, so some of them would hit this problem hard, and some would avoid it by magic. I was not expecting to debug hash collision in Lua code while trying to get some performance numbers from overloading RavenDB, quite random, literally.