Production postmortem: The big server that couldn’t handle the load
However, very recently they started to run into severe issues. RavenDB would complain that there isn’t sufficient memory to run.
The system metrics, however, said that there are still gobs of GBs available (I believe that this is the appropriate technical term).
After verifying the situation, the on-call engineer escalated the issue. The problem was weird. There was enough memory, for sure, but for some reason RavenDB would be unable to run properly.
An important aspect is that this user is running a multi-tenant system, with each tenant being served by its own database. Each database has a few indexes as well.
Once we figured that out, it was actually easy to understand what is going on.
There are actually quite a few limits that you have to take into account. I talked about them here. In that post, the issue was the maximum number of tasks defined by the system. After which, you can no longer create new threads.
In this case, the suspect was: vm.max_map_count.
Beyond just total memory, Linux has a limit on the number of memory mappings that a process may have. And RavenDB uses Voron, which is based on mmap(), and each database and each index typically have multiple maps going on.
Given the number of databases involved…
The solution was to increase the max_map_count and add a task for us, to give a warning to the user ahead of time when they are approaching the system’s limits.