What you’ll learn

  • How to identify and troubleshoot disk issues caused by indexes in RavenDB.
  • Practical tools and techniques for diagnosing disk usage issues.
  • Using RavenDB tools to monitor index performance.
  • Where to look for information to resolve disk problems related to indexing.

Introduction

Disk usage issues often stem from indexes consuming more resources than expected. Identifying and resolving these problems is crucial for maintaining optimal database performance. This guide will help you understand how to find indexes that are eating up your precious disk resources and where to look for help to alleviate related concerns.

The information in this article is relevant for general information and for troubleshooting. If you don’t actually see a problem with your current setup, there is no reason to change something. For the most part, RavenDB is going to be able to manage your system for you without the need for a human to be directly invovled.

This article covers what to do when the default behavior is not sufficient and should be considered as an Intermediate to Expert stage in optimizing your RavenDB based systems.

Indexing Performance View

Let’s start by going through a tool that is available to every RavenDB user, which is the Indexing Performance view in Studio.

The Indexing Performance View in RavenDB Studio is a powerful tool that provides insights into how your indexes are performing. Understanding these metrics can help you identify and troubleshoot issues that may be consuming excessive disk resources.

Purpose of the Indexing Performance View

The Indexing Performance View allows you to monitor and analyze the behavior of your indexes. By providing real-time data on various performance metrics, this tool helps you pinpoint problematic indexes and optimize them for better performance and resource efficiency.

Key metrics

Indexing Time: This metric shows the total time taken by the index to process documents. Long indexing times can indicate complex computations or large data sets.

  • Example: If an index is taking unusually long to process, it may be due to large fields or complex logic within the index definition.

Reduce Time: For map-reduce indexes, this metric shows the time taken for the reduce phase.

  • Example: If the reduce time is high, it may indicate that the reduction logic is too complex or the data set is too large. It may also indicate that we have to touch of values (distinct group by elements) and have to do a lot of I/O because of that.

Indexing Throughput: This measures the number of documents indexed per second.

  • Example: Low throughput could point to inefficiencies in the indexing process, potentially due to resource contention or suboptimal index definitions.

Interpreting metrics

To effectively use the Indexing Performance View, it’s crucial to understand how to interpret these metrics:

  • High Indexing Time: Indicates potential inefficiencies in index processing. Look for ways to simplify the index logic or reduce the amount of data being indexed.
  • Long Reduce Time: May indicate complex reduction logic or large data sets. Simplify the reduction process where possible.
  • Low Indexing Throughput: Could be a sign of resource contention or inefficiencies. Investigate server performance and optimize index definitions.

Example: A common mistake is to index large text fields or frequently changing fields which can lead to high resource consumption. By analyzing these metrics, you can identify such issues and take corrective actions.

You can find more information about Indexing performance view in documentation.

Common Indexing Definition Issues

To optimize your indexes and reduce disk usage, it’s essential to understand common indexing definition issues, including the use of References (using LoadDocument):

Over-Indexing

  • Issue: Indexing too many fields or large fields unnecessarily.
  • Solution: Review and simplify your index definitions to include only essential fields.

For instance, if you’re indexing an Employee entity, consider whether it’s necessary to index every field. Indexing large text fields like Biography or fields that rarely participate in queries can lead to excessive disk usage and slower performance. Instead, focus on fields that are frequently used in queries, such as LastName or Department.

Improper use of storing

Issue: Storing too much data in an index or not using the Store feature when dealing with projections.

Solution: Carefully consider which fields should be stored in the index. Storing too much data can increase the size of the index and slow down query performance. Conversely, not storing frequently accessed fields can lead to inefficiencies, as RavenDB may need to load the entire document during a query.

If you’re querying on a field but not retrieving it directly from the index (e.g., using the field only for search purposes and not for displaying results), there is no need to store that field in the index. Storing is only necessary when you want to retrieve the value from the index without loading the full document.

For example, if you’re querying on Employee data and often need to display an employee’s FullName, consider storing the FullName field. This allows RavenDB to retrieve it directly from the index without needing to fetch the entire document, optimizing both performance and resource usage. However, if you’re only searching by FullName and not retrieving it in the query results, there’s no need to store it.

If you are always loading the full document back, there is no point in storing a field and it would be just a waste of space. If you are loading specific fields, and you are already indexing on them, it may be worth it to store them (especially if it is complex to fetch the fields from the document). Note that storing fields is an advanced step, and should be taken only if you are both familiar with the feature and see the potential for a performance boost in using it.

Frequent updates

Issue: Indexes that are frequently triggered to process small amounts of data can be inefficient and consume a lot of resources.

Solution: Use static fields or less dynamic data where possible to reduce update frequency. Also consider using throttling and experimenting with the MapBatchSize configuration

LoadDocument Misuse

Issue: Using LoadDocument to fetch related documents can lead to significant performance issues if overused. When an index uses LoadDocument to reference related documents, it creates a dependency on the referenced documents. For example, imagine you have an index for Employee records where each employee document references a Department document via LoadDocument. If you have thousands of Employee documents that each reference the same Department, every time that Department document is updated, all related Employee documents might need to be re-indexed. This can result in a significant amount of re-indexing and thus, higher disk usage and slower performance.

Solution: Minimize the use of LoadDocument in your index definitions. Ensure that it is used only when absolutely necessary. If possible, denormalize data to reduce dependencies on LoadDocument .

Complex index definitions

Issue: Indexes with complex calculations or transformations can be resource-intensive.

Solution: Simplify index logic and reduce unnecessary computations.

High cardinality fields

Issue: Indexing fields with many unique values can increase index size.

Solution: Avoid indexing high cardinality fields unless absolutely necessary.

Indexing large arrays

Issue: Indexing large arrays or collections can significantly increase resource usage.

Solution: Limit indexing to the most critical elements of large collections.

Redundant indexes on the same collection

Issue: Defining multiple indexes on the same collection of documents, where each index handles only one or two fields, can lead to inefficiencies. Managing several small indexes often results in more disk usage and increased resource consumption because each index must process and store data independently.

Solution: Whenever possible, merge multiple indexes into a single, more comprehensive index that processes all the necessary fields. By consolidating indexes, you reduce the overhead of maintaining separate indexes and improve overall performance. RavenDB Studio includes an Index Merge view that can help you identify opportunities to combine indexes efficiently. This tool suggests potential merges and provides insights on how to streamline your indexing strategy.

For more information about these problems and how to fix them, see the documentation.

Finding which index ate my disk using iotop

When your disk resources start to dwindle due to heavy indexing activity in RavenDB, identifying the specific index responsible is crucial. One effective tool for diagnosing disk usage issues on Linux is iotop, a disk I/O monitor. For Windows users, a similar tool is the Resource Monitor, which provides detailed information about disk usage by processes.

Here’s a step-by-step guide to find which index is consuming the most resources:

  1. Install iotop

If iotop is not already installed on your Linux machine, and you’re for example using Debian/Ubuntu you can use:

sudo apt-get install iotop
  1. Run iotop

Launch iotop with root privileges and -a option to show accumulated I/O instead of bandwidth. In this mode, iotop shows the amount of I/O processes that have been done since iotop started:

sudo iotop -a

You’ll see something like this:

There’s a lot of information! But we’re mostly interested in columns:

  • TID (Thread ID),
  • DISK READ,
  • DISK WRITE,
  • COMMAND.

Identify resource-intensive threads

    In iotop you can sort values by each of the visible columns using arrows on the keyboard.

    For example I will sort by Disk Write:

    Now we can see in the first row the most write I/O intensive thread.

    Raven.Server -c /ravendb/con~ttings.json [Idx NorthwindNZ]

    That Idx NorthwindNZ at the end is a clear indicator that we’re dealing with an index thread.

    NorthwindNZD (full name won’t show up in iotop if it’s too long) is the name of an origin database.

    Let’s note its thread Id: 27349

    Go to RavenDB Studio

      Open RavenDB Studio in your browser and navigate to Manage Server and Advanced in the Debug section.

      On this view you can filter threads by id which is perfect for us. When I enter thread id from previous step I get exactly what I want:

      Orders/ByShipment/Location which is a name of an index that is causing the most write I/O operations on my server.

      Analyze and optimize the index

        Once we’ve identified the problematic index, we can take steps to optimize it:

        • Review index definitions: Start by carefully reviewing your index definitions. Look for fields that can be removed or simplified.
        • Store frequently accessed fields: Utilize the storing data in indexes feature for fields that are frequently queried but infrequently changed. Read more about here
        • Use Static Indexes: Where possible, use static indexes that provide more control over indexing behavior and can be optimized for specific use cases.
        • Throttling: Consider throttling the indexing process. This can help balance the load and prevent the index from overwhelming the system resources. Read more about here

        If you are running a RavenDB server on a Windows machine you should be able to access a tool like Resource Monitor. It is quite a handy tool. Using it you can cut straight to the chase since you can see the file path and finding the most I/O intensive index is just a matter of ordering by Read, Write or Total (B/sec):

        We can see that index Orders/ByShipment/Location is doing a lot of writing operations.

        Conclusions

        We explored how to identify and troubleshoot disk usage issues caused by indexes in RavenDB, using tools like iotop for Linux and RavenDB Studio’s Indexing Performance View. We also reviewed common indexing definition issues.

        Best Practices

        • Simplify Index Definitions: Regularly review and optimize your indexes.
        • Store Data in Indexes: Use this feature during projections for frequently queried but infrequently updated fields.
        • Throttling: Implement throttling to balance the indexing load.
        • Monitor Performance: Continuously monitor index performance using RavenDB tools.

        Monitor and Analyze Regularly

        Set up a monitoring system to keep track of key performance metrics. Regular analysis helps prevent unexpected spikes in resource usage and ensures efficient database performance.

        Further Resources