Cloud: Maintenance & Troubleshooting


Overview

Our products come equipped with advanced monitoring system that track critical performance metrics to ensure optimal operation. Each product may display one or more of the following labels, providing real-time insights into its current status.

Labels

Label Description
Node not responding Product is currently not responding to input or commands. It suggests that the system may be experiencing a critical issue or is in an unresponsive state. This may be caused by high CPU or IO usage, thrashing, cloud provider hardware failure or application errors.
High CPU usage Product is experiencing a high level of CPU utilization. It suggests that the product's processor is under heavy load, which may affect performance.
High IO usage Product is experiencing high input/output operations, which may indicate heavy disk or network activity. High I/O usage can slow down data processing and affect overall system performance, necessitating optimizations or resource upgrades. Sometimes might be caused by thrashing (excessive swapping due to lack of memory).
Low memory Product is running low on available memory. It warns that the system may be close to using all its RAM, which can impact its ability to handle additional tasks. Low memory conditions can cause applications to slow down or crash, indicating a need for memory optimization or an increase in RAM capacity.
CPU Credits are low Product is running low on CPU credits (only on burstable instances). Low CPU credits can lead to reduced processing power, affecting performance. It may require adjusting resource allocation or upgrading to a higher credit tier.
Free disk space - 10% left Product's available disk space is below 10%.
Free disk space is very low Product's available disk space is below 7%.
Low uptime - below 15 minutes Product's uptime has been below 15 minutes for at least 30 minutes.
Low uptime - below 2 hours Product's uptime has been below 2 hours for at least 4 hours.
Low uptime - below 12 hours Product's uptime has been below 12 hours for at least 24 hours.
Low uptime - below 48 hours Product's uptime has been below 48 hours for at least 96 hours.

Cluster Health

Cluster Health helps keep your cluster healthy by providing suggestions and incidents history.
This dashboard has two separated sections:


Incidents History

The RavenDB Cloud monitoring system tracks incidents and cluster performance.
This section lets you analyse your cluster incidents for a selected time period, cluster node, severity and category.

"Cluster Health: Incidents History section"

Cluster Health: Incidents History section

Categories and their descriptions

Incidents are split into six categories. Description column contains additional information about the incidents.

Category name Description
Uptime Product's uptime has been below a specified time.
Memory Product is running low on available memory.
CPU Product is running low on CPU credits or experiencing a high level of CPU utilization.
Disk Product's available disk space is low.
IO Product is experiencing high input/output operations.
Availability Product is currently not responding to input or commands or has been restarted.

Suggestions

This section displays suggestions for a selected cluster node based on incident trends.

"Cluster Health: Suggestions section"

Cluster Health: Suggestions section

Suggestions are generated based on a 60-day period divided in half.

First period (hereinafter referred to as previous period) lasts from now - 60 days to now - 30 days.
Second period (hereinafter referred to as current period) lasts from now - 30 days to now.

Suggestion types

Cluster Health is able to generate suggestions for five usage areas:

  • High CPU usage
  • High IO usage
  • Low memory mode
  • Low CPU credits
  • Server restarted due to Out of Memory

Analyses

Suggestions are generated using three types of analyses for specific suggestion types:

Analyse type Applicable to Description
Current period above previous period by percents High CPU usage, High IO usage, Low memory mode, Low CPU credits The duration of a specific event was longer in the current period than in the previous period.
Current period above previous period by numbers Server restarted due to Out of Memory The number of occurrences of a specific event was higher in the previous period than in the current period.
Current period above threshold by percents High CPU usage, High IO usage, Low memory mode, Low CPU credits The duration of a specific event in the current period was longer than the threshold (5%).

Below examples of the suggestions with the analyses:

"Cluster Health: An example of `Current period above previous period by percents` analyse"

Cluster Health: An example of `Current period above previous period by percents` analyse

"Cluster Health: An example of `Current period above previous period by numbers` analyse"

Cluster Health: An example of `Current period above previous period by numbers` analyse

"Cluster Health: An example of `Current period above threshold by percents` analyse"

Cluster Health: An example of `Current period above threshold by percents` analyse