Cluster Observer



Operation flow

  • To maintain the Replication Factor, every newly elected Leader starts measuring the health of each node by creating dedicated maintenance TCP connections to all other nodes in the cluster.

  • Each node reports the current status of all its databases at intervals of 500 milliseconds (by default).
    The Cluster Observer consumes those reports every 1000 milliseconds (by default).

  • Upon a node failure, the Dynamic Database Distribution sequence will take place in order to ensure that the Replication Factor does not change.

    For example:

    • Let us assume a five-node cluster with servers A, B, C, D, E.
      We create a database with a replication factor of 3 and define an ETL task.

    • The newly created database will be distributed automatically to three of the cluster nodes.
      Let's assume it is distributed to B, C, and E (so the database group is [B,C,E]),
      and the cluster decides that node C is responsible for performing the ETL task.

    • If node C goes offline or becomes unreachable, the Cluster Observer detects the issue. Initially:

      • After the duration specified in the Cluster.TimeBeforeMovingToRehabInSec configuration,
        the observer moves node C to rehab mode, allowing time for recovery.
      • The ETL task fails over to another available node in the Database Group.
    • If node C remains offline beyond the period specified in the Cluster.TimeBeforeAddingReplicaInSec configuration, the observer begins replicating the database to another node in the Database Group as a last resort.

    Note:

    • The Cluster Observer stores its information in memory, so when the Leader loses leadership,
      the collected reports of the Cluster Observer and its decision log are lost.

Interacting with the Cluster Observer

You can interact with the Cluster Observer using the following REST API calls:

URL Method Query Params Description
/admin/cluster/observer/suspend POST value=[bool] Setting false will suspend the Cluster Observer operation for the current Leader term.
/admin/cluster/observer/decisions GET Fetch the log of the recent decisions made by the cluster observer.
/admin/cluster/maintenance-stats GET Fetch the latest reports of the Cluster Observer