Monitor the cluster¶
The absence of a centralized node in the cluster enhances resilience and scalability. Each node operates independently, allowing for a distributed approach to data management and processing. This design eliminates a single point of failure, ensuring that the failure of one node does not compromise the entire system.
Each node maintains a unique view of the cluster, enabling autonomous data processing and request handling. This independence provides greater flexibility and performance, as nodes operate in parallel without waiting for a central authority to coordinate actions.
To identify the source of issues, administrators monitor each node independently. This approach offers a comprehensive view of the cluster’s health and performance, facilitating more effective troubleshooting and optimization.
Manual cluster monitoring with Myq-tools¶
Manual cluster monitoring can be performed using
myq-tools. Currently, this
toolset includes a single utility known as myq_status
, but there is
potential for additional tools in the future.
The myq_status
utility offers Iostat-like views of MySQL SHOW GLOBAL
STATUS
variables, providing insights into the performance and
status of the MySQL environment.
Monitor node status¶
Variable | What it indicates | Example values | Typical cause when not OK | Troubleshooting tips |
---|---|---|---|---|
wsrep_ready |
Node readiness to handle queries | ON , OFF |
SST in progress, Donor state, Desynced, local issues (disk, memory) |
* Check wsrep_local_state_comment (should be Synced )* Wait for SST/IST to finish * Review logs and system resource usage |
wsrep_local_state_comment |
Node’s current role/state | Joining , Donor , Synced , etc. |
Node is syncing or recovering |
* Monitor SST/IST progress * Wait until state changes to Synced * Avoid restarting other nodes during SST |
wsrep_local_cert_failures |
Number of failed certifications | 0, 1, 5, ... | Conflicting transactions or replication failures |
* Investigate frequent conflicts * Review queries or retry failed operations |
wsrep_local_bf_aborts |
Transactions aborted due to brute-force conflict resolution | 0, 10, 100, ... | High contention on hot rows or large transactions |
* Identify conflicting queries * Break up large writes * Reduce contention through schema or logic changes |
wsrep_ready¶
The wsrep_ready
variable indicates whether the node is ready to accept write operations. If this value is not “ON,” the node is not prepared to handle write requests, which may affect the overall functionality of the cluster.
mysql> SHOW GLOBAL STATUS LIKE 'wsrep_ready';
Expected output
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready | ON |
+---------------+-------+
wsrep_local_state_comment¶
Check the node with a SHOW STATUS
with wsrep_local_state_comment
.
mysql> SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
```{.text .no-copy}
+-----------------------------+-----------+
| Variable_name | Value |
+-----------------------------+-----------+
| wsrep_local_state_comment | Synced |
+-----------------------------+-----------+
```
Check for these results:
-
Joining
orDonor
: The node is syncing with either SST or IST. This operation is expected during startup. -
Disconnected
orInitialized
: The node has not yet joined the cluster or has lost connection.
If the node is syncing, wait until the operation finishes and changes to Synced
.
If the node is stuck, do the following:
-
Check the database error log (/var/log/mysql/error.log)
-
Make sure there is enough disk space
-
Confirm that the done node is reachable and healthy
Check the network connectivity:
-
Test ping and port access, especially TCP 4567, 4568, 4444.
-
Check firewall and hostnames
Restart the node if needed, and then monitor the log for progress.
To ensure effective monitoring of your Percona XtraDB Cluster, set up specific triggers in addition to the usual MySQL alerting:
Check the cluster state of each node¶
Variable | What it indicates | Example values | Typical cause when not OK | Troubleshooting tips |
---|---|---|---|---|
wsrep_cluster_status |
Cluster-wide health | Primary , Non-Primary |
Quorum loss, network partition, misconfiguration |
* Ensure a majority of nodes are running and reachable * Check wsrep_cluster_address and firewall rules* Bootstrap the cluster if needed |
wsrep_cluster_size |
Number of nodes currently in the cluster | 1, 2, 3, ... | Other nodes are offline or unreachable |
* Check node status across the cluster * Restart missing nodes carefully |
wsrep_connected |
Whether the node is connected to the cluster | ON , OFF |
Node cannot communicate with cluster peers |
* Confirm IP and port access (4567, 4568, 4444) * Verify network interfaces and cluster address |
wsrep_flow_control_sent |
Times this node paused replication to slow the cluster | 0, 50, 1000, ... | The node’s receive queue is full, replication is lagging |
* Check disk and network speed * Reduce long-running writes * Avoid underpowered nodes in the cluster |
wsrep_flow_control_recv |
Times this node was paused by others via flow control | 0, 20, 300, ... | Other nodes are struggling to keep up, often due to resource issues |
* Look at cluster-wide disk/network performance * Monitor if a specific node is consistently causing throttling |
Connect to the MySQL server using a MySQL client or command-line tool to access one of the nodes in the cluster.
Next, verify that the wsrep_cluster_status
does not equal “Primary.” This status indicates whether the cluster is stable. If the status is not “Primary,” the cluster may be experiencing issues, such as a split-brain scenario or insufficient nodes to form a quorum.
mysql> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
Expected output
+-----------------------------+-------------+
| Variable_name | Value |
+-----------------------------+-------------+
| wsrep_cluster_status | Primary |
+-----------------------------+-------------+
Verify node state¶
Check the wsrep_connected
and wsrep_ready
variables to ensure both equal “ON.”
The wsrep_connected
variable indicates whether the node is connected to the cluster. If this value is not “ON,” the node is disconnected and cannot participate in cluster operations.
mysql> SHOW GLOBAL STATUS LIKE 'wsrep_connected';
Monitor replication conflicts¶
Identify excessive replication conflicts by monitoring the wsrep_local_cert_failures
and the wsrep_local_bf_aborts
variables.
The wsrep_local_cert_failures
variable tracks the number of certification failures during the replication process. Certification failures occur when a node attempts to apply a write operation that conflicts with another operation already applied to the cluster. A high number of certification failures can indicate frequent write conflicts, leading to performance issues and increased latency.
mysql> SHOW GLOBAL STATUS LIKE 'wsrep_local_cert_failures';
Expected output
+-----------------------------+-------+
| Variable_name | Value |
+-----------------------------+-------+
| wsrep_local_cert_failures | 0 |
+-----------------------------+-------+
The wsrep_local_bf_aborts
variable tracks the number of aborts due to conflicts with write sets being processed. These conflicts typically happen when multiple nodes attempt to write to the same data simultaneously, resulting in conflicts that require one operation to be aborted.
- You can identify excessive flow control messages by monitoring the
wsrep_flow_control_sent
andwsrep_flow_control_recv
variables.
Flow control messages are signals used in the cluster to manage the flow of replication traffic between nodes. They help prevent a situation where a node becomes overwhelmed with incoming data, especially if it is lagging behind in processing transactions. By regulating the flow of data, these messages ensure that all nodes can keep up with the replication process without losing data integrity.
The wsrep_flow_control_sent
variable counts the number of flow
control messages sent by the node to manage replication traffic.
Conversely, the wsrep_flow_control_recv
variable tracks the
number of flow control messages received by the node, indicating
how often the node has to pause or slow down its processing to
accommodate the flow control mechanism.
Monitoring these variables allows you to assess the frequency of flow control messages in the cluster. A high number of these messages may indicate performance issues, such as nodes struggling to keep up with replication, prompting you to investigate and optimize the cluster’s performance.
- Large replication queues indicate a backlog of transactions waiting to be processed in a database cluster. You can identify these queues by monitoring the
wsrep_local_recv_queue
variable. When the replication queue grows significantly, it suggests that the system struggles to keep up with incoming changes, which can lead to delays in data synchronization across nodes. This situation may result in increased latency for read and write operations, potential data inconsistencies, and a negative impact on overall system performance. Addressing large replication queues is crucial for maintaining efficient database operations and ensuring timely data availability.
Gather cluster metrics¶
Gathering cluster metrics for long-term analysis and visualization plays a crucial role in maintaining the health and performance of a database cluster. Consistent collection of specific performance data over time allows for the creation of graphs that facilitate monitoring and evaluation. Tracking these metrics enables the identification of trends, early detection of issues, and informed decision-making to optimize overall cluster performance. The following list outlines essential metrics that should be collected to ensure effective monitoring and analysis.
- Queue Sizes:
The wsrep_local_recv_queue
and wsrep_local_send_queue
variables provide insights into the sizes of the local receive and send queues in a database cluster.
-
The
wsrep_local_recv_queue
tracks the number of transactions waiting to be processed by the local node. A large receive queue may indicate that the node struggles to keep up with incoming replication traffic, potentially leading to delays in data synchronization and increased latency for read and write operations. -
The
wsrep_local_send_queue
monitors the number of transactions that the local node has sent to other nodes but have not yet been acknowledged. A large send queue can suggest that other nodes are unable to process incoming changes quickly enough, which may also contribute to replication delays and affect overall system performance.
Understanding these queue sizes is essential for diagnosing performance issues and ensuring efficient data replication across the cluster.
- Flow control metrics:
The wsrep_flow_control_sent
and wsrep_flow_control_recv
variables provide important information about the flow control mechanism in a database cluster.
-
The
wsrep_flow_control_sent
variable indicates the number of flow control messages sent by the local node to other nodes. Flow control messages help manage the rate of data replication, ensuring that nodes do not become overwhelmed with incoming transactions. A high number of sent flow control messages may suggest that the local node frequently needs to slow down the replication process to maintain stability. -
The
wsrep_flow_control_recv
variable tracks the number of flow control messages received by the local node from other nodes. This metric reflects how often the local node must pause or slow down its operations due to requests from other nodes. A high count of received flow control messages can indicate that the local node is experiencing pressure from its peers, which may lead to delays in processing transactions.
Monitoring these flow control metrics is essential for understanding the dynamics of data replication within the cluster and for identifying potential performance bottlenecks.
- Replication metrics:
The wsrep_replicated
and wsrep_received
variables provide critical insights into the replication process within a database cluster.
-
The
wsrep_replicated
variable indicates the total number of transactions that the local node has successfully replicated to other nodes in the cluster. This metric reflects the effectiveness of the replication process and helps assess how much data has been shared across the cluster. A high value forwsrep_replicated
suggests that the node actively participates in the replication process and contributes to data consistency across all nodes. -
The
wsrep_received
variable tracks the total number of transactions that the local node has received from other nodes. This metric shows how many transactions have been sent to the local node for processing. A high value forwsrep_received
indicates that the node is receiving a significant amount of data from its peers, which can impact its performance if the incoming transaction rate exceeds its processing capacity.
Understanding these replication metrics is essential for evaluating the health and efficiency of the replication process in the cluster. Monitoring both wsrep_replicated
and wsrep_received
helps identify potential issues related to data synchronization and performance bottlenecks.
The wsrep_replicated_bytes
and wsrep_received_bytes
variables provide important information about the volume of data involved in the replication process within a database cluster.
-
The
wsrep_replicated_bytes
variable indicates the total number of bytes that the local node has successfully replicated to other nodes in the cluster. This metric reflects the amount of data shared across the cluster and helps assess the efficiency of the replication process. A high value forwsrep_replicated_bytes
suggests that the node actively participates in data replication, contributing to overall data consistency. -
The
wsrep_received_bytes
variable tracks the total number of bytes that the local node has received from other nodes. This metric shows the volume of data sent to the local node for processing. A high value forwsrep_received_bytes
indicates that the node is receiving a significant amount of data from its peers, which can affect performance if the incoming data rate exceeds the node’s processing capacity.
Understanding these replication metrics is essential for evaluating the health and efficiency of the replication process in the cluster. Monitoring both wsrep_replicated_bytes
and wsrep_received_bytes
helps identify potential issues related to data synchronization and performance bottlenecks.
Use Percona Monitoring and Management¶
Percona Monitoring and Management includes two dashboards to monitor PXC:
-
PXC/Galera Cluster Overview:
-
PXC/Galera Graphs:
These dashboards are available from the menu:
Please refer to the official documentation for details on Percona Monitoring and Management installation and setup.