GlassFish Wiki : FaqShoalGMSAttributesInDomainXML

What do the group-management-service element's attributes in GlassFish v2's domain.xml mean ?

This FAQ entry explains the meaning and implications of changing configuration values for the group-management-service element in domain.xml.

The group-management-service element provides attributes, whose values determine the health monitoring and discovery protocol behavior in Shoal GMS.

The default values in the GlassFish domain.xml have been arrived at by taking into account our testing, both functionally and with system under load, with an 8 instance cluster. Recently, we have received feedback from customers that with these values, they were able to have a large GMS group with Sailfin, working without issue and reporting group membership event notifications correctly - so these domain.xml values are a good default for large number of instances in the cluster.

The following attributes are present in the <group-management-service> element along with their default values :

cluster-name-config.group-management-service.fd-protocol-max-tries = 3
cluster-name-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster-name-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster-name-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster-name-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster-name-config.group-management-service.vs-protocol-timeout-in-millis = 1500

fd stands for Failure Detection

fd-protocol-max-tries stands for the maximum number of missed heartbeats that the GMS service provider's HealthMonitor would wait for, before marking an instance as suspected to have failed - in addition to the max tries, GMS also tries to make a peer-2-peer connection with the suspected member and if that also fails, the member is marked suspected failed.

fd-protocol-timeout-in-millis stands for the number of milliseconds interval between each heartbeat message that an instance would wait to send out its Alive state, AND as a result, the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in GMS service provider's Master Node, before counting another missed heartbeat.

Changing the value of max-retries lower would result in failure suspicion determination with a shorter number of missed heartbeats and vice versa. More below on consequences of different settings in the Impact of Changing Values section.

merge-protocol-max-interval-in-millis and merge-protocol-min-interval-in-millis are no-op attributes that have no effect on GMS behavior. These attributes remained in the v2 release due to oversight. In the upcoming v2.1.1 release, we are planning to deprecate or remove these attributes along with more meaningful descriptive attribute names.

ping-protocol-timeout-in-millis stands for initial discovery timeout. This is the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that appserver startup does not wait for the timeout) for discovering the master member of the group - called master node discovery protocol in GMS. The instance's GMS module sends out a master node query to the multicast group address and waits until a response is received or the timeout occurs. If the wait times out i.e. the instance does not receive a master node response from another instance within this time, indicating the absence of a master, then it assumes the master role, sending out a master node announcement to the group. This instance subsequently responds to all future master node query messages from other members with a master node response. In the appserver, since DAS joins a cluster as soon as it is created, the DAS becomes a master member of the group ahead of time allowing cluster members to discover master quickly without having to timeout. More below on impact of changing settings.

vs-protocol-timeout-in-millis stands for Verify Suspect protocol's timeout used by the HealthMonitor. Once a member is marked suspect based on missed heartbeats and a failed p2p connection check, the verify suspect protocol kicks in waiting for the specified timeout to check for any further health state message received in that time and, to see if a peer-2-peer connection can be made with this suspect member. If not (i.e both the health state update missing and a p2p connection attempt failing), the suspected failed member is marked as confirmed failed and a failure notification is sent out.

Impact of Changing Values

Failure Detection values

Changing the value of fd-protocol-timeout-in-millis lower than default, would result in more frequent heartbeat messages being sent out in the system from each member. Higher timeout value will result in lesser heartbeats in the system as the time interval between heartbeats is longer.

Mileage gained from the above varies depending on how quickly and reliably the deployment environment needs to have failures detected.

Setting the fd-protocol-timeout-in-millis (and/or fd-protocol-max-retries) lower or higher has impact that you should consider :

Lower timeout value would result in a larger number of heartbeat messages going out periodically in the network than a system actually needs in order to perform failure detection protocols effectively.
Lower timeout value (and/or retries) could result in false positives such as detecting a member as failed when in fact, the member's heartbeat may not have reached in time due to network load from other parts of appserver.
With higher timeout, failure detection would take a bit longer with the added possibility of the failed member starting up during detection process resulting in a new join notification, without a preceding failure notification since failure detection and determination had not completed. This fact of a join notification without a preceding failure notification is logged (thanks to Joe Fialli's work on adding more diagnostics)
Changing both the values of retries and the timeout would have consequences that can be extrapolated based on the above information.

Improvement in Failure Detection protocols for GlassFish v2.1.1 (Sailfin 2.0)

In upcoming GlassFish v2.1.1/Sailfin 2.0, Shoal now has the capability of Watchdog type members (typical members are Spectators or Core type) allowing processes like GlassFish Node Agent to participate in the GMS group of a cluster. This helps with quickly detecting and reporting process failures without having to change timeout and retries values that are part of the heartbeat based system for process failure detection - node agent detects process failures faster than the GMS heartbeat based system can detect and with the Watchdog facility, the Node Agent can report such failures to GMS as soon as it occurs. The heartbeat based system is now a secondary fall-back for process failures while being the primary means of failure detection when hardware or network failures are involved wherein one loses reachability with both Node Agents and Instances involved. The heartbeat based system is thus a significant part of the health monitoring functionality.

The retries, missed heartbeat intervals, peer-2-peer connection based failure detection, the watchdog based failure reporting, and verify suspect protocols are all needed ensure that failure detection is robust and reliable in GlassFish/Sailfin. Most of these protocols (except for watchdog protocol) are employed as standard in many group communication solutions such as JGroups, Coherence, GridGain, GigaSpaces, etc., so our goal is to have parity with those solutions and with additional watchdog capability we are augmenting failure detection functionality.

Note

For hardware failures and/or network failures, GMS uses a default of 10 seconds to timeout a blocking TCP connection attempt that will otherwise wait for the system specific TCP retransmission timeout (typically 3-5 minutes). Combined with the heartbeat based system, this effectively means approximately under 30 secs to detect a hardware or network failure.

Discovery related values

Setting the ping timeout lower would cause a member to timeout quicker than it should, in discovering the master node, and as a result may end up having multiple masters in the group leading to the master collision and resolution protocol to kick in. The master collision and resolution protocol results in multiple masters telling each other who the true master candidate is based on sorted order of memberships based on their UUIDs, but can be intensive in messaging if there are large number of masters in the group. Hence the ping timeout value should ideally be set to defaults or more but not less.

For any further questions, please send email to users at shoal dot dev dot java dot net