If a machine has multiple network interfaces, how does one configure Shoal GMS to only use one network interface?

By default, Shoal GMS uses all available working network interfaces on a machine. Thus, to limit Shoal GMS
to use only one network interface, one must follow the instructions below to configure a cluster to only use
one of the available working network interfaces on a multihomed machine.

Here is a shell script that demonstrates setting the cluster element gms-bind-interface-address property using asadmin command line.
Note that the shell script is not parameterized, but one can change the environment variables to configure
for their purposes. This script works on "devtest-cluster" that is created by glassfish/appserv-tests/devtest/replication
by running "ant setup".

#!/bin/sh -x

ASADMIN=${AS_HOME}/bin/asadmin
DAS=devtest-cluster-domain
DAS_GMS_BIND_ADDRESS=129.148.71.176
CLUSTER=devtest-cluster
DASCONFIG=server-config
INSTANCE1=instance1
INSTANCE1_ADDRESS=129.148.71.176
INSTANCE2=instance2
INSTANCE3=instance3
PORT="--port 4845"
USER="--user admin"

${ASADMIN} start-domain ${DAS}
${ASADMIN} set ${PORT} ${USER} ${CLUSTER}.property.gms-bind-interface-address=\${GMS_${CLUSTER}_BIND_ADDRESS}
${ASADMIN} set ${PORT} ${USER} ${DASCONFIG}.system-property.GMS_${CLUSTER}_BIND_ADDRESS=\${DAS_GMS_BIND_ADDRESS}
${ASADMIN} set ${PORT} ${USER} ${INSTANCE1}.system-property.GMS_${CLUSTER}_BIND_ADDRESS=\${INSTANCE1_ADDRESS}
${ASADMIN} set ${PORT} ${USER} ${INSTANCE2}.system-property.GMS_${CLUSTER}_BIND_ADDRESS=\${INSTANCE1_ADDRESS}
${ASADMIN} set ${PORT} ${USER} ${INSTANCE3}.system-property.GMS_${CLUSTER}_BIND_ADDRESS=\${INSTANCE1_ADDRESS}
# following is needed for sailfin communication application server if "default-cluster" is not deleted.
# ${ASADMIN} set ${PORT} ${USER} default-cluster.property.gms-bind-interface-address=\${GMS_${CLUSTER}_BIND_ADDRESS}
${ASADMIN} stop-domain ${DAS}

Here is what it will look like in domain.xml.

In domain.xml of DAS for a cluster named "devtest-cluster", one sets the following.

<cluster config-ref="devtest-cluster-config" heartbeat-address="228.8.20.94"
heartbeat-enabled="true" heartbeat-port="17227" name="devtest-cluster">
      <!--  ... deleted unrelated info ...-->
      <property name="gms-bind-interface-address"
value="${GMS_DEVTEST-CLUSTER_BIND_INTERFACE_ADDR}"/>
    </cluster>

Each server and possibly the DAS can choose to explicitly set the bind interface
address so only one is used, or not set it, so the default of all public network
interface addresses is used by group management services. One explicitly adds
the system-property "GMS_DEVTEST-CLUSTER_BIND_INTERFACE_ADDR" on servers that
one wants to specify the only network interface address to use.

For example to set it in DAS and instance1 of devtest-cluster, one add the
system-property to server configuration element Server as follows.

<server config-ref="server-config" lb-weight="100" name="server">
      <!-- deleted non-essential info to this issue -->
      <system-property name="GMS_DEVTEST-CLUSTER_BIND_INTERFACE_ADDR"
value="129.148.71.168"/>
    </server>
    <server config-ref="devtest-cluster-config" lb-weight="100" name="instance1"
node-agent-ref="devtest-agent">
      <!-- deleted non-essential info to this issue -->
      <system-property name="GMS_DEVTEST-CLUSTER_BIND_INTERFACE_ADDR"
value="129.148.71.169"/>
</server>

If the system property is not set, then group management service will perform
default processing of network interface address(es). The default is to use all
of them by gms.

After the above steps have been taken, all Glassfish processes (domain server(DAS), NodeAgent, clustered instances) should be stopped
and restarted to pick up these changes. The NodeAgent recently was modified that it joins the Shoal group as a WATCHDOG so the nodeagent
also needs to be restarted to pick up this change.

In order to ensure that the DAS never uses default binding, all cluster defined in DAS domain.xml must have gms-bind-interface-address property set.


Configuring GMS Failure Detection in Application Server

__To get an intro to GMS Failure Detection and other configuration settings, please see the following Faq entry:
http://wiki.glassfish.java.net/Wiki.jsp?page=FaqShoalGMSAttributesInDomainXML__

To decrease the time it takes for GMS to detect hardware/network failure of a server instance within a cluster,
one needs to decrease the TCP socket timeout used when trying to access that machine.

The default value is currently 10 seconds. The total time that GMS takes to detect a server instance has failed due to a
hardware failure/network cable plug pulled is approximately 28 seconds using the current defaults (as of January 2009).

Below the configuration shows the value being set to 3000 ms or 3 seconds.
This value is not recommended but it is necessary to achieve detection of server instance failing due to a hardware or network failure
within 15 seconds. For software failures, GMS detects a server instance has failed around 8-9 seconds. The smaller the timeout value is,
there is an increase in chance of GMS detecting false failures, namely the instance has not failed but just failed to respond within the
short window of time.

Configuration changes in domain.xml

In domain.xml, this is achieved by adding property failure-detection-tcp-retransmit-timeout to group-management-service of
cluster devtest-cluster.

<config dynamic-reconfiguration-enabled="true" name="devtest-cluster-config">
  ...
  <group-management-service fd-protocol-max-tries="3" fd-protocol-timeout-in-millis="2000" 
                            merge-protocol-max-interval-in-millis="10000" 
                            merge-protocol-min-interval-in-millis="5000" 
                            ping-protocol-timeout-in-millis="5000" vs-protocol-timeout-in-millis="1500">
        <!-- property below configures gms so when it attempts to connect to a suspected failed server instance, 
          -- the tcp socket creation timeout should be set to 3 seconds. This value is probably too small but was necessary
          -- to achieve goal of detecting hw failure within 15 seconds.  Default value of 10 seconds detects hw failure in 28 seconds.
          -->
	<property name="failure-detection-tcp-retransmit-timeout" value="3000"/>
      </group-management-service>

It is also necessary to change this value for domain admin server since it is the GMS Master Node.
Typically, changing the "group-management-service" within "default-config" is sufficient to achieve this for the DAS.

GMS Failure Detection States

GMS failure detection algorithm using group-management-service configuration parameters from domain.xml.

  • Normal operation
    • Each server instance in cluster sends out a heartbeat message every fd-protocol-timeout-in-millis. Default is 2000 ms or 2 seconds above.
  • Possibly Suspect
    • Master Node suspects a server instance has failed when fd-protocol-max-tries * fd-protocol-timeout-in-millis ms pass and master node
      has not received a heart beat message from a given server instance. Default is 6 seconds.
  • Confirmed Suspect
    1. To avoid waiting a long time trying to contact a machine via tcp that might have failed or had its network plug pulled,

GMS attempts a timed TCP operation with the failed machine with a timeout of failure-detection-tcp-retransmit-timeout ms. If this step times out, proceed to Failure Validation.

    1. If previous test concludes the machine is up, then GMS uses a jxta method that pings the TCP connection of the server instance to verify the server instance is still procssing messages. If the ping succeeds, the server instance is no longer suspect, goto Normal Operation. If the ping fails, proceed to Failure Validation.
  • Failure Validation
    • Wait vs-protocol-timeout-in-millis seconds, check if a heartbeat has arrived from the suspected failed server instance. If heartbeat arrived, proceed to Normal operation, else goto next step.
    • Repeat Confirmd Suspect step 1 and 2. If no proof that server instance is working, then GMS Master node sends FAILURE notification to cluster for failed server instance.


This was not working for me though, but I think I've to change the PhP file to get them free.

Online Strategy

Posted by james1122 at Jul 17, 2011 03:24