GlassFish 3.1 Next GMS Non-Multicast Design

Authors: Joe Fialli, Bobby Bissett

Date: 07/27/2011: No longer for GlassFish 3.2, just calling GlassFish next.
Update to reflect deprecating VIRTUAL_MULTICAST_URI_LIST for gms over jxta only. Replaced with DISCOVERY_URI_LIST.

Date: 04/15/2011

1.0 Overview:

This document describes changes to support GMS in an environment without UDP multicast enabled between all clustered instances. It also discusses optimizations to enable scaling of the max number of instances in a cluster when in this mode. Lastly, it identifies changes to assist in decentralizing GMS operations away from the GMS Master in preparation to handle self-configuring cluster mode of running without a DAS. The DAS is isolated from application load on the cluster so it was beneficial to place a lot of GMS system level
processing on it. This processing was not impacted by high application load on the CORE clustered instance.

1.1 Terminology

Shoal GMS is implemented in a GlassFish subproject and does have terminology that is more generic than its usage within GlassFish.

This section provides a mapping between GlassFish clustering terms to Shoal GMS terms. When discussing GMS design within this document, the GMS terms, not the GlassFish instantiations of GMS concepts, is used. This section is to assist GlassFish reviewers in relating discussion to GlassFish.

A GlassFish cluster is mapped to a Shoal GMS Group.
A GlassFish clustered instance is mapped to a Shoal GMS group member.
A GlassFish DAS is a member of all gms-enabled GlassFish clusters listed in its domain.xml configuration.
A GlassFish DAS is a GMS Spectator, an observer of the GMS Group. A GMS Spectator does not
provide a HA replication store and it is not considered as a place to replicate session data within a GlassFish cluster.
A GlassFish DAS is typically the GMS Master, but since the GMS Master can not be a single point of failure,
if the DAS leaves the GMS Group, another GlassFish clustered instance will assume the GMS Master role.
The GMS Master provides centralized processing so all GMS Members have same GMS member view (GlassFish clustered instances).

2.0 Design Notes

2.1. GMS Configuration Changes in GlassFish 3.1 next domain.xml

See GF 3.1 Next GMS Configuration in domain.xml

interface Multicast Sender is implemented by

1. com.sun.enterprise.mgmt.transport.VirtualMulticastSender when glassfish cluster/group-management-service property
GMS_DISCOVERY_URI_LIST is set.

2. com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender when multicast address and port are set.
(and DISCOVERY_URI_LIST is not set.)

3. not supporting hybrid scenario where multicast address and port are set AND DISCOVERY_URI_LIST
is set. (proposed implementation simplification for time being. See GMS requirements doc non-requirements
GMS-1.1.3 and GMS-1.1.5. The implementation rationale for this requirement of not supporting hybrid is
that work would need to be done to ensure that a given message sent over both UDP multicast and
individually to each instance listed in DISCOVERY_URI_LIST was not delivered in duplicate.
There is currently no mechanism to ensure that a given GMS message is delivered at most one time.
We observed duplicate message delivery when VirtualMulticastSender was both delivering messages
over UDP multicast, when it was enabled, and unicast the same messge to each GMS member.

Outstanding need from GlassFish 3.2 domain instance:
a globally unique id to use for namespace to evaluate cluster name within.
(one should be able to have two clusters with same name but the clusters are created in the
context of different domains.) If not possible, then
See Dependencies GMS-1.2.2 and GMS-1.2.3 in GMS Requirements document.

ASARCH Review comments from May 10th
1. requested considering gms tcp port to be port unified with asadmin port. (need to consult with Alexey)
must consider that gms is enabling user to enable ssl on gms tcp connection. Is it
possible to do port unification with asadmin with chance of ssl being enabled for gms?
2. other option is to have a restful service that provides ports for services. IIOP and GMS could use this.

3. file a RFE on enabling securing GMS heartbeat info sent over multicast. (next release)

2.1.1 GMS Non-Multicast Implementation

com.sun.enterprise.mgmt.transport.VirtualMulticastSender contains a list of all active members.

This list is maintained by adding a member joining the group in
com.sun.enterprise.mgmt.transport.NetworkManager.addRemotePeer(PeerID)
and removing a member leaving the group when the method
com.sun.enterprise.mgmt.transport.NetworkManager.removePeerID(PeerID)
is called.

The active member list is initialized with values from VIRTUAL_MULTICAST_URI_LIST.

The VirtualMulticastSender.doBroadcast(Message msg) iterates over all active
members of the gms group, sending the msg to them each member over unicast.
There is a proposal to enable configuration that some messages be able to
be sent over lower overhead UDP unicast rather than higher overhead TCP unicast.
The GMS member heartbeat is an initial candidate for this.

2.2 HeartBeat Failure Detection

The current heartbeat failure detection implementation is optimized for the
heartbeats to be sent via UDP multicast. The majority of the HealthMonitoring
failure detection is centralized in GMS Master. It is the only instance that monitors
all other instances heartbeats.
It was desirable in GlassFish 2.x and GlassFish 3.1 for this centralized processing
to occur in GMS master which typically would be DAS. (domain server)
This isolated GMS health monitoring processing from the application load on the
cluster. The DAS was not a single point of failure as default GMS master.
If DAS was stopped or failed, the other members in cluster would identify this
and elect a successor GMS Master. (The GMS notification GroupLeadershipNotification
represents a change in GMS Master for the GMS group. A GMS group maps to a
glassfish cluster).

GlassFish 3.1 Next will be targeted to run in configurations/environments where UDP multicast
is not be available and the DAS may not be running(see ad hoc clustering). Thus, we will introduce
an alternative heartbeat failure detection algorithm that decentralizes processing
some and is optimized for unicast heartbeats.

2.2.1 GMS Heartbeat Failure Detection in GlassFish 3.1

The current heartbeat failure detection algorithm relies on UDP multicast transport.
-Each cluster member broadcasts its heartbeat to all other members.
-Each cluster member records heartbeats for all other members. (timestamps receive time)

  • Each cluster member cleans its connection caches to a member when it receives a health message of
    a PLANNED_SHUTDOWN or FAILURE about that member.
    -Only the master member detects, validates and notifies all other members of a suspected or failed instance.
    -All members in cluster monitor master's heartbeat. If an instance detects that master has
    failed, it uses algorithm shared by all members to compute the new master. Only instance that
    is appointed new master by algorithm will send notification to other cluster members that
    it is new Master (via GroupLeadershipNotication) and then that the former master has failed.

A health message is of type HEALTH_MESSAGE and is composed of
HealthMessage.Entry which has a health state and system advertisement
(the gms client start time in sys advertisement is instrumental in
detecting instances that have restarted faster than heartbeat failure
detection could detect that previous instantiation of instance has
failed and was restarted). When the health message is received, the
HealthMonitor.receiveMessageEvent timestamps the local receive time on
the HealthMessage.Entry and places this entry in the HealthMonitor map
of clustered instance peerid to last HealthMessage.Entry received.
The health message may be the first message received by an instance
from another instance, so an instance should send its first heartbeat
to all instances. There is dynamic contact information in the message
that is cached and used to send messages back to that instance. All
messages from that instance contain this contact info. However, the
heartbeat message is typically a candidate for being the first message
received from a newly joined member of the cluster that assists in
filling the dynamic routing cache with necessary info to send GMS
messages back to the newly joined instance.

When DISCOVERY_URI_LIST is enabled, the virtual
multicast sender over TCP is currently working. But as the number of
instances in the cluster increase, the overhead related to simulating
UDP multicast by sending a heartbeat to each instance over unicast
will increase overhead, particularly in the GMS Master which for
self-configuring clusters running without a DAS, one of the clustered
CORE instances would be performing GMS system overhead and handling
application requests.

Lastly, each heartbeat from the master also includes the last master
sequence id sent by the master. Each instance inspects the sequence id and
checks if it has received all prior GMS notifications sent by the
master. If there are any missing GMS notification from the master,
the instance requests that the master rebroadcast the missed messages
to it. This was a bug fix in GF 3.1 for dropped UDP messages.

2.2.2 Lower Risk Incremental Optimizations to existing HealthMonitor

  • Provide a capability to configure UDP unicast to be used for heartbeat and/or GMS notifications.
  • Requires an ability to configure UDP unicast port and DatagramSocket.
  • Disclaimer: s
  • Incrementally Evolved Heartbeat failure detection algorithm

Assume a cluster of N members.

  • Each cluster member sends a ALIVE and ALIVEANDREADY heartbeat only to master.
    (savings of (N-1)^2 unicast heartbeat messages per heartbeat frequency from other instances to master)
  • Master sends a heartbeat to X mostly likely to be selected master members in cluster. (could default to all members)
    (savings of (N-1) - X unicast heartbeat messages per heartbeat frequency from master to cluster)
  • Due to savings on unicast heartbeat monitor, when there is a change in master, initialize heartbeat monitor
    to all existing members with a TCP ping to ensure that it does not take longer to detect instances that fail
    at same time or close to same time as master failed. (shorten window of possible missed failures due to a change
    in GMS Master).
  • When master performs planned shutdown, master sends its heartbeat info to master candidate.
  • Instances flush connection cache when receive PLANNED_SHUTDOWN or FAILURE notification from master.

2.2.3 New Heartbeat Failure Detection algorithm

This new algorithm will be optimized for no UDP multicast and no DAS running.
(These conditions are derived from PRD CLUST-2 Ad Hoc Clustering Support)
It will decentralize more of the failure detection processing from GMS Master
while at same time preserving the consistent view across all cluster by having
a Master (also known as a GroupLeader) that is only instance to send
GMS Views around.

Neighbor heartbeat failure detection decentralizes initial identification that
an instance is potentially suspected of failure. Rather than master watching
all other instances in the cluster, each clustered instance is responsible for
monitoring that it is receiving its neighbor's periodic heartbeat. Additionally,
an optimization can be added that if a GMS message is sent to a neighbor, the
receiver will treat the received message as proof that the instance is still
working and will infer a ALIVE/ALIVEANDREADY heartbeat message.

For all the members of a GMS Group, GMS defines an ordering of these members based on
a sorting of the PeerID object. Each member of the GMS group has the same view of
the active members of the GMS Group. This ordering is already used to calculate
the replacement Master when the current GMS master leaves the GMS group.
Now this ordering will be used to calculate whether an instance is a neighbor of another
instance in the group.

2.2.3.1 Definitions:
Given a GlassFish 3.1 Next cluster with N members and a DAS, there will be N+1 GMS group members.
Given a GlassFish 3.1 Next cluster with N members and no DAS, there will be N GMS group members.

To normalize between above to scenarios, lets just consider that the GMS group has M members for
these definitions. There exist a well defined ordering of these members and we will refer
to these members with notation of m1 to mM.

For n between 1..M members,
mn sends its health messages to neighbor mn+1, mM sends its health message to m1.
mn+1 is considered the neighbor health monitor for member mn.
when n + 1 is M, m1 is the neighbor health monitor for member mM.

2.2.3.2 Boundary cases for neighbor heartbeat failure detection:
These occur when group membership is changing quickly.

1. cluster startup time.
Multiple instances joining at same time. Since "asadmin
start-cluster" is a non-requirement for self-configuring clusters
the GMS optimization to notify GMS clients whether an instance is
joining as part of group startup or an isolated instance startup
will not exist. All instance startups will be considered a single
instance starting up. The neighbor heartbeat failure detection
algorithm will not work as quickly when multiple instances are
joining or leaving at same time. (in configurations that asadmin
subcommands start-cluster & stop-cluster are supported GMS
notifications have a subevent to indicate whether group or just
individual instances are changing their group membership.)
Neighbors will be constantly changing as instances start up. The
actual time between each instance starting for a self-configuring
cluster with a minimum of instances over 2 will be OS, VM, GF 3.1 Next
implementation dependent and the timings will change as we progress
through GF 3.1 Next. (Experience in GF 3.1, when "start-cluster" went
from serialized starting to parallel starting, the change in timing
uncovered other issues)

2. fast restart - when a clustered instance fails and the same identified
instance is quickly restarted.
In GlassFish 2.x, the nodeagent watched and restarted failed
instances. In GlassFish 3.1, one had an option of installing an
instance as an OS service. Thus, the OS would monitor process and
restart the service if it had a software failure (such as out of
memory, segv, ....) Also, for this boundary case, rejoin detection
requires the neighbor to detect the restarted instance and add a
rejoin subevent to the GMS notification of JOIN and
JOINED_AND_READY.

3. instances considered to be neighbors failing at same time.
There will be a delay in detecting failure of the instance that fails
whose neighbor who was watching it also fails.

2.2.3.3 Description of inferred heartbeat from a GMS message.

Sending a GMS message that is not a HEALTH_MESSAGE to neighbor monitoring instance:
1. when an instance sends a GMS message to its heartbeat neighbor,
it checks to see if its current HealthMonitor state is ALIVE or ALIVEANDREADY.
For either of these states, the sending instance sets a last heartbeat timestamp to that
neighbor to the current system time. (Thus, avoiding the next heartbeat message to that instance)

Neighbor health monitor receiving a GMS message from its neighbor.
1. when receiving a GMS message that is not a HEALTH_MESSAGE type.
Check if the message sender is considered a neighbor.
If so, then check if last heartbeat from that neighbor was
ALIVE or ALIVEANDREADY. If so, treat this received message as
inferred HealthMessage with the last received HEALTH_MESSAGE state.
Next check if NODEADV confirms that
the sending instance is the same invocation of this instance
as the last heartbeat was. (to detect fast restart case when
they do not match) If from same invocation of the GMS client,
then set timestamp of HealthMessage.Entry to have current
system time receive

2.3. Virtual Multicast Optimizations

Send in parallel threads over unicast to each cluster member.
Only deserialize GMS message once and share one NIO buffer (via different views)
to write bytes of serialized message via unicast messaging to each
member of the cluster.

2.4 GMS Master (GroupLeadershipNotification)

The first clustered instance started for a given cluster will be the GMS master for the cluster.
The GMS Notification GroupLeadershipNotification is sent to all clustered instances when
an instance is made the GMS Master.

Rather than a requirement, the GMS module would strongly recommend that when the elasticity
manager needs to stop an instance in the cluster, that preference would be given to NOT
stop the current GMS master. While the GMS master is not a single point of failure, it would
improve the continuity of GMS processing if the GMS Master is not constantly changing.

Worst case scenario would be that GMS had an algorithm that the longest running member
becomes the new Master and the elasticity manager when it is shrinking the cluster
has a preference to stop the longest running member. These two rules would
result in churning of the GMS Master. (leading to an excess of GroupLeadershipNotification
to the other clustered instances.)