GlassFish 3.1 Next GMS Non-Multicast Design Authors: Joe Fialli, Bobby Bissett Date: 07/27/2011: No longer for GlassFish 3.2, just calling GlassFish next. Update to reflect deprecating VIRTUAL_MULTICAST_URI_LIST for gms over jxta only. Replaced with DISCOVERY_URI_LIST. Date: 04/15/2011 1.0 Overview: This document describes changes to support GMS in an environment without UDP multicast enabled between all clustered instances. It also discusses optimizations to enable scaling of the max number of instances in a cluster when in this mode. Lastly, it identifies changes to assist in decentralizing GMS operations away from the GMS Master in preparation to handle self-configuring cluster mode of running without a DAS. The DAS is isolated from application load on the cluster so it was beneficial to place a lot of GMS system level processing on it. This processing was not impacted by high application load on the CORE clustered instance. 1.1 Terminology Shoal GMS is implemented in a GlassFish subproject and does have terminology that is more generic than its usage within GlassFish. This section provides a mapping between GlassFish clustering terms to Shoal GMS terms. When discussing GMS design within this document, the GMS terms, not the GlassFish instantiations of GMS concepts, is used. This section is to assist GlassFish reviewers in relating discussion to GlassFish. A GlassFish cluster is mapped to a Shoal GMS Group. A GlassFish clustered instance is mapped to a Shoal GMS group member. A GlassFish DAS is a member of all gms-enabled GlassFish clusters listed in its domain.xml configuration. A GlassFish DAS is a GMS Spectator, an observer of the GMS Group. A GMS Spectator does not provide a HA replication store and it is not considered as a place to replicate session data within a GlassFish cluster. A GlassFish DAS is typically the GMS Master, but since the GMS Master can not be a single point of failure, if the DAS leaves the GMS Group, another GlassFish clustered instance will assume the GMS Master role. The GMS Master provides centralized processing so all GMS Members have same GMS member view (GlassFish clustered instances). 2.0 Design Notes 2.1. GMS Configuration Changes in GlassFish 3.1 next domain.xml See GF 3.1 Next GMS Configuration in domain.xml interface Multicast Sender is implemented by 1. com.sun.enterprise.mgmt.transport.VirtualMulticastSender when glassfish cluster/group-management-service property GMS_DISCOVERY_URI_LIST is set. 2. com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender when multicast address and port are set. (and DISCOVERY_URI_LIST is not set.) 3. not supporting hybrid scenario where multicast address and port are set AND DISCOVERY_URI_LIST is set. (proposed implementation simplification for time being. See GMS requirements doc non-requirements GMS-1.1.3 and GMS-1.1.5. The implementation rationale for this requirement of not supporting hybrid is that work would need to be done to ensure that a given message sent over both UDP multicast and individually to each instance listed in DISCOVERY_URI_LIST was not delivered in duplicate. There is currently no mechanism to ensure that a given GMS message is delivered at most one time. We observed duplicate message delivery when VirtualMulticastSender was both delivering messages over UDP multicast, when it was enabled, and unicast the same messge to each GMS member. Outstanding need from GlassFish 3.2 domain instance: a globally unique id to use for namespace to evaluate cluster name within. (one should be able to have two clusters with same name but the clusters are created in the context of different domains.) If not possible, then See Dependencies GMS-1.2.2 and GMS-1.2.3 in GMS Requirements document. ASARCH Review comments from May 10th 1. requested considering gms tcp port to be port unified with asadmin port. (need to consult with Alexey) must consider that gms is enabling user to enable ssl on gms tcp connection. Is it possible to do port unification with asadmin with chance of ssl being enabled for gms? 2. other option is to have a restful service that provides ports for services. IIOP and GMS could use this. 3. file a RFE on enabling securing GMS heartbeat info sent over multicast. (next release) 2.1.1 GMS Non-Multicast Implementation com.sun.enterprise.mgmt.transport.VirtualMulticastSender contains a list of all active members. This list is maintained by adding a member joining the group in com.sun.enterprise.mgmt.transport.NetworkManager.addRemotePeer(PeerID) and removing a member leaving the group when the method com.sun.enterprise.mgmt.transport.NetworkManager.removePeerID(PeerID) is called. The active member list is initialized with values from VIRTUAL_MULTICAST_URI_LIST. The VirtualMulticastSender.doBroadcast(Message msg) iterates over all active members of the gms group, sending the msg to them each member over unicast. There is a proposal to enable configuration that some messages be able to be sent over lower overhead UDP unicast rather than higher overhead TCP unicast. The GMS member heartbeat is an initial candidate for this. 2.2 HeartBeat Failure Detection The current heartbeat failure detection implementation is optimized for the heartbeats to be sent via UDP multicast. The majority of the HealthMonitoring failure detection is centralized in GMS Master. It is the only instance that monitors all other instances heartbeats. It was desirable in GlassFish 2.x and GlassFish 3.1 for this centralized processing to occur in GMS master which typically would be DAS. (domain server) This isolated GMS health monitoring processing from the application load on the cluster. The DAS was not a single point of failure as default GMS master. If DAS was stopped or failed, the other members in cluster would identify this and elect a successor GMS Master. (The GMS notification GroupLeadershipNotification represents a change in GMS Master for the GMS group. A GMS group maps to a glassfish cluster). GlassFish 3.1 Next will be targeted to run in configurations/environments where UDP multicast is not be available and the DAS may not be running(see ad hoc clustering). Thus, we will introduce an alternative heartbeat failure detection algorithm that decentralizes processing some and is optimized for unicast heartbeats. 2.2.1 GMS Heartbeat Failure Detection in GlassFish 3.1 The current heartbeat failure detection algorithm relies on UDP multicast transport. -Each cluster member broadcasts its heartbeat to all other members. -Each cluster member records heartbeats for all other members. (timestamps receive time)
- Each cluster member cleans its connection caches to a member when it receives a health message of
a PLANNED_SHUTDOWN or FAILURE about that member. -Only the master member detects, validates and notifies all other members of a suspected or failed instance. -All members in cluster monitor master's heartbeat. If an instance detects that master has failed, it uses algorithm shared by all members to compute the new master. Only instance that is appointed new master by algorithm will send notification to other cluster members that it is new Master (via GroupLeadershipNotication) and then that the former master has failed.
A health message is of type HEALTH_MESSAGE and is composed of HealthMessage.Entry which has a health state and system advertisement (the gms client start time in sys advertisement is instrumental in detecting instances that have restarted faster than heartbeat failure detection could detect that previous instantiation of instance has failed and was restarted). When the health message is received, the HealthMonitor.receiveMessageEvent timestamps the local receive time on the HealthMessage.Entry and places this entry in the HealthMonitor map of clustered instance peerid to last HealthMessage.Entry received. The health message may be the first message received by an instance from another instance, so an instance should send its first heartbeat to all instances. There is dynamic contact information in the message that is cached and used to send messages back to that instance. All messages from that instance contain this contact info. However, the heartbeat message is typically a candidate for being the first message received from a newly joined member of the cluster that assists in filling the dynamic routing cache with necessary info to send GMS messages back to the newly joined instance. When DISCOVERY_URI_LIST is enabled, the virtual multicast sender over TCP is currently working. But as the number of instances in the cluster increase, the overhead related to simulating UDP multicast by sending a heartbeat to each instance over unicast will increase overhead, particularly in the GMS Master which for self-configuring clusters running without a DAS, one of the clustered CORE instances would be performing GMS system overhead and handling application requests. Lastly, each heartbeat from the master also includes the last master sequence id sent by the master. Each instance inspects the sequence id and checks if it has received all prior GMS notifications sent by the master. If there are any missing GMS notification from the master, the instance requests that the master rebroadcast the missed messages to it. This was a bug fix in GF 3.1 for dropped UDP messages. 2.2.2 Lower Risk Incremental Optimizations to existing HealthMonitor
- Provide a capability to configure UDP unicast to be used for heartbeat and/or GMS notifications.
- Requires an ability to configure UDP unicast port and DatagramSocket.
- Disclaimer: s
- Incrementally Evolved Heartbeat failure detection algorithm
Assume a cluster of N members.
- Each cluster member sends a ALIVE and ALIVEANDREADY heartbeat only to master.
(savings of (N-1)^2 unicast heartbeat messages per heartbeat frequency from other instances to master)
- Master sends a heartbeat to X mostly likely to be selected master members in cluster. (could default to all members)
(savings of (N-1) - X unicast heartbeat messages per heartbeat frequency from master to cluster)
- Due to savings on unicast heartbeat monitor, when there is a change in master, initialize heartbeat monitor
to all existing members with a TCP ping to ensure that it does not take longer to detect instances that fail at same time or close to same time as master failed. (shorten window of possible missed failures due to a change in GMS Master).
- When master performs planned shutdown, master sends its heartbeat info to master candidate.
- Instances flush connection cache when receive PLANNED_SHUTDOWN or FAILURE notification from master.
2.2.3 New Heartbeat Failure Detection algorithm This new algorithm will be optimized for no UDP multicast and no DAS running. (These conditions are derived from PRD CLUST-2 Ad Hoc Clustering Support) It will decentralize more of the failure detection processing from GMS Master while at same time preserving the consistent view across all cluster by having a Master (also known as a GroupLeader) that is only instance to send GMS Views around. Neighbor heartbeat failure detection decentralizes initial identification that an instance is potentially suspected of failure. Rather than master watching all other instances in the cluster, each clustered instance is responsible for monitoring that it is receiving its neighbor's periodic heartbeat. Additionally, an optimization can be added that if a GMS message is sent to a neighbor, the receiver will treat the received message as proof that the instance is still working and will infer a ALIVE/ALIVEANDREADY heartbeat message. For all the members of a GMS Group, GMS defines an ordering of these members based on a sorting of the PeerID object. Each member of the GMS group has the same view of the active members of the GMS Group. This ordering is already used to calculate the replacement Master when the current GMS master leaves the GMS group. Now this ordering will be used to calculate whether an instance is a neighbor of another instance in the group. 2.2.3.1 Definitions: Given a GlassFish 3.1 Next cluster with N members and a DAS, there will be N+1 GMS group members. Given a GlassFish 3.1 Next cluster with N members and no DAS, there will be N GMS group members. To normalize between above to scenarios, lets just consider that the GMS group has M members for these definitions. There exist a well defined ordering of these members and we will refer to these members with notation of m1 to mM. For n between 1..M members, mn sends its health messages to neighbor mn+1, mM sends its health message to m1. mn+1 is considered the neighbor health monitor for member mn. when n + 1 is M, m1 is the neighbor health monitor for member mM. 2.2.3.2 Boundary cases for neighbor heartbeat failure detection: These occur when group membership is changing quickly. 1. cluster startup time. Multiple instances joining at same time. Since "asadmin start-cluster" is a non-requirement for self-configuring clusters the GMS optimization to notify GMS clients whether an instance is joining as part of group startup or an isolated instance startup will not exist. All instance startups will be considered a single instance starting up. The neighbor heartbeat failure detection algorithm will not work as quickly when multiple instances are joining or leaving at same time. (in configurations that asadmin subcommands start-cluster & stop-cluster are supported GMS notifications have a subevent to indicate whether group or just individual instances are changing their group membership.) Neighbors will be constantly changing as instances start up. The actual time between each instance starting for a self-configuring cluster with a minimum of instances over 2 will be OS, VM, GF 3.1 Next implementation dependent and the timings will change as we progress through GF 3.1 Next. (Experience in GF 3.1, when "start-cluster" went from serialized starting to parallel starting, the change in timing uncovered other issues) 2. fast restart - when a clustered instance fails and the same identified instance is quickly restarted. In GlassFish 2.x, the nodeagent watched and restarted failed instances. In GlassFish 3.1, one had an option of installing an instance as an OS service. Thus, the OS would monitor process and restart the service if it had a software failure (such as out of memory, segv, ....) Also, for this boundary case, rejoin detection requires the neighbor to detect the restarted instance and add a rejoin subevent to the GMS notification of JOIN and JOINED_AND_READY. 3. instances considered to be neighbors failing at same time. There will be a delay in detecting failure of the instance that fails whose neighbor who was watching it also fails. 2.2.3.3 Description of inferred heartbeat from a GMS message. Sending a GMS message that is not a HEALTH_MESSAGE to neighbor monitoring instance: 1. when an instance sends a GMS message to its heartbeat neighbor, it checks to see if its current HealthMonitor state is ALIVE or ALIVEANDREADY. For either of these states, the sending instance sets a last heartbeat timestamp to that neighbor to the current system time. (Thus, avoiding the next heartbeat message to that instance) Neighbor health monitor receiving a GMS message from its neighbor. 1. when receiving a GMS message that is not a HEALTH_MESSAGE type. Check if the message sender is considered a neighbor. If so, then check if last heartbeat from that neighbor was ALIVE or ALIVEANDREADY. If so, treat this received message as inferred HealthMessage with the last received HEALTH_MESSAGE state. Next check if NODEADV confirms that the sending instance is the same invocation of this instance as the last heartbeat was. (to detect fast restart case when they do not match) If from same invocation of the GMS client, then set timestamp of HealthMessage.Entry to have current system time receive 2.3. Virtual Multicast Optimizations Send in parallel threads over unicast to each cluster member. Only deserialize GMS message once and share one NIO buffer (via different views) to write bytes of serialized message via unicast messaging to each member of the cluster. 2.4 GMS Master (GroupLeadershipNotification) The first clustered instance started for a given cluster will be the GMS master for the cluster. The GMS Notification GroupLeadershipNotification is sent to all clustered instances when an instance is made the GMS Master. Rather than a requirement, the GMS module would strongly recommend that when the elasticity manager needs to stop an instance in the cluster, that preference would be given to NOT stop the current GMS master. While the GMS master is not a single point of failure, it would improve the continuity of GMS processing if the GMS Master is not constantly changing. Worst case scenario would be that GMS had an algorithm that the longest running member becomes the new Master and the elasticity manager when it is shrinking the cluster has a preference to stop the longest running member. These two rules would result in churning of the GMS Master. (leading to an excess of GroupLeadershipNotification to the other clustered instances.) |