GlassFish Server Open Source Edition 3.1 - Shoal Group Management Service (GMS) for Runtime Clustering Services

Scope of the Project

Shoal GMS is a clustering framework that provides infrastructure to build fault tolerance, reliability and availability.
The clustering framework provides the following functionalities to Glassfish services and user applications.

  • GMS event notification
    • register callback for GMS event notification
    • GMS runtime notifies when changes in group membership occur.
  • Cluster-wide Messaging
    • send a message to one, sublist or broadcast to all members of cluster
    • register message callback to process messages sent by other group members.
  • GMS status methods
    • list of all members or just list of CORE members
    • request a member status on any member in cluster
  • GMS Group Configuration
    • configure multicast address and port used by a group
    • ability to configure gms listener BIND_INTERFACE_ADDRESS(ip address of network interface) to use on a multihome machine

Recent enhancement to GMS is to have a pluggable transport and a Grizzly implementation of that transport.
GF HA subsystem is building on top of GMS clustering framework and its messaging.

group-management-service Feature Status from glassfish issue tracker features targeted for GFv3.1 milestone

Feature-ID Priority Description Eng Response Owner(s) Estimate (Person Days) Source of Requirement Status/Comments
GMS-01 P1 Shoal GMS over Grizzly implementation YES Joe Fialli, Bobby Bissett DONE switch to a well supported transport shoal gms dev level testing confirmed working
GMS-02 P1 Integrate Shoal GMS into GF v3.1 using dev. stop-gap clustering cmds YES Joe Fialli, Bobby Bissett near complete earlier gms in gf v3.1 testing this is a transitional integration. building of gms-adapter not enabled in cluster pom.xml yet. tmp gms cluster config files/asadmin cluster cmds so gms in gf v3.1 testing can start
GMS-03 P1 Shoal GMS integration via domain.xml configuration YES Bobby/Joe estimateManDays impl v2.1 functionality initial gmsconfig document presented to admin team, updating doc with feedback. Send out update to admin dev alias.
GMS-04 P1 Introduce GMS GroupHandle.getPreviousAliveOrReadyMembers() YES Joe Fialli 5 days HA request Mahesh needs to use this during HA M3 development. Higher priority than rejoin.
GMS-05 P1 Introduce GMS rejoin subevent in JOIN and JOINED_AND_READY notification YES Bobby estimateManDays compensate for loss of v2.1 NodeAgent as gms watchdog subevent informs GMS client that a clustered instance failed and was restarted quicker than GMS heartbeat failure detection would have been able to detect failure. In GF v3.1, local initrc technique for monitoring a process and restarting it when it fails will cause this to occur.
GMS-06 P2 asadmin get-health clustered-instance or cluster YES Joe 3-7 days feature parity note: only works for gms-enabled cluster
GMS-07 P3 GMS over Grizzly virtual multicast NO ownerTBD estimateManDays identified as desirable feature during GF v3.1 launch status/comments
GMS-08 P2 multicast enabled diagnostic utility YES Bobby estimate? ease of support request test already exist in shoal gms just needs to be properly packaged
GMS-09 P2 Monitoring Stat Providers YES Bobby/Joe estimateManDays   Provide stats that be used for both monitoring and debugging. messaging throughput and event notification counters
GMS-10 P2 Upgrade from v2.1 cluster and group-management-service element's attributes/properties to v3.1 cluster/group-management-server YES Bobby ?? feature parity  
GMS-11 P2 Update external library Shoal GMS to meet GF v3.1 logging requirements YES Bobby/Joe ?? feature parity  

Feature Overview

GMS-01 Shoal GMS over Grizzly implementation

  • Description

Replace JXTA as the transport provider for GMS and provide an implementation over Grizzly using Grizzly as the transport provider for TCP and UDP messages. The basic implementation is being provided by Bongjae Chang, Shoal community member, but needs further enhancements to make it production quality.

  • Sub-tasks
    • Tune default Grizzly NetworkManager parameters. These are settable via following GMS Property parameters. <b>MAX_POOLSIZE, CORE_POOLSIZE, KEEP_ALIVE_TIME, POOL_QUEUE_SIZE, HIGH_WATER_MARK, MAX_PARALLEL, WRITE_TIMEOUT, MAX_WRITE_SELECTOR_POOL_SIZE</b>. These parameters manage resources available to process incoming Grizzly processing.
    • The GMS over Grizzly implementation will be tested over several of the critical test scenarios to establish its viability and stability for usage in a production quality GlassFish deployment. Initial stability tests will be based on module level distributed tests.
    • Requires QE to port over earlier GMS tests that were based on Appserver and depended on Appserver asadmin commands et al and revert to a module level test as we did with initial GlassFish v2 testing.
    • Fix P1 and P2 type bugs
  • Dependencies
    • Grizzly framework and util jars version 1.9.19
    • Add GMS messaging load testing since HA relying directly on GMS over Grizzly messaging. (in v2.1, HA used jxta messaging directly)

GMS-02 Integrate Shoal GMS into GF v3.1 using dev. stop-gap clustering cmds

  • Sub-tasks
    • QE use the stop-gap clustering cmds/configurations so gms over grizzly testing in gf v3.1 does not have to wait till GMS-03 functionality complete. Should be simple replacement of stop-gap cluster command for equivalent asadmin cluster command when GMS-03 is ready.
    • Make GMS OSGi compliant - GMS was not an OSGi compliant module. Changes being incorporated to mavenize GMS to publish to maven repository, and use GF v3 maven commands to build an OSGi compliant module
    • Create a GMS Service in GlassFish that will act as a delegate to start and stop GMS module in each GF 3.1 instance

Create a GMS Service in GlassFish that will delegate start and stop operations on GMS module in each GF 3.1 instance. This is an @Startup class that will be part of the startup services in a clustered GF instance. Need to close on requirements with respect to when GMS module will be started - need architects to consider impact of introducing instability and unpredictability in group memberships with lazy startup guidelines. COMPLETED by Sheetal.

  • Dependencies

GMS for JoinedAndReady (instance start and restart), failure detection notifications and all Messaging pertaining to session replication.
DOL (possibly for deployment framework support to correctly handle automatic timer creation) <br>

GMS-07 GMS over Grizzly virtual multicast

  • Description

Multicast is not always enabled in all production environments. Additionally, it is not available in cload environments.
So this feature is desirable for those configurations.

Implement as a staic list of ip address and ports that make up cluster. no dynamic capabilities when multicast not enabled in network.

  • issues
    • Yet another configuration to test. Really need to run all existing shoal tests with this configuration to be sure it is working.

GMS-09 Monitoring Stat Providers

  • Description

Provide monitoring stats that will help with diagnosing GMS membership related issues.

One-pager / Functional Specification

Dev Tests

  • Information on GMS junit dev tests posted on Dev Tests page
  • TBD. Post how to run automated distributed shoal gms developer unit test will be added to Dev Tests page
  • 14 junit tests. Run as part of "mvn install". Also, "ant run-junit-tests.
  • distributed developer level tests (runnable with multiple instances on one machine AND distributed as one instance per machine.)
    • 4 gms event notification tests (join/joinAndReady/suspected/failure/plannedShutdown)
    • 2 gms messaging tests
  • Automation of nightly running of shoal gms distributed developer level test (both on single machine and distributed) via hudson is in progress.

Quality

  • Link to Test Plans

Documentation

  • Link to Documentation

Milestone Schedule

Item # Date/Milestone GF Issue Tracker Feature-ID Description QA/Docs Handover? Status / Comments
01 DONE   GMS-01 GMS over Grizzly QA handover done, no doc handover needed Initial distributed unit level testing done.
02 DONE gfit 12189 GMS-02 Integrate Shoal GMS into GF v3.1 using dev. stop-gap clustering cmds  
03 DONE gfit 12190 GMS-03 Shoal GMS integration via domain.xml configuration QA Handoff: PENDING(will complete on 6/28) Docs: yes AS ARCH completed on gms config doc
04 M3 DONE gfit 12191 GMS-04 Introduce GMS GroupHandle.getPreviousAliveOrReadyMembers() No Dev test is sufficient. Only useful for consistent hash calculation in HA.
05 M3 DONE gfit 12192 GMS-05 Introduce GMS rejoin subevent in JOIN and JOINED_AND_READY notification QA Handover YES, DOC: javadoc, Project shoal web pages Status/Comments
06 M? v3.2 GMS-07 GMS over Grizzly virtual multicast N/A while desirable, impl and testing resources not identified yet
07 M4 DONE gfit 12195 GMS-08 multicast enabled diagnostic utility YES DONE
08 moved to v3.2 gfit 12194 GMS-09 Monitoring Stat Providers YES This feature was moved from M4 to v3.2. message throughput, thread utilitization, number of detect SUSPECTED, number of FAILURES,
09 M4(8/16) DONE gfit 12563 GMS-10 Upgrade from v2.1 cluster and group-management-service element's attributes/properties to v3.1 cluster/group-management-server YES manual testing completed. automated testing to be done in admin devtest.
10 M5(9/13) DONE gfit 12196 GMS-11 Update external library Shoal GMS to meet GF v3.1 logging requirements No QA handoff or doc Status/Comments
11 M1(5/24) DONE   GMS one-pager   Identifying dependencies and new methods being added.
12 M4 DONE GFIT 12193 GMS-06 asadmin get-heatlth cluster or clustered-instance YES automated test in admin devtest
Task Target Milestone Start End Date Owner(s) Feature ID Status / Comments
Test GMS over Grizzly in shoal unit testing based on evaluation criteria DONE     Kazem, Steve DiMilla   maintain gf v2.1 quality. Run 6-8 critical scenarios of all scenarios over several iterations to establish feasibility of Grizzly transport based implementation and stability levels equivalent to the JXTA based implementation. still need scenario 51.
Automate running shoal distributed dev tests via Hudson M1 05/03 05/17 Steve DiMilla GMS-01 automation of standard shoal notification validation nearly complete. msg throughput test still being worked on.
OSGI shoal-gms.jar M1 DONE DONE Sheetal GMS-02 initial pass completed.
Load shoal-gms.jar only when gms-enabled M2 5/2? endDate Bobby or Joe (whoever is free first to work on it) GMS-02 or GMS-03 only load shoal-gms.jar when a clustered instance/DAS has a gms-enabled cluster. Check with Jerome if okay to enable gms in gfv3.1 before implementing this task.
Meet GF v3.1 OSGI requirements M2 start end Bobby GMS-02 Export minimal gms pks based on OSGI requirements documented in How to run OSGI Pkg Dep Analyzer
stop-gap clustering control in gf v3.1 M1 DONE DONE Steve DiMilla GMS-02 dev handoff to QE was completed on April 29. Note: cluster/pom.xml is not building gms-adapter yet. Check with Jerome if okay to enable it before completing Lazy loading of shoal-gms.jar only when cluster gms-enabled is on.
DAS joins all of its domain.xml clusters when started DONE Start endDate Joe GMS-03 task documents v2.1 impl behavior , may need to be revisited in v3.1
DAS dynamically joins cluster created by "asadmin create-cluster" DONE Start endDate Joe GMS-03 task docs v2.1 impl behavior, may need to be revisited in v3.1
Test GMS within gf v3.1 with stop-gap clustering M2 start DONE Kazem/Steve featureID stop-gap clustering implemented and initial gms integration into gf is completed. cluster/gms-adapter checked in but building not enabled yet by cluster/pom.xml.
Test GMS within gf v3.1 with domain.xml/asadmin cmds M? start endDate Kazem featureID use domain.xml config and GF v3.1 asadmin cmds. Depends on asadmin start-cluster, start-instance, stop-instance, stop-cluster.
asadmin get-health clustered-instance or cluster M4 startDate 3-7 days Joe GMS-06 note: only works for gms-enabled cluster. Leverage existing GMS get-member-status method. most time will be implementing Admin CLI command side.
Developer unit test for Distributed State Cache M3 CANCELLED 5 day task Joe Fialli featureId source of NPE and hangs in v2.1, IIOP will use DSC as it did in v2.1 to get IIOP address of other members in cluster
Disable Distributed State Cache for GF v3.1 M3   1 day task Joe Fialli featureId more efficient for IIOP to just read IIOP port info directly from domain.xml. transaction is no longer using fencing, so add property to disable DSC completely.
Performance Tuning: Tune Grizzly NetworkManager default settings M4 StartDate EndDate Joe Fialli featureId tune default values
Place ALL SEVERE, WARNING, INFO log messages into logstrings.properties. Must have event id (GMS-050) M5 startDate endDate Bobby GMS-11 status/comments
Add Diagnostic messages for SEVERE and WARNING M5 startDate endDate Joe GMS-11 Minimally required for GMS properties when misset cause instance to not come up. (i.e. BIND_INTERFACE_ADDRESS)

References / Links



gmsconfig_gfv3_1.pdf (application/x-download)
gms_gfv3_1_onepager.txt.pdf (application/pdf)
gmsconfig_gfv3_4.rtf.pdf (application/pdf)