Item |
Section |
Comment |
Response |
MK-1 |
2.2 |
Please mention in this section any risks involved due to node agent not available in V3.1 |
agree. added a line and referred to complete discussion under section 4.12 |
MK-2 |
4.1 |
Are there any significant difference in the way members advertise and discover themselves in the absence of JXTA? A brief overview of how this is done (including any requirement for multicast support etc.) will be good here. |
Member discovery is written in GMS Shoal layer code. Only difference between JXTA and Grizzly transport is JXTA may have broadcast on ALL network interfaces on a multihome machine while the Shoal over Grizzly is definitely just broadcasting on first network interface it chooses to use. When one sets gms-bind-interface-address on a multihome machine, there is no difference between GMS over jxta and GMS over Grizzly. |
MK-3 |
4.1 |
Looks like LB plugin can detect instance failures far earlier than GMS. This means that replication layer may continue to replicate to a failed instance because it did not receive any failure event from GMS. I mentioned (in the LB feedback document) a possible approach to detect failures far quicker than GMS in the absence of node agent. Let me briefly mention it here: Let us say that LB detects an instance failure (say node n2 failed), it then stamps the failed instance name in every re-directed(or even every request) request (for a certain period of time). A GF valve, installed in the web container layer, can interpret this info and can pass this info (this valve can be thought of as a watchdog implementation) to GMS. GMS can then declare this node to be in FAILED state (may be after doing a ping). This could be an interim solution till we resurrect node agent. The valve itself can be provided by GMS (and discovered by web container using hk2). Also, does Grizzly provide any way to discover node failures faster (I guess not) |
Disagree with this assessment. The LB does not detect that an instance has failed, it marks a non-responsive instance as unhealthy in < 30 ms. This state is equivalent to the GMS state of SUSPECT. (the LB info is extracted from this document ) When the LB marks an instance as unhealthy due to non-responsiveness, it then starts active health monitoring of the instance to detect when it can reinstate it as "healthy". Link to a sample loadbalancer.xml The LB's state of unhealthy is equivalent to the GMS Shoal state of a member as being SUSPECT. Unlike the GF v2.x NodeAgent, the LB does not have a means to detect that an application server has FAILED in faster and more correct manner than GMS. Please read GMS watchdog to understand the NodeAgent's advantage of running locally in being able to detect FAILURE faster and more correctly than GMS. The nodeagent monitored the process id of the instances it controlled and when it noticed the process id failed, it restarted the node that it controlled. |
MK-4 |
4.12 |
Some components (like EJB container) may have registered only failure event handlers in V2. This means that they will miss the REJOINING subevent unless they modify their code. Could you document why GMS did not internally translate this into a FAILURE followed by a JOIN_AND_READY event? |
We thought it would be confusing to a GMS client to send a FAILURE notification AFTER the instance has already RESTARTED. This is described in detail in Section 3.1 of GMS Watchdog . Will document also. The REJOIN subevent indicates that an instantiation of the instance that joined the cluster at a certain time in the past has failed. The JOIN or JOINED_AND_READY notification represents the restarted instance and the REJOIN subevent represents the failure of a past instance. If one was to ask the state of the instance in the JOIN/JOINED_AND_READY, it would be ALIVE. If GMS issued a FAILURE for a restarted instance and its gms member status was ALIVE, GMS clients would find that confusing. If one can not detect the FAILURE before the RESTART, it was just too confusing to report FAILURE for a past instantiation of the instance when the instance had already RESTARTED and was JOINING the cluster. |
|
MK-5 |
GEN |
Can you comment on whether GMS will be started eagerly OR will be started only when clients try to access GMS? Also, can the callback handlers be discovered by GMS through hk2? |
will add that GMS will be started when an instance belongs to a cluster with cluster attribute gms-enabled set to true. So GMS will be eagerly started when gms-enabled is true in domain.xml This will be documented in gms configuration document also. Design work is needed on how one interacts with GMS via hk2. That has not been done yet. Will add a TBD bullet under GMS one pager 4.5.1 public interfaces for this request. |
|
MK-6 |
GEN |
Are there any payload size limitations imposed by GMS or Grizzly |
We will make payload size limitations configurable. This is the MAX_MESSAGE_SIZE and the multicastpacketsize . Current defaults are 128K for Message size and 64K for multicast packetsize. Broadcast message with size greater than multicast size will just be virtually broadcast over TCP. |
|
MK-7 |
GEN |
Can you comment on whether send APIs are synchronous or asynchronous. |
The javadoc for GMS sendMessage does not provide any control over whether sends are synch or asynch. It is implementation dependent. The current GMS sendMessage implementation over Grizzly is synchronous. There is a possibility to use Grizzly asynchronous write capabilities, but would not recommend given that when message sending gets way ahead of message processing, the typical messaging technique to not consuming unbounded memory resources on destination of messages is to throttle senders. That is currently working well in our GMS messaging simulation runs. Not possible to throttle senders when send is asynch. |
|