3.1 Load-balancer Review

Document: OnePagerforLoad-BalancerPlugin.html.pdf

Item Section Comment Response
JFD-1 2.1 Is the load-balancer plugin optional? Can a customer use some other load-balancer that supports session stickiness? Granted you won't get intelligent failover to replica nodes, but I'm wondering if customers must use the GF load balancer. It is not mandatory to use load-balancer plugin to distribute traffic to GF cluster. Other options like Reverse proxy plugin of our web-server, Apache + mod_jk etc. will work fine as well.
JFD-2 4.1.3 "The process to figure out replica for a particular session is based on broadcast mechanism." The in memory replication spec indicates this is done via a consistent hash of the session id and no broadcast is needed. Is that correct? Yes, broadcast is the last resort if session is not found in selected replica. I will clarify it further in one pager.
JFD-3 4.1.3 Approach 1 vs Approach 2. Approach 1 seems counter to the in memory replication design which relies on the ability for any instance to determine the replica via a consistent hash. Having the load balancer involved with selecting the replica seems counter to this design. My vote goes for Approach 2. Consistent based algorithm is a pluggable implementation. In case of Approach 1, an appropriate implementation need to be plugged in. It will replace consistent hash algorithm and retrieve replica information stamped by load-balancer plugin.
Item Section Comment Response
MK-1 GEN How does the LB detect failure and how fast does it detect failure? If LB can detect failures faster than GMS, can this info (that an instance has failed) be passed to GMS? Here is one approach (I may be totally wrong here, but ...): Let us say that LB detects an instance failure (say node n2 failed), it then stamp the failed instance name in every re-directed request (for a certain period of time). A GF valve, installed in the web container layer, can interpret this info and can pass this info (through a callback handler) to GMS. GMS can then declare this node to be in FAILED state (may be after doing a ping). This could be an interim solution till we resurrect node agent. In fact, the callback interface can be thought of as a way to have pluggable 'fast failure dection' mechanism. The valve itself can be provided by GMS (and discovered by web container using hk2). If there is a instance/system failure, it will be detected almost immediately. Network outages may take more time to detect. It is not difficult to detect instance failure. Even now it can be detected using proxy headers and cookie information. There is no need to stamp any new headers.
MK-2 4.1.3 "The process to figure out replica for a particular session is based on broadcast mechanism." Actually, the replication module uses a consistent hash algorithm to select a replica. The replication layer will resort to a broadcast only if the data is not found in the replica cache (of the instance that is computed as the replica). In fact, there is a separate API provided by GMS to provide the previous view of the cluster to know where the data was replicated before an instance failure. I will update the document as per this information.
MK-3 4.1.3 Approach 1 VS. Approach 2. If we follow approach 1, and if we are front ended by a different LB, the replication layer will have to resort to broadcast. I understand that GF will operate sub optimally when front ended by a different LB, but in the more common case (read in the single failure case), by using a consistent hash algorithm, the replication layer can still perform a direct load from the replica (even when front ended by a different LB). Being a pluggable solution, consistent hash algorithm can be enabled when using other load-balancer. Right?
MK-4 4.1.3 I understand that 'instance identification' logic will be duplicated in approach 2, but lets say that GF's web container sets a cookie called (say) 'replicaInstance: instance6'. Will it be possible for the LB plugin to then rewrite this cookie on the way out? That way, the encoding logic will live in one place. Even if it is not possible, I still prefer approach 2 for the reasons cited above. Moreover, to maintain stickiness, the web container anyway needs to encode the sicky instance name. Which implies that the logic needs to be duplicated both in LB and web container layer. Correct? It need to be handled by web container. For url rewriting case, LB plugin need to parse through complete response body to replace this value. It will result in lot of performance degradation. Also cookie has path associated with them which is best known to web container.
MK-5 6.1 Preferred fail-over feature support has been listed to be available only in MS4. This seems too late. Can this be made available by MS3? It will be difficult to get it done by MS3 owing to commitment to other tasks. We will try and make it available earlier in MS4.