GlassFish Wiki : 0904MeetingMinutes

The discussion mainly focussed around a list of design decisions.
Most of these decisions have an impact on the requirements.

SS/SAS replication configuration per application

The replication of the Sip Session and the Sip Applicaiton Session is done per application based on the deployment descriptor.
Changing the replication parameters is only possible by re-deploying the application.

DS/DF replication configuration for the container

The replication of the DialogueSet and DialogueFragment is done per container. Probably as a part of the domain.xml.

Replication scope: 'modified-session' or 'session'

The scope of the SS/SAS/DF/DS is based on either the modified-session or session scope.
The advantage of session scope instead over the modified-session scope, is that it will also replicate the modified attributes in the session even if the application did not explicitly invoke the setAttribute.

2007-06-09: However, well written application should not rely on this functionality. In general the modified-session is prefered for performance reasons. Both options will be provided and tested. Should we not just pick one option only? This would also simplify the testing.

Replication frequency: sip-transaction

The sip-transaction is similar to the web-method frequency, i.e., replication is driven by the requests (incoming or outgoing).

Consistency HTTP and SS/SAS replication

The scope and frequency of the HTTP and SS/SAS replication must be the same.

Consistency of DS/DF and SS/SAS replication

Applications deployed on a container for which replication is not enabled will either fail, or will revert to non-replicating mode with a warning.
Dynamically changing the replication parameters of the container should not affect any already deployed applications.
E.g., disabling replication for the container (DF/DS) is not allowed when there are applications that have replication enabled (SS/SAS).

We should consider to dynamically determine whether the DF/DS should be replicated based on the fact if there are any applications associated with the chain for which replication is active. If none of the applications associated with a DF have replication enabled, it is a waste of resources to replicate the DF/DS.

If this is optimised, then allocating and initialising the JXTA resources can be delayed until the first application with replication enabled is deployed, increasing startup times.

EVDV: Still I think that it might be useful to always enable replication for the container by default or at least have a default replication configuration. Explicit configuration of the container configuration might not be needed if replication can be fully driven by the applications on top.

sync and async replication

Both Async and synchronous replication are supported.

It must be noted that the synchronous replication does actually not do a full roundtrip handshake.
Synchronous replication will block until the message is delivered to the TCP transport. This is not a guarantee that it is received and stored by the buddy.
Asychronous replication will not return when the message is put in the JXTA queue.
Effectively, this means that synchronous replication does not eliminate the possibility to get out-of-sync with the buddy, but it does decrease the 'window of vunerability'.

There is code present for a full synchronous solution, but this has never been system tested. It also will perform worse than the current synchronous replication (it would require a response message).

Before the sailfin release the replication framework will use a new version of JXTA that will offer various performance improvements. The queueing will be removed from the JXTA part and only the TCP queue will remain.
The difference between async and sync replication will probably become less.

AP EVDV + Johan: it should be evaluated whether the current synchronous replication is acceptable

20070906: The fully synchronous solution might suffer from hanging threads after a crash of the buddy. The GMS notifications indicating a buddy down will not unblock these threads. However, since the assumption is that a fully synchronous solution is not needed, this is not a problem.

lazy reactivation after failure

After a failure we will not try to eagerly re-active the sessions. The mean time between failures in combination with the mean time for reactivation (driven by either a timeout on the SAS or by a message on the session) will ensure that the chance of session loss at a second failure is neglible.

AP EVDV + Johan: check that the assumptions made by the model regarding mean time between failure and reactivation are acceptable. E.g., in practice the standard deviation of the failures might not be so high (e.g., more failures under high load situation).

activation by timeouts in replica

We will monitor the timers in the replica for timer expiry. When a replica's timer expires, the replica will be activated.
The timeout time indicated in the replica will be the timeout time in the replica plus a additional delta to avoid false positives.

There will be at least one timer per SAS. This might prove a problem for the replica. After a failure of the primary instance, the replica might have to scan a million timers (300K sessions each with 3 or more timers).

In general, the amount of objects is an issue. In a normal situation (no @SAK or encoded URI, only one application involved in the call) there will be four objects per sip session; one SS, one SAS, one CF and one CS (akthough the latter might be modelled as one replicated object). When stating 300K sessions per instance, is this assuming all of these are present?

AP EVDV + Johan: make clear how many of each object should be present in the performance requirements of 300K sessions.
AP EVDV: explore the possibility of doing lazy SAS creation and examine the impact of this on the cleanup of the SSes.

It has to be investigated where the replica is activated on the timeout.
It would be most beneficial if the replica could be activated on the instance where the traffic will also be directed by the LB. This could mean that the BEKey might have to be stored in the SAS. If we do this then we also should model the timers as embedded in the SAS and we should have the deserialised timeout value on the SAS level.

AP EVDV + Joel: check if activating the timer on the correct instance is possible.

If it is not possible to do the reactivation on the correct instance, then it could still be beneficial to spread the reactivations over the cluster. The chances of being activated on the 'right' instance is the same, but at least the load will not be concentrated on the replica of the failed instance.

20070906: The indications it would indeed be possible to activate on the correct instance . The BEKey is already stored as part of the SAS and the LBmanager is accessible from the sipsessionmanager. The LBManager would implement the consistent hashing algorithm. The same mechanism as the reverse repair could be used to trigger the loading of the session on the correct instance. It merely be one session id in the list instead of multiple session ids. The main question is how does the replica get the BEKey. Maybe it should then also be stored in deserialised form? Or we make this a two step approach. First activate the replica on the buddy, then peek inside, then trigger the migration to the correct instance?

peek mechanism

For the timer based activation the replica must be able to check if the re-activated copy is the most recent version in the cluster (since there is no version number in the timeout). For this a peek functionality is needed, similar to the current load mechanism, but only reporting the version instead of the complete replica.

there will be a variant of load request that returns and deletes both active and replicas.

Migration of active sessions is needed in a UC based LB strategy (and in other situations). To support this there must be a variant of the load request that removes the active version. It is recommended that also all replica versions are removed. This should only be done when the re-activation or migration is completed.

We have to be careful with making a destructive load; the code should be very careful to only call the load request once. EVDV: I still do not completely get this. If the load request is done twice, then the second one will not result in a broadcast since the correct version is already present. If both load requests occur on the same instance from different threads, then I see a potential problem similar to the locking problem. If the requests come from different instances then we are back to the locking problem. Not sure that if we solve the locking problem this will not also be automatically solved.

sipSessionsUtil does not cause migration

SipSessionsUtil.getApplicationSession(id) will only return the version of the SAS that is available on the local instance. If the SAS is not present on the local instance it will return null.
It will not cause the migration of the SAS from another instance.

It is a nice-to-have to also cover the migration, with a loss of performance.

After shutdown lazy reactivation

Shutdown will be handled similar to failure. We will not actively re-activate sessions.

There might be a queiscing, probably only based on message level. This means that the LB will not quiesce, but the operator will still wait for a while until all the traffic in the instance has terminated.

There is no support for doing the actual shutdown based on the presence of traffic. This is purely a human decision and typically based on time instead of on counters.

After restart (after shutdown) or recovery (after failure) we do an eager reactivation (reverse-repair)

After a restart or recovery a repair-under-load and a reverse-repair is done.
The repair-under-load will ensure the the replica cache of the restarted/recovered instance is in sync with its neighbor.
The reverse-repair will ensure that the activate cache of the restarted/recovered instace is in sync with its buddy.

Reverse repair needs information about the origin of the data.
It will probably be implemented thus:

- the buddy sends a list of session ids to the restored instance.
- the restored instance will do a 'normal' load request for all of them.

We could optimise the latter by only loading from the buddy, but this would expose us to the risk of multiple active copies (e.g, in race conditions).

replicate objects separately and no transactions

The objects (SS/SAS/DS/DF and possibly timers) are replicated separately.
Some of the actions might influence multiple objects (e.g., creating a new SS in a SAS).
The replications associated with these actions are not done inside one transaction.
This means that it can happen that one replication succeeds, but the associated replication does not. In such cases there might be an inconsistency in the replica.
We will try to avoid such situations on a best-effort basis, but they can never be avoided entirely without an (expensive and error-prone) transaction mechanism.

no integrity guarantee

Like stated above the replicas can become inconsistent. However, each object (or better artifact) also contains a version number. These version numbers not really be used for checking the consistency between artifacts since versioned references between objects lead to every object in a tree being updated for every replication of a part. This means that there in addition to the chance of getting inconsistent replicas, these inconsistent replicas can also not be detected.

CSeq for staleness check (DOS and no guarantee)

The version numbers are used in the HTTP situation to request to check for staleness of the active session and to request the correct version of the session in case the active session is stale.

In SIP there is no compatible cookie mechanism that can be used to transport the latest version of the session. In SIP the CSeq can be used for similar purposes, but has a lot of limitation. This means it can be used for a sort of staleness check, but that it should be taken more as a hint of a potential stale session in stead of a guarantee of a stale session. I.e., there may be gaps in the CSeq number range.

This opens up the possibility of DOS attack. An malicious or badly written client that skips CSeq numbers will cause unnecessary load requests.

replication suspended during upgrade

During upgrade the replication is suspended.
Suppose that S2 is being upgraded. During the time S2 is unavailable S1 would not replicate to S3 and also no repair-under-load of S1 to S3 is being done.

The latter was an assumption made by me (Erik) which has to be validated if this actually occurs in normal failure situation. In addition to a lazy re-activation there might also be a lazy re-replication mechanism at work, where the FSD assumed an active re-replication mechanism (repair-under-load) for both the failure and the restore/restart case.

AP: Jan, check with Larry if we do lazy re-replication or repair-under-load.

no replication of ServletContext

The servlet context will not be replicated.
Correctly written applications should not use this functionality.
PGM needs this to associate their own keys to a SAS.

The JSR 289 proposed way to do this is to allow the application to add additional keys to a SAS (called linkSession now in sailfin, but likely something like addKey() in the final version of JSR289).

It might complicate the design if the application can lookup a SAS by its secondary key. The replication framework has to be aware of the secondary keys. Possibily the secondary index can be created on the fly?

The handling of memory full is handled by the load regulation and max nr of session settings.

There are two ways of ensuring that the number of sessions and the related memory is kept in check. The overload protection mechanism from EAS 4.1 implements checks on the memory for throttling. At the first limit all new sessions will be rejected, but subsequent requests are still allowed. At the second limit, both will be rejected.

In addition the active cache has a setting that controls the maximum number of active sessions. Together with an estimated session size this should give some protection as well

? do we also support migration of active sessions that are locked?

Discussion still ongoing; proposal is to reject requests for migrating a locked active session.

? do we do anything during network segmentation?

There was a proposal from /// to do a reconciliation. This would remove all the duplicates and stale replicas from the cluster after a merge.
This might also be useful before or after a reverse-repair. It could be a cheaper then issuing a separate load request for each re-activated replica, instead simple copy from the buddy would suffice.

We do not migrate, reactivate the complete tree (SAS, SSes, DFs, DSes?)

Just like replication is based on single artifacts, so will migration and re-activation be.
It is not trivial to deduct the complete connected graph from the information stored in the replica (i.e. in deserialised version). So collecting the correct information togher is not easy.

We quiesce on message level.

The LB will not do any queiescing before shutdown. It will handle this similar to failure. However, the instance is not shutdown until it has hasd the time to handle the ongoing requests. This could be based on some counter of the nu mumber of ongoing requests, but this will not be automated.

Dual ethernet

We will use a dual/redundant ethernet setup.
The Session Replication framework can work with this, if the cluster is configured correctly.
The failover (and/or loadsharing) of both connections is even transparent to the java code