Rolling Upgrade in Sailfin

Open Issues

  • How to handle the CANCEL in case of quiescing, it can not be routed to the correct instance.
  • Can we really confirm the INVITE transaction at the 200OK in stead of at the ACK? This will have some impact on timer handling and the persistent state.
  • How to handle PRACK and BYE on early dialogs?

Introduction

Scope

The purpose of the rolling upgrade is to upgrade an application, the AS, the OS, the hardware or a combination of these with a minimal loss of service and sessions.
This document describes a possible implementation or rolling upgrade in the Sailfin project. Parts in italics indicate questionable areas that need to be further refined.

Terminology

References

Architectural Introduction

Overview

The first implementation of rolling upgrade in Sailfin will be based on the save/restore principle. This assumes relatively few mutations to the caches happen during a roll (the time it takes to upgrade one instance in a rolling upgrade).

The solution is based on the current session replication solution.

Architectural Goals, Principles and Constraints

Goals

Principles

Constraints

Use-Case view

One roll

The operator ...

Logical View

RU relation with SSR

The main requirement is that during and after a rolling upgrade the sessions are still accessible. For this we need some of the functionality from Sip Session Replication (SSR).
More details and partly outdated discussions on SSR, including some notes on rolling upgrade can be found in the geeknotes 1 . The text in this doc is both a summary of, as well as an extension to, those geeknotes.

SSR design in a nutshell

During the design of SSR, it was decided that we would not cater for subsequent failures (also called back-to-back failures), given that the chances of these are very low. Instead we decided to rely on:

  • lazy replication
    updating the replica cache only when the active cache became active because of new traffic or timer events
  • lazy re-activation
    Restoring the active copy only when there is new traffic or a timer event occurred on a replica object
  • lazy migration
    Migrating the active copy when it is accessed due to new traffic or based on a timer event.

We also decided that we would not strive for a 'properly distributed' configuration (A situation where every artifact is active on its home instance (I.e., home instance equals owner instance).
The downside of that choice is that we can not know whether a request can be handled locally or whether a re-activation or migration is needed to handle the request. To get this information, while still adhering to the lazy migration and re-activation principles, we introduced the expat list, which gives information on whether a session is lives elsewhere in the cluster.
SSR handles shutdown very similar to a failure, except that there is some quiescing period during the shutdown where we try to complete ongoing transactions, this will avoid transaction loss when the instance is stopped.

Upgrade and SSR

Rolling upgrade from an SSR perspective can be modeled as a sequence of subsequent failures. The lazy replication/re-activation/migration strategy will not work in such a case, since this does not guard against subsequent failures.
Therefore, we have to use an eager replication and eager re-activation to establish a robust system again, where every set of data has a copy. There is no need to implement eager migration for this reason, a migrated session is already protected, since it will also have triggered replication.
There are two ways to restore the situation of the upgraded server as it was before the upgrade (minus any sessions that migrated during the upgrade); disk based repair and partner based repair.

  • Disk based repair.
    This is optimized for the case that the upgrade time per instance is short enough that the majority of the sessions have not been changed. It has the advantage that most of the repair handling is done by the upgraded instance. This should limit the throughput loss and the traffic loss (errors) due to overload instance during the repair.
    E.g., PGM with a lot of sessions (300K per instance) of which relatively few are touched during the upgrade is the prime example of this.
  • Partner based repair.
    This option involves the partners more in the repair process. This can lead to more throughput loss and more traffic loss due to high CPU usage during the repair. Also, it can cause a lot of traffic over the JXTA pipes (the complete replica and active cache).
    However, if we assume short lived sessions, most of which are updated during the upgrade of an instance, the differences between the two becomes less obvious and the partner based repair might have advantages here. The question is, of course, if for such applications SSR would be enabled at all.
    The current SSR implementation already has a time-based re-activation strategy for long lived sessions. If the definition of long-lived session can be re-configured during a rolling upgrade, this time-based re-activation can be used as a cheap substitute for the eager repair.
  • Memory based repair
    This option would only work for application upgrade and not for AS or OS upgrade, nor for hardware upgrade.

It would mean that two versions of the application should be deployed at the same time, where one is active and the other de-activated (or quiescing?). The active and replica cache could be transported between these in-memory applications very efficiently (with the same memory-footprint). There will be some overhead because active sessions have to be serilalised and de-serialised to move between classloader.
This is mainly kept here as an option for further study, maybe in glassfish V3?

Technical Descriptions

In general the choreography of a rolling upgrade looks as follows.
The following steps are preformed only once in the cluster

  • set dynamic-reconfig to disabled
    This ensures that instances are not syncing their configuration and apps with the DAS anymore.
    (This is a manual asadmin set command)
  • We also require a property under <availability-service> to be set to "true" during rolling upgrade.
    example: <property name="rolling-upgrade-underway" value="true">
    This is a manual asadmin set command.
    This is optional and depends on the selected alternative for save; if we have separate save and restore commands then this is not needed.
  • Backup the cluster configuration
    (not sure why this is here - leaving it so it can be discussed)
  • Deploy the new version of the app on the DAS (in case of an application upgrade).
  • Roll each server in turn (described below)
  • reset the dynamic-reconfig and rolling-upgrade-underway flags
    The rolling-upgrade-underway is optional and depends on the selected alternative for save.
    Again, depending on whether we have separate save/restore commands.

Rolling each server in turn is done in the following step. Each of these steps is explained in more detail below.

  1. Quiescence
    tell the load balancer to disable the instance (asadmin command). This means it is removed from the consistent hash. (there is a configurable time delay which is a form of 'quiesceing', allowing existing requests to finish processing.

This is an asadmin command.

  1. Save
    An (new) asadmin command is issues to save the data. This command will block until all the data has been saved (or a configurable maximum time, whichever comes first.
  2. Stop
    Shutdown the instance. The instance is disabled. Wait a short while to give the in-flight replication data from that instance the chance to make it to the replication partner.

Alternatively, step 2 and 3 can be combined into a modified stop command that would have a configurable timeout and that would save the data if the global rolling-upgrade-flag was set.

  1. Upgrade
    Perform whatever upgrades needed while the instance is stopped. This could include things like: hardware repair, OS patch, etc. (If only an application is being upgraded, then this step may be skipped. But the instance must still be stopped and re-started).
  2. Restart
    restart the instance. The instance will retrieve its configuration state and any new application version from the DAS. This will not result in any traffic being received yet, since the CLB is not enabled yet and any ongoing transactions should be finished either in the quiescing or in the upgrade period.
  3. Restore
    If the rolling-upgrade-underway is set to true the instance will restore its state from disk.
  4. Enable CLB
    The instance is enabled in the load balancer. The instance will start receiving new traffic again.
  5. Reconcilliate
    Then there are some repair/reconciliation actions to the stored state that was retrieved from disk, since it is outdated.
    The neighbors (which are involved in this repair) must not be upgraded before the repair is finished.
    We need a feeback mechanism to achieve this (probably just writing to the log file that re-conciliation is done is not good enough).

Quiescence

The purpose of quiescence is to limit the number of lost transactions when the server is upgraded. There are two reasons why we might loose transactions if we do not quiesce. One reason applies to the BE and one to the FE.

  1. Since session replication only happens at the end of the SIP transaction, any SIP transaction that is not completed on the BE when the instance is stopped will be lost.
  2. Since the responses from a SIP request are always routed via the CLB Frontend (if FE and BE are not co-located for the request), no responses can be sent after the FE is gone (or is unreachable).

The command to quiesce is the same command to disable the instance from the load balancer (i.e., remove it from the consistent hash). This will start the quiescing period. However, the question is whether it should block until the quiescing period is completed. Probably this is needed for automated scripts to perform the quiescing, without having to rely on log inspection or just local sleeps.
We need more information on how to configure the quiescing period (default 32 seconds). Will this be part of the command as well?

BE Quiescence

Given the decision to only replicate on SIP transaction boundaries, there is no possibility to continue a transaction on another instance, since the transaction data it not available on any other instance.
If the transaction should not be lost then we must allow the BE to complete the ongoing transactions, while not starting any new transactions.
Stopping new transactions
New transactions can be started in various ways:

  • New incoming requests
    New incoming initial or subsequent requests must not end up on the queiscing instance. The reconfiguration of the CLB will ensure this.
  • Timeouts
    Any timeouts (SAS timeout, ST timeout) happening on the queiscing instance must not result in new transactions being started in that instance. This is already ensured by triggering timeouts on the current home instance. Since after a CLB reconfiguration, the queiscing instance is no longer the home instance of any object owned by it, any timeout trigger will result in

there are no ongoing transactions)

  • Migration
    If there are not ongoing transaction preventing migration.
  • Ignore
    If there is an ongoing transaction preventing migration the timeout is ignored. However, the expiration time is not changed, and after fail-over the replica partner will cause the timer to expire. The maximum delay will be the quiescing time.
  • New outgoing requests
    New outgoing request can start new SIP transactions. They do not happen by themselves, but they can be generated in the context of:
  • Timeouts
    excluded based on the above.
  • Incoming requests
    Excluded based on the above.
  • Incoming responses
    We have two possibilities here.
    Either reject any outgoing request during quiescing (for the application this will probably look like an immediate error response is received in a different thread, which is not a very nice way to handle this).
    Or just accept the request. The result is that there is less time available for quiescing on the newly started SIP transaction. This is not a major problem, but just increases the chance of a lost SIP transaction.
    The decision is to allow new transactions to be started in this case.
  • HTTP sessions
    HTTP requests will be quiesced the same as incoming SIP requests, so this should not happen.
  • True Out-of-Band
    If the outgoing request is sent based on an EJB request, we are talking about true out-of-band. A similar reasoning as for incoming responses holds here.
    Finishing ongoing transactions
    Even though we should prevent (as much as convenient) any new transaction from being started, we should also allow ongoing transactions to finish. There are several ways an ongoing transaction finishes or continues.
  • Success or Error response
    for some transaction (e.g., subscribe, info etc.) the transaction completes when a 2xx (success) or higher (error, e.g. 3xx, 4xx) response is received.

Responses are not routed according to the consistent hash, instead the VIA header includes a reference to the originating instance. Therefore, any response if routed to the correct instance and can be used to complete the transaction.

  • Provisional response
    a provisional response does not finish the transaction, but must be handled in the context of the ongoing dialog. Since the dialog is not yet replicated, it must be routed to the instance where the (early) dialog resides. Since the responses are already routed according to the beroute in the VIA header, there is no problem here.
  • Transaction Timeout
    When no response is received a transaction timeout occurs. The transaction timer is kept on the local instance, so the timeout will occur on the instance that started the transaction and the transaction can be completed.
  • CANCEL
    An INVITE transaction can be canceled by the UAC if the 200OK is not yet received. However, since the CANCEL is routed according the same DCR rules as the original INVITE and the consistent hash has changed during the quiescing, the CANCEL will be received on the wrong instance (an instance that does not have the transaction) and will be rejected by that instance (481 from the transaction manager).
    Ideally, the CANCEL should be routed to the same instance as where the INVITE was routed.

However, it should be considered that an application always must be ready to accept a crossing of a CANCEL and the 200OK. It can respond to any 200OK or non-100 provisional response with a BYE.

  • PRACK
    An acknowledge of a provisional response (PRACK) is related to an early dialog and should be routed to the instance where this dialog resides. Since the prack is routed according to the contact (TS is UAS) or route (TS is proxy) that were received in the provisional response, they will currently be routed to the wrong instance. Therefore, the retransmissions of the provisional response would not be stopped.

Ideally the PRACK would be routed to the instance where the transaction is ongoing.

  • BYE on early dialog.
    An application is allowed to send either a CANCEL (see above) or a BYE on a dialog after the (non-100) provisional response is received.
    In case of an early dialog the BYE should be routed to the instance where the transaction is handled, since the dialog is not yet replicated and loaded on any other instance.
    However, a BYE on a confirmed dialog should be routed according to bekey, to the current home instance.
  • ACK on 2xx
    The ACK is, unfortunately, routed according the contact or route information that was received in the 200OK, like any other subsequent request (e.g., re-INVITE or BYE).
    This means in a quiescing situation, the ACK will be routed to the new instance.
    Currently, we replicate on the ACK and NOT on the 200OK for the INVITE scenario. Therefore, the ACK will be received on an instance that knows nothing of the transaction and also can not load it. Consequently, it will be dropped. And since the ACK will then never be received on the correct instance the 200OK will continue to be retransmitted, until the timer expires, in which case a BYE will be sent.
    So from the UACs point of view it will look quite strange and eventually the transaction will fail.
    A possible solution to this problem would be to already replicate at the 200OK. Then the ACK could be routed to the new instance, which would migrate the ongoing session to that instance. However, this has some large impact SSR handling. It has to be carefully checked what the consequences are of replicating a dialog before it is confirmed (e.g., the proxy implementation is now only exchanged for a serialisable version after the confirm happens). Since we replicate the session in a state where the retransmit timer is running (for retransmitting the 200OK), this also implies that this timer must be guarded on the replica, where currently we only guard SAS and ST timers. Also, this solution would mainly work for ACK on 2xx and not cover any of the other scenarios in this section.
  • ACK on error response
    Like the ACK on 2xx, the ACK on error response is also used to stop re-transmissions. This ACK is hop-by-hop and not end-to-end and is handled in a different layer (in the transaction layer), but from a routing perspective the same issues hold as for the ACK on 2xx.
    Since we normally do not replicate on the error response, the dialog can never be migrated to another instance, so there is no possibility to handle this ACK on any other instance than the instance where the transaction is located.
  • Speedy NOTIFY
    A so called speedy notify is a notify that is received before the corresponding 200OK response to the subscribe. The speedy notify is a headache, since according to the spec it must also count as dialog confirmation.
    If a speedy notify is routed according to the be-key it might be received on an different instance as where the 200OK is received (routed by be-route). It will then not be marked as speedy and try to load the session. Since the dialog is not yet confirmed, it is not replicated, and not ready for migration. (remote locked).
    The conclusion is that if the BE is quiescing, incoming initial and subsequent requests as well as timeouts will no longer happen. Responses will still be routed to the correct instance.

From the text above it is clear that there are some requests (CANCEL, ACK, PRACK, BYE) that should be routed to the instance where the transaction they pertain to is located. Several solutions for this have been proposed.

  • Include both be-route and be-key information in initial routing info
    In this alternative we route these 'special' request (PRACK/ACK) according the be-route instead of the be-key. Since we can only provide the information once to the UAC on which all subsequent requests will be routed, this requires that both be-key and be-route are provided in the 200OK or any provisional responses. Any other (non-special) request can be routed (by the FE) according to the be-key information. Since both are available special and non-special request can be routed differently.
    This would require changes to the CLB, which currently only provides the be-key in the contact and route information, and the be-route only in the via headers.
    This solution does not work for CANCEL, since the contact or record-route information that includes the be-key and the be-route, is not used in the CANCEL. The CANCEL will contain the same information as the original INVITE.
    Also for the BYE there is the problem that it can not be determined by the FE whether the BYE is sent in an early or confirmed dialog.
    For Speedy NOTIFY there is a similar problem. There is no way to know whether the NOTIFY is speedy or not.
    For PRACK and ACK there are different problems. Since the contact and record-route information are effectively immutable, the be-route would indicate the instance where the initial INVITE was handled. For ACKs and PRACKs on the responses to the initial INVITE this is good. However, in case of a combination of re-invite and migration, the re-invite might be received on a different instance than the instance that handled the initial invite. In such a case the be-route information is outdated, and ACK/PRACK requests could be routed to the wrong instance. This might be less of an issue, given that re-invites are relatively rare.
    We can also put a time-limit on the use of the be-route. Here we assume that re-routing based on be-route for special requests is only needed in a short time (32 seconds) after the last cluster reshape.
    Conclusion, this solution does not work properly for CANCEL, BYE or speedy NOTIFY, and only works for ACK/PRACK in the case where there are no re-invites after migration.
  • Include a timestamp and keep multiple cluster configurations
    Disclaimer; I'm not completely sure about this option. So the rest of the text should be read with this in mind.

Instead of including a be-route in the contact/route information for routing special requests, a time-stamp is included. The FE will based on the timestamp route special request according to the current cluster config or according to the previous cluster config. This will only be done for 32 seconds after the last cluster reshape.
The solution is similar to the be-route, with a time-limited route of the be-route. As far as I can see it has simlar limitation.
The proposed solution from the CLB team was to keep all the ongoing sessions sticky for 32 seconds after a cluster reshape. However, there are some issues with SSR (which will not load any objects that do not have their home on the current instance). Also, this would allow new requests to be handled (e.g., OPTIONS or re-INVITE) during the quiescing phase, which will again lead to large chances of these transactions being lost. It would rely on the current transactions being finished with the 32 seconds and no new transactions being started...

  • Re-route after lock detection
    This solution entails that we route the request according to be-key, always. The request might end up on the 'wrong' instance. Via the replication framework we will try to load the dialog with the indicted DFid (todo, check how this works for CANCEL as this does not use the fid that is provided in the contact or route). If the dialog is in the local active cache the request is handled. If the dialog is not in the local cache, a broadcasted load request is issued. For the ACK and PRACK, the load request should fail with a remote locked exception.
    If we receive the remoteLockedException, some requests, like PRACK/ACK/BYE can be routed to the instance that generated the remote locked exception. (FE->BE->BE). Responses (on BYE, PRACK) have to be routed back via the first BE, in order to remove transactions and reuse existing connections.
    This solution will work for ACK and PRACK as well as for BYE and speedy NOTIFY. It will not work for CANCEL, since the TM already returns the 481 result if the related transaction can not be found on the instance. It will work after re-invite and after migrations. It does not need to be time-limited. The disadvantage is that the architecture gets a bit mixed up.
    Fortunately, the ACK on error responses is handled as an 2xx ACK if the TM can not find it...

FE quiescence
Currently the incoming requests will be routed by the FE to the BE (if not co-located). These internally routed request include a VIA header which identifies the connection used for the incoming request (connid) and an indication that this request is routed by the FE (felb).
Outgoing responses will be routed by the BE to the FE based on this VIA, which allows the FE to re-use the incoming connection to send the responses. This allows the same TCP connection or TLS connection that was used for the incoming request to be used for outgoing responses as well.
Unfortunately, this requires the FE to be available when sending an outgoing response. If the FE is externally addressable during the quiescing period, which is will be if during quiescing we only remove the BE from the CLB, the FE will keep on handling requests for incoming requests and hence be required to be available for outgoing responses as well.
In order to do quiescing for the FE as well we have different alternatives.

  • Disable the FE for new incoming requests during the quiescing period, this includes closing of any established connection from the external parties. However, any responses to requests that were sent previously via the FE should still be routed via the FE.
    If all the communication internally between BE and FE are reusing existing connections, then the responses can still be routed over these connections, provided that we do NOT close these internal connections (although the port is closed which will prevent new internal connections from being established).

If responses can be send over new connections from BE to FE without re-using any existing connections, then this requires a different port for new incoming requests then for outgoing responses. Then during the quiescing period the one is closed, but not the other.
At the end of the quiescing period also the internal FE-BE traffic must be stopped by closing all the internal connections from the quiescing instance (and closing the port if that solution is chosen).
It need to be investigated if there is a fully meshed internal network connection pool that is used for sending these responses from BE to the quiescing FE.

  • If the FE is not available when sending the outgoing request, and the top VIA indicates the felb and connid, we do not try to resend the response, but instead pop the via header and respond similar to a FE that lost the connection identified by connid. This would require the BE to open a new connection directly to the UAC.
    It is unclear whether this works correctly for TLS. For TLS is it not the UAC that must set up the connection. So in case of TLS, when the connection is broken by the FE, the UAC will re-setup the TLS connection to a different FE. However, the BE has no way of knowing to which FE and which connid it should route. This would then also be an issue in a normal connection lost situation.

In-flight data
At the start of the quiescing, not all the instances in the cluster will immediately have the same view of the re-configured consistent hash. There can be FEs that already sent data to a BE, because that is what the FEs consistent hash indicated, However, when the request is received on the BE that is being quiesced, its consistent hash will indicate that this is not the home instance.
There are several ways to handle this:

  • Issue an error response with a retry-after header on these requests. There is some discussion on the correct error response, but at least this should not be a 503, since this indicates things on the IP-address level, which in our case is also the cluster level (saying that the complete cluster is not available for a while). A 302 or 500 error would be preferred.
    This is the current behaviour.
  • Handle the request as if is belongs here. This might not be wise since there are several checks also in the SSR part where the home instance is checked.
  • Reroute the request to the correct (current) home instance.
    This is probably complicated to achieve since it blurs the boundary between FE and BE (the BE would be acting as a FE). Also this introduces the risk of recursive behaviour (since the current home instance might not have updated its consistent hash yet).
    The first solution is probably acceptable, since the amount of inflight data should be small.

Migration of session between 200OK and ACK

A problem related to the handling of ACK during quiescence is the locking of the dialogs. Currently the lock of the dialog (or better said, all SASes related to a dialog) is obtained at the INVITE, but released at the 200OK. It will be re-obtained at the ACK and released 32 seconds after the ACK is handled.

This means two things;

  • The SAS may be migrated between the 200OK and the ACK.
    If any other request is received for the same SAS (e.g., another session correlated via the SipApplicationSessionKey annotation) the SAS will be migrated. Then even if the ACK would be received on the original instance (which is unlikely, see earlier), the ACK would either be dropped because the SAS is in the mean time remotely locked on another instance, or the SAS will migrate back, which is something we want to avoid from performance point of view.
  • After the ACK is received and handled, there is a time of 32 seconds during with migration of the SAS is prevented (remoteLockedException).
    Redesign of the DLC should avoid the latter problem; the lock being kept for 32 seconds after the ACK is handled.

The first problem could be solved by replicating on the 200OK (see earlier) and routing based on the be-key in the ACK.
If we route the ACK to the original instance based on the be-key (see earlier on PRACK handling), then we would still have to extend the locking period to be in effect until the ACK was handled.

Save

During the save a snapshot of the active cache and the replica cache are written to disk.
Save is fairly simple. During the backup all the active caches are serialized, by serializing the cache object in which they reside (normally an hashmap or similar). The replica cache is already serialized and can be written quite simply.
The location where the data is written must be configurable and depending on the upgrade scenario this could be a file on the local file system or on ram disk (application upgrade, AS upgrade) or on a NFS mounted central disk (OS upgrade, HW upgrade).

As a possible later optimization we can make the saving configurable per application and the reconciliation fault tolerant to write failures.
Could be combined with the start command?
The only mutations to the cache that happen during the save will be migration (i.e., removal from the active cache). Since all traffic ports are closed (both externally and internally in the cluster) after the quiescing, only SSR requests will be handled in this phase. These kind of mutations are not an issue; they will be corrected later during the reconciliation phase.

Stop

The traffic to the instance is stopped .The pipes are kept open a tiny bit longer to allow in-flight traffic to complete its journey.
Now there are two options.

  • Disable replication from the replica source.
    I.e., during replication the replica source will no longer replicate to us or to any other partner. The instance is expected to return in a relatively short time and will re-claim the replicas that are saved during its down period anyway.
  • Allow the normal behaviour of having the replica source reconfigure its pipes based on the new cluster configuration.
    The advantage is that this is the normal behaviour. Also it is more save in case of an (very unexpected) failure during the upgrade (i.e., when the instance is not recoverable for some reason). The disadvantage is that this will create potential zombies on the current replicate destination partner which must be removed before the reconciliation.

Upgrade

Do all the stuff that needs to be done.

Restart

The instance is restarted. The ports are opened. Traffic can be received, but since the instance is not enabled in the CLB, no BE traffic will be received. However, FE traffic can immediately start.

Restore

The data is restored from the cache. This requires reading from the previously set file location. The data is deserialised for restoring the active cache, the replica cache is already in serialized form, so can be restored as is.
Note: with this rolling upgrade functionality it will be possible to upgrade applications but under tight constraints including:

  1. there is no support for application versioning in this release so as each instance is updated, the application is newly deployed (instance by instance) with the dynamic-reconfig disabled.
  2. this means that during the overall upgrade process, there will be "in effect" different versions. So great care must be taken by app developers and deployers that the newly deployed app is completely consistent with the old, particularly with regard to usage of objects stored in http-session, sip application session, etc.

Application are recommended to use a pattern where a version of the data is included in the serialized form, so as to be able to handle both old and new data formats, in case there is any changes made.
The container artifacts use the same pattern.

Enable

The instance is enabled in the CLB and will start receiving BE traffic again.

Reconciliation

Between saving the snapshot and restoring it all kind of things might have happened that make the restored data invalid.

  • Active sessions have migrated because they have been accessed during the roll (either by traffic or because of a timeout). However, they are still present in the snapshot of the active cache.
  • Active sessions have been removed during the roll (after being migrated). However, they are still present in the snapshot of the active cache.
  • New sessions have been started on the replica source but not yet replicated to the upgraded instance. Depending on the solution these replicas may be on our replica destination already.
  • Sessions might have been updated, but the restored replica cache still contains the old version.
  • Session have been removed, but they are still present in the restored replica cache.

The solution to this is to do reconciliation. We distinguish two types of reconciliation; Reconciliation of the active cache and reconciliation of the replica cache.

Active cache reconciliation

The idea behind repairing the active cache is very simple. Any items that were in the active cache of upgraded instance should (under normal circumstances) be in the replica cache of the replication partner. During the upgrade, traffic meant for the upgrading instance is redirected to other instances. This will result in reactivaction of those sessions, which means that the owner changes. So during the upgrade the items in the replica cache that are still owned by the upgrading instance will only decrease (never increase). After the upgrade, we just have to remove all the instances from the reloaded active cache that have been reactivated elsewhere in the cluster.

Until that task is completed, any items that have not yet been repaired in the active cache are marked as suspect and access on those will trigger a load request, even if it is found in the active cache (since in the mean time it may have been re-actived elsewhere)

The way this will occur is that the instance will send out a propagated request to all members of the cluster. Each instance will reply with a list of id's owned by the caller instance (in practice usually only the next partner instance of the caller will reply).

Then:

  1. all the members of the active cache are marked as 'suspect'
  2. iterate over the members of the active cache - if it's id is not in the list of replicas still owned by it, it is removed (only from the active cache). If is is in the list, then it is marked as 'no longer suspect'.
    (see the 'geek notes' for example pseudocode for this p.56).
    Note: this process is occurring while the instance is also under load, so an incoming request during this period must also check members of the active cache and if they are still marked 'suspect', then it should check the owned list. If it is in there it's ok and can be returned. If not, then it should be tossed and a fresh load call issued.
    (see pseudocode in geek notes p.57)
    As a future optimization we can consider a fault tolerant active cache reconciliation, that does not depend on the fact that the saved copy of the active cache is complete.
    The solution assumes that both deleted and migrated sessions will be removed from the restored active cache. This leaves any migrated sessions or sessions that are created expat intact. This could lead to a 'bunched-up' effect due to the cumulation of migrated sessions.
    TODO qualify this effect, it might not be that bad, we have to take into account the actually time it was disabled from the CLB, during which time migration can occur, vs the reconciliation phase, where migration back is already occurring. Also during the rest of the roll migration back to already migrated instances will happen, so the length of the total roll is essential as well.
    Migrating back any migrated sessions could be done in different ways, but the most robust solution is based on the expat list handling. Request the expat list for the upgraded instance after the CLB is enabled again and actively try to load all the sessions in the expat list. There could be sessions locked at such a time. Then there are two options, leave those sessions (they will not add up to much anyway, so the bunch-up is avoided and they will be migrated at the next traffic event or timer event) or retry (how long, how much, there is no trigger on which to retry).

Replica cache reconciliation
This process is different than what the 'geek notes' mentions although similar in spirit.
In the geek notes the attempt is made to do the reconciliation work from instance1. We find it is easier to do it from instance4.
After the snapshot is restored the upgraded instance triggers its replication source to do the following:

  • query its replication partner (the upgraded instance) to get a list of replica id/version data elements
  • remove any replica that was created during the roll on any other instance (i.e., anything owned by our replication source that is not on its current replication partner).

This could be avoided by disabling replication from the replica source during the upgrade.

  • iterate over the query result:
  • if an id from this list does not exist in its active cache - issue a remove message to remove it from the restored replica cache.
  • if an id exists and the versions match - do nothing.

The data in the replica cache is already up to date.

  • if an id exists and the active version is > replica version, do a save

The data was outdated (possibly it was created on the replica partner and but the updated version removed again in the step above).

  • iterate over the active cache
  • if an id from active cache does not exist in the replica list, do a save. This ensures that any newly added sessions are stored.

Next roll

The next instance in the sequence should not be upgraded before the reconciliation is complete.
Well, it could be optimized if the next upgraded instance is neither of the two partners of the just upgraded instance. In fact, if the distance in the replication ring is at least three between upgraded instances, there should be no functional effect of this, in theory they might even be upgraded simultaneously!

Re-balancing of TCP connections

The SCSF keeps a fixed number of TCP connection to the SIP-AS. In case one of the instances is disabled, the SCSF will notice this as a termination of a part of the connections (1/n th of the connections, in a nicely balanced scenario). It will re-establish these TCP connections based on the termination trigger. The re-established connections will be redistributed over the currently available FE instances by the IP-sprayer.
When the instance comes back up again, the SCSF will get no trigger. Therefore, none of the TCP connections will be re-established. Effectively, this means that the TCP connections are not evenly distributed over the FE instances.
Example. If we have 10 instances, and 100 TCP connections, ideally every FE has 10 TCP connections. If one instance goes down after re-establishing the connections the 8 of the remaining instances will have 11 TCP conncetion and one will have 12 connections. After the instance is restored, 8 instances will have 11 connection, 1 instance will have 12 connections and one instance will have none.
Remember, this is only the FE TCP traffic that is not properly distributed, all instances will receive their fair share of BE traffic or of FE UDP traffic.
The previous solution from EAS was to close all the TCP ports in an upscale scenario, forcing a redistribution over the currently available instances.
A nicer, less brute force, solution would be to close connections regurlaly. This can be for two reasons

  1. The connection did not receive any traffic in a configurable period.
  2. The connection has been alive for a configurable period.

Period 2 should be longer than period 1 and ensure that even in the presence of constant traffic the connections will eventually be re-distributed evenly over the cluster.

Use case realization

Process view

Implementation view

Data View

Deployment View

Size and Performance

Cost Estimate

The following tasks have been identified.

  1. discussions on choreography (includes discussions with the CLB team on quiescing).

There seem to be several discussions on quiescing and VIP in parallel. Also there are open issues, see text.

We should synchronize this with the CLB team

As a result of these meetings we might have some impact on CLB, e.g., handling of early dialogs;

There might be SSR impact (e.g., we might have to replicate on the 200OK for INVITE dialogs or at least avoid migration of the SAS between the 200OK and the ACK)

There might be NM impact (multiple ports for responses and requests)

  1. TCP connection redistribution

Closing of ports.

  1. new asadmin commands potential

Potentially, we need new asadmin commands and this will include extra effort to create those.

  1. save/restore functionality

This is already partly implemented, but needs to be refined. E.g., configuration of the destination ?

Code reviews and edge case analysis still needs to be done. Main work done by SUN.

  1. reconciliation

This is also started already, but some parts are not yet done. Again code reviews and edge case analysis is needed.

  1. EJB SFSB functionality

THis is actually part of the previous two. However, it is mentioned as a specific item since it might be decided to skip this in the first release.

There is some extra effort involved in this due to the different structure of the managers for EJB SFSBs.

  1. SSR high volume tests

At the moment there are not yet high volume tests for SSR system test it seems.
These kind of tests are needed as a basis for the high volume rolling upgrade tests.

  1. Rolling upgrade function test

Except some stabilising period here. Calculate some support in fixing these issues.(and remember, that these are the same resources as involved in the 'normal' SSR work so preferable stabilise that first, so the focus can be on rolling upgrade)

  1. System test rolling upgrade

Need to define appropriate traffic scenarios here.
Again expect some performance issues, which might or might not be solved within the scope of the proposed solution.
Experience tells that performance problems take a long time to track down and probably also a long time to correct.
Also SSR team members are needed for trouble shooting, domain knowledge, bug fixing etc.
To conclude, the recommendation is to first finalise the normal SSR testing, so we can be confident that the SSR functionality is in a good state.
For Ericsson the effort would not be so much in coding, but more in reviews, design support and trouble shooting help.
The main effort for Ericsson should probably on the rolling upgrade testing.
I'm not sure what kind of tests are already available from EAS in this area and if they can be reused nor do I know what the normal effort of this was in the EAS days.
The following figures are very tentative and are subject to change pending the ongoing discussions. They also do not include an optimist compensation factor (I'm known to be about 25% too optimistic )

  1. Quality