In memory Replication in GlassFish Server Open Source Edition 3.1
1.2. Name(s) and e-mail address of Document Author(s)/Supplier:
Mahesh Kannan <p>Mahesh.Kannan@Oracle.Com</p>
1.3. Date of This Document:
May 15 2010
2. Project Summary
2.1. Project Description: In memory Replication in GlassFish Server Open Source Edition 3.1
GlassFish V2 relied on session state persistence feature to provide high availability of session states (HTTP and EJB) by enabling containers to persist HTTP or EJB session states to a persistent 'store' that can survive GlassFish instance faiures. In order to be able to interact with a variety of physical stores (like file system, in memory, HADB etc.) GlassFish V2 defined an SPI called BackingStore SPI. The SPI defined a set of classes that a store provider must implement (SPI) and also a set of APIs for the containers to use. This allowed containers to persist the session states to a variety of BackingStores without worrying about the actual physical store. One such BackingStore was in memory replication that replicated the session states to an another GlassFish instance so that session states are available even in the presence of instance failure.
In V3.1, replication module has been redesigned to provide some more improvements over V2 implementation. First of all, it is an OSGi module just like many other modules in V3. Secondly, it internally uses a consistent hash algorithm to pick a replica instance within a cluster of instances. This allows the replication module to locate the replica (or find the replicated data) in an easy manner when the contianers need to retrieve the data. The replication module allows any Serializable POJOs to be replicated (and retrieved) in a cluster of GlassFish instances. This means that the module can be used to provide availability of any POJOs and not just HTTP and Stateful EJBs.
Basically, the replication layer provides a set of APIs to (a) save, (b) retrieve, (c) update and (d) delete data (some POJO). The save method is implemented by replicating the data to another instance in the cluster. In V3.1, it internally uses a consistent hash algorithm to pick a replica instance within a cluster of instances. The retrieve operation is implemented by locating the replica and then pulling the data (POJO) from the replica.
The rest of the document will describe the various details of in memory implementation module.
2.2. Risks and Assumptions:
For V3.1, we will continue to use the BackingStore SPI that was approved for the Sailfin release. The SPI allows our containers to save states (HTTP Session / Stateful EJBs) to avariety of physical stores (like Database, Filesystem, InMemory replication etc.)
It is assumed that GMS will provide the APIs for the following:
1. APIs for sending and receiving serialized data (byte[]) to a specific target instance / all members in the cluster
2. APIs for receiving events about health of members in the cluster. More specifically, the HA module will rely on the existence of JOIN_AND_READY, MEMBER_FAILED, CLUSTER_STOP and CLUSTER_START events. Also, the JOIN_AND_READY event can additionally specify a sub-event (REJOIN) to indicate that the member has rejoined the group after leaving the group for a brief period of time. This sub event is typically generated when an instance is restarted (and joins the cluster) even before GMS could detect the failure. More details can be found the GMS one pager.
3. It is assumed that, for this release, we cannot guarentee availability of session state in the event of multiple (more than one) server failures. This is because, the replication module will replicate the data to only one replica instance and hence may not be able to recover state in the presence of more than one server failure.
4. Metro HA (particularly reliable messaging) requires larger payloads. While the exact payload sizes are application dependent, it is assumed that GMS messaging layer can support these higher payload sizes.
5. For better performance, it is expected that the web container sets a cookie called 'replicaLocation' in the response. This cookie tells where the replica resides. It is assumed that the GlassFish LB plugin will - after a failure - route the request to the node indicated by the cookie.
6. The 'replicaLocation' cookie approach can be taken by EJB container as well, but this is NOT planned for this release (as this requires more work in the ORB, request/response interceptors etc.)
3. Problem Summary
3.1. Problem Area:
3.2. Justification: Why is it important to do this project?
GlassFish 2.x ensured HTTP and EJB session state availability by providing a in memory replication module. GlassFish 3.0 did not have the necessary clustering infrastructure to provide the above HA feature. Since, clustering support will be available in GlassFish 3.1, we need a light weight, open source implementation of in-memory replication module.
4. Technical Description:
Similar to V2.x, the general approach is to replicate session state from each instance in a cluster to a back-up / replica instance.
The replication layer itself will use the GMS communication APIs for transporting the data. It is assumed that GMS will use the appropriate transport APIs (like Grizzly APIs) to send the data.
The current plan is to leverage GMS for cluster group membership services including things like initial bootstrapping/startup and various cluster shape changes including instances being stopped and re-started, instances failing, new instances being added to the cluster, etc. Its is assumed that ReplicaSelector will register itself with GMS to know the current 'alive' members so that it can pick an 'alive' replica.
The plan is to have as close to zero configuration as possible. Availability configuration will continue to work as it has for previous HA enabled releases. The existing persistence-type ("replicated") will continue to be supported. This will allow QA and performance tests to run as they do with V2.x.
4.1. Details:
For the rest of the document we will use the following notations:
n1, n2, n3 denote the instances in the cluster. The terms nodes and instances will be used interchangeably.
s1, s2, s2, s3 are used to denote session ids and k1, k2 and k3 will be used to denote keys (of any object that might be persisted in the store).
The terms keys, ids and session ids - in this document - mean the same.
Note: We will continue to use the same set of classes and interfaces that ASARCH approved for Sailfin release.
4.1.1 Replication in V2
Let us assume that the loadbalancer routes sessions s1, s2 and s3 to node n1. In V2, n1 would have replicated s1, s2 and s3 to n2. This is because it followed a 'buddy' replication scheme wherein a instance will replicate all its data just to its neighboring instance.
Now, lets say that n1 fails and the loca balancer routes s1 to n2 and s2 and s3 to n3. Obviously, when the container(s) in both instances will ask the BackingStore to 'load' the data. The load request for s1 can be satisfied 'locally' by n2 as originally n1 replicated s1 to n2. However, the load requests for s2 and s3 cannot be satifisfied 'locally' by n3. It will have to send out a 'broadcast' query to the entire cluster to retrieve the dat from n2. We have seen that this approach leads to severe performance problems as, under heavy load, (a) there will be a lot of these broadcast requests and (b) the instance that has the replica will also has to serve extra data to the requesting instances.
V2.1 used a bunch of classes like Metadata, SimpleMetadata, CompositeMetadata and BatchMetadata to replicate / persist data in to an instance of BackingStore. This meant that the BackingStore implementations needed to understand these objects to correctly store / replicate data. Also, it was difficult to support a new container as addition of new containers typically meant new artifacts that must be understood by <bold>each</bold> BackingStore implementations.
4.1.2 Replication in V3
Again, assume that the loadbalancer routes sessions s1, s2 and s3 to node n1. In V3, n1 might replicate s1 to n2 and s2, s3 to n3. This is because it uses a consistent hash algorithm to map a key to an (available) instance. The reason why this is called a consistent hash is that, the function yields the same (consistent) instance name when computed from any instance in the cluster. In V3, instead of using a consistent hash function directly, the replication module will define an interface called ReplicaSelector. The input to the ReplicaSelector will be the key and the groupname and the ReplicaSelector will return an instance name (that is alive witin the specified group) to which the data must be replicated. Each backup instance will store the replicated data in-memory.
Project SHOAL will provide the in-memory replication implementation. Project shoal will provide a class called DataStore that supports all methods of GlassFish BackingStore interface. DataStore will use GMS under the covers to join a group and to replicate data (using GMS send API). The reasons for doing it in project shoal are:
It ensures that the in-memory replication module is built using clean abstractions instead of having direct dependencies on container modules.
Also, it allows for independent testing without any need for any GlassFish cluster.
The main interfaces that are used by the Containers are (a) BackingStoreFactory, (b) BackingStore and (c) BatchBackingStore. BackingStoreFactory is used to create the other two stores. BatchBackingStore is used only by the EJB Contianer. BackingStore will be used by Web, EJB and Metro.
In Sailfin, we took a different approach. The BackingStore SPI was extended by defining four new annotations. The most important annotation being @StoreEntry and @Attribute annotations. This allowed containers to define any POJO that contained the data to be persisted. The POJOs are annotated with these annotations which allowed containers to describe each attribute of the POJO. Here is an example:
@StoreEntry
public class PersistentState {
@Attribute(name="id")
public String getId() {...}
@Attribute(name="version")
public long getVersion() {...}
@Attribute(name="state")
public byte[] getState() {...}
@Attribute(name="data1")
public String getData1() {...}
}
Here we have defined a simple POJO called PersistentState and have annotated the getter methods with @Attribute annotation. Lets say that a container wants to store the above in a BackingStore. The container first calls BackingStoreFactory.createBackingStore() and passes PersistetState.class as one of the parameters. This allows the BackingStore implementation class to use reflection APIs to examine the "Attributes" of this POJO. A database implementation might use these "Attributes" to create a table whose field names are the attribute names. Currently, only "Attributes" that are primitives, primitive wrappers, String and byte[] types can be annotated with @Attribute.
Once the POJO is defined, it must be run through an annotation processor that is defined in ha-apt module. This will create a couple of classes:
For each POJO named X it will generate a sub class named X_Sub. This class will override all the getters and setters in the parent class (X) whose getters are marked with @Attribute. This allows the underlying physical store to determine which attributes are 'dirty' and perform an efficient update.
For each POJO named X it will generate a class called X_. This can be used for creating portable queries. Since GlassFish 3.1 containers do not perform any queries, we will not be implementing any query support in this release.
In V3.1, we will let the containers to use any POJO object that are annotated with @StoreEntry and @Attribute. In fact, it will be easy for the containers to annotate the V2 artificats (like Metadata, WebMetada, SimpleMetadata, SFSBBeanState etc.)
These genertaed sub class POJOs are used in BackingStore operations. The generated sub class X_Sub (or PersistentState_Sub) will contain (generated) methods that allows dirty detection (meaning if a setter was called on, say attribute 'y', then the generated sub class remembers that the attribute y is dirty. This allows, say a database implementation, to efficiently perform updates).
4.1.3 Similarities between V2 and V3.1 replication
The way HA feature is enabled / disabled remains the same
BackingStoreFactory is still used to create BackingStores for a particular persistence type
BackingStore APIs remain the same between V2 and V3.1. There are a few more methods in this class but are required only for Sailfin
4.1.4 Assumptions about how BackingStore will be used
It is assumed (for this release) that a BackingStore will be used to persist a single type of object (that is if the store was initialized with a V.class then store.save() will beused to save only objects of type V).
Lets say that instances inst1 and inst2 are alive and have replicated some data. Now when instance inst3 comes up, the implementation must ensure that no data is replicated to inst3 until inst3 is in some 'operational' state.
4.1.5 Note for BackingStore providers
Ensure that the implementation class is annotated with @Service like:
@Service(name="replication")
public class ReplicatedBackingStoreFactory {
.......
}
Also, (to lazily load the actual BackingStoreFactory) write a separate OSGi module (V3 module) that implements V3 Startup interface that registers a proxy BackingStoreFactory for the supported persistence type as follows:
public ProxyForReplicationBackingStoreFactory
implements Startup, BackingStoreFactory {
public Startup.Lifecycle getLifecycle() {
BackingStoreFactoryRegistry.register("replication", this);
return Startup.Lifecycle.SERVER;
}
//Other BackingStoreFactory methods...
}
4.1.6 Note for Containers
In V3.1 an instance of BackingStoreFactory can be obtained as follows: Habitat habit = ....; BackingStoreFactory factory = habitat.getComponent(BackingStoreFactory.class, type); OR BackingStoreFactory factory = BackingStoreFactoryRegistry.getBackingStoreFactory(type); Where type is the persistence type (like "file", "replication", "db", "coherence" etc. )
4.2. Bug/RFE Number(s):
4.3. In Scope:
This project will deal will replication of HTTP Sessions and Stateful Session EJBs and some Metro artifacts.
4.4. Out of Scope:
The project deals with providing availability of sessions (both EJB and HTTP) in the presence of one instance failure
In the case of multiple failures, it is assumed that the time between successive failures are large enough for the sub-system to replicate the sessions to another replica instance
We will not support multiple replicas for this release.</p>
4.5. Interfaces:
Note: We will continue to use the same set of classes and interfaces that ASARCH approved for Sailfin release.
BackingStoreFactory
/**
* A factory for creating BackingStore(s). Every provider must provide an
* implementation of this interface.
*
* The <code>createBackingStore(env)</code> method is called typically during
* container creation time.
*
* The <code>createBatchBackingStore()</code> method will be called whenever
* the container decides to save a set of states that belong different
* applications.
*
*/
@Contract
public interface BackingStoreFactory {
/**
* This method is called to create a BackingStore that will store objects
* of type vClazz and keys of type kClazz.
*/
public <K, V> BackingStore<K, V> createBackingStore(
String storeName, Class<K> keyClazz, Class<V> vClazz, Properties env)
throws BackingStoreException;
/**
* This method is called to store a set of Objects objects atomically.
*/
public BatchBackingStore createBatchBackingStore(Properties env)
throws BackingStoreException;
}
"BackingStore"
/**
* An object that stores a objects (of type V) against an id (of type K). This class defines the
* set of operations that a container could perform on a store.
*
* The store implementation must be thread safe.
*
*/
public abstract class BackingStore<K, V> {
protected String getStoreName() {...}
protected void initialize(String storeName, Class<K> keyClazz, Class<V> vClazz,
public abstract V load(K key, long version) throws BackingStoreException;
public abstract void save(K key, V value, boolean isNew) throws BackingStoreException;
public abstract void remove(K key) throws BackingStoreException;
public abstract void updateTimestamp(K key, long time) throws BackingStoreException;
public abstract int removeExpired(long idleForMillis)
throws BackingStoreException;
public abstract int size() throws BackingStoreException;
//Typically called during appserver shutdown
public void close()
throws BackingStoreException {
}
//Typically called during app undeployment.
public abstract void destroy() throws BackingStoreException;
}
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
public @interface Version {
String value() default "";
}
. 4.5.3 Deprecated/Removed Interfaces: List existing public interfaces which will be deprecated or removed by this project.
Interface:
Reason for Removal:
4.6. Doc Impact:
Any asadmin commands that set various attributes of availability-service need to change (to not use any hadb configuration)
4.7. Admin/Config Impact:
Admin GUI pages that performed availaibility-service configuration need to change (to not use any hadb configuration)
Identify changes to GUIs, CLI, agents, plugins...
4.8. HA Impact:
Depends on GMS for communication APIs Depends on GMS API for obtaining previous cluster state For better performance, after an instance failure, replication layer expects GlassFish LB plugin to understand the 'replicaLocation' cookie and route the request to the instance indicated by this cookie. The cookie will be set by web-container.
4.9. I18N/L10N Impact:
Does this proposal impact internationalization or localization?
4.10. Packaging, Delivery & Upgrade:
4.10.1. Packaging
The BackingStore SPI classes will be available as an OSGi module and will be available in the web profile as well. The module will also contain a no-op BackingStore implementaiton.
4.10.2. Delivery
There will be there jars delivered to support the in memory replication feature.
ha-api.jar will contain BackingStore SPI.
shoal-replication.jar will provide the implementation of BackingStore API using memory replication. It will live in shoal project but will have dependency on ha-api which is in glassfish repository
4.10.3. Upgrade and Migration:
What impact will this proposal have on product upgrade and/or migration from prior releases?
Enumerate requirements this project has on upgrade and migration.
4.11. Security Impact:
N/A
4.12. Compatibility Impact
4.12.1 Changes to BackingStore interface
We have added the following three methods to the BackingStore. These methods were added for Sailfin to support query feature. But, none of te GlassFish 3.1 containers will use these APIs.
The current plan is to retain the same V2 availability-service element. The HADB related attributes will not be used by V3. The plan is to mark the corresponding config attributes with @Deprecated
4.12.3 Impact on Upgrade tool
Since clustering was not in Glassfish v3, there is no upgrade from v3 to v3.1.
Upgrade from glassfish v2 clustering applications to glassfish v3.1 is also NOT needed since we will retain the same availability-service elements from v2 domain.xml
4.13. Dependencies:
Replication module depends heavily on GMS module for the following:
Various cluster shape change notification events
GMSHandle.send() APIs to transmit replication messages (both to a specific target and broadcast to group). Note that replication module doesn't use any transport APIs to transmit or receive data. This is done under the covers by GMS. In V3.1 GMS will use Grizzly under the covers as transport layer.
API to register event handlers for receiving messages.
The ability to get the previous view of the cluster when a node rejoins a cluster (or when a node fails).
In terms of maven dependencies,
ha-api.jar (contains SPI classes) will not depend on anything else.
shoal-replication.jar (contains in memory implementation classes) will depend on ha-api.jar
4.14. Testing Impact:
In V2, GlassFish used buddy replication to replicate data. Some of the SIFT tests might have been coded for this specific way of replication. For example, after sending the requests to instance1, the tests may expect some log messages to appear in the instance2 server.log. These tests need to change. One approach is to rely on the 'replicaLocation' cookie (which tells the replica location)