GlassFish Server Open Source Edition 3.1 - Clustering Design SpecThis is the design specification for the 3.1 clustering support.
IntroductionAdding clustering support to GlassFish v3 involves a number of separate tasks. Here's a partial list:
Detailed Discussion of Architecture Components |
Content | Synchronization approach |
---|---|
domain.xml | Key special case, see above. |
config files | File by file mod time check. Only modified files are sent. |
applications | Check mod time of each top level application directory. If the application has changed, all the application files are sent, as well as all the generated content related to the application. |
docroot | Check the mod time of each file or directory in the docroot directory, but not subdirectories. For each modified file or directory, send that file or (recursively) directory. The docroot directory might be very large, so we don't want to check every file all the time. By checking files in the docroot directory, we pick up changes to index.html. See the Open Issue section for some more discussion about docroot synchronization. |
lib | Recursively check mod time of each file. For each modified file, send that file. We assume that typically there are relatively few files in the lib directory (less than 20). |
config-specific directory | The config-specific directory is an optional subdirectory of the config directory, with the name of the instance's config. We check the mod time of any files or directories in the config-specifig directory, but not any subdirectories. The config-specific directory may commonly contain lib and docroot subdirectories, and so might be very large. |
In all cases, if the instance has a file or directory in the check list that the server does not, the server will tell the instance to remove that file or directory. The same approach applies to applications (which includes all their generated content).
The specific list of config files to consider specified in an internal default file, config-files. This file can be overridden by supplying a config-files file in the config directory of the domain. The file contains the list of config files to be synchronized, with one file name per line.
XXX - What about commands that touch other files but don't change domain.xml,
such as create-file-user?
XXX - What about add-on modules that need to modify or extend the synchronization algorithm?
XXX - Should we externalize the entire synchronization algorithm in an xml file, as v2 did?
I've considered two different approaches to startup synchronization:
A big advantage of #2 is that the asadmin client already has all the infrastructure necessary to talk to the DAS. #1 would require a special startup service that was sure to run before anything else that might use the configuration information. That seems fragile at best.
However, one of the issues is that the DAS will likely need to know that an instance is "starting", but not yet fully "started". This seems like something GMS could help with. I don't know whether we could run GMS in the asadmin process and notify the group that the server is starting, and then run GMS in the server process, continue in the "starting" state, and then finally enter the "started" state. Alternatively, with approach #1, can GMS be started very early in the server startup process? GMS gets its configuration data from the domain.xml file, which is a problem with this proposal.
More research is needed to determine the best approach, and to determine if GMS can be used during startup synchronization.
The synchronization protocol uses the standard asadmin remote command facility. We need one command, something like "_get_newer_files". The body of the request should be an XML document of the form:
<get-files> <instance>name</instance> <directory>dir</directory> <files> <file> <name>name</name> <time>time</time> </file> ... </files> </get-files>
The response to this command is a zip file containing the newer files. It may contain more files than were requested. The mod times for the files are included in the zip file metadata. The existing mechanism for returning files (e.g., for get-client-stubs) can be used.
Administrative commands that are executed on the DAS are replicated to the effected server instances. This is done by sending to the server instances the same admin command request that was sent to the DAS. Thus, each server instance will need the same admin command listener as the DAS. This wiki page goes into the finer details of command replication feature. As a result replicating commands on DAS and individual instances, the DAS and the instances will make the same changes to the domain's configuration. There are two aspects of this approach that interact with what we've described here.
If the server instance is updating and rewriting its version of domain.xml, the mod time of its domain.xml will likely be different than the version on the DAS. When the server instance restarts, it would find out that its domain.xml is out of date, which would trigger the full synchronization algorithm.
To prevent this, the DAS will need to send additional metadata with each request, describing the mod time that the domain.xml file must have after the command completes.
To allow deployment operations to be executed on the server instances as well as the DAS, the DAS needs to keep the original application archive, and send it to the instances so they can do the same deployment operation. See also the next section on application deployment issues.
XXX - are there any operations that are done during deployment that could not safely be executed identically on each server instance? For instance, does EJB CMP deployment access the database? We probably don't want every instance accessing the database to generate EJB CMP classes based on the database tables.
There are also interesting issues with handling application references, which Jerome will need to describe.
Note that the application synchronization approach will not allow for (e.g.) hand-editing of deployment descriptors after deployment if we switch to synchronizing the original archive instead of the expanded application directory, e.g., for performance reasons.
It's not clear how heavily used this directory is. Certainly people are expecting to modify the content in this directory directly. If there is typically lots of content here, synchronization might be expensive. Similar to above, we may need a command that says "please synchronize the docroot now".
A node agent is a process that controls the life cycle of the server instances. On each node (machine) we have a node agent process per GlassFish domain. For example, if a GlassFish domain d1 contains a cluster c1 spanning machines m1, m2 and m3 with three server instances s1, s2, s3 on each of them, we need three node agents n1, n2 and n3. This is how it was in GlassFish v2.
____________________ | _____________ | | | | | | | s1 <--> n1 | | | | | | _______ | |_____________| | | | | m1 | | | | _____________ | | DAS | | | | | | d1 | | | s2 <--> n1 | | |_______| | | | | | |_____________| | c1 = {s1, s2, s3} d1 contains c1| m2 | d1 contacts | _____________ | n1, n2, n3 | | | | | | s3 <--> n1 | | | | | | | |_____________| | | m3 | |____________________|
Since n1, n2 and n3 are separate processes themselves, their life cycle needs to be managed by human administrators. Since we are making node agents optional for this release, we need an alternate mechanism for situations like:
As a first step, for GlassFish v3.1, we will assume that server instances are managed manually, or by using the platform-specific facilities
(Windows Servers, Linux rc files, Solaris SMF).
In a future release we will consider an approach such as the following:
To remotely start the server processes from a DAS process, we propose a solution that depends on the ubiquitous sshd which is both standard and secure. Thus, when we want to start a process remotely from a DAS process, we contact the ssh daemon running on a given port (default: 22) on a given machine and ask it to start the GlassFish server process. If sshd is not running, administrator needs to manually start the server (by using a local asadmin command start-server) or manually restart the sshd.
For this to happen, DAS needs to be an ssh client and pure Java libraries are available for the same in the public domain. In fact, Hudson project uses this technique to remotely configure the secondary Hudson machines. Thus, the above picture now looks like:
____________________ | _____________ | | | | | | | s1 (sshd) | | | | | | _______ | |_____________| | | | | m1 | | | | _____________ | | DAS | | | | | | d1 | | | s2 (sshd) | | |_______| | | | | | |_____________| | c1 = {s1, s2, s3} d1 contains c1| m2 | d1 contacts | _____________ | sshd on each | | | | of m1,m2,m3 | | s3 (sshd) | | | | | | | |_____________| | | m3 | |____________________|
Once a server is started on a machine, it follows the synchronization algorithm as described above.
It is possible that when a cluster size grows, not all nodes in the cluster are upgraded simultaneously because that means service downtime. In order to cut the downtime and ensure service availability, the system should be designed in such a way that for limited period, different nodes can be running slightly different versions of GlassFish. This means that we need to carefully manage compatibility of the synchronization protocol and the config files.
To install and configure a cluster, the following steps are needed:
Install the software on the DAS machine. Create the domain. Create a cluster with no members. Install the software on an instance machine (if a different machine). Create a local instance, providing: DAS host, port admin username, password instance name (defaults to local host name) optionally, name of cluster to join Optionally, use "asadmin create-service" to manage the instance.
Instances can be created that are not part of a cluster. These are called stand-alone instances. Stand-alone instances share some behaviors with instances that are part of a cluster, but they are different in other ways.
Like clustered instances, a stand-alone instance:
Stand-alone instances are implemented primarily to preserve conceptual compatibility with GlassFish 2 and to provide simpler operation for people that want multiple independent instances without having to deal with the configuration of a cluster. A stand-alone instance cannot become part of a cluster at a later time.
The design consequence of supporting stand-alone instances is when administration software is performing on operation on an instance and it retrieves the information about the cluster for the instance, the cluster may be null.
This section gives details how DAS keeps track of the states of instances and how various events force an instance's state to change. Since there are various commands / subsystems that need to know the state of an instance before taking an action, a new service, called InstanceStateService, will be made available which can be used by those interested by doing
@Inject InstanceStateService states;
The following image shows the instance state diagram. The oval / circular boxes are various events and the rectangular / square boxes are the state of an instance as held in DAS.
The following image explains the various events that force state changes.
Here are is a broad overview of how the DAS will keep track of the state of the instances :
To make life easy for the potential users of this service :
InstanceState.StateType.getDescription()
The one known drawback of this approach is that there is a chance that an instance is put in the RESTART_REQUIRED state even though a command got replicated at that instance successfully. This can happen in the following scenario :
In the above sequence, the DAS will mark the instance as RETART_REQUIRED which is actually redundant. We will live with this drawback for now (and, if possible, find a solution also) because the alternative approach of keeping the state at instances has potential issues such as :
Since the state is going to be saved in file .instancestate
Due to time and resource constraints, the above will be implemented in different phases :
Note 1 : TBD : Can we queue deploy command; Is there a way to invoke the postDeploy/create-app/ref command alone on instances where a queued deploy command has to be executed; needs more investigation
Note 2 : TBD : A first look at resource commands indicates that they can be queued and executed on instances later; needs more investigation before we can take the final decision
Note 3 : TBD : Versioning the state files - need to take care of it ? How important is it ?
The clustering implementation for 3.1 must support upgrades of clusters from a GlassFish v2 installation. This section describes design details for accomplishing an upgrade.
Generally, the steps for upgrading a domain with a cluster from v2 to 3.1 are:
There are several options for step (2):
2a) Use the manual synchronization commands, export-sync-bundle and import-sync-bundle to establish the instances on each node.
2b) Use a sequence of create-local-instance commands to establish the instances on each node.
2c) Copy the nodeagents directory tree from the existing node, and run some command to cause the instances to be upgrade for 3.1
2d) Use a new "recreate-cluster-instances" command that would use the SSH capability to establish the instances on each node.
Of these, the 3.1 release will support (2a) and (2b). If application-specific data must be preserved from the v2 instances, it will be up to the user to get that data copied to the instances under 3.1. Note: having instance-specific data is discouraged. An enhancement for (2b) would be to have a single command, "recreate-local-node", that would run the the right create- local-instance commands for a node so that the user doesn't have to worry about what instances are on which nodes. A recreate-local-node command is not planned for 3.1.
To upgrade the DAS for the domain, the clustering data from v2 must be converted to the format for 3.1. This includes the clusters, servers, configs, and node-agent elements in the domain.xml file, and all elements that they reference. Conversion of this data is implemented using the upgrade framework that is already in place for GlassFish.
Since the supported options do not make use of the v2 nodeagents directory tree, there is no need to copy that over from the v2 installation. The node directory tree is recreated when the instances are reestablished in step 2.
The clustering software is partitioned into the following modules within the GlassFish source tree:
Module Name | Source Tree Subdirectory | Jar | Purpose |
---|---|---|---|
Cluster Admin CLI | cluster/cli | cluster-cli.jar | Contains only local asadmin commands. This module should never be loaded by the server. |
Cluster Admin | cluster/admin | cluster-admin.jar | Contains (among other things) the remote admin commands that run in the server. |
Cluster Common | cluster/commmon | cluster-common.jar | Contains classes that are shared among the other modules. |
Cluster SSH Provisioning | cluster/ssh | cluster-ssh.jar | Contains software related to ssh support for remote nodes. |
GlassFish GMS Boostrap Module | cluster/gms-bootstrap | gms-bootstrap.jar | Software for detecting whether there are clusters that require loading the GMS software. |
GlassFish GMS Module | cluster/gms-adapter | gms-adapter.jar | Software that integrates the Shoal module into GlassFish for use in the group management service. |
The following new commands are used to configure and manage a cluster:
Usage: create-cluster [--config <config>] [--systemproperties (name=value)[:name=name]*] [--multicastport <multicastport>] [--multicastaddress <multicastaddress>] cluster_name
--multicastaddress and --multicastport are used to broadcast messages to all instances in the cluster. GMS uses it to monitor health of instances in the cluster.
--multicastaddress is a renaming of undocumented v2.x --heartbeataddress. --heartbeataddress is an alias of --multicastaddress. Valid values range from 224.0.0.0 through 239.255.255.255. Default is "228.9.XX.YY" where XX and YY are independent values between 0..255.
--multicastport is a renaming of undocumented v2.1 --heartbeatport. --heartbeatport is an alias of --multicastport. Valid values are from 2048 to 32000. Default value is a generated value between the valid ranges.
HADB options are no longer needed. They are ignored, with a warning that the option has been deprecated and may not be supported in a future release.
[--hosts hadb-host-list] [--haagentport port_number] [--haadminpassword password] [--haadminpasswordfile file_name] [--devicesize devicesize ] [--haproperty (name=value)[:name=value]*] [--autohadb=false]
Usage: create-instance --node <node_name> [--config <config_name> | --cluster <cluster_name>] [--systemproperties (name=value)[:name=name]*] [--portbase <port_number>] [--checkports[=<checkports(default:true)>]] instance_name
See discussion on admin@glassfish.dev.jav.net. Most likely this command will not be present in 3.1, but will be added later when we have node agent support, or the ssh-based equivalent.
Usage: create-local-instance [--node <node_name>] [--nodedir <node_path>] [--savemasterpassword[=<savemasterpassword(default:false)>]] [--config <config_name> | --cluster <cluster_name>] [--systemproperties (name=value)[:name=name]*] [--portbase <portbase>] [--checkports[=<checkports(default:true)>]] instance_name
See discussion on admin@glassfish.dev.jav.net. This new local command will be used on a node to initialize a server instance on that node.
The create-local-instance command also performs part of the function that was performed by create-node-agent in v2. It creates the file system structure, including the das.properties file so that instances have the information that they need to contact the DAS.
Many of those options wouldn't apply to the remote create-instance command, so that's progbably a good reason to have two separate commands instead of a --local option on create-instance.
The --portbase and --checkports options work just like the corresponding arguments to create-domain.
Usage: create-service [--name <name>] [--serviceproperties <serviceproperties>] [--dry-run[=<dry-run(default:false)>]] [--force[=<force(default:false)>]] [--domaindir <domaindir>] [--serviceuser <serviceuser>] [--nodedir <nodedir>] [--node <node>] [-?|--help[=<help(default:false)>]] [server_name]
Changes to the create-service command include:
--nodedir
--node
Usage: copy-config [--systemproperties (name=value)[:name=value]*] source_configuration_name destination_configuration_name
Usage: delete-cluster cluster_name
--autohadboverride option is to be ignored, with a warning that the option has been deprecated and may not be supported in a future release.
[ --autohadboverride={true|false} ]
Usage: delete-instance instance_name
Usage: delete-local-instance [--node <node_name>] [--nodedir <node_path>] instance_name
Unregisters an instance from domain.xml and deletes the filesystem structure for a local instance. The instance must be stopped. If this is the last instance using the node agent directory structure, then then that directory structure is also removed.
Usage: delete-config configuration_name
Usage: export-sync-bundle --target cluster_std-alone-instance [--retrieve true|false] [file_name]
Usage: export-sync-bundle [--node node_name] [--nodedir node_path] --file xyz-sync-bundle.zip instance_name
Usage: start-cluster [--verbose[=<verbose(default:false)>]] ? cluster_name
Always and only a remote command, as in v2.
--autohadboverride option is to be ignored, with a warning that the option has been deprecated and may not be supported in a future release.
[ --autohadboverride={true|false} ]
Usage: start-instance [--nosync[=<nosync(default:false)>]] [--fullsync[=<fullsync(default:false)>]] [--debug={true|false}] instance_name
Need a variant of start-instance that works locally. It should probably mirror create-instance, either having a --local option or a start-local-instance command.
--setenv option is to be ignored, with a warning that the option has been deprecated and may not be supported in a future release.
[--setenv (name=value)[:name=name]*]
Usage: restart-instance [--debug={true|false}] instance_name
--debug is a boolean argument.
true --> restart the server with JPDA debugging enabled
false--> restart the server with JPDA debugging disabled
not set: restart with whatever it is set to now in the running server
Restarts itself. If you call restart-instance on DAS it restarts itself.
Usage: restart-local-instance [--node <node>] [--nodedir <nodedir>] instance_name
Uses the instance_name to get host:port. Then calls restart-instance on the instance itself
Usage: start-local-instance [--verbose[=<verbose(default:false)>]] [--debug[=<debug(default:false)>]] [--nosync[=<nosync(default:false)>]] [--syncfull[=<syncfull(default:false)>]] [--node <node_name>] [--nodedir <node_path>] instance_name
Usage: stop-cluster [--verbose[=<verbose(default:false)>] cluster_name
--autohadboverride option is to be ignored, with a warning that the option has been deprecated and may not be supported in a future release.
[ --autohadboverride={true|false} ]
Usage: stop-instance [--force[=<force(default:true)>]] instance_name
Always and only a remote command, as in v2.
Usage: stop-local-instance [--node <node>] [--nodedir <nodedir>] [--force[=<force(default:true)>]] instance_name
Do we need this? How does it work? Or do we just leave it to "kill" and the local services facility to stop the instance?
Usage: list-clusters [target]
Usage: list-instances [--timeoutmsec=n] [--nostatus] [--standaloneonly] [target]
The three options to the list-instances command are new for 3.1. The timeoutmsec option limits the time that will be spent trying to determine the status of instances. The default is 2 sec. The nostatus option causes list-instances to not display status about whether instances are running. And --standaloneonly option lists only stand-alone instances.
Usage: list-configs [target]
Usage: create-node-ssh --nodehost <nodehost> [--installdir <installdir>] [--nodedir <nodedir>] [--sshport <sshport>] [--sshuser <sshuser>] [--sshkeyfile <sshkeyfile>] [--force[=<force(default:false)>]] node_name
--installdir default is the DAS installation directory.
Usage: create-node-config [--nodedir <nodedir>] [--nodehost <nodehost>] [--installdir <installdir>] node_name
This section describes changes (or the lack of needed changes) for administrative commands that already exist in GlassFish.
Support for the --template domain_template option will be added to the command as part of this feature.
No changes to delete-domain are required to support clusters.
As a stretch goal, it would be nice to have an option to stop-domain that would stop all clusters and instances in the domain. However, no changes to stop-domain are required to support clusters.
Add the following argument:
[--debug={color}{true|false}]
--debug is a boolean argument.
true--> restart the server with JPDA debugging enabled
false--> restart the server with JPDA debugging disabled
not set: restart with whatever it is set to now in the running server
Deployment to the DAS will be largely unmodified as compared to the non clustered version of the application server. Deployment to remote instances from the DAS will be handled by a hidden (glassfish private) deploy command. This command will take 2 parameters :
The deploy command on the DAS will first perform a normal deployment on the DAS (without loading like in v2) and a supplemental command will be registered for this DeployCommand. This supplemental command will be responsible for sending the hidden command invocation on all necessary (as determined by the application-refs added). The following diagram shows the invocation path.
Client DAS DAS Remote Instance Remote Instance Admin Framework deploy backend hidden __deploy deploy backend asadmin deploy X.war |---------------> Receives deploy |--------------------> @ExecuteOn(runtime=DAS) DeployCommand |--> unpack archive |--> deploy phases |--> optional start |--> writes domain.xml if failure<----------------| failure <--------------------| if success |-> for each server |--------------------------------------------->@ExecuteOn(Runtime=Server) HiddenDeployCommand |--> unpack archive |--> unpack generated |--> writes domain.xml |--> invokes load |-------------> loadApp() result <---------------------------------------------------------------------| |-> collect all results results <----------------------------|
Nothing specific should be put in place to support undeployment in clustered mode. The undeploy command implementation will be annotated with the @ExecuteOn annotation to run on both the DAS and remote instances.
@ExecuteOn(runtimes={DAS, INSTANCE} @Service public UndeployCommand implements AdminCommand { ... }
On the DAS the redeploy command is basically implemented today as a succession of undeploy command followed by a deploy command. As seen above the undeploy() command invocation will naturally take care of cleaning up the deployed bits on both the DAS and the remote instances. The followup deploy() command will be unchanged from first time deployment.
This was an open issue during our AS ARCH review. Current implementation does not pick up changes to files (ex. in docroot) if they are two or more level down in the directory hierarchy.
Here are some thoughts on how we may want to position this with end users..
We could have something like a synchronize command that user will have to invoke to get their content synchronized.
% asadmin synchronize --target <cluster | std-alone-instance>
synchronize command may also have additional option to help us identify a sub-set of content that needs to be synchronized.
For example,
%asadmin synchronize --application <name-of-application> OR
%asadmin synchronize --docroot OR
%asadmin synchronize --config
It would be great if we can synchronize while the server is running so that user can shutdown DAS afterwards and/or does not have to wait until the instances are shutdown.
One option would be to use the SSH based SCP feature that Rajiv is adding. We could save the synchronization zip in a known location that server startup can pickup.
User may also use --fullsync option during instance startup. But, this may be expensive in certain cases to do full synchronization.
docroot location
============
Since our current implementation is not compatible with GlassFish v2.x (docroot, changes to deployment descriptors, JSPs, etc. are not picked up), we may also want to take this opportunity to point to config specific directory for docroot. For example,
Currently, docroot is global. Every config points to <instanceRoot>/docroot
All the contents are globally synchronized across the domain. This can be a problem when we have 100 instances (10 cluster with 10 instances each)
Alternative to consider: Use <instanceRoot>/config/<config-name>/docroot
Example of config-name is "cluster1-config". In this scheme, only associated clustered instances get the docroot content.
A more complex method of synchronizing docroot has been deferred for the 3.1 release. See issue 12029 for the rational.