Elasticity Module One Pager

(template version 1.92)

1. Introduction

1.1. Project/Component Working Name:

Elasticity Module in GlassFish 3.2

1.2. Name(s) and e-mail address of Document Author(s)/Supplier:

Mahesh Kannan,
Santiago Pericassgeertsen,
Carla Mott

1.3. Date of This Document:

Apr-12-2011

2. Project Summary

2.1. Project Description:

A system is elastic if it can be easily resized to accommodate fluctuations in demand. Resizing in this context means growing or shrinking. A system provides auto-scaling if it can automatically resize based on fluctuations in demand.

GlassFish 3.2 should offer both elasticity and auto-scaling of clusters.

See below for a longer, more detailed technical description.

2.2. Risks and Assumptions:

It is assumed that there will be only one GlassFish instance running in each VM.

The metrics gatherer will accumulate data for each GlassFish instance in memory. It is assumed that there are enough system resources. See CLUST-3 Scalable Clusters in PRD.

The elasticity module will store metrics data for two hours for each GlassFish instance.

The elasticity module does not persist the metric data so if DAS crashes, data collection and rule comparisons starts over.

It is assumed that the metrics that are needed to perform auto-scaling can be gathered on all the virtualized environment on which GlassFish runs.

The elasticity module will use the interface / abstraction provided by the IMS layer to gather metrics.

The elasticity module will contact the GlassFish instances for JVM level metrics.

It is assumed that the metrics needed by the elasticity module can be collected from IMS layer with no major / significant overhead.

It is assumed that the metrics needed by the elasticity module can be collected from the GlassFish instances with no major / significant overhead.

Calculating CPU usage on a multicore virtualized environment can be quite complicated. Incorrectly determining the CPU metric could make the entire cluster / pool of systems unstable.

Orchestrator will start or stop the GlassFish instances and the elasticity module will be notified when the instances are ready or have been stopped.

3. Problem Summary

3.1. Problem Area:

The elasticity module will allow GlassFish cluster to scale up / down dynamically depending on the value of some of the key system metrics like CPU, Memory and Response time.

3.2. Justification:

4. Technical Description:

4.1. Details:

Elasticity module in GlassFish 3.2 will rely on alerts to perform auto-scaling feature. An alert is used to monitor a specific metric and perform one or more actions when the metric value reaches a certain threshold. The actual metric to use, the function (like avg, min, max) to be applied to the metric and the threshold are configurable using asadmin commands or admin GUI. The metric function is applied to the metric data collected from each GlassFish instance so that each instance has a statistic value. Because a GlassFish cluster consists of multiple instances, the alert applies another function (called cluster-aggregator) to the individual statistic values. The result of aggregator function is then compared with the threshold using the comparison operator. If the result is true then the alert is in 'alarm' state. Else it is in 'ok' state.

An alert can be configured to take one or more actions for each of the states it can be in.

Alerts can be used to perform non auto-scaling operations such as sending an email or logging information about current state of the system.

There are three sub-modules in elasticity: Metrics Gatherer, Rule Engine and Actions Engine.

4.1.1 Metrics Gatherer

The Metrics Gatherer collects raw data from each of the instances (VM or GlassFish) in the cluster and stores those values for some period of time measured in hours. Raw data would be the actual value returned from IaaS Management Service (IMS) layer or the GlassFish instance. The rate at which the data is collected is not specified. Each data entry has a timestamp. There is one metrics gatherer for each type of metric (CPU, memory and response time). The Metrics Gatherer is a singleton and the data collected is used by all alert instances.

There is also a set of statistics objects to manipulate the data. The statistics object will retrieve all the data for the time period specified by the sample-interval for each instance. Then for each instance it applies the transformation specified by the function option in the command. The possible functions that can be applied to the data are: average, minimum, maximum, geometric mean, harmonic mean, and summation. This is done each time the alert is run.

4.1.2 Rule Engine

The Rule Engine runs the alerts at a rate that is specified by schedule option. The rule looks at the set of data (one value per instance) from the statistics objects. It then applies another transformation specified by the cluster aggregator function. To determine the state of the cluster, it then compares the value calculated to the threshold value using the comparison operator.

4.1.3 Actions Engine

An alert can be configured to execute zero or more actions for each state. Actions Engine looks up the action to be taken, if any, based on the state of the cluster and executes that action.

There are two pre-defined actions in the system. They are start-instance and stop-instance.

We have seen that the Rule Engine runs the alert at a rate specified by the schedule option. After the rule is run, the Action Engine executes the actions associated with that state (even if the state has not changed between the last time the rule was run).

4.1.3.1 start-instance Action

When start-instance action is called, the auto-scaling engine calls ElasticInstanceManager.startInstance() to start a GlassFish instance. The return value from that method is a Future object that can be used to determine when the operation completes.

When the start-instance operation completes, the auto-scaling engine will notify the Metrics gatherer engine to start collecting metrics from the newly started instance.

It is possible that two alerts (say) cpu1 and cpu2 execute and fire two 'start-instance' action. The start-instance action is responsible for interacting with ElasticInstanceManager (see below) to start an instance. When start-instance calls StartInstance.startInstance(), it marks the auto-scaling engine in a 'reconfigure' state and then calls ElasticInstanceManager.startInstance(). After an instance is started the engine enters the 'wramup' period. Any requests for start-instance will be ignored by the StartInstance during the reconfigure and warmup periods.

When ElasticInstanceManager.startInstance() is called, elasticity engine will pass a set of GlassFish instance names. These are the instances that are in 'alarm' state. Orchestrator may use this data and may avoid starting another GlassFish instance in the same machine (in which these GlassFish instances are running).

4.1.4 Time based alerts

Administrators can create time based alerts that can be used to perform actions at specific times (irrespective of system load). Time based alerts are associated with a time pattern that decides when they need o be executed. They are either in 'ok' state or in 'alarm' state. A time based alarm is in 'alarm' state when the current time matches the time pattern associated with it. Time based alerts can be configured to execute a set of actions on each of their states.

We will reuse the create-schedule asadmin sub-command that allows definition of a specific date and time. The syntax is

asadmin create-schedule [--second s] [-- minute m] [--hour h] [--day-of-month [1,31]/Last] [--month 1-12/Jan-Dec] [--day-of-week [0,7]/Sun-Sat] [--year YYYY] schedule-name

Once a schedule is created, the create-time-based-alert command can be used to create a time base alert
(see Section 4.5 below for details on the commands).

4.1.5 Elasticity in ad-hoc clusters

If ad-hoc cluster support is only for bean stalk, then elasticity module do not have to do anything. If ad-hoc cluster is supported for non bean stalk environment also, then the following requirements must be satisfied:
1. domain.xml must be available on all instances.
2. every instance must be able to run the IMS apis to collect VM level metrics
3. Orchestrator service must be available on all instance (to stop / start instances)

In an ad-hoc cluster, the Metrics gatherer, the Alert engine as well as the Action engine will all be run on the GMS (Group Management Service) elected group leader. Whenever GMS elects a new leader, the Metrics gatherer, Action engine and alert engines will be run on the newly elected leader.

4.2. Bug/RFE Number(s):

RFE # Priority Summary
16432 P2 This is the umbrella issue for the elasticity feature to be supported in GlassFish 3.2
16433 P2 Elasticity module must allow adminstrators to create time-based alerts.
16434 P2 Elasticity module should allow creation of alerts based on CPU utilization.
16435 P2 Elasticity module should allow creation of alerts based on memory usage.
16436 P2 Elasticity module should allow creation of alerts based on application's response times.

4.3. In Scope:

The tables below list the requirements that are within the scope of this module. Each requirement as an associated priority.

Feature # Priority Description Comments
EL-2.0.1 P1 No programming skills shall be required from administrators to create new alarms  
EL-2.0.2 P1 Administrators shall be able to create or delete alarm instances without a restart Provide CLI or GUI to create alarm instances or specify actions without requiring a restart of any instance OR system OR VM
EL-2.0.3 P1 Administrators shall be allowed to create custom alarm instances Setting threshold values, duration
EL-2.0.4 P1 Administrators shall be able to easily create alarm instances that execute periodically OR at specific time Provide UI to specify the periodicity / specific time (similar to cron). See DAS backup and recovery module
EL-2.0.5 P1 Administrators shall be able to specify the minimum and maximum number of instances in an auto-scaling cluster Need to react to a change in min and max cluster size dynamically
EL-2.0.6 P1 Auto-scaling must be supported for EE services Only scaling GlassFish instances
EL-2.0.7 P1 Auto-scaling runtime performance shall not have a significant impact on performance Exact benchmark and overhead accepted TBD
EL-2.0.8 P3 Auto-scaling module must function properly even if DAS is unavailable It is not clear if we are moving towards ad-hoc cluster. Depending on the decision about ad-hoc cluster we will prioritize this feature. Note dependent on IMS and Orchestrator services
EL-2.0.9 P3 It shall be possible to easily extend the auto-scaling system to support extensible alarms Scripting could be a value-add in future releases
EL-2.0.10 P1 Auto-scaling module shall provide customizable alarms for JVM CPU utilization, JVM memory and response time  
EL-2.0.11 P2 Auto-scaling module shall provide customizable alarms for machine CPU utilization, machine memory  
EL-2.0.12 P2 Administrators shall be allowed to modify alarm instances  
EL-2.0.13 P1 Auto-scaling shall be supported only on virtualized environments  
EL-2.0.14 P1 Unit of auto-scaling is a GlassFish cluster  

Metrics that will be collected by elasticity module:

Feature # Priority Description Comments
EL-3.0.1 P2 Support for monitoring CPU at VM level CPU Utilization - The percentage of allocated compute units that are currently in use by the VM instance.
EL-3.0.2 P3 Support for monitoring CPU at machine level Machine CPU Utilization - The percentage of allocated compute units that are currently in use on the machine. (Internal purposes only, no alarms set on this)
EL-3.0.3 P2 Support for monitoring Memory at VM level The percentage of allocated memory units that are currently in use by the VM instance.
EL-3.0.4 P3 Support for monitoring Memory at machine level The percentage of allocated memory units that are currently in use by the machine instance.(Internal purposes only, no alarms set on this)
EL-3.0.5 P1 Support for monitoring JVM Memory at GlassFish instance level JVM memory heap size
EL-3.0.6 P1 Support for monitoring Response times of a particular administrator specified URL Administrators will be able to create Alarm instances that monitor the response times of the specified URL. The URL can be as a 'ping' URL or a request that accesses the most common path of the app
EL-3.0.7 P3 Support for monitoring Network Input at the VM level The number of bytes received on all network interfaces by the VM instance. This metric identifies the volume of incoming network traffic to an application on a single instance.
EL-3.0.8 P3 Support for monitoring Network Output at the VM level The number of bytes sent out on all network interfaces by the VM instance. This metric identifies the volume of outgoing network traffic to an application on a single instance.
EL-3.0.9 P2 Support for monitoring Disk Reads at the VM level Bytes read from all disks available to the instance. This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to determine the speed of the application.
EL-3.0.10 P2 Support for monitoring Disk Writes at the VM level Bytes written to all disks available to the instance. This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to determine the speed of the application.
EL-3.0.11 P2 Total number of cores on the machine  
EL-3.0.12 P2 Total number of VMs running on the machine  
EL-3.0.13 P1 Support for monitoring CPU at JVM level CPU Utilization - The percentage of allocated compute units that are currently in use by the JVM instance.

4.4. Out of Scope:

Even though the Alert instances will survive DAS restart, the metrics collected by the Metrics modules will not survive restart.

The following table list requirements that are not within the scope of this module.

# Description Comments
EL-2.1.1 The set of alarms shall be extensible either by writing new rules or by combining existing alarms into one alarm  
EL-2.1.2 The set of metrics available for auto-scaling shall be extensible  
EL-2.1.3 Auto-scaling shall be supported for services such as load balancing, databases, etc.  
EL-2.1.4 Auto-scaling shall be supported in a non-virtualized environment  

4.5. Interfaces:

The following commands will be added for auto-scaling feature:

enable-auto-scaling, disable-auto-scaling, create-cpu-alert, create-memory-alert, create-response-time-alert, delete-alert, list-alerts, add-alert-action, delete-alert-action, enable-alert, disable-alert, create-email-action, create-log-action

Note:
If elasticity engine is off, then the above commands will emit a warning saying that elasticity engine must be turned on for the alerts to be in effect. The alert's configurations are, however, saved in domain.xml.

enable-auto-scaling
cluster-name

disable-auto-scaling
cluster-name

create-cpu-alert
--function (average, minimum, maximum, sum, geometric-mean, harmonic-mean)
--sample-interval (1, 5, 15, 20, 30) measured in minutes
--schedule schedule-name
--threshold threshold-number
--comparison-operator (less-than, greater-than, less-than-or-equal, greater-than-or-equal, equal)
--cluster-aggregator (average, minimum, maximum, sum, geometric-mean, harmonic-mean)
--cluster cluster-name
alert-name

create-memory-alert
--function (average, minimum, maximum, sum, geometric-mean, harmonic-mean)
--sample-interval (1, 5, 15, 20, 30) measured in minutes
--schedule schedule-name
--threshold threshold-number
--comparison-operator (less-than, greater-than, less-than-or-equal, greater-than-or-equal, equal)
--cluster-aggregator (average, minimum, maximum, sum, geometric-mean, harmonic-mean)
--cluster cluster-name
alert-name

create-response-time-alert
--function (average, minimum, maximum, sum, geometric-mean, harmonic-mean)
--sample-interval (1, 5, 15, 20, 30) measured in minutes
--schedule schedule-name
--threshold threshold-number
--comparison-operator (less-than, greater-than, less-than-or-equal, greater-than-or-equal, equal)
--cluster-aggregator (average, minimum, maximum, sum, geometric-mean, harmonic-mean)
--cluster cluster-name
--URL url
alert-name

create-time-based-alert
--schedule schedule-name
--cluster cluster-name
alert-name

delete-alert
--cluster cluster-name
alert-name

list-alerts
--cluster cluster-name
--alert-type  (cpu, memory, response-time, time-based)
--state (ok, alarm)

enable-alert
--cluster cluster-name
alert-name

disable-alert
--cluster cluster-name
alert-name

create-email-action
--cluster cluster-name
--to-address address
action-name

create-log-action
--cluster cluster-name
--log-level  (INFO, SEVERE, WARNING)
action-name

delete-alert-action
--cluster cluster-name
--alert-name alert-name
action-name

add-alert-action
--cluster cluster-name
--alert-name alert-name
--state (ok, alarm)
action-name

delete-action
--cluster cluster-name
action-name

There will be default configuration elements for actions start-instance and stop-instances so the administrator can refer to those elements. These are provided since there are no parameters required for either action.

The following scenario details the asadmin commands the administrator must enter to create a CPU alarm that runs every 10 minutes and emails the administrator and starts an instance if the threshold value is above 60% for all the instances in the cluster. The sample interval tells the static object to request the last 15 minutes of data and the schedule tells the rule how often to run (every 10 minutes). In this example, the predefined schedule of 10 minutes 'ten-minutes' is used as well as the predefined action 'start-instance'.

asadmin create-cpu-alert --function average --sample-interval 15 --threshold 60 --schedule ten-minutes --comparison-operator greater-than --cluster-aggregator min --cluster c1  myAlert

asadmin create-email-action  --to-address me@oracle.com email1

asadmin add-alert-action --cluster c1 --state alarm --alert myAlert   email1

asadmin add-alert-action --cluster c1 --state alarm --alert myAlert   start-instance

4.5.1 Public Interfaces

List new, public interfaces this project exports.

  • Interface:
  • Comment:

4.5.2 Private Interfaces (Work in progress)

The Elasticity engine exposes the following interface that can be consumed by other modules.

@Contract
public interface ElasticityEngine {

    public boolean isEnabled();

}

asadmin start-instance and asadmin stop-instance commmands will check if elasticity engine is running or not by calling engine.isEnabled(). If it is running these commands will emit a warning asking the user to turn elasticity off. Elasticity engine can be turned on / off by executing: asadmin enable-auto-scaling cluster-name OR asadmin disable-auto-scaling cluster-name

The elasticity engine defines a hk2 contract called ElasticityInstanceManager that defines a set of methods to start / stop GlassFish instances. It is assumed that some module in GlassFish (most likely the Orchestrator) will implement this contract.

@Contract
public interface ElasticInstanceManager {

   /**
    * @param excludeInstances the set of instances that are heavily loaded. Orchestrator is
    *   advised not to start another instance on the machines where these instances are running.
    *
    * @returns the name of the instance that was started
    */
   Future<String> startInstance(Set<String> excludeInstances);

   /**
    * @param instances the set of instances from which the Orchestrator can choose to stop one.
    *
    * @returns the name of the instance that was stopped
    */
   Future<String> stopInstance(Set<String> instances);

}

The orchestrator may decide to start a new instance on a system that doesn't host any of the instances specified in the above set.

The Future object can be used by the elasticity engine to track when the operation completes. The Elasticity engine enters into a quiet period during this time (and most likely for a additional 'warm up' time after future.isDone() returns true).

4.5.3 Deprecated/Removed Interfaces:

NA

4.6. Doc Impact:

List any Documentation (man pages, manuals, service guides...) that will be impacted by this proposal.
All CLI commands will have to be documented.

4.7. Admin/Config Impact:

How will this change impact the administration of the product?

<schedules> <!-- TBD: Not sure is this is where schedules appear in domain.xml. will update once I confirm this with Chris-->
 <schedule name="ten-minutes" second="0" minute="10" hour="0" day-of-month="*" month="*" day-of-week="*" year="*"/>
 <schedule name="weekly" second="0" minute="0" hour="0" day-of-month="*" month="*" day-of-week="2" year="*"/>
</schedules>
<cluster>
 <alerts>
    <alert name="alert-name1" type="CPU" threshold=30 schedule="ten-minutes" operator="greater-than" function="average" sample-interval=5 cluster-aggregator="max" enabled="true">
       <alert-actions>
         <alert-action action-state="ok-state" action-ref="log1"/>
       </alert-actions>
    </alert>
    <alert name="alert-name2" type="response-time" threshold=30 schedule="ten-minutes" operator="greater-than"  function="average" sample-interval=5 cluster-aggregator="max" enabled="true">
        <response-url url="some url"/>
        <alert-actions>
         <alert-action action-state="ok-state" action-ref="log1"/>
         <alert-action action-state="alarm-state" action-ref="start-instance"/>
       </alert-actions>
    </alert>
    <alert name="alert-name3"  type="time-based" schedule="weekly">
        <alert-actions>
         <alert-action action-state="alarm-state" action-ref="start-instance"/>
        </alert-actions>
    </alert>
    <actions>
       <log-action action-name="log1" log-level=INFO"/>
       <email-action action-name="email1" to-address="some address"/> 
       <start-instance-action action-name="start-instance"/>
       <stop-instance-action action-name="stop-instance"/>
    </actions>
 </alerts>
</cluster>

4.8. HA Impact:

The Metric gatherers will be running in DAS. This means that when DAS stops / crashes all the collected metrics are lost. When it restarts, all the metric gatherers will start all over again.

4.9. I18N/L10N Impact:

Does this proposal impact internationalization or localization?

4.10. Packaging, Delivery & Upgrade:

4.10.1. Packaging

The alerts and their associated asadmin sub-commands will be packed as an OSGi module.

Since the alerts are part of the open source distribution, we do not expect any IPS / pkg(5) to be created for these.

4.10.2. Delivery

No impact on product installation.

4.10.3. Upgrade and Migration:

This is a brand new feature and hence we do not expect any upgrade or migration issues.

4.11. Security Impact:

4.12. Compatibility Impact

NA (Elasticity is a new feature in 3.2)

List any requirements on upgrade tool and migration tool.

4.13. Dependencies:

4.13.1 Internal Dependencies

Elasticity depend on other HK2 services like Orchestrator to perform actions. The interaction between elasticity engine and Orchestrator will be clearly defined.

Elasticity module also defines some metrics gatherers that could depend on IMS API.

Collecting metrics from individual GF instances will performed by making REST calls. So, the metrics gatherers might have to use some Jersey APIs.

4.13.2 External Dependencies

None

4.14. Testing Impact:

How will the new feature(s) introduced by this project be tested?

One approach is to write test application that will (say) consume a large amount of CPU cycles. Then a set of alerts can be created that monitors cpu usage and takes various actions like sending emails, starting an instance etc. Then the application can be deployed to see if the actions were taken or not.

Do tests exist from prior releases (e.g. v2) that can be reused?

NA

Will new tests need to be written? Can they be automated?

Yes new tests will be written and automated either by simple ant scripts or through hudson jobs.

How will it impact existing devtests or SQE tests?

This is new functionality and there for no dev tests or SQE tests exist.

5. Reference Documents:

6. Schedule:

6.1. Projected Availability:

Indicate which milestones from the current schedule the project
will be:

  • Initially integrated (may not be feature complete)
  • Feature complete (ready for handoff to QA)
  • At production quality level

  1. Instead of create-cpu-alert, create-memory-alert, etc, why not a generic "create-alert --type" to simplify the interface and user experience?  Thought is extensibility for future alerts without complicating asadmin interface & user experience.
  2. Can we leverage a common "create-alert" command (memory/CPU), with common options (and differentiated options via properties)? Thought is extensibility for future commands without complicating asadmin interface & user experience.
  3. Can we leverage a common "create-schedule" command and share with domain backup and recovery, or are they semantically different?
Posted by johnclingan at Apr 28, 2011 18:24