Elasticity Module One Pager(template version 1.92)
1. Introduction |
RFE # | Priority | Summary |
---|---|---|
16432 | P2 | This is the umbrella issue for the elasticity feature to be supported in GlassFish 3.2 |
16433 | P2 | Elasticity module must allow adminstrators to create time-based alerts. |
16434 | P2 | Elasticity module should allow creation of alerts based on CPU utilization. |
16435 | P2 | Elasticity module should allow creation of alerts based on memory usage. |
16436 | P2 | Elasticity module should allow creation of alerts based on application's response times. |
The tables below list the requirements that are within the scope of this module. Each requirement as an associated priority.
Feature # | Priority | Description | Comments |
---|---|---|---|
EL-2.0.1 | P1 | No programming skills shall be required from administrators to create new alarms | |
EL-2.0.2 | P1 | Administrators shall be able to create or delete alarm instances without a restart | Provide CLI or GUI to create alarm instances or specify actions without requiring a restart of any instance OR system OR VM |
EL-2.0.3 | P1 | Administrators shall be allowed to create custom alarm instances | Setting threshold values, duration |
EL-2.0.4 | P1 | Administrators shall be able to easily create alarm instances that execute periodically OR at specific time | Provide UI to specify the periodicity / specific time (similar to cron). See DAS backup and recovery module |
EL-2.0.5 | P1 | Administrators shall be able to specify the minimum and maximum number of instances in an auto-scaling cluster | Need to react to a change in min and max cluster size dynamically |
EL-2.0.6 | P1 | Auto-scaling must be supported for EE services | Only scaling GlassFish instances |
EL-2.0.7 | P1 | Auto-scaling runtime performance shall not have a significant impact on performance | Exact benchmark and overhead accepted TBD |
EL-2.0.8 | P3 | Auto-scaling module must function properly even if DAS is unavailable | It is not clear if we are moving towards ad-hoc cluster. Depending on the decision about ad-hoc cluster we will prioritize this feature. Note dependent on IMS and Orchestrator services |
EL-2.0.9 | P3 | It shall be possible to easily extend the auto-scaling system to support extensible alarms | Scripting could be a value-add in future releases |
EL-2.0.10 | P1 | Auto-scaling module shall provide customizable alarms for JVM CPU utilization, JVM memory and response time | |
EL-2.0.11 | P2 | Auto-scaling module shall provide customizable alarms for machine CPU utilization, machine memory | |
EL-2.0.12 | P2 | Administrators shall be allowed to modify alarm instances | |
EL-2.0.13 | P1 | Auto-scaling shall be supported only on virtualized environments | |
EL-2.0.14 | P1 | Unit of auto-scaling is a GlassFish cluster |
Metrics that will be collected by elasticity module:
Feature # | Priority | Description | Comments |
---|---|---|---|
EL-3.0.1 | P2 | Support for monitoring CPU at VM level | CPU Utilization - The percentage of allocated compute units that are currently in use by the VM instance. |
EL-3.0.2 | P3 | Support for monitoring CPU at machine level | Machine CPU Utilization - The percentage of allocated compute units that are currently in use on the machine. (Internal purposes only, no alarms set on this) |
EL-3.0.3 | P2 | Support for monitoring Memory at VM level | The percentage of allocated memory units that are currently in use by the VM instance. |
EL-3.0.4 | P3 | Support for monitoring Memory at machine level | The percentage of allocated memory units that are currently in use by the machine instance.(Internal purposes only, no alarms set on this) |
EL-3.0.5 | P1 | Support for monitoring JVM Memory at GlassFish instance level | JVM memory heap size |
EL-3.0.6 | P1 | Support for monitoring Response times of a particular administrator specified URL | Administrators will be able to create Alarm instances that monitor the response times of the specified URL. The URL can be as a 'ping' URL or a request that accesses the most common path of the app |
EL-3.0.7 | P3 | Support for monitoring Network Input at the VM level | The number of bytes received on all network interfaces by the VM instance. This metric identifies the volume of incoming network traffic to an application on a single instance. |
EL-3.0.8 | P3 | Support for monitoring Network Output at the VM level | The number of bytes sent out on all network interfaces by the VM instance. This metric identifies the volume of outgoing network traffic to an application on a single instance. |
EL-3.0.9 | P2 | Support for monitoring Disk Reads at the VM level | Bytes read from all disks available to the instance. This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to determine the speed of the application. |
EL-3.0.10 | P2 | Support for monitoring Disk Writes at the VM level | Bytes written to all disks available to the instance. This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to determine the speed of the application. |
EL-3.0.11 | P2 | Total number of cores on the machine | |
EL-3.0.12 | P2 | Total number of VMs running on the machine | |
EL-3.0.13 | P1 | Support for monitoring CPU at JVM level | CPU Utilization - The percentage of allocated compute units that are currently in use by the JVM instance. |
Even though the Alert instances will survive DAS restart, the metrics collected by the Metrics modules will not survive restart.
The following table list requirements that are not within the scope of this module.
# | Description | Comments |
---|---|---|
EL-2.1.1 | The set of alarms shall be extensible either by writing new rules or by combining existing alarms into one alarm | |
EL-2.1.2 | The set of metrics available for auto-scaling shall be extensible | |
EL-2.1.3 | Auto-scaling shall be supported for services such as load balancing, databases, etc. | |
EL-2.1.4 | Auto-scaling shall be supported in a non-virtualized environment |
The following commands will be added for auto-scaling feature:
enable-auto-scaling, disable-auto-scaling, create-cpu-alert, create-memory-alert, create-response-time-alert, delete-alert, list-alerts, add-alert-action, delete-alert-action, enable-alert, disable-alert, create-email-action, create-log-action
Note:
If elasticity engine is off, then the above commands will emit a warning saying that elasticity engine must be turned on for the alerts to be in effect. The alert's configurations are, however, saved in domain.xml.
enable-auto-scaling cluster-name disable-auto-scaling cluster-name create-cpu-alert --function (average, minimum, maximum, sum, geometric-mean, harmonic-mean) --sample-interval (1, 5, 15, 20, 30) measured in minutes --schedule schedule-name --threshold threshold-number --comparison-operator (less-than, greater-than, less-than-or-equal, greater-than-or-equal, equal) --cluster-aggregator (average, minimum, maximum, sum, geometric-mean, harmonic-mean) --cluster cluster-name alert-name create-memory-alert --function (average, minimum, maximum, sum, geometric-mean, harmonic-mean) --sample-interval (1, 5, 15, 20, 30) measured in minutes --schedule schedule-name --threshold threshold-number --comparison-operator (less-than, greater-than, less-than-or-equal, greater-than-or-equal, equal) --cluster-aggregator (average, minimum, maximum, sum, geometric-mean, harmonic-mean) --cluster cluster-name alert-name create-response-time-alert --function (average, minimum, maximum, sum, geometric-mean, harmonic-mean) --sample-interval (1, 5, 15, 20, 30) measured in minutes --schedule schedule-name --threshold threshold-number --comparison-operator (less-than, greater-than, less-than-or-equal, greater-than-or-equal, equal) --cluster-aggregator (average, minimum, maximum, sum, geometric-mean, harmonic-mean) --cluster cluster-name --URL url alert-name create-time-based-alert --schedule schedule-name --cluster cluster-name alert-name delete-alert --cluster cluster-name alert-name list-alerts --cluster cluster-name --alert-type (cpu, memory, response-time, time-based) --state (ok, alarm) enable-alert --cluster cluster-name alert-name disable-alert --cluster cluster-name alert-name create-email-action --cluster cluster-name --to-address address action-name create-log-action --cluster cluster-name --log-level (INFO, SEVERE, WARNING) action-name delete-alert-action --cluster cluster-name --alert-name alert-name action-name add-alert-action --cluster cluster-name --alert-name alert-name --state (ok, alarm) action-name delete-action --cluster cluster-name action-name
There will be default configuration elements for actions start-instance and stop-instances so the administrator can refer to those elements. These are provided since there are no parameters required for either action.
The following scenario details the asadmin commands the administrator must enter to create a CPU alarm that runs every 10 minutes and emails the administrator and starts an instance if the threshold value is above 60% for all the instances in the cluster. The sample interval tells the static object to request the last 15 minutes of data and the schedule tells the rule how often to run (every 10 minutes). In this example, the predefined schedule of 10 minutes 'ten-minutes' is used as well as the predefined action 'start-instance'.
asadmin create-cpu-alert --function average --sample-interval 15 --threshold 60 --schedule ten-minutes --comparison-operator greater-than --cluster-aggregator min --cluster c1 myAlert asadmin create-email-action --to-address me@oracle.com email1 asadmin add-alert-action --cluster c1 --state alarm --alert myAlert email1 asadmin add-alert-action --cluster c1 --state alarm --alert myAlert start-instance
List new, public interfaces this project exports.
The Elasticity engine exposes the following interface that can be consumed by other modules.
@Contract public interface ElasticityEngine { public boolean isEnabled(); }
asadmin start-instance and asadmin stop-instance commmands will check if elasticity engine is running or not by calling engine.isEnabled(). If it is running these commands will emit a warning asking the user to turn elasticity off. Elasticity engine can be turned on / off by executing: asadmin enable-auto-scaling cluster-name OR asadmin disable-auto-scaling cluster-name
The elasticity engine defines a hk2 contract called ElasticityInstanceManager that defines a set of methods to start / stop GlassFish instances. It is assumed that some module in GlassFish (most likely the Orchestrator) will implement this contract.
@Contract public interface ElasticInstanceManager { /** * @param excludeInstances the set of instances that are heavily loaded. Orchestrator is * advised not to start another instance on the machines where these instances are running. * * @returns the name of the instance that was started */ Future<String> startInstance(Set<String> excludeInstances); /** * @param instances the set of instances from which the Orchestrator can choose to stop one. * * @returns the name of the instance that was stopped */ Future<String> stopInstance(Set<String> instances); }
The orchestrator may decide to start a new instance on a system that doesn't host any of the instances specified in the above set.
The Future object can be used by the elasticity engine to track when the operation completes. The Elasticity engine enters into a quiet period during this time (and most likely for a additional 'warm up' time after future.isDone() returns true).
NA
List any Documentation (man pages, manuals, service guides...) that will be impacted by this proposal.
All CLI commands will have to be documented.
How will this change impact the administration of the product?
<schedules> <!-- TBD: Not sure is this is where schedules appear in domain.xml. will update once I confirm this with Chris--> <schedule name="ten-minutes" second="0" minute="10" hour="0" day-of-month="*" month="*" day-of-week="*" year="*"/> <schedule name="weekly" second="0" minute="0" hour="0" day-of-month="*" month="*" day-of-week="2" year="*"/> </schedules> <cluster> <alerts> <alert name="alert-name1" type="CPU" threshold=30 schedule="ten-minutes" operator="greater-than" function="average" sample-interval=5 cluster-aggregator="max" enabled="true"> <alert-actions> <alert-action action-state="ok-state" action-ref="log1"/> </alert-actions> </alert> <alert name="alert-name2" type="response-time" threshold=30 schedule="ten-minutes" operator="greater-than" function="average" sample-interval=5 cluster-aggregator="max" enabled="true"> <response-url url="some url"/> <alert-actions> <alert-action action-state="ok-state" action-ref="log1"/> <alert-action action-state="alarm-state" action-ref="start-instance"/> </alert-actions> </alert> <alert name="alert-name3" type="time-based" schedule="weekly"> <alert-actions> <alert-action action-state="alarm-state" action-ref="start-instance"/> </alert-actions> </alert> <actions> <log-action action-name="log1" log-level=INFO"/> <email-action action-name="email1" to-address="some address"/> <start-instance-action action-name="start-instance"/> <stop-instance-action action-name="stop-instance"/> </actions> </alerts> </cluster>
The Metric gatherers will be running in DAS. This means that when DAS stops / crashes all the collected metrics are lost. When it restarts, all the metric gatherers will start all over again.
Does this proposal impact internationalization or localization?
The alerts and their associated asadmin sub-commands will be packed as an OSGi module.
Since the alerts are part of the open source distribution, we do not expect any IPS / pkg(5) to be created for these.
No impact on product installation.
This is a brand new feature and hence we do not expect any upgrade or migration issues.
NA (Elasticity is a new feature in 3.2)
List any requirements on upgrade tool and migration tool.
Elasticity depend on other HK2 services like Orchestrator to perform actions. The interaction between elasticity engine and Orchestrator will be clearly defined.
Elasticity module also defines some metrics gatherers that could depend on IMS API.
Collecting metrics from individual GF instances will performed by making REST calls. So, the metrics gatherers might have to use some Jersey APIs.
None
How will the new feature(s) introduced by this project be tested?
One approach is to write test application that will (say) consume a large amount of CPU cycles. Then a set of alerts can be created that monitors cpu usage and takes various actions like sending emails, starting an instance etc. Then the application can be deployed to see if the actions were taken or not.
Do tests exist from prior releases (e.g. v2) that can be reused?
NA
Will new tests need to be written? Can they be automated?
Yes new tests will be written and automated either by simple ant scripts or through hudson jobs.
How will it impact existing devtests or SQE tests?
This is new functionality and there for no dev tests or SQE tests exist.
Indicate which milestones from the current schedule the project
will be: