Integration with Graphite¶
Beaker can optionally be configured to send metrics to the Graphite real-time graphing system. Beaker sends metrics via UDP for efficiency, and to avoid impacting the performance and reliability of the system, so a version of Graphite with UDP listener support is required.
To enable Graphite integration, configure the hostname and port of the carbon daemon in /etc/beaker/server.cfg:
carbon.address = ('graphite.example.invalid', 2023)
carbon.prefix = 'beaker.'
The carbon.prefix setting is a prefix applied to the name of all metrics Beaker sends to Graphite. You can adjust the prefix to fit in with your site’s convention for Graphite metric names, or to distinguish multiple Beaker environments sharing a single Graphite instance.
Aggregating metrics¶
Beaker does not perform aggregation of metrics, and expects to send metrics to Graphite’s carbon-aggregator daemon (which forwards the metrics to carbon-cache for storage after aggregating them). The carbon.address setting should therefore be the address of the carbon-aggregator daemon.
Beaker may send three types of metrics: counters, gauges, and durations. (A duration is equivalent to a gauge except that it is in seconds instead of arbitrary units.) The type appears at the start of the metric name, after the configured prefix. For example, assuming the default prefix beaker., Beaker will periodically report the number of running recipes as beaker.gauges.recipes_running.
You should configure suitable aggregation rules for Beaker in /etc/carbon/aggregation-rules.conf. The following example assumes the default prefix beaker. and 1-minute storage resolution:
beaker.durations.<name> (60) = avg beaker.durations.<name>
beaker.counters.<name> (60) = sum beaker.counters.<name>
beaker.gauges.<name> (60) = avg beaker.gauges.<name>
System utilization metrics¶
To provide a real-time view of system utilization, Beaker updates the following gauges:
beaker.gauges.systems_idle_automated
beaker.gauges.systems_idle_broken
beaker.gauges.systems_idle_manual
beaker.gauges.systems_manual
beaker.gauges.systems_recipe
These metrics describe the current utilization of Automated, Manual and Broken systems in Beaker.
Automated systems are under the control of the Beaker scheduler, and are available to run submitted jobs. They are covered by the recipe (currently waiting for other recipes in a recipe set, provisioning the system or running a task as part of a recipe) and idle_automated (waiting to be assigned to a recipe) gauges.
Manual systems are available to Beaker users, but not to the scheduler. They are covered by the manual (assigned to a specific user) and idle_manual (not assigned to anyone) gauges.
Broken systems, covered by the idle_broken gauge, are awaiting investigation by system administrators before being placed back in the pool of available systems.
In addition to the metrics for every system known to Beaker, live metrics are also available for systems in the shared pool, which are equally available to all users of a Beaker installation. To access these metrics, replace .all with .shared.
Each of the system utilization gauges is also available broken down by architecture and by the lab controller that manages that system. For example, information on the idle x86_64 machines can be accessed as:
beaker.gauges.systems_idle_automated.by_arch.x86_64
As a system may support multiple architectures (e.g. both “i386” and “x86_64”, the by_arch metrics may not add up to the all metrics).
Information on the machines managed by a particular lab controller can be accessed as:
beaker.gauges.systems_idle_automated.by_lab.lchost_example_com
Recipe queue metrics¶
To provide a real-time view of the recipe queue, Beaker updates the following gauges:
beaker.gauges.recipes_new.all
beaker.gauges.recipes_processed.all
beaker.gauges.recipes_queued.all
beaker.gauges.recipes_scheduled.all
beaker.gauges.recipes_running.all
beaker.gauges.recipes_waiting.all
The new and processed states are transient states used when a job is initially submitted to Beaker. All recipes should move relatively quickly through these states to the queued state. If this isn’t happening, it is a sign that new jobs are arriving faster than the scheduler is able to process them.
The queued state indicates that initial processing of the recipe is complete, and it is ready to be assigned to a system. Depending on the strictness of the recipe’s host requirements, and the availability of suitable systems, recipes may spend an extended period of time in this state.
The scheduled state indicates that the recipe has been assigned a system (or a virtualized resource), but is waiting for other recipes in the same recipe set to be assigned a resource.
The waiting state indicates that the recipe is waiting for the initial reboot of the system that starts the kickstart-based provisioning process. Recipes should move relatively quickly through this state to the running state. If this isn’t happening, it is a sign that there is a problem somewhere in the Beaker installation (e.g. if the beaker-provision service is not running on one of the lab controllers, recipes assigned to that lab will get stuck in this state).
The running state indicates that the recipe is either waiting for the provisioning to complete, or is actually executing tasks on the assigned resource.
The number of recipes in scheduled and running may exceed the number of systems assigned to a recipe (as indicated by the systems_recipe gauge) as recipes may be executing on a dynamically created virtual machine.
To observe the utilization of dynamic virtualization resources, replace .all with .dynamic_virt_possible. These metrics show recipes which are either still under consideration for creation of a dynamic virtual machine, or which have already been assigned one.
Each of the recipe queue gauges is also available broken down by the architecture of the distro tree associated with the recipe. For example, information on the recipes currently in Beaker that require x86_64 machines can be accessed as:
beaker.gauges.recipes_queued.by_arch.x86_64
Dirty job count¶
Beaker populates this gauge with the number of jobs currently marked “dirty” in the database:
beaker.gauges.dirty_jobs
Jobs become “dirty” when their scheduling state has been changed (for example, the user cancels the job, or the harness completes a task) but the scheduler has not yet handled the status update.
A large value for this gauge indicates that there may be a problem with the scheduler causing a backlog of unhandled status updates.
System command metrics¶
Similar to the recipe queue metrics described above, Beaker provides a real-time view of the system command queue with the following gauges:
beaker.gauges.system_commands_queued.all
beaker.gauges.system_commands_running.all
The queued state represents commands which are in the queue but the beaker-provision daemon has not started running them yet. The running state represents commands which have started but not finished yet.
A large value for the queued gauge indicates that there may be a problem with the beaker-provision daemon on a lab controller causing a backlog of queued commands.
In addition, Beaker updates the following counters when a system command has finished (whether successfully or not):
beaker.counters.system_commands_completed.all
beaker.counters.system_commands_aborted.all
beaker.counters.system_commands_failed.all
Each of the command queue gauges and counters is also available broken down by the lab controller responsible for running the command.
Useful graphs¶
Below are some links to useful graphs showing the overall health and performance of your Beaker system. These URLs could be used as the basis for a dashboard or given to users. The URLs assume the default metric name prefix beaker. with a Graphite instance at graphite.example.com.
- Utilization of all systems
http://graphite.example.com/render/?width=1024&height=400 &areaMode=stacked &target=beaker.gauges.systems_idle_automated.all &target=beaker.gauges.systems_idle_broken.all &target=beaker.gauges.systems_idle_manual.all &target=beaker.gauges.systems_manual.all &target=beaker.gauges.systems_recipe.all
- Utilization of shared systems
http://graphite.example.com/render/?width=1024&height=400 &areaMode=stacked &target=beaker.gauges.systems_idle_automated.shared &target=beaker.gauges.systems_idle_broken.shared &target=beaker.gauges.systems_idle_manual.shared &target=beaker.gauges.systems_manual.shared &target=beaker.gauges.systems_recipe.shared
- Recipe queue
http://graphite.example.com/render/?width=1024&height=400 &areaMode=stacked &target=beaker.gauges.recipes_new.all &target=beaker.gauges.recipes_processed.all &target=beaker.gauges.recipes_queued.all &target=beaker.gauges.recipes_running.all &target=beaker.gauges.recipes_scheduled.all &target=beaker.gauges.recipes_waiting.all
- Recipe throughput
http://graphite.example.com/render/?width=1024&height=400 &target=beaker.counters.recipes_completed &target=beaker.counters.recipes_cancelled &target=beaker.counters.recipes_aborted