Event Driven Ansible

by Larry Gadallah

June 18, 2024

Classic Industrial Control Panel

In The Beginning

When Ansible was first introduced, most of its use was in an organic, human-in-the-loop mode. A developer would write an Ansible Playbook, create an inventory file of targets, and then manually run the playbook against the hosts in the inventory.

Automation Advances

Later on, when CI/CD pipelines became more prevalent, people started running Ansible jobs using automated triggers (e.g. git commit, cron job) rather than having a human trigger the Ansible run.

Classic Ansible Workflow

The Emergent Problem

While many systems provisioned with Ansible have various forms of monitoring and alerting configured, the problem is that the alerts they generate are monitored by humans or other automated alert management systems. In the end, the common issue is that there is often a significant latency between when monitoring systems report an event and when a human operator notices it and acts upon it.

Event Driven Automation

A significant recent change to Ansible came with the introduction of Event Driven Ansible (EDA).

EDA introduces a new object to Ansible: the rulebook. The rulebook encapsulates three items:

Sources: These are the sources of events that drive the execution of the Rulebook. They are produced by 'source plugins' that interact with upstream event sources such as webhooks, observability and monitoring systems. There are source plugins provided by Red Hat and also by the Ansible community.
Rules: These are rules that connect events with actions under specified conditions
Actions: These are the actions that should be taken when a rule "fires". They may be Ansible Playbooks or ad-hoc actions using Ansible Modules.

Unlike a Playbook, running an Ansible Rulebook results in the process waiting for an event, evaluating the event in the context of the rules, and executing actions if any rules matched. It then goes back to waiting for events.

Ansible with Event-driven Feedback

This is a very significant change to the capabilities of Ansible, as this makes it into a self-contained, closed-loop control system, able to monitor, almost in real time, the state of the system(s) under management and the difference between this state and the desired configuration, and then to execute actions using Ansible to move the system(s) under management back into the desired configuration.

While this is by no means the first or only closed-loop control system of this kind, it is notable that the Ansible ecosystem is relatively open. Many competitors have offerings from which similar systems can be created, but most of them are vendor-specific, locking the user into the vendor's ecosystem. Historically, Ansible has moved in the opposite direction, allowing users to integrate services and components from various vendors together.

Action Categories

A list of typical, frequently needed actions that Event-driven Ansible can help to address:

Automated remediation (patches, resource management)

When CVEs and vulnerabilities are published a rulebook can be introduced to check for patches, and automatically apply them if they are not present.

Ticket enrichment (collecting and adding data to an existing ticket)

For many types of IT tickets, a very common pattern is for a human to have to find or research data which gets added to the ticket in order for the ticket to be resolved. For recurring or frequent tickets of this sort, an Ansible rulebook could be created to get parameters from a new IT ticket and then go and query the systems under management for the data needed to resolve the ticket and then add that data to the original triggering IT ticket.

Automated platform scaling (scale up/down based on load)

This is just another instance of the load balancer pattern in use in many different contexts or environments: The rulebook could, for example, monitor the rate of incoming requests to a web service, and if the rate exceeded a threshold, actions could be triggered to add additional web service capacity, and conversely, reduce it when the rate of incoming requests declined.

Risk mitigation (security patches and vulnerabilities)

This is somewhat similar to the Automated remediation action, although this might contain a component dedicated to scanning systems under management for indications that they are vulnerable to a particular security issue, and then using an Automated remediation action to cause a patch that mitigates that security issue to be applied.

Automated tuning and capacity management (storage, processing)

A very common example of this action would be managing issues where systems under management are running out of disk space. A rulebook could be configured to monitor available disk space, and when it went below a certain threshold, actions could be taken to either clean up unneeded files or to expand the available disk space for systems under management.

Scaling automation (VMs, network bandwidth)

This action is a variant of the Automated tuning and capacity management action, but applied to automation resources. When events indicating that too few builder nodes are available, a rulebook could execute actions to increase the number of builder nodes, and conversely, reduce the number when the load decreases.

Benefits

Event-driven Ansible allows organizations to address problems like:

Reducing toil

Often, organizations have large queues of low-value, repetitive tasks to complete. In many cases EDA can automate some or all of the processing required to complete these tasks.

Reliability and resilience

EDA can help mitigate the effects of asynchronous events like service outages, or newly published security vulnerabilities that adversely affect system uptime and reliability.

Automated remediation

When systems emit monitoring and observability signals indicating known system behavior patterns, or low capacity, or scalability events, EDA can be utilized to automatically take steps to remediate or repair the issue, or to open tickets for investigation by humans.

Ticket enrichment

Many organizations have large queues of task tickets that require research before they can be addressed and closed. EDA can be made to automatically perform some or all of the required research, and add it to the ticket. This way, the human time to process the ticket might, for example, be reduced from 30 minutes to 5 minutes.

Summary

Open-loop

Classical use of Ansible is in an open-loop configuration:

Open-loop Control System

Advantages

Simple and easy to design
Low cost of implementation and maintenance
Convenient to use
Stable

Disadvantages

Non-feedback system does not facilitate automation of process
Cannot detect or correct output deviations
Unreliable, since output can be affected by external system disturbances

Closed-loop

EDA moves the use of Ansible into a closed-loop configuration:

Closed-loop Control System

Advantages

Accurate and less error prone
Automatically corrects errors
Supports automation

Disadvantages

Complex and more difficult to design
High maintenance required
Feedback signals may cause the system to oscillate