Skip to Main Content
Operational and Predictive Intelligence - Ideas Portal
Status Will not implement
Created by Christian Mochales
Created on Feb 14, 2025

Reevaluate Recovery Action System architecture

Currently the recovery action system is working in an async way leaving room for failure or missing episodes that have a recovery action associated to them.

This architecture also depends on a lot of complex pipelines in Gitlab CI/CD and searches/KV's in Splunk making it hard to troubleshoot.

It would be great to explore alternatives to simplify the architecture easing the maintenance of the system and reducing complexity, make the system work in a synchronous way improving response and recovery times and simplify customer usage of the system.

Adding on that the system currently doesn't allow for event driven automation such as Automatic analysis, increasing resources or application deployment. This if executed correctly can bring benefit for operational teams increasing the amount of options that they have to answer an alarm.


Proposals:

  • Explore Pagerduty capabilities for recovery action system

  • Analyze if it would make sense to move the RA system to Pagerduty

  • Explore the possibility to create a Splunk custom addon to leave behind the Recovery action launcher

  • Explore Pagerduty / Rundeck integration capabilities and benefits to simplify customer usage of the system

  • Explore Event Driven Automation for RA System: Does it make sense? How to avoid turning it into a job scheduler? Which actions should be allowed apart from recovery ( Automatic Analisys generation, Resource increasing in VM's, Application redeployment... )