Currently the recovery action system is working in an async way leaving room for failure or missing episodes that have a recovery action associated to them.
This architecture also depends on a lot of complex pipelines in Gitlab CI/CD and searches/KV's in Splunk making it hard to troubleshoot.
It would be great to explore alternatives to simplify the architecture easing the maintenance of the system and reducing complexity, make the system work in a synchronous way improving response and recovery times and simplify customer usage of the system.
Adding on that the system currently doesn't allow for event driven automation such as Automatic analysis, increasing resources or application deployment. This if executed correctly can bring benefit for operational teams increasing the amount of options that they have to answer an alarm.
Proposals:
Explore Pagerduty capabilities for recovery action system
Analyze if it would make sense to move the RA system to Pagerduty
Explore the possibility to create a Splunk custom addon to leave behind the Recovery action launcher
Explore Pagerduty / Rundeck integration capabilities and benefits to simplify customer usage of the system
Explore Event Driven Automation for RA System: Does it make sense? How to avoid turning it into a job scheduler? Which actions should be allowed apart from recovery ( Automatic Analisys generation, Resource increasing in VM's, Application redeployment... )