Incident Management

Goal

Update the core runbook technology to cater for the major incident use case.

Problems to solve

#Rigidity

Introduce a dynamic mode (allowing edits which were previously restricted to planning or paused runbook states)

#Visibility

Implement an activity feed

#Extensibility

Allow custom apps to be hosted within a runbook

Solutions

Dynamic Runbooks

To break the planning - rehearsal - live event cycle of a standard runbook we needed a more flexible runbook type. One that allows for edits to be applied during a live run. This is where the dynamic setting came into being. When enabled this setting allows edits to runbook in a live run. This was achieved by adding the new setting and ensuring that editing capabilities took this setting into account. To close out a runbook we would use the finishing of the final task as an indicator that the runbook was complete. However, for incidents, we needed a manual closing mechanism to ensure post-resolution tasks could be added to an ongoing incident. We introduced another setting to allow us to achieve this. The incident flag was coupled to the dynamic flag so new runbook types could benefit from being dynamic without necessarily being incidents.

Activity Feed

Incidents required a windowed view into all actions that were occurring on a runbook in real-time. An incident would be made up of participants who were either co-ordinators or those who actioned tasks. An activity feed provided the co-ordinators with up-to-date view on the incident's progress. The app's events-based architecture meant we were able to create jobs that listened to all relevant events (e.g. task created) and create an activity feed entry. This activity would be broadcast over the web socket so that each live connection would see the update within the activity feed window. The activity feed also allowed live chat within it. Allowing real-time comms which is an essential element of co-ordinating large-scale recovery events.

Cutover apps

A lot of Cutover clients are already tied into either 3rd party or internal software that still plays an integral part in any incident resolution. To ensure a one-stop shop we opened up the core application to allow 3rd parties to host custom mini-apps inside of the runbook. We achieved this by providing a set of building blocks using our internal React component library. On load of the incident, each configured app would make a request to the 3rd party endpoint. The request would include details of the runbook id and the response URL to POST the make-up of the app back to. On receipt of the 3rd party's response, a JSON parser would be responsible for converting the JSON into a series of nested React components which would be rendered inside of the container for the mini app. Interactions with the app were also posted back to the configured endpoint with details of the user, runbook id, the element interacted with and any other element-specific details (e.g. selected option 'Critical'). This allowed clients to host apps responsible for organising Zoom bridges, managing access, creating summaries and closing out incidents on external systems.

Take aways

Context is everything, having a deep understanding of the client’s needs is essential in helping devise solutions that solve problems and make sense within the existing product.
Everything is an event! By thinking in this way it is easy to introduce side effects that are asynchronous which results in a quick app
Protect against race conditions. Often the sequence of events cannot be guaranteed. I learnt a lot about handling this accordingly.
Compromise is key. There was a lot of debate around how we allowed 3rd party apps to be embedded. By developing the apps in house we can see common patterns and have a better understanding of how the apps are constructed. This allows us to create larger components that are less granular and less prone to design issues.