Cutover

Incident Management

Goal

Goal

Update the core runbook technology to cater for the major incident use case. 

Problems to solve

#Rigidity
Introduce a dynamic mode (allowing edits which were previously restricted to planning or paused runbook states)
#Visibility
Implement an activity feed
#Extensibility
Allow custom apps to be hosted within a runbook

Solutions

Incident Overview
3rd Party TriggerIncident TemplateRunbook3rd Party AppsSnippetsLinked Runbooks
The incident runbook lifecycle required many core changes. First, we needed a more flexible runbook type. Introducing the dynamic type meant we could allow updates to the runbook during a live run. Next, we needed to ensure runbooks did not close out naturally (when the last task was completed), but instead could be closed by third-party apps. For customisation, we allowed clients to embed their own components inside the runbook. Finally, we needed a live window into all activity during the incident. We incorporated a dedicated component below the task list view to display actions as they occurred in real time.
Activity Feed
Event BusJob ListenersWebSocket ServerActivity FeedLive ClientsLive Chat
Incidents required a windowed view into all actions that were occurring on a runbook in real-time. An incident would be made up of participants who were either co-ordinators or those who actioned tasks. An activity feed provided the co-ordinators with up-to-date view on the incident's progress. The app's events-based architecture meant we were able to create jobs that listened to all relevant events (e.g. task created) and create an activity feed entry. This activity would be broadcast over the web socket so that each live connection would see the update within the activity feed window. The activity feed also allowed live chat within it. Allowing real-time comms which is an essential element of co-ordinating large-scale recovery events.
Cutover apps
3rd Party EndpointCore AppComponent LibraryJSON ParserRunbook ContainerMini App
A lot of Cutover clients are already tied into either 3rd party or internal software that still plays an integral part in any incident resolution. To ensure a one-stop shop we opened up the core application to allow 3rd parties to host custom mini-apps inside of the runbook. We achieved this by providing a set of building blocks using our internal React component library. On load of the incident, each configured app would make a request to the 3rd party endpoint. The request would include details of the runbook id and the response URL to POST the make-up of the app back to. On receipt of the 3rd party's response, a JSON parser would be responsible for converting the JSON into a series of nested React components which would be rendered inside of the container for the mini app. Interactions with the app were also posted back to the configured endpoint with details of the user, runbook id, the element interacted with and any other element-specific details (e.g. selected option 'Critical'). This allowed clients to host apps responsible for organising Zoom bridges, managing access, creating summaries and closing out incidents on external systems.

Take aways

  • Context is everything, having a deep understanding of the client’s needs is essential in helping devise solutions that solve problems and make sense within the existing product.
  • Everything is an event! By thinking in this way it is easy to introduce side effects that are asynchronous which results in a quick app
  • Protect against race conditions. Often the sequence of events cannot be guaranteed. I learnt a lot about handling this accordingly.
  • Compromise is key. There was a lot of debate around how we allowed 3rd party apps to be embedded. By developing the apps in house we can see common patterns and have a better understanding of how the apps are constructed. This allows us to create larger components that are less granular and less prone to design issues.
Target