Cutover

Linked runbooks

Goal

Goal

To create a hierarchy between runbooks, allowing a single master runbook to control multiple child runbooks.

Problems to solve

#Visibility
There is often too much noise within a single runbook for large events. What tends to happen is each team merges their tasks into a single runbook
#Performance
A single runbook with all the team's tasks results in a very large runbook. This can result in poor performance.
#Security
All teams need access to the single runbook to complete their tasks. They may see tasks that aren't related to their activities.

Solutions

Linked runbook overview
UserParent RunbookLink TaskLink TableChild Runbook
To give large-scale events better visibility, performance, and security we needed a way to break a single monolithic runbook into a hierarchy of team-specific runbooks. We introduced a polymorphic link table that could associate any task with any entity. Initially this was used to link a parent task to a child runbook, making the parent the master controller and each child an isolated team view. This design also made the system extensible — future entity types can be linked without schema changes.
Creation Flow
Select TemplateParent RunbookLink TaskEvent HubCreate LinkConvertChild Runbook
Creating linked runbooks had to be frictionless even at large scale. We introduced a template select modal with full-text search, filtering, and bulk selection so operators could link tens of runbooks in a single action. To prevent two link tasks ever sharing the same runbook, when a link task is created inside a live runbook the system immediately converts the selected template into a brand new child runbook and updates the link. If the link task is still inside a template, the link stays on the template and conversion happens later when the template goes live.
Run Flow
Parent RunbookEvent HubStart RunbookChild RunbookUpdate Progress
At run time, the parent and child runbooks are decoupled via the event bus. Starting a link task publishes a task-started event; a subscriber job picks this up and starts the child runbook. When the child's final task completes it publishes a runbook-complete event; another job picks this up and closes the parent link task. Timing information syncs bidirectionally through the same event architecture — changes to the parent's start times propagate to all children via a dedicated sync job that polls to handle race conditions from the asynchronous event flow.

Take aways

  • Due to the asynchronous nature of the event architecture we experienced a few race conditions. This wasn’t something I’d come across a lot beforehand. I learnt the complexities of dealing with timing calculations and how to avoid potential performance issues by polling as opposed to recalculating after every action.
Target