IT Operations has always been difficult. There is always too much work to do, not enough time to do it, and frequent interrupts. Moreover, there is the relentless pressure from executives who hold the view that everything takes too long, breaks too often, and costs too much.
In search of improvement, we have repeatedly bet on new tools to improve our work. We’ve cycled through new platforms (e.g., Virtualization, Cloud, Docker, Kubernetes) and new automation (e.g., Puppet, Chef, Ansible). While each comes with its own merits, has the stress and overload on operations fundamentally changed?
Enterprises have also spent the past two decades liberally applying Management frameworks like ITIL and COBIT. Would an average operations engineer say things have gotten better or worse?
In the midst of all of this, there is conventional wisdom that rarely gets questioned.
The first of these is the idea that grouping people by functional role should be the primary driver for org structure. I discussed the problem with this idea extensively in a previous post on silos.
The second bit of conventional wisdom is the extensive use of tickets to manage operations work. This post is about how ticket queues are likely causing more harm than good.
Tickets have become the go-to work management tool in today’s operations organizations. Need something done? Open a ticket. Does someone need something from you? A ticket will appear in your queue.
Tickets have become ubiquitous. Few will think twice about adding more ticket queues across an organization. However, what if we were wrong about tickets? What if tickets queues were a significant source of operational strife hiding in plain sight?
What is wrong with tickets?
More tickets mean more queues
Tickets on their own are relatively innocuous as they are just records. The issue is where you put those tickets. Tickets go into ticket queues, and then the problems start.
In a previous post on silos, I discussed the cost of queues. Queues add delay, increase risks, add variability, add overhead, lower quality, and decrease motivation.
These aren’t my opinions, the cost of queues comes from extensive research in other fields as diverse as physical manufacturing and product development. Queuing Theory is its respected area of academic study.
In the rest of this post, I’ll use “tickets” and “queues” somewhat interchangeably. Just know that it is the queue part that causes the problems.
Tickets increase communication problems
Tickets delay feedback and learning
Almost every modern management philosophy (from Lean to DevOps and the Lean Startup) hinges on the concept of the organization learning quicker. All strive for increasingly tighter feedback loops so that analysis and decisions happen faster.
However, as Scott Prugh likes to remind me every time I bring up the topic, “Queues don’t learn.” Queues work against feedback loops by both injecting delays and stripping each request of its context. It is tough to become a learning organization when there is rampant use of ticket-drive request queues.
Tickets encourage bottlenecks
The nature of how teams work through ticket queues encourages bottlenecks. First, ticket queues are often used where there are capacity mismatches. For example, ticket queues are commonly used to buffer requests made of specialist teams (e.g., network, database, security) who are largely outnumbered by people who need them to do something. These requesting teams “stuff” the queue with requests which causes the queue length and response times to grow. Because queues delay feedback, the requestors are often unaware of (or don’t care about) the capa
city mismatch and continue to stuff the queue. This behavior increases both the queue length and response time, worsening the bottleneck.
Second, as queue lengths grow, the teams responsible for working the queue will instinctively look inward to protect their capacity. This natural reaction leads to optimizations for the team behind the queue — often at the expense of the broader organization. For example, if a firewall team only makes non-emergency changes on Mondays and Thursdays, it creates a delay for the rest of the organization even if it helps the firewall team optimize its workload.
Tickets reinforce siloed ways of working
Ticket queues act as buffers that allow teams to continue working in a siloed, disconnected manner. The more disconnected teams become, the more they behave like siloed pools of specialist labor.
Requests are made of these specialists via ticket queues, and requests are fulfilled as one-offs through semi-manual or manual efforts. Variability is high. Priorities and context are difficult to gauge.
As with the reaction to bottlenecks, the primary management focus is on protecting team capacity (not the needs of the broader organization). The more that silos effects are reinforced, the more disconnects, mistakes, and delays there will be. Ticket queues are the enablers of this downward spiral.
Tickets create snowflakes
The default FIFO nature of Ticket queues encourages one-offs. When a ticket comes up to the top of the queue, the people working the ticket queue will spring into action, attempt to garner as much context as possible in the limited time they have, do the needful, and then move on to the next ticket.
This way of working — jumping from one request to another, each with seemingly random context — is a leading cause of snowflakes. “Snowflakes” is a term that describes something that may be technically correct (even perfect) but a one-off that is not reproducible. A manually updated server is an excellent example of a snowflake. You may be able to get it into a working state, but in all likelihood, it is going to be slightly different from other servers in your fleet (and often in undetectable ways).
The cost of snowflakes might seem minimal at first, but as environments grow the costs quickly compound and create an expensive and seemingly intractable condition. As Keith Chambers has reminded me “How many enterprises have had ‘snow days’ where some small, unexpected variation results in an incident that kills a team’s capacity for hours or days?”
Tickets obscure the value stream
So much of the work of the Lean, Agile, and DevOps movements have been about breaking down barriers to build a systemic view of the work that needs to be done to deliver value (this end-to-end systemic view is often referred to as the “value stream”). Because context matters in all knowledge work, understanding where each piece of work fits into the broader system is critical.
After all the work that has been done to provide transparency and build up context, breaking it down into a series of individual tickets obscures the value stream and scatters the context. In fact, breaking work down into smaller and smaller units is often seen as a ticket system best practice.
Tickets add management overhead
What are tickets good for?
Don’t get me wrong; tickets aren’t all bad. Tickets are just overused and used for the wrong reasons.
In my opinion, ticket systems are useful for raising true exceptions (e.g., logging bugs or enhancement requests) and for routing human-to-human communication when approvals are unavoidable.
Ticket queues are also useful when each request is atomic and isolated (e.g., traditional customer helpdesk or selling tickets at a box office). However, most of these requests are prime candidates for self-service automation.
When it comes to the complex knowledge work required to deliver and operate enterprise software-based services, the evidence seems clear that ticket queues are costly at best and destructive at worse.
How do we get rid of as many tickets queues as possible?
The toxic side effects of ticket-driven request queues first came to my attention when I saw Rundeck users working to eliminate the use of ticket queues. Their work would follow the same basic pattern:
1. Redesign the work (if possible) to avoid handoffs
Forward-thinking organizations have been focusing on creating “cross-functional teams” or “product-oriented teams” that can handle as much of the lifecycle as possible (without needing to hand off work to other teams). Eliminating handoffs, naturally, cuts down on the need for ticket queues and limits the opportunity for silo effects to take hold.
2. Apply the Operations as a Service design pattern to eliminate remaining ticket queues
Wherever ticket queues can’t be eliminated, replacing the queues with self-service is an excellent alternative. Operations as a Services is a design pattern that a enables both the definition and execution of operations activity to be delegated across traditional organizational boundaries. Rundeck has been watching this design pattern emerge from the Rundeck user community and is documenting its positive effects. By replacing tickets queues with pull-based self-service interfaces, wait time is eliminated, feedback loops are shortened, breaks in context are avoided, and costly toil is eliminated.
It is time to take action