How we measure output and productivity
… getting more stuff done
At Trade Me we want to measure the overall health of Tech (that’s our team of 125 designers, developers, testers, BAs, and Squad Masters) to identify trends and to know if we are getting better (or worse!). We know that when we measure something it is a strong way of saying “This matters” which is why we put a lot of effort into deciding which metrics to collect.
There are lots of things that matter to us in Tech but the most important ones are:
- Get stuff out fast
- Have high quality
- Have happy clients (business people/end users)
- Have happy employees
- Build the right thing
We measure this with the following metrics:
Key indicators of Tech Health
While we measure all of the above, in this post I’m going to focus on technology output which supports our goal of “Get stuff out fast“. In particular, I’ll focus on the number of stories shipped (I assume that everyone knows what cycle time is).
Counting the number of shipped user stories
Why user stories?
The idea behind counting how many user stories make it to production each week is that we want to measure productivity trends. Productivity in software development is impossible to measure (I think) so the best thing we can possibly do is measure a proxy and look at trends. For this we need to define some unit as a representative for productivity and that we can count and that, if it trends upwards, most likely means that we’re getting more stuff out to end users.
We could use features, story points or stories as the unit (we know that lines of code or function points don’t work). Features are really hard to identify because we don’t track which stories belong to which feature and as we build them incrementally we don’t just deploy at the “ end of a feature”. Story points will be difficult, as humans subconsciously game metrics, and indadvertedly will inflate story points which will then make it harder to use them for forecasting. That leaves stories where we can assume that the more we release, the better and the more productive we probably are. When people game this metric (you really do get what you measure) people will release smaller stories which is actually a good thing.
I am aware that stories come in different sizes but this particular measure is only used as an aggregated number of stories released per week across the entire organisation so we assume that it will even out across squads. As said we’re only interested in trends not actual numbers.
Really, why not story points?
At Trade Me squads that use 2-week sprints and fibonacci story points will have this in common: 13 points is the biggest story they will commit to in a sprint. If it’s bigger than that, it needs to be split. I imagine that lends a degree of commonality to the points used. Wouldn’t aggregate trends in velocity, based on these points, be able to show you how much we’re delivering?
In theory normalised story points could be a useful indicator. The main problem with using them is the danger of half-knowledge outside the squad and story points being abused. I have seen points turn into a currency in a sense of “You guys owe me 5 points”, “Team B delivered 20 points last sprint. Why are you guys only delivering 17?”. When used in a way to compare teams I have always experienced that people (especially business people) very quickly forgot that story points indicate the relative size of stories to each other and not an absolute currency of value.
So, in short while I agree with you in theory I’m very scared of using story points in this way as I have always seen them be misused. I am also wary of using the same measure for different things (i.e. points to measure relative story size and team productivity) because when people game them there’ll be unintended and unknown side-effects.
This leaves the question why we only want to count stories when they’re on production. This is because a user story only has value once it is on production and is used by end users. A measure of productivity that represents this is better than one that doesn’t.
I know that most measures outside a lab aren’t accurate but we’re assuming that measures don’t need to be accurate if we only care about trends. Most measures are proxies and this one could for example help us identify whether a problem is delivery speed or building the right thing – say for example our output in user stories across the company doubles within a year but way fewer users use our product then we should test the hypothesis that we’re building the wrong thing.
Practically, in terms of what to track for your squad I’d count the actual story when it is released to production. Or in other words, every time you deploy count the number of stories in this deploy. As prototype stories or research are necessary work but don’t provide end user value I wouldn’t track them towards the output.
What about bug fixes?
While I think fixing bugs represents work and achievement I don’t think bugs should be part of a metric that measures how much we get out to our end users. To me working on a bug is finishing a story that has already been deployed and has been accounted for in our overall story number. Working on a bug represents so-called failure demand and takes away time from producing something of end user value (represented in a user story). In terms of tech output working on a bug is a “wasteful” activity and should the the number of shipped stories across the company ever decline, one of the first things I’d look at would be failure demand and code quality (number of bugs escaped on production).
Regarding the “Is it a bug or an enhancement” discussions I wouldn’t waste too much time. If you’re unsure just use your best judgment. It will even itself out over time.
Does it work?
Below is our trend of user stories shipped since we introduced squads and agile:
There are some spikes representing bigger features/projects (ca. 3-6 months) going live but overall the trend is pointing upwards. It looks like we’re on the right track!
Over time I’d expect the trend to flatten out. Although our squads will keep improving, I’d expect our squads to be at such a high level that, like with any high performance team, any additional performance gains will be marginal. When that happens we should only see an increase in story number trends when we add more squads. This is a very good thing – it shows that we’ll have succeeded in creating an organisational structure that allows us to scale and take over the world.
Overall, we are using this information as part of a wider collection of metrics to assess our Tech Health. We have a strong agreement to use the information in the right way, including looking for trends and carrying out root cause analysis.