What you should know about QA Measurement in Operations

August 10, 2020

Bernat Fages

The goal of QA in the context of Operations is to measure the quality of a process. As a manager, you want to guarantee a minimum level of quality within the outcomes owned by your team.

In Marketing, a Manager would track things like conversion rates to understand the quality of their team's output. In Engineering, a Manager would likely track metrics around error rates, server response times. If these evolve in the right direction (ie conversion rate rises, error rates diminish) we know the team is influencing better outcomes.

In Operations, the practice of tracking a team's output quality is known as Quality Assurance (QA). The philosophy of QA is that, in the best of the cases, everyone in the team should get to the same result, given a certain case in a specific process. So, the key in QA is to measure outcome discrepancies between team members within a given case.

The rate at which these discrepancies happen is what we tend to refer to as the Disagreement – or Error – rate.

Sources of discrepancy

Fundamentally, each outcome disagreement can be traced back to either of these two categories:

  1. Process Error: can be mitigated through better process, typically via context, training, guideline, policy or knowledge improvements.
  2. Human Error: those inevitable mistakes that may still happen in a perfect process; can get mitigated by reducing cognitive load, time pressure or by adding consensus-based mechanisms into the process.

Each category is influenced by a different set of levers. Thus, gaining a separate understanding of each of these metrics will help inform how we prioritize different measures.

Measuring discrepancies

The ideal, but impractical, way of measuring quality would be to have everyone in the team perform the same tasks. Of course, that would defeat the whole purpose of having a team. We need to design a more scalable strategy instead.

Broadly speaking, there are two paths we can take here depending on whether we have subject matter experts (SME) within the team:

Measuring Accuracy

If we dispose of SMEs we can measure our operational Error rate. The opposite metric is Accuracy, which represents how often an instance of a process is handled correctly.

Here are a few approaches you can implement to measure it, in decreasing order of effectiveness:

  1. Ongoing approvals. Consists of SMEs approving every single completed instance of a process, which guarantees SME-grade output quality. This approach is the most expensive of all, and requires a low ratio of reps per SME — although this will also depend on the time cost ratio of completing the task vs reviewing it.
  2. Ongoing random checks. Consists of SMEs evaluating random samples of completed tasks. This approach ensures an efficient and scalable ongoing estimation of accuracy. More suitable for teams with a mid ratio of reps per SME. Requires a fair amount of SME bandwidth.
  3. Blind Golden set. Involves distributing a series of cases to all reps and SMEs (Golden set) as part of their normal work. This is done by injecting these cases into the process queue in a randomized way. It needs to be a blind process, so the more randomization the better. This approach scales well, making it best for large rep to SME ratios. The downside is the lower process throughput, since a fraction of tasks resolved by every rep will be redundant, and the risks around the blindness of the process, which might make the metric a bit less trustable.
  4. Revealed Golden set. Involves surveying reps on a set of cases. Because it is effectively a simulation, it will produce biased results. Therefore, it's the least recommendable option, but also the least technically costly one.

Personally, I tend to find approach (2) most optimal because it makes the most efficient use of resources while providing an accurate measurement we can trust.

Alternatively, you may consider approach (1) if you can justify its cost (ie in a high-stakes process) because it will virtually provide you with 0% error rate, assuming no human error and perfect SME knowledge.

Measuring Consensus

Consensus is the technical term used to define a set of strategies similar to the aforementioned ones, with a key difference: comparisons are done between reps, as opposed to reps and SMEs.

Consensus indicates how often reps are in agreement when independently dealing with identical instances of a process. High consensus is a precondition to high Accuracy, as low consensus implies unreliable outcomes.

The most common ways of running Consensus are:

If you only intend to use Consensus to draw measurement, this will only apply to a fraction of all tasks for efficiency reasons. However, you may consider leveraging this technique across all tasks as an accuracy improving tactic. This may be done in combination with variable vote number thresholds, weighted voting or disagreement votes.

Comparing decisions

Whether you're measuring Accuracy, Consensus or both, you may want to consider how to account for disagreements:

Unless you're dealing with different levels of error cost in the same task, we strongly recommend to keep it simple and base your disagreements off exact matches.

Troubleshooting disagreements

Earlier in this guide we mentioned how there are two kinds of errors: process and human based. In order to pinpoint the cause of an error, you will need to have accuracy and consensus data on the task that caused the error: