The goal of QA in the context of Operations is to measure the quality of a process. As a manager, you want to guarantee a minimum level of quality within the outcomes owned by your team.
In Marketing, a Manager would track things like conversion rates to understand the quality of their team's output. In Engineering, a Manager would likely track metrics around error rates, server response times. If these evolve in the right direction (ie conversion rate rises, error rates diminish) we know the team is influencing better outcomes.
In Operations, the practice of tracking a team's output quality is known as Quality Assurance (QA). The philosophy of QA is that, in the best of the cases, everyone in the team should get to the same result, given a certain case in a specific process. So, the key in QA is to measure outcome discrepancies between team members within a given case.
The rate at which these discrepancies happen is what we tend to refer to as the Disagreement – or Error – rate.
Sources of discrepancy
Fundamentally, each outcome disagreement can be traced back to either of these two categories:
- Process Error: can be mitigated through better process, typically via context, training, guideline, policy or knowledge improvements.
- Human Error: those inevitable mistakes that may still happen in a perfect process; can get mitigated by reducing cognitive load, time pressure or by adding consensus-based mechanisms into the process.
Each category is influenced by a different set of levers. Thus, gaining a separate understanding of each of these metrics will help inform how we prioritize different measures.
The ideal, but impractical, way of measuring quality would be to have everyone in the team perform the same tasks. Of course, that would defeat the whole purpose of having a team. We need to design a more scalable strategy instead.
Broadly speaking, there are two paths we can take here depending on whether we have subject matter experts (SME) within the team:
- If we do, we can measure our Error rate through our SME's expertise.
- If we don't, we will need to rely on a Consensus proxy metric, which will serve as a lower bound on our operational Error rate.
If we dispose of SMEs we can measure our operational Error rate. The opposite metric is Accuracy, which represents how often an instance of a process is handled correctly.
Here are a few approaches you can implement to measure it, in decreasing order of effectiveness:
- Ongoing approvals. Consists of SMEs approving every single completed instance of a process, which guarantees SME-grade output quality. This approach is the most expensive of all, and requires a low ratio of reps per SME — although this will also depend on the time cost ratio of completing the task vs reviewing it.
- Ongoing random checks. Consists of SMEs evaluating random samples of completed tasks. This approach ensures an efficient and scalable ongoing estimation of accuracy. More suitable for teams with a mid ratio of reps per SME. Requires a fair amount of SME bandwidth.
- Blind Golden set. Involves distributing a series of cases to all reps and SMEs (Golden set) as part of their normal work. This is done by injecting these cases into the process queue in a randomized way. It needs to be a blind process, so the more randomization the better. This approach scales well, making it best for large rep to SME ratios. The downside is the lower process throughput, since a fraction of tasks resolved by every rep will be redundant, and the risks around the blindness of the process, which might make the metric a bit less trustable.
- Revealed Golden set. Involves surveying reps on a set of cases. Because it is effectively a simulation, it will produce biased results. Therefore, it's the least recommendable option, but also the least technically costly one.
Personally, I tend to find approach (2) most optimal because it makes the most efficient use of resources while providing an accurate measurement we can trust.
Alternatively, you may consider approach (1) if you can justify its cost (ie in a high-stakes process) because it will virtually provide you with 0% error rate, assuming no human error and perfect SME knowledge.
Consensus is the technical term used to define a set of strategies similar to the aforementioned ones, with a key difference: comparisons are done between reps, as opposed to reps and SMEs.
Consensus indicates how often reps are in agreement when independently dealing with identical instances of a process. High consensus is a precondition to high Accuracy, as low consensus implies unreliable outcomes.
The most common ways of running Consensus are:
- Globally. All reps resolve the same set of tasks. Paired with ground truth data (SME review) this is analogous to the Golden set approach, which can help us understand not only which reps are more agreeable but also accurate.
- Locally. A subset of all reps resolve the same task. At a minimum this involves pairs of 2 reps doing the same work. At enough scale, this should still allow us to construct a global picture through data triangulation of data point pairs. This approach is lest wasteful than the Global one, with efficiency improving with decreasing redundancy per task.
If you only intend to use Consensus to draw measurement, this will only apply to a fraction of all tasks for efficiency reasons. However, you may consider leveraging this technique across all tasks as an accuracy improving tactic. This may be done in combination with variable vote number thresholds, weighted voting or disagreement votes.
Whether you're measuring Accuracy, Consensus or both, you may want to consider how to account for disagreements:
- Exact match. Agreements are considered binary (either 100% or 0%), there is no in-between.
- Similarity match. Each disagreement might penalize accuracy more or less depending on how different the two outcomes in the comparison are. For this approach to work we need to define a similarity function that calculates the distance between two results.
Unless you're dealing with different levels of error cost in the same task, we strongly recommend to keep it simple and base your disagreements off exact matches.
Earlier in this guide we mentioned how there are two kinds of errors: process and human based. In order to pinpoint the cause of an error, you will need to have accuracy and consensus data on the task that caused the error:
- If reps aren't aligned, it is a process error. Your guidelines might be ambiguous. Additional training might be required.
- If reps are mostly aligned, but in disagreement with SMEs, it is a process error. Your guidelines might be misleading.
- If reps are aligned with the SMEs but in disagreement with the mistaken rep, it may be a human error. To confirm, it's best checking with the rep.