Made to Measure, by David Cole

The story is now infamous: between 2006 and 2009, Google tested 41 different shades of blue in order to determine which would generate the most clicks.

It came to light after designer Doug Bowman left his Visual Design Lead position at Google due to their data-driven culture. He wrote this about his departure:

I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can’t operate in an environment like that. I’ve grown tired of debating such minuscule design decisions… I won’t miss a design philosophy that lives or dies strictly by the sword of data.¹

The post spread quickly through the design community, with many designers expressing their disgust with Google’s approach. Fast Company ran a wonderful, violent headline on the story: “Google’s Marissa Mayer Assaults Designers with Data.”²

In the five years since, the event has become a legend in our field. “41 shades of blue” is now shorthand for flawed decision-making by data, as opposed to relying on the taste and instincts of an experienced designer.

In the same timeframe, the broad notion of design has undergone a remarkable transformation. Apple’s sustained dominance has popularized the concept of a “design-led” business. Design at Apple is not an afterthought, determined by A/B testing. Rather, it’s the starting point. An ad explaining their design process declares, “the first thing we ask is: what do we want people to feel?”³ They rely on intuition and principles of form. They believe money will naturally follow from there, and it has.

It’s not uncommon to encounter these two modes presented as a dichotomy, with headlines like “Should designers trust their instincts—or the data?”⁴ This model, positioning design and data against each other, reflects a misunderstanding of what data is and how it can be used. The design community needs to reframe our discourse on this subject in order to take advantage of a critical opportunity to advance our practice.

The Data in Intuition

One significant barrier to moving this conversation forward is that intuition and instinct are difficult to define with precision. Intuitive decision-making is actually the product of several different forces. It’s a mix of rules of thumb, anecdotes from previous experiences, information from “soft” sciences, and listening to your gut.

And data, in the popular conception, seems to be defined as something along the lines of “hard numbers measured by a computer”. Survey results plotted over time can also be data, as can a single survey result, or an in-person interview with a user. Data can even include the conversation with a relative about what you work on, and their difficulty in understanding it. All of these are data points containing information about how people use and understand your design work.

In truth, “data” fragments into many different sub-sections: usage data from server logs, data from split testing, data from aggregated or classified qualitative sources, and so on.

Really, data and intuition aren’t totally distinct: intuition is derived from past experiences and previous observations. It’s simply not useful to talk about these approaches as mutually exclusive, much less antithetical.

Idle Fears

Why, then, is there so much concern surrounding the relationship between intuition and data? The “41 shades” story points us to the answer.

In one read, this story comes off as an attempt to eliminate the designer from the process. If you can test every possible permutation imaginable, down to the finest details, you don’t need a professional to spend time reasoning about the differences.

And in another read, this story flies in the face of what we’re taught as designers. Color choices emerge from defined principles about form and aesthetics. One shade of blue “means” something that another does not. What happens if the best performing blue isn’t in-brand? Surely these decisions can’t be made in isolation.

Both of those reactions reflect valid concerns, but from the outside we don’t know what Google did with the information. Did they pick the best blue and move on with it? Did they take the time to learn why the best blue did so well, so they could re-apply those learnings elsewhere? Or perhaps there was no difference, and they fell back to the designer’s preference. Or maybe there was a huge difference, but they still went with the designer’s choice for aesthetic or brand reasons.

Imagine you were the designer in the group that ran that test, and one shade of blue performed significantly better than another. How could that be possible? Is it an aesthetic property of the blue? Perhaps it matches the other colors in some more satisfying way. Or does it have to do with how different it is from the other colors on the page? Maybe it’s the blue that is most distinct from black on the largest variety of screens. There are so many interesting possibilities to dig into.

Because the thing is, data is just data. It can’t be wrong, even if it can be misunderstood. If one blue outperformed another, and the test was constructed correctly, then it must be happening for a reason. The question of why opens up a vast learning opportunity. If there’s knowledge to be gained at that level of granularity, imagine how many different areas of inquiry there are in a single piece of design work.

Real Risks

By assuming the worst whenever we tell this story, we’re signaling that testing design changes can only produce negative outcomes. It’s so much more complex than that. Here’s what we should actually be concerned about when we hear a story like this:

Most websites, especially if they’re products, are complicated, multi-dimensional beasts. When discussing the improvement of a metric, we should ensure we’re not hurting ourselves elsewhere. It’s often true that behaviors are linked in unexpected ways. Clicks may be up, but long-term retention might start going down in a few months.

We should also acknowledge that numbers are just numbers. They can only tell us a limited amount of information. A healthy testing culture complements quantitative testing with a qualitative component. Maybe the best blue got the most clicks, but only because it was extremely bright and distracting. In-person testing might reveal that people click a tiny bit more, but with so much stress and frustration that it’s actually worse overall.

While I believe many more things are measurable than designers tend to think, some aspects of our work can’t be tied to a number. Brand loyalty, for example, is too long-term to tie to a single change. Major new features can take many months to show their impact, and it rarely makes sense to withhold a feature from your customers for that long. Some effects are too external to the system, like how a change might affect recruiting—a dynamic that is heavily influenced by the market and competitive forces. Some types of monitoring—say, surveying audiences that are hard to identify or access—may simply be too expensive or time-consuming to bother with. Despite all of this, deciding something is not measurable should be a conclusion subject to a lot of scrutiny. It’s much healthier to start with the assumption that testing is possible and data is valuable.

The big fear here is that companies focus only on what they can measure, at the expense of everything non-measurable. But we should not allow this fear to keep us from measuring as much as we can and applying the results.

The smarter approach is to build internal consensus that some important things simply can’t be tied to a metric, and go from there. As an example, consistency would be very difficult to measure. If it’s even possible, it would be quite complicated and expensive. Yet, few would argue against the idea that consistency matters. Evaluation is still possible through defined and agreed-upon principles that the whole organization respects.

These principles don’t need to be arbitrary: the difference between a three-pixel and a five-pixel line is significant. It’s also explicable. Our understanding of why certain visual design choices work better than others may be incomplete, but I can think of many good reasons to prefer one thickness over another. Varying line thickness can convey hierarchical relationships, or divide space with different amounts of strength, or establish a harmony with the stroke of some type, and so on. We can’t just attribute our design decisions to “good taste”—there are real reasons when we look for them.

Perhaps the most problematic aspect of the “41 shades” story goes unmentioned. It happened before the 41 shades:

A designer, Jamie Divine, had picked out a blue that everyone on his team liked. But a product manager tested a different color with users and found they were more likely to click on the toolbar if it was painted a greener shade.

As trivial as color choices might seem, clicks are a key part of Google’s revenue stream, and anything that enhances clicks means more money. Mr Divine’s team resisted the greener hue, so Ms Mayer split the difference by choosing a shade halfway between those of the two camps.⁵

Splitting the difference between two colors and just shipping is much riskier than running a test. Without a point of comparison, the success or failure of using that shade will teach us nothing. Maybe it fails. Then who was wrong? The designer, because it was slightly bluer? Or the PM, because it was slightly greener? There’s no way of knowing, and both parties lose. In the words of David Deutsch, if a solution is “no one’s idea of what will work, then why should it work?”⁶

Of course, it’s often the case that learning isn’t the primary goal of a test. Simply reaping the results from a high-performing variant can be the only concern. And it’s true that digging deeper into the dynamics of some change might not be worth the time or effort. That’s fine, but that choice should be recognized as both a missed opportunity and a real risk. When you learn nothing about why something succeeds, you don’t get to re-apply those learnings in future work. And more importantly, the more your product succeeds without your understanding why, the more likely it is that something will break without your knowledge. Imagine climbing into a rocket where the engineers aren’t exactly sure why it takes off, but hey, it sure is fast. Enjoy your flight!

Designing for Reality

If this just sounds like risk mitigation, it’s because I’m focusing on the fears I’m seeing in the design community. What gets much less attention is all of the untapped opportunity. Utilizing data as a designer isn’t just about building variants and running split tests.

Learning how to access and study existing usage data opens you up to an expanse of knowledge. There’s a lot more to this than checking the dashboard of Google Analytics. Robust logging organized meaningfully can be used to dive into every nook and cranny of your work. Monitoring key flows and tracking cohort-specific usage can surface issues you wouldn’t notice in simpler metrics, like time on site or monthly visits.

One of my favorite applications of data is during the design process itself. When you can quickly sample how people are really using your product, you can completely eliminate dummy behavior from your workflow, replacing Lorem Ipsum with the language people really use. Our tooling at Quora is sophisticated enough that static prototypes rarely exist for long: our first pass on the code often has production data immediately running through it. This means you’re designing for reality, which is often quite distinct from the ideal.

Really, that’s what all of this is about: designing for reality. The way we design today is unlike anything done in the past. Our work is beginning to reach billions of people simultaneously. We’re building products that need to facilitate relationships across the globe. The scale, scope, and complexity of our work demands a nuanced understanding of our systems and the people within them.

Returning to Apple and Google: let’s remember that these companies may be direct competitors, but their strengths rarely overlap. Apple makes a point not to keep user data, and it’s held them back in cases like Maps, Siri, or Ping. When large amounts of information or complex social behaviors are at play, data-centric companies like Facebook and Google have repeatedly bested them. The iPhone, for all its wonders, is primarily a single-user device that succeeds when it is beautiful and delightful enough to warrant an expensive purchase. Those dynamics do not apply to every product in every market. We should be equipped to design ourselves out of any problem we face, and increasingly that requires an ability to handle tremendous complexity.

Designers too often see data as a threat, when in fact it’s an opportunity. Our collective fears are unfounded, based on a misconception of what’s possible. Embracing data affords us deeper understanding, faster learning, and more nuanced reasoning.

Douglas Bowman, “Goodbye, Google,” Creative Outlet of Douglas Bowman (blog), March 20, 2009. ↩
Alissa Walker, “Google’s Marissa Mayer Assaults Designers with Data,” Fast Company, October 13, 2009. ↩
“Apple - Designed by Apple - Intention,” YouTube video, posted by Apple, June 10, 2013. ↩
Braden Kowitz, “Should Designers Trust Their Instincts—Or the Data?” Google Ventures (blog), 2013. ↩
Laura M. Holson, “Putting a Bolder Face on Google,” New York Times, February 28, 2009. ↩
David Deutsch, The Beginning of Infinity: Explanations That Transform the World (Penguin, 2011). ↩