DILEMMA WORKS

Erik on product management and such




             


Product development by random walk (2025)


“Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.”

-Gene V. Glas

One of the most counter-intuitive things about building digital products is that A/B testing for statistical signficance between iterations can keep you from making the changes that really matter and keep you stuck on a path of mindlessly iterating on minimal changes. And that the problem gets worse as your sample size increases.
    The reason is that effect size and the practical significance of a change needs to be considered. The larger your sample size gets, the smaller is the minimum difference that can show up as statistically significant versus your baseline experience. But those differences may be too small to matter to users and to the business.
    If you don’t account for effect size and discount experiements where the statistically significant difference is simply too small, you’ll get stuck running hundreds of experiments per year, some of them “successful” yet the user experience becomes no better at all. This is what I call product development by random walk.
    What you need to do to escape this trap is to require a minimum effect size for your experiments. If success requires a 20% improvement to baseline instead of 1.1%, you’ll find that that forces you to think bigger. And to think bigger you’ll need to be better informed. To be better informed you need to engage in research, of competitors, of users and of your industry. This will make the job more fun and you’ll become more effective. 

To add context for this essay: I have spent a decade in China, working on launching products and brands in the global markets. Product development by random walk is a common failure mode by product teams, caused by a fixation on quarterly metrics improvements at the expense of literally everything else that contributes to a consistent and coherent user experience. I will exemplify by using this quote by a 15-year veteran of Alibaba, looking back at some of the companyäs failures:

“I happened to experience both phases of Koubei. To be honest, based on my personal experience at the time, I knew it was bound to fail. Because our methods were extremely simple and crude: throwing money at operations and boosting data metrics.

As for the fundamental work like onboarding merchants, providing services, and building products—this was considered hard, dirty work. It had long cycles, required large investments, and produced slow results. Its impact on KPIs was far less direct and rapid than short-term spending on operations.

Consequently, either no one was willing to do this fundamental work, or the people who did it couldn’t get good performance reviews.

Coincidentally, in 2016, while at Alipay, I was transferred to support the offline payment battle. After six months, I felt it was a lost cause. It wasn’t because the market was too competitive or WeChat was too strong; it was because our evaluation mechanisms and culture were flawed.

We had lost the fortitude for long-termism. We preferred short-term stimulants. Our team’s collaborative combat effectiveness had also become inefficient.

...

How do you get results with a small investment in a short time? The answer is operations. Run an operational campaign, and the metrics soar. Use targeted data boosting, and the ROI is optimal. What if it stops growing? Define a new metric.

Funneling traffic between internal products, overlapping user counts… I’ve seen all the tricks. The value brought by long-term product construction pales in comparison and is therefore long-term ignored.”

-Source

A/B-testing as the main mode of analysis

When Doug Bowman, design director and Google's first visual designer, left the company in 2009, he wrote on his blog:
    "Yes, it’s true that a team at Google couldn’t decide between two blues, so they’re testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can’t operate in an environment like that. I’ve grown tired of debating such minuscule design decisions. There are more exciting design problems in this world to tackle."
    Well, what's wrong with being thorough and data-driven? He answered that too:
    "When a company is filled with engineers, it turns to engineering to solve problems. Reduce each decision to a simple logic problem. Remove all subjectivity and just look at the data. Data in your favor? Ok, launch it. Data shows negative effects? Back to the drawing board. And that data eventually becomes a crutch for every decision, paralyzing the company and preventing it from making any daring design decisions."
    As a single instance there isn’t much of an issue. But at an aggregate organizational level, when product decisions are not informed by expertise and vision, the development process becomes a mindless, random walk.



Working in product the proportion of work to be done and decisions to be made to time available for proper research and analysis is usually extremely lopsided. A/B testing is therefore great help because it’s faster, it gives you the answer, and requires less of your time (but more of developers’ and designers’). Build it, launch it, see what happens. Let the data speak for itself. 
    This works in select scenarios. If you have a well-researched feature or product and you’re looking to patch edge cases without affecting the existing experience negatively, using data as validation is enough.
    There are other times when it isn’t. Any time you have a goal (making conversion rates go up or making bounce rates go down) but no clear way of linking that user needs, a fall back to “what if we move the button around and see what happens” is not enough. Even if it does work and your metric increases, the problem is that as a principle, foregoing research and understanding of your problem takes you down a path where your product development process deteriorates into guesswork.
    A team that continuously skips user interviews, doesn’t spend time on keepin itself informed about market trends and competitors, and creates hypotheses to test inside the vacuum of their office then builds and tests it, and does only that week after week, will soon have lost the ability to think.
    They will also, unknowingly and unconsciously, have made an assumption about their product. 
    
The quiet assumption


A/B testing excels at answering narrow questions: Which headline converts better? Does a $9.99 price tag outsell $10?
    This approach feels deceptively safe. After all, who can argue with “letting the data speak”? But data without context is a compass without a map. Teams end up optimizing for local maxima—small, isolated wins—while blind to the global maxima of transformative innovation. Data tells you what is happening; research tells you why. Without the “why,” you’re left reacting to noise. 
    Not every projecct or requirement needs to reconsider the current experience from the ground up. But if you never do, your quiet assumption is that your overall user experience is the best it could ever be, with only small room for tweaks to squeeze out a couple of percentage more of clicks and conversion.
    In the real world of limited hours and energy, teams can fall back on random iteration of new designs. It relieves of you the hard effort of doing research, of staying informed and of having a point of view of how products should adapt to users and how they behave. Work becomes mindless.
    But not only that, it becomes actively anti-progress. Past non-significant tests can be used as a club to beat down new initiatives from other teams, and by limiting each change to be small in size, it keeps every team tied to the local maxima, walking around in circles year after year.
    Even a decade ago, the problems with an unbalanced product development process that relies solely on A/B testing at places like Google were being voiced:


  
Outside of academia we may believe that the method of using A/B tests to determine whether a change to the user experience has a real effect on user behavior and product performance is scientific and fault-free. The reality in the scientific community is not so: 

“... the widespread belief that scientific progress arises from the application of formal methods of statistical inference to random samples of data, something Guttman (1985, p. 3) characterizes as “illogical,” and Gigerenzer (2004, p. 587) as “mindless.” - The Limited Role of Formal Statistical Inference in Scientific Inference,Statistical Inference in the 21st Century: A World Beyond p < 0.05, The American Statistician, Volume 73, 2019

In fact, over-reliance on statistifical significance has been a major topic in the scientific community for more than five decades. 


Doom loops and stalled feature development





 
Statistical significance (p-value) and effect significance (d-value)

“Statistical significance is the probability that the observed difference between two groups is due to chance. If the P value is larger than the alpha level chosen (eg, .05), any observed difference is assumed to be explained by sampling variability. With a sufficiently large sample, a statistical test will almost always demonstrate a significant difference, unless there is no effect whatsoever, that is, when the effect size is exactly zero; yet very small differences, even if significant, are often meaningless. Thus, reporting only the significant P value for an analysis is not adequate for readers to fully understand the results.

For example, if a sample size is 10 000, a significant P value is likely to be found even when the difference in outcomes between groups is negligible and may not justify an expensive or time-consuming intervention over another. The level of significance by itself does not predict effect size. Unlike significance tests, effect size is independent of sample size. Statistical significance, on the other hand, depends upon both sample size and effect size. For this reason, P values are considered to be confounded because of their dependence on sample size. Sometimes a statistically significant result means only that a huge sample size was used.”

- Journal of Graduate Medical Education

Larger sample sizes increase a test’s ability to detect small differences, often resulting in statistically significant outcomes even for trivial effect sizes.

Example: Conversion rate 10.1% vs. 10.0%
Sample size = 1,000 → p = 0.3 (not significant)
Sample size = 1,000,000 → p = 0.0001 (significant)

The absolute difference remains 0.1%, but conclusions about significance differ due to sample size alone.  





Cohen’s d can be used to determine if an effect size is actually relevant to the business. If effect size threshholds are used properly, that disincentivices making a myriad of tiny, insignificant, random changes to a product. It breaks you free of the random walk and forces you to think and to think bigger.
     Imagine an A/B test with a statistically signficicant improvement of 1.1% over a baseline 7% conversion rate, or 0.07% percentage points. Is this meaningful and moreover, is it really reproducible? If you ran the same test three months later, would it still turn out statistically significant?
    That is why effect size and practical significance needs to be considered alongside statistical significance, and the larger the sample size the larger the threshhold for a relevant effect size should be. A relative change of 1.1% from a baseline 7% is probably meaningless, but a 20% change is real and meaningful.

“This has become such a problem in many fields of research that more than 800 researchers recently published a paper condemning the usage of p values since many rely too heavily on it and not enough on practical measures of significance that demonstrate realistic effects.[3] Completely abandoning measures of statistical significance like p values is probably a little too extreme, but their point is well taken. We should be publishing measures of both statistical and practical significance or meaningfulness.” - Practical Significance and Effect Sizes, Chris Bailey





 The belief is that after a successful rejection of Ho, it is highly probable that replications will also result in Ho rejection17. This is often not true; for example, given typical power levels in behavioral science (around .50 for medium effect sizes), the chance of three replications all being significant is only one in eight, and five replications having as many as three significant results is only 50:50.

Timescales and speed

Speed of iteration is generally a good thing. But we need to consider if it's done in an informed direction. If we skip research and planning, we gain in speed of iteration, but it is unclear where we'll end up. Maybe where we started. When your strategy relies on randomness, you’re betting that blind luck will outperform intentionality.
    Consider the cost of testing 50 iterations of a card across a year, where one version eventually “win” with a 0.5% conversion boost. But what if users abandon your app because the core workflow is broken? 
    As I am now leading a cross-functional project re-designing a core component of the user flow that will roughly take three months to launch, one of the questions I've received is "what about speed? Normally we would be able to launch a change in two weeks."
    Well, if we could do this in two weeks we already would have and our current problems wouldn't exist. We are not only doing one change, but our scope, informed by qualitative and quantitative insights, and with the help of people from 10 markets, comprises twenty-some changes, which actually would take more time to launch if done one by one in a typical siloed fashion.

User research costs time and effort, but it saves you tenfold because it is the only way to keep your team from going down the path of iterating on nonsense. It is a way to pre-empt deterioration of organizational intelligence just as much as it is is a way of building better products. 

Blending qualitative and quantitative

The antidote to random walk development is insight-driven experimentation. Start with qualitative research to define the problem, then use A/B testing to validate the solution
    Take Slack. Before building their now-iconic platform, the team spent months observing how teams communicated (and failed to) in workplaces. They didn’t A/B test their way to dominance—they solved a visceral, researched pain point. Similarly, Figma’s shift from a prototyping tool to a collaborative design ecosystem was fueled by deep user interviews about the friction in designer-developer workflows. 
    These companies didn’t abandon A/B testing. They just stopped using it as a crutch.



The organizational aspect
The most important learning from this is not for individual product managers but for those building product organizations. A team that does only quick iterations and A/B testing will be directionless and stuck in place. 


The arts and crafts approach

Airbnb and Apple release bi-annual and annual updates to their products and it goes without saying that they are not driven by experiment metrics but rather informed by experience and craft. 







‘Tukey (1991) wrote that "It is foolish to ask 'Are the effects of A and B different?' They are always different—for some decimal place"
-The Earth is Round (p < 0.05), Jacob Cohen, 1994

John Tukey also noted, "It is foolish to ask 'Are the effects of A and B different?' They are always different—for some decimal place". This means that with a sufficiently large user base or enough experiment participants, you will almost inevitably find a "statistically significant" difference, even if that difference is practically meaningless18.... This can lead to a "tautological logic" where you merely confirm you collected data on hundreds or thousands of subjects, which you already knew. And that A = A, B = B, and A != B. 


The Tyranny of the .05 Cutoff: The reliance on the arbitrary .05 significance level forces a rigid, dichotomous "reject/accept" decision1.... This "mechanical, objective, and content-independent yes-no decision process" is not how effective product development or science operates