Hypothesis Testing

When making data-driven decisions, the tools provided by probability and statistics play an important role.

Say that we set up an experiment just as follows: About half of the users of our platform will be exposed to our newly developed feature, and the other half of the users will be kept in our current unmodified experience.

We are starting our reasoning from the standpoint of I don't know if our feature is an improvement. If we already knew that for sure, it would make sense to ship the feature to all users immediately, but we start our reasoning from a more humble standpoint.

It is also important that there is no selection bias introduced by the user partitioning process. This means that the users should be, as much as possible, independently and randomly partitioned into each group. The reason being that, if there is any distinction at all -- besides the feature we are testing for -- between the groups, an additional hypothesis will be raised for the cause of the observed results.

I hope the concepts so far have made sense. Let's introduce some nomenclature to them. The group of users we are making some intervention in during our experiment is called treatment group. The group which is kept in the unmodified condition, is the control group.

The initial assumption we are starting with is that there is no statistical difference between treatment and control, meaning that we start our reasoning from our introduced intervention having a null effect. This is called the null hypothesis. This is the most conservative claim, and we assume it as true from an argumentative perspective.

As an scientist who'd like to present new results, our goal becomes disproving the null hypothesis. That is, providing enough and substantial evidence such that a reasonable third-party following the same thread of reasoning would agree that what you claim -- the intervention you've designed has a significative effect -- is true.

Suppose that our goal is to increase the time users spend on our site, and assume that we can measure the time users spent on the site each day. We then average this measurement over each of the groups over the length of the experiment. We arrive at the result that on average, users of the treatment group spend X% more time per day than users of the control group. Can we wrap up and consider that our experiment was successful?

Not quite. If we come back to our null hypothesis -- “there is no statistical difference between treatment and control” -- and we assume for it to be true, it is possible that due to the inherent variability of user behavior we observe downs and ups in any measure. How likely? That can be quantified as a probability after making a few assumptions.

First we need to model our measurement as a sample taken from a random distribution. That is to say, the amount of time that users spend on our site is an unpredictable random amount, but that most likely we expect for it to fall around a central value with some spread around it. The random distribution which makes the fewer assumptions about the underlying phenomena is the normal distribution, described by the Gaussian curve. It takes two parameters: the mean -- around which the samples will be centered around -- and the variance -- the measure of how wide the deviation from the mean will be.

Alright, coming back to the observation. We observed that in our treatment the users, on average, spent X% more time on our site than on the control group. Now, we assume that the same distribution as control describes the treatment (thus, assuming there is no underlying movement caused by our intervention). If that were true, and assuming that our measure is described by the normal distribution, we can calculate the exact probability of how likely it is that we see, by chance alone, a measurement which deviates at least X% from the mean (taken from control). This is called the p-value.

Rigorously speaking the p-value is how likely is that -- assuming the null hypothesis is true -- we would be able to see a measurement which deviates as much as observed. As a shorthand, although not strictly correct, you can think of p-value as “how likely that was observed by chance”?

For most fields of science, a p-value smaller than 5% is considered statistically significant or a positive result, that is, it has enough evidence to disprove the null hypothesis.

That is to say, even in the best of circumstances, some chance that the observation was due to chance alone is accepted so that the result can be put forward. After a few independent replications, the likelihood that it was a fluke every single time is vanishingly small.

Note that other hypotheses that may be visited for a complete rigorous critique may include whether the sample sizes are large enough for the claimed size of the effect, whether the sample sizes are large enough to be representative of a larger population, whether the choice of random distribution was adequate, whether the methods of partitioning treatment from control introduced biases, whether any discarded measurements were reasonable, and so on.