A/B Testing: How effective is the treatment?¶

In this project, I will investigate the effect of a treatment given to a group of customers on a shopping website. In particular, I will explore differences in relationships between features, but also differences between key metrics from both groups - control and test.

For this project, I used the following kaggle datasets:

  • https://www.kaggle.com/datasets/amirmotefaker/ab-testing-dataset

Furthermore, I drew inspiration from this article in particular, which provides a good introduction into the methodological basics of A/B testing.

  • https://medium.com/data-science/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499
In [120]:
# load packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import warnings
warnings.filterwarnings('ignore')
In [121]:
control_data = pd.read_csv('C:/Users/felix/OneDrive/Dokumente/Python Projects/AB_Testing/control_group.csv', delimiter=';')

test_data = pd.read_csv('C:/Users/felix/OneDrive/Dokumente/Python Projects/AB_Testing/test_group.csv', delimiter=';')
In [122]:
control_data.head()
Out[122]:
Campaign Name Date Spend [USD] # of Impressions Reach # of Website Clicks # of Searches # of View Content # of Add to Cart # of Purchase
0 Control Campaign 1.08.2019 2280 82702.0 56930.0 7016.0 2290.0 2159.0 1819.0 618.0
1 Control Campaign 2.08.2019 1757 121040.0 102513.0 8110.0 2033.0 1841.0 1219.0 511.0
2 Control Campaign 3.08.2019 2343 131711.0 110862.0 6508.0 1737.0 1549.0 1134.0 372.0
3 Control Campaign 4.08.2019 1940 72878.0 61235.0 3065.0 1042.0 982.0 1183.0 340.0
4 Control Campaign 5.08.2019 1835 NaN NaN NaN NaN NaN NaN NaN
In [123]:
test_data.head()
Out[123]:
Campaign Name Date Spend [USD] # of Impressions Reach # of Website Clicks # of Searches # of View Content # of Add to Cart # of Purchase
0 Test Campaign 1.08.2019 3008 39550 35820 3038 1946 1069 894 255
1 Test Campaign 2.08.2019 2542 100719 91236 4657 2359 1548 879 677
2 Test Campaign 3.08.2019 2365 70263 45198 7885 2572 2367 1268 578
3 Test Campaign 4.08.2019 2710 78451 25937 4216 2216 1437 566 340
4 Test Campaign 5.08.2019 2297 114295 95138 5863 2106 858 956 768
In [124]:
test_data.columns
Out[124]:
Index(['Campaign Name', 'Date', 'Spend [USD]', '# of Impressions', 'Reach',
       '# of Website Clicks', '# of Searches', '# of View Content',
       '# of Add to Cart', '# of Purchase'],
      dtype='object')
In [125]:
control_data.shape
Out[125]:
(30, 10)
In [126]:
test_data.shape
Out[126]:
(30, 10)

These are the meanings for the different features that we find in our data:

  • Campaign Name: The name of the campaign
  • Date: Date of the record
  • Spend: Amount spent on the campaign in dollars
  • Number of Impressions: Number of impressions the ad crossed through the campaign
  • Reach: The number of unique impressions received in the ad
  • Number of Website Clicks: Number of website clicks received through the ads
  • Number of Searches: Number of users who performed searches on the website
  • Number of View Content: Number of users who viewed content and products on the website
  • Number of Add to Cart: Number of users who added products to the cart
  • Number of Purchase: Number of purchases
In [127]:
# plot the spend for control and test separately
fig = plt.subplots(figsize=(20, 5))
sn.lineplot(
    data= control_data,
    x = control_data['Date'],
    y= control_data['# of Purchase'],
    label='Control'
)
sn.lineplot(
    data= test_data,
    x= test_data['Date'],
    y=test_data['# of Purchase'],
    label = 'Test'
    )
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()
In [128]:
# Show how the features relate to each other with a correlation matrix - control group & test group
plt.figure(figsize=(20, 10))

corr_matrix_control = control_data.corr(numeric_only=True)
corr_matrix_test = test_data.corr(numeric_only=True)
plt.subplot(1, 2, 1)
plt.title('Correlation Matrix for Control Group')
sn.heatmap(corr_matrix_control, annot=True)
plt.subplot(1, 2, 2)
plt.title('Correlation Matrix for Test Group')
sn.heatmap(corr_matrix_test, annot=True)
Out[128]:
<Axes: title={'center': 'Correlation Matrix for Test Group'}>

Description

From this correlation matrix comparison between control and test group, we can assume and hypothesize that:

  • There is a strong correlation between reach and the # of impressions in the control group. The coefficients of 0.96 suggests that these two variables are close to being equal. This relationship is weaker for the test group, having a coefficient of 0.78.

  • There is a small positive effect between Spend in USD and the # of impressions in the control group, which suggests that the more the company invests into its campaign, the stronger the investment was translated into more impressions. However this effect remains rather small for the control group with a coefficient of 0.27. This effect is much weaker in the test group, suggesting that the treatment negatively mediates this relationship, making additional spend less important and contributing to the number of impressions.

  • Looking at the relationship between the # of add to cart and Spend, we can observe that there is a slightly negative, but close to 0 relationship with the spend in the control group. In the test group, however, additional spend is slightly positively correlated with the # of add to cart. This suggests that the treatment might make the spending on the campaign more effective and increasing the items added to cart.

  • Similarly, we can observe that the relationship # of searches -> # of Purchases is near 0 (but negative) in the control group, while being moderately positive with a coefficient of 0.29 in the test group. This could lead us to the assumption that the treatment leads to more searches having a positive impact on the number of final purchases.

The comparison of the two matrices briefly put shows us that the treatment slightly dimishes the impact of spending on key metrics, while it supports in conversion metrics. In summary, we see some differences in the relationships between different features between control and test group. In other words, there seems to be differences between control and treatment group that could among other things boost the effectiveness of the company's campaign.

We will now investigate whether observed differences between both groups are statistically significant, other rather due to chance.

Roadmap for Hypothesis Testing - which test to choose?¶

image.png

(https://medium.com/data-science/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499)

In [129]:
# prepare the data

# Convert the 'Date' column to datetime format
control_data['Date'] = pd.to_datetime(control_data['Date'], format='%d.%m.%Y')
test_data['Date'] = pd.to_datetime(test_data['Date'], format='%d.%m.%Y')


# fill missing values with 0
control_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)
In [130]:
# Define a function for different metrics

def calculate_metrics(data):

    # total # of purchases
    total_purchases = data['# of Purchase'].mean()

    # conversion rate
    conversion_rate = np.where(data['# of Searches'] > 0, data['# of Purchase'] / data['# of Searches'], 0).mean()

    # completion rate
    completion_rate = np.where(data['# of Add to Cart'] > 0, data['# of Purchase'] / data['# of Add to Cart'], 0).mean()


    return round(total_purchases,2), round(conversion_rate,2), round(completion_rate,2)
    
In [131]:
# apply the function to both control and test data
control_metrics = calculate_metrics(control_data)
test_metrics = calculate_metrics(test_data)

print("Control Group Metrics:")
print(control_metrics)
print("\nTest Group Metrics:")
print(test_metrics)
Control Group Metrics:
(505.37, 0.26, 0.44)

Test Group Metrics:
(521.23, 0.22, 0.62)

In this brief comparison of key metrics we will test for, we can already see slight differences. Furthermore, this helps us confirm that we should perform a two-sided hypothesis test later on, since we observe that metrics can either increase or decrease, going from control to treatment group.

In [132]:
# check for distribution of the data

conversion_rate_control = control_data['# of Purchase'] / control_data['# of Searches']
conversion_rate_test = test_data['# of Purchase'] / test_data['# of Searches']

completion_rate_control = control_data['# of Purchase'] / control_data['# of Add to Cart']
completion_rate_test = test_data['# of Purchase'] / test_data['# of Add to Cart']


def check_distribution(data, label):

    data = data.dropna()  
    # normaltest
    stat, p = stats.normaltest(data)
    print(f'Normality test statistic: {stat}, p-value: {p}')
    if p < 0.05:
        print(f'{label} does not follow a normal distribution.')
    else:
        print(f'{label} follows a normal distribution.')

    plt.figure(figsize=(10, 5))
    sns.histplot(data, kde=True, bins=15)
    plt.title(f'Distribution of {label}')
    plt.xlabel(label)
    plt.ylabel('Frequency')
    plt.show()
In [133]:
# grid of the distributions

check_distribution(control_data['# of Purchase'], '# of Purchase control')
check_distribution(test_data['# of Purchase'], '# of Purchase test')
Normality test statistic: 0.6015538753614901, p-value: 0.7402428746244449
# of Purchase control follows a normal distribution.
Normality test statistic: 10.451490326105246, p-value: 0.005376352207858708
# of Purchase test does not follow a normal distribution.

From the analysis above, we can conclude that while the purchases of the control group follow (or resemble) a normal distribution, the purchases of the test group do not. Hence, we will do our Hypothesis testing using the Mann-Whitney U test.

In [134]:
check_distribution(conversion_rate_control, 'Conversion Rate Control')
check_distribution(conversion_rate_test, 'Conversion Rate Test')
Normality test statistic: 4.817573462427387, p-value: 0.08992433079743405
Conversion Rate Control follows a normal distribution.
Normality test statistic: 6.427494501253087, p-value: 0.040205669915583296
Conversion Rate Test does not follow a normal distribution.

From the analysis above, we can conclude that while the conversion rate of the control group follows (or resembles) a normal distribution, the conversion rate of the test group does not. Hence, we will do our Hypothesis testing using the Mann-Whitney U test.

In [135]:
check_distribution(completion_rate_control, 'Completion Rate Control')
check_distribution(completion_rate_test, 'Completion Rate Test')
Normality test statistic: 29.016169409956795, p-value: 5.00286598880455e-07
Completion Rate Control does not follow a normal distribution.
Normality test statistic: 0.9718944813842352, p-value: 0.6151142594600424
Completion Rate Test follows a normal distribution.

Here, we have the opposite case, in which the completion rate of the control group does not follow a normal distribution, whereas it does for the test group. Here again, this suggests to use the Mann-Whitney U test.

In [136]:
from scipy.stats import mannwhitneyu
In [137]:
# prepare the data for hypothesis testing

control_data['conversion_rate'] = conversion_rate_control
test_data['conversion_rate'] = conversion_rate_test

control_data['completion_rate'] = completion_rate_control
test_data['completion_rate'] = completion_rate_test

# fill missing values with 0
control_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)
In [138]:
# Function to perform statistical tests and return results

def perform_whitney_u_test(control_data, test_data, metric_name):
    control_metric = control_data[metric_name]
    test_metric = test_data[metric_name]

    mean_control = control_metric.mean()
    mean_test = test_metric.mean()
    
    stat, p_value = mannwhitneyu(control_metric, test_metric)
    
    print(f'Mann-Whitney U Test for {metric_name}:')
    print(f'U-statistic: {stat}, p-value: {p_value}')
    
    if p_value < 0.05:
        print(f'The difference in {metric_name} between control and test groups is statistically significant.')
    else:
        print(f'The difference in {metric_name} between control and test groups is not statistically significant.')

     # Plot distributions
    plt.figure(figsize=(10, 5))
    sns.kdeplot(control_metric, label='Control', fill=True, color='skyblue', alpha=0.5)
    sns.kdeplot(test_metric, label='Test', fill=True, color='salmon', alpha=0.5)

    plt.axvline(mean_control, color='blue', linestyle='--', linewidth=1, alpha=0.4)
    plt.axvline(mean_test, color='red', linestyle='--', linewidth=1, alpha=0.4)
    
    plt.title(f'Distribution of {metric_name} (Mann–Whitney U Test)', fontsize=14)
    plt.xlabel(metric_name)
    plt.ylabel('Density')
    plt.legend()
    
    plt.text(x=0.05, y=plt.ylim()[1]*0.9, s=f'p = {p_value:.4f}', fontsize=12, bbox=dict(boxstyle='round', facecolor='white', edgecolor='gray'))

    plt.grid(True, linestyle='--', alpha=0.4)
    plt.tight_layout()
    plt.show()
In [139]:
# Perform Hypothesis testing for # of Purchases
perform_whitney_u_test(control_data, test_data, '# of Purchase')
Mann-Whitney U Test for # of Purchase:
U-statistic: 439.0, p-value: 0.8766246981054522
The difference in # of Purchase between control and test groups is not statistically significant.
In [140]:
# Perform Hypothesis testing for conversion rate
perform_whitney_u_test(control_data, test_data, 'conversion_rate')
Mann-Whitney U Test for conversion_rate:
U-statistic: 527.0, p-value: 0.25805149508623626
The difference in conversion_rate between control and test groups is not statistically significant.
In [141]:
# Perform Hypothesis testing for completion rate
perform_whitney_u_test(control_data, test_data, 'completion_rate')
Mann-Whitney U Test for completion_rate:
U-statistic: 185.0, p-value: 9.211268793716519e-05
The difference in completion_rate between control and test groups is statistically significant.

Results of the Hypothesis Testing¶

  • We were able to detect a significant difference between control and test group in only one of the investigated metrics: Completion Rate. In other words, the treatment significantly increased the completion rate, which means that customers in the treatment group went through with purchasing the items added to their cart more frequently than in the control group, which received no treatment.

  • There was no significant difference between control and treatment group in the conversion rate and the number of purchases. Hence, no effect of the treatment onto these metrics can be assumed at this point.