In this project, I will investigate the effect of a treatment given to a group of customers on a shopping website. In particular, I will explore differences in relationships between features, but also differences between key metrics from both groups - control and test.
For this project, I used the following kaggle datasets:
Furthermore, I drew inspiration from this article in particular, which provides a good introduction into the methodological basics of A/B testing.
# load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import warnings
warnings.filterwarnings('ignore')
control_data = pd.read_csv('C:/Users/felix/OneDrive/Dokumente/Python Projects/AB_Testing/control_group.csv', delimiter=';')
test_data = pd.read_csv('C:/Users/felix/OneDrive/Dokumente/Python Projects/AB_Testing/test_group.csv', delimiter=';')
control_data.head()
Campaign Name | Date | Spend [USD] | # of Impressions | Reach | # of Website Clicks | # of Searches | # of View Content | # of Add to Cart | # of Purchase | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Control Campaign | 1.08.2019 | 2280 | 82702.0 | 56930.0 | 7016.0 | 2290.0 | 2159.0 | 1819.0 | 618.0 |
1 | Control Campaign | 2.08.2019 | 1757 | 121040.0 | 102513.0 | 8110.0 | 2033.0 | 1841.0 | 1219.0 | 511.0 |
2 | Control Campaign | 3.08.2019 | 2343 | 131711.0 | 110862.0 | 6508.0 | 1737.0 | 1549.0 | 1134.0 | 372.0 |
3 | Control Campaign | 4.08.2019 | 1940 | 72878.0 | 61235.0 | 3065.0 | 1042.0 | 982.0 | 1183.0 | 340.0 |
4 | Control Campaign | 5.08.2019 | 1835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
test_data.head()
Campaign Name | Date | Spend [USD] | # of Impressions | Reach | # of Website Clicks | # of Searches | # of View Content | # of Add to Cart | # of Purchase | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Test Campaign | 1.08.2019 | 3008 | 39550 | 35820 | 3038 | 1946 | 1069 | 894 | 255 |
1 | Test Campaign | 2.08.2019 | 2542 | 100719 | 91236 | 4657 | 2359 | 1548 | 879 | 677 |
2 | Test Campaign | 3.08.2019 | 2365 | 70263 | 45198 | 7885 | 2572 | 2367 | 1268 | 578 |
3 | Test Campaign | 4.08.2019 | 2710 | 78451 | 25937 | 4216 | 2216 | 1437 | 566 | 340 |
4 | Test Campaign | 5.08.2019 | 2297 | 114295 | 95138 | 5863 | 2106 | 858 | 956 | 768 |
test_data.columns
Index(['Campaign Name', 'Date', 'Spend [USD]', '# of Impressions', 'Reach', '# of Website Clicks', '# of Searches', '# of View Content', '# of Add to Cart', '# of Purchase'], dtype='object')
control_data.shape
(30, 10)
test_data.shape
(30, 10)
These are the meanings for the different features that we find in our data:
Campaign Name
: The name of the campaignDate
: Date of the recordSpend
: Amount spent on the campaign in dollarsNumber of Impressions
: Number of impressions the ad crossed through the campaignReach
: The number of unique impressions received in the adNumber of Website Clicks
: Number of website clicks received through the adsNumber of Searches
: Number of users who performed searches on the websiteNumber of View Content
: Number of users who viewed content and products on the websiteNumber of Add to Cart
: Number of users who added products to the cartNumber of Purchase
: Number of purchases# plot the spend for control and test separately
fig = plt.subplots(figsize=(20, 5))
sn.lineplot(
data= control_data,
x = control_data['Date'],
y= control_data['# of Purchase'],
label='Control'
)
sn.lineplot(
data= test_data,
x= test_data['Date'],
y=test_data['# of Purchase'],
label = 'Test'
)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Show how the features relate to each other with a correlation matrix - control group & test group
plt.figure(figsize=(20, 10))
corr_matrix_control = control_data.corr(numeric_only=True)
corr_matrix_test = test_data.corr(numeric_only=True)
plt.subplot(1, 2, 1)
plt.title('Correlation Matrix for Control Group')
sn.heatmap(corr_matrix_control, annot=True)
plt.subplot(1, 2, 2)
plt.title('Correlation Matrix for Test Group')
sn.heatmap(corr_matrix_test, annot=True)
<Axes: title={'center': 'Correlation Matrix for Test Group'}>
Description
From this correlation matrix comparison between control
and test
group, we can assume and hypothesize that:
There is a strong correlation between reach
and the # of impressions
in the control group
. The coefficients of 0.96 suggests that these two variables are close to being equal. This relationship is weaker for the test group
, having a coefficient of 0.78.
There is a small positive effect between Spend in USD
and the # of impressions
in the control group
, which suggests that the more the company invests into its campaign, the stronger the investment was translated into more impressions
. However this effect remains rather small for the control group
with a coefficient of 0.27. This effect is much weaker in the test group
, suggesting that the treatment negatively mediates this relationship, making additional spend less important and contributing to the number of impressions.
Looking at the relationship between the # of add to cart
and Spend
, we can observe that there is a slightly negative, but close to 0 relationship with the spend
in the control group
. In the test group
, however, additional spend
is slightly positively correlated with the # of add to cart
. This suggests that the treatment might make the spending on the campaign more effective and increasing the items added to cart.
Similarly, we can observe that the relationship # of searches -> # of Purchases
is near 0 (but negative) in the control group
, while being moderately positive with a coefficient of 0.29 in the test group
. This could lead us to the assumption that the treatment leads to more searches having a positive impact on the number of final purchases.
The comparison of the two matrices briefly put shows us that the treatment slightly dimishes the impact of spending on key metrics, while it supports in conversion metrics. In summary, we see some differences in the relationships between different features between control and test group. In other words, there seems to be differences between control and treatment group that could among other things boost the effectiveness of the company's campaign.
We will now investigate whether observed differences between both groups are statistically significant, other rather due to chance.
# prepare the data
# Convert the 'Date' column to datetime format
control_data['Date'] = pd.to_datetime(control_data['Date'], format='%d.%m.%Y')
test_data['Date'] = pd.to_datetime(test_data['Date'], format='%d.%m.%Y')
# fill missing values with 0
control_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)
# Define a function for different metrics
def calculate_metrics(data):
# total # of purchases
total_purchases = data['# of Purchase'].mean()
# conversion rate
conversion_rate = np.where(data['# of Searches'] > 0, data['# of Purchase'] / data['# of Searches'], 0).mean()
# completion rate
completion_rate = np.where(data['# of Add to Cart'] > 0, data['# of Purchase'] / data['# of Add to Cart'], 0).mean()
return round(total_purchases,2), round(conversion_rate,2), round(completion_rate,2)
# apply the function to both control and test data
control_metrics = calculate_metrics(control_data)
test_metrics = calculate_metrics(test_data)
print("Control Group Metrics:")
print(control_metrics)
print("\nTest Group Metrics:")
print(test_metrics)
Control Group Metrics: (505.37, 0.26, 0.44) Test Group Metrics: (521.23, 0.22, 0.62)
In this brief comparison of key metrics we will test for, we can already see slight differences. Furthermore, this helps us confirm that we should perform a two-sided hypothesis test later on, since we observe that metrics can either increase or decrease, going from control to treatment group.
# check for distribution of the data
conversion_rate_control = control_data['# of Purchase'] / control_data['# of Searches']
conversion_rate_test = test_data['# of Purchase'] / test_data['# of Searches']
completion_rate_control = control_data['# of Purchase'] / control_data['# of Add to Cart']
completion_rate_test = test_data['# of Purchase'] / test_data['# of Add to Cart']
def check_distribution(data, label):
data = data.dropna()
# normaltest
stat, p = stats.normaltest(data)
print(f'Normality test statistic: {stat}, p-value: {p}')
if p < 0.05:
print(f'{label} does not follow a normal distribution.')
else:
print(f'{label} follows a normal distribution.')
plt.figure(figsize=(10, 5))
sns.histplot(data, kde=True, bins=15)
plt.title(f'Distribution of {label}')
plt.xlabel(label)
plt.ylabel('Frequency')
plt.show()
# grid of the distributions
check_distribution(control_data['# of Purchase'], '# of Purchase control')
check_distribution(test_data['# of Purchase'], '# of Purchase test')
Normality test statistic: 0.6015538753614901, p-value: 0.7402428746244449 # of Purchase control follows a normal distribution.
Normality test statistic: 10.451490326105246, p-value: 0.005376352207858708 # of Purchase test does not follow a normal distribution.
From the analysis above, we can conclude that while the purchases of the control group follow (or resemble) a normal distribution, the purchases of the test group do not. Hence, we will do our Hypothesis testing using the Mann-Whitney U test.
check_distribution(conversion_rate_control, 'Conversion Rate Control')
check_distribution(conversion_rate_test, 'Conversion Rate Test')
Normality test statistic: 4.817573462427387, p-value: 0.08992433079743405 Conversion Rate Control follows a normal distribution.
Normality test statistic: 6.427494501253087, p-value: 0.040205669915583296 Conversion Rate Test does not follow a normal distribution.
From the analysis above, we can conclude that while the conversion rate of the control group follows (or resembles) a normal distribution, the conversion rate of the test group does not. Hence, we will do our Hypothesis testing using the Mann-Whitney U test.
check_distribution(completion_rate_control, 'Completion Rate Control')
check_distribution(completion_rate_test, 'Completion Rate Test')
Normality test statistic: 29.016169409956795, p-value: 5.00286598880455e-07 Completion Rate Control does not follow a normal distribution.
Normality test statistic: 0.9718944813842352, p-value: 0.6151142594600424 Completion Rate Test follows a normal distribution.
Here, we have the opposite case, in which the completion rate of the control group does not follow a normal distribution, whereas it does for the test group. Here again, this suggests to use the Mann-Whitney U test.
from scipy.stats import mannwhitneyu
# prepare the data for hypothesis testing
control_data['conversion_rate'] = conversion_rate_control
test_data['conversion_rate'] = conversion_rate_test
control_data['completion_rate'] = completion_rate_control
test_data['completion_rate'] = completion_rate_test
# fill missing values with 0
control_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)
# Function to perform statistical tests and return results
def perform_whitney_u_test(control_data, test_data, metric_name):
control_metric = control_data[metric_name]
test_metric = test_data[metric_name]
mean_control = control_metric.mean()
mean_test = test_metric.mean()
stat, p_value = mannwhitneyu(control_metric, test_metric)
print(f'Mann-Whitney U Test for {metric_name}:')
print(f'U-statistic: {stat}, p-value: {p_value}')
if p_value < 0.05:
print(f'The difference in {metric_name} between control and test groups is statistically significant.')
else:
print(f'The difference in {metric_name} between control and test groups is not statistically significant.')
# Plot distributions
plt.figure(figsize=(10, 5))
sns.kdeplot(control_metric, label='Control', fill=True, color='skyblue', alpha=0.5)
sns.kdeplot(test_metric, label='Test', fill=True, color='salmon', alpha=0.5)
plt.axvline(mean_control, color='blue', linestyle='--', linewidth=1, alpha=0.4)
plt.axvline(mean_test, color='red', linestyle='--', linewidth=1, alpha=0.4)
plt.title(f'Distribution of {metric_name} (Mann–Whitney U Test)', fontsize=14)
plt.xlabel(metric_name)
plt.ylabel('Density')
plt.legend()
plt.text(x=0.05, y=plt.ylim()[1]*0.9, s=f'p = {p_value:.4f}', fontsize=12, bbox=dict(boxstyle='round', facecolor='white', edgecolor='gray'))
plt.grid(True, linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()
# Perform Hypothesis testing for # of Purchases
perform_whitney_u_test(control_data, test_data, '# of Purchase')
Mann-Whitney U Test for # of Purchase: U-statistic: 439.0, p-value: 0.8766246981054522 The difference in # of Purchase between control and test groups is not statistically significant.
# Perform Hypothesis testing for conversion rate
perform_whitney_u_test(control_data, test_data, 'conversion_rate')
Mann-Whitney U Test for conversion_rate: U-statistic: 527.0, p-value: 0.25805149508623626 The difference in conversion_rate between control and test groups is not statistically significant.
# Perform Hypothesis testing for completion rate
perform_whitney_u_test(control_data, test_data, 'completion_rate')
Mann-Whitney U Test for completion_rate: U-statistic: 185.0, p-value: 9.211268793716519e-05 The difference in completion_rate between control and test groups is statistically significant.
We were able to detect a significant difference
between control and test group in only one of the investigated metrics: Completion Rate
. In other words, the treatment significantly increased the completion rate, which means that customers in the treatment group went through with purchasing the items added to their cart more frequently than in the control group, which received no treatment.
There was no significant difference
between control and treatment group in the conversion rate
and the number of purchases
. Hence, no effect of the treatment onto these metrics can be assumed at this point.