Hypothesis Testing Examples¶
This notebook demonstrates core hypothesis tests using publicly available datasets commonly used in data science:
Summary
| Independent variable | Dependent variable | Type of plots | Type of Hypothesis test |
|---|---|---|---|
| Continuous | Continuous | Scatter plots | Correlation test |
| Continuous | Categorical | Bar charts | (two categories) t-test/z-test, (more than 2 categories) F test |
| Categorical | Continuous | Joint Histograms | (two categories) t-test/z-test, (more than 2 categories) F test |
| Categorical | Categorical | Mosaic charts | Chi-Square independence test |
Examples We are using the below four usecases to demonstrate these:
- Two‑proportion Z‑test: Marketing A/B conversion (Udacity e‑commerce A/B test)
Data source:ab_data.csv(GitHub mirror) - Welch's t‑test: Supplement efficacy (ToothGrowth: Orange Juice vs Vitamin C)
Data source: Rdatasets (CSV) - One‑way ANOVA + Tukey HSD: Branch‑wise gross income (Supermarket Sales)
Data source: selva86/datasets (CSV) - Chi‑square independence + Cramér's V: Gender vs Product category
Data source: Retail data (CSV)
import numpy as np, pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.float_format', lambda x: f"{x:,.6f}")
Two‑Proportion Z‑Test : Marketing A/B Conversion¶
In digital marketing, companies frequently run A/B tests to compare two versions of a webpage or campaign. The objective is to determine which version leads to higher conversions (e.g., purchases, sign-ups). In this dataset, the control group saw the old landing page, while the treatment group saw the new page. The hypothesis test checks if the new page significantly improves conversion rates.
Result Interpretation: After running the two-proportion Z-test, we obtained a p-value.
If this p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there is no statistically significant difference between the old and new pages. This suggests that the redesign did not lead to a measurable improvement in conversions.
If the p-value were below 0.05, we would conclude that the new page performs differently (better or worse) than the old one.
This insight helps marketing teams decide whether to adopt the new design or stick with the old one, ensuring data-driven decisions rather than relying on intuition.
Dataset: Udacity e‑commerce A/B test (ab_data.csv).
Link: https://github.com/beery4010/Analyze-AB-Test-Results/blob/master/ab_data.csv
Variables: group ∈ {control, treatment}, converted ∈ {0,1}.
Goal: Test whether conversion rates differ between old (control) and new (treatment) pages.
Hypotheses (two‑sided):
- $ H_0: p_{\text{treatment}} = p_{ ext{control}} $
- $ H_1: p_{\text{treatment}} \neq p_{ ext{control}} $
Assumptions: Independent Bernoulli trials; large sample sizes so normal approximation holds.
from statsmodels.stats.proportion import proportion_confint
# Load dataset (raw CSV via GitHub)
url_ab = "https://raw.githubusercontent.com/beery4010/Analyze-AB-Test-Results/master/ab_data.csv"
ab = pd.read_csv(url_ab)
ab
| user_id | timestamp | group | landing_page | converted | |
|---|---|---|---|---|---|
| 0 | 851104 | 2017-01-21 22:11:48.556739 | control | old_page | 0 |
| 1 | 804228 | 2017-01-12 08:01:45.159739 | control | old_page | 0 |
| 2 | 661590 | 2017-01-11 16:55:06.154213 | treatment | new_page | 0 |
| 3 | 853541 | 2017-01-08 18:28:03.143765 | treatment | new_page | 0 |
| 4 | 864975 | 2017-01-21 01:52:26.210827 | control | old_page | 1 |
| ... | ... | ... | ... | ... | ... |
| 294473 | 751197 | 2017-01-03 22:28:38.630509 | control | old_page | 0 |
| 294474 | 945152 | 2017-01-12 00:51:57.078372 | control | old_page | 0 |
| 294475 | 734608 | 2017-01-22 11:45:03.439544 | control | old_page | 0 |
| 294476 | 697314 | 2017-01-15 01:20:28.957438 | control | old_page | 0 |
| 294477 | 715931 | 2017-01-16 12:40:24.467417 | treatment | new_page | 0 |
294478 rows × 5 columns
# Align groups and landing pages as done in the Udacity project
ab = ab[((ab['group'] == 'control') & (ab['landing_page'] == 'old_page')) |
((ab['group'] == 'treatment') & (ab['landing_page'] == 'new_page'))]
# Compute conversions and sample sizes
conv_control = ab.loc[ab['group']=='control','converted'].sum()
n_control = ab.loc[ab['group']=='control','converted'].count()
conv_treat = ab.loc[ab['group']=='treatment','converted'].sum()
n_treat = ab.loc[ab['group']=='treatment','converted'].count()
print('In our control group, we had', n_control,' out of whom ', conv_control, 'converted, making a probability of p=', round(conv_control/n_control, 4))
print("In our treatment group we had", n_treat, " out of which ", conv_treat, 'converted., making a probability of p=', round(conv_treat/n_treat, 4))
In our control group, we had 145274 out of whom 17489 converted, making a probability of p= 0.1204
In our treatment group we had 145311 out of which 17264 converted., making a probability of p= 0.1188
# Two‑sided two‑proportion z‑test
z_stat, p_val = proportions_ztest([conv_treat, conv_control], [n_treat, n_control], alternative='two-sided')
print(f"Two‑proportion z‑test: z = {z_stat:.4f}, p = {p_val:.6f}")
Two‑proportion z‑test: z = -1.3116, p = 0.189653
# Wald 95% CI for each proportion
ci_treat = proportion_confint(conv_treat, n_treat, method='normal')
ci_ctrl = proportion_confint(conv_control, n_control, method='normal')
# Difference in proportions & CI (Wald)
p_hat_t = conv_treat / n_treat
p_hat_c = conv_control / n_control
diff = p_hat_t - p_hat_c
se = np.sqrt(p_hat_t*(1-p_hat_t)/n_treat + p_hat_c*(1-p_hat_c)/n_control)
ci_diff = (diff - 1.96*se, diff + 1.96*se)
print(f"Counts (treat/control): {conv_treat}/{n_treat} vs {conv_control}/{n_control}")
print(f"p_treat = {p_hat_t:.5f} 95% CI {ci_treat}")
print(f"p_ctrl = {p_hat_c:.5f} 95% CI {ci_ctrl}")
print(f"Diff (treat - control) = {diff:.5f} 95% CI {ci_diff}")
Counts (treat/control): 17264/145311 vs 17489/145274
p_treat = 0.11881 95% CI (0.11714362162601945, 0.12047087417952866)
p_ctrl = 0.12039 95% CI (0.11871294722381814, 0.12205966177710426)
Diff (treat - control) = -0.00158 95% CI (-0.003938713688889012, 0.0007806004935147211)
Decision rule: If p < 0.05, reject \(H_0\) and conclude conversion differs between pages. Otherwise, fail to reject \(H_0\).
As p for two-proportion z-test is p=0.189 > 0.05, we are failing to reject the Null hypothesis. This means that the new and old page have an approximately equal chance of converting users. We recommend to the e-commerce company to keep the old page. This will save time and money on creating a new page.
def plot_z_hypothesis(
data_list,
pop_mean=0.0,
pop_sd=1.0,
alternative='two.sided', # 'two.sided' | 'greater' | 'less'
type_test = 'mean', #'mean' | 'prob'
alpha=0.05,
label='Sampling distribution',
title='z-test (sampling distribution of x̄)',
x_label = "Sampling distribution of x̄",
figsize=(8, 5)
):
"""
Visualize z-test decision regions for the sampling distribution of the sample mean.
Parameters
----------
data_list : array-like
Sample values (used only to compute x̄ and n).
pop_mean : float
Hypothesized population mean (μ under H0).
pop_sd : float
Known population standard deviation (σ).
alternative : str
'two.sided', 'greater', or 'less' (same semantics as your R code).
alpha : float
Significance level for critical regions.
label : str
Label for the distribution curve.
title : str
Plot title.
figsize : tuple
Figure size.
Returns
-------
fig, ax : matplotlib Figure and Axes
"""
x = np.asarray(data_list)
n = len(x)
xbar = np.mean(x)
# Standard error of the mean
se = pop_sd / np.sqrt(n)
# Range for plotting ±4 SE around μ (as in your R function)
grid = np.linspace(pop_mean - 4 * se, pop_mean + 4 * se, 4001)
pdf = stats.norm.pdf(grid, loc=pop_mean, scale=se)
# Compute critical cutoffs under H0 for the chosen alternative
if alternative == 'two.sided':
# symmetric cutoffs: (alpha/2) and (1 - alpha/2)
lower_cut = stats.norm.ppf(alpha / 2, loc=pop_mean, scale=se)
upper_cut = stats.norm.ppf(1 - alpha / 2, loc=pop_mean, scale=se)
# Retain between cutoffs; reject outside
retain_mask = (grid >= lower_cut) & (grid <= upper_cut)
reject_mask = ~retain_mask
elif alternative == 'greater':
# reject on right tail
cutoff = stats.norm.ppf(1 - alpha, loc=pop_mean, scale=se)
retain_mask = (grid <= cutoff)
reject_mask = (grid > cutoff)
lower_cut, upper_cut = None, cutoff
elif alternative == 'less':
# reject on left tail
cutoff = stats.norm.ppf(alpha, loc=pop_mean, scale=se)
retain_mask = (grid >= cutoff)
reject_mask = (grid < cutoff)
lower_cut, upper_cut = cutoff, None
else:
raise ValueError("alternative must be one of {'two.sided','greater','less'}")
# Build a DataFrame like the R pipeline for convenience (optional)
df = pd.DataFrame({'x': grid, 'pdf': pdf, 'retain': retain_mask})
# Plot
fig, ax = plt.subplots(figsize=figsize)
ax.plot(df['x'], df['pdf'], color='black', lw=1.2, label=label)
# Shade retain region
ax.fill_between(df['x'], 0, df['pdf'], where=df['retain'], color='#69b3a2', alpha=0.4, label='Retain H₀')
# Shade reject region(s)
ax.fill_between(df['x'], 0, df['pdf'], where=~df['retain'], color='#e76f51', alpha=0.4, label='Reject H₀')
# Critical lines
if alternative == 'two.sided':
ax.axvline(lower_cut, color='#e76f51', ls='--', lw=1)
ax.axvline(upper_cut, color='#e76f51', ls='--', lw=1)
ax.text(lower_cut, ax.get_ylim()[1]*0.3, f"Lower crit\n{lower_cut:.2f}", ha='right', va='top', fontsize=9)
ax.text(upper_cut, ax.get_ylim()[1]*0.3, f"Upper crit\n{upper_cut:.2f}", ha='left', va='top', fontsize=9)
elif alternative == 'greater':
ax.axvline(upper_cut, color='#e76f51', ls='--', lw=1)
ax.text(upper_cut, ax.get_ylim()[1]*0.9, f"Crit\n{upper_cut:.2f}", ha='left', va='top', fontsize=9)
elif alternative == 'less':
ax.axvline(lower_cut, color='#e76f51', ls='--', lw=1)
ax.text(lower_cut, ax.get_ylim()[1]*0.9, f"Crit\n{lower_cut:.2f}", ha='right', va='top', fontsize=9)
# x̄ line and annotation
if(type_test == 'prob'):
ax.axvline(xbar, color='#264653', lw=2, ls='-', label='z')
ax.annotate(f"z = {xbar:.2f}",
xy=(xbar, stats.norm.pdf(xbar, loc=pop_mean, scale=se)),
xytext=(xbar, ax.get_ylim()[1]*0.6),
arrowprops=dict(arrowstyle='->', color='#264653'),
ha='center', color='#264653')
else:
ax.axvline(xbar, color='#264653', lw=2, ls='-', label='Sample mean (x̄)')
ax.annotate(f"x̄ = {xbar:.2f}",
xy=(xbar, stats.norm.pdf(xbar, loc=pop_mean, scale=se)),
xytext=(xbar, ax.get_ylim()[1]*0.6),
arrowprops=dict(arrowstyle='->', color='#264653'),
ha='center', color='#264653')
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel("Density")
ax.legend(loc='upper right', frameon=False)
ax.grid(alpha=0.15)
plt.show();
plot_z_hypothesis([z_stat], type_test = 'prob', title='Two proportion z-test', label = 'Standard Normal', x_label = '')

Welch’s t‑Test : Effect of Vitamin C on Tooth Growth¶
This classic dataset explores the impact of Vitamin C on tooth growth in guinea pigs, a foundational experiment in nutritional science. The response variable is the length of odontoblasts, which are cells responsible for tooth development. Sixty guinea pigs were randomly assigned to receive Vitamin C in one of two delivery methods:
1. Orange Juice (OJ)
2. Ascorbic Acid (VC) (a synthetic form of Vitamin C)
Each animal was also given one of three dose levels: 0.5 mg/day, 1 mg/day, or 2 mg/day. The experiment aims to determine whether the delivery method influences tooth growth, controlling for dosage.
Why it matters:
Understanding the effectiveness of different Vitamin C sources can guide dietary recommendations and supplement formulations. In modern analytics, this type of test parallels comparing two treatments or interventions in healthcare or A/B testing in product design.
Hypothesis:
\(H_0\): The mean tooth length is the same for both delivery methods (OJ and VC).
\(H_1\): The mean tooth length differs between the two methods.
This analysis focuses on the 0.5 mg/day dosage level; similar analyses can be performed for other dosages.
Result Interpretation: After running Welch’s t-test, if the p-value is less than 0.05, we reject the null hypothesis, concluding that the delivery method significantly affects tooth growth. If the p-value is greater than 0.05, we fail to reject \(H_0\), suggesting no measurable difference between OJ and VC. Additionally, reporting Cohen’s d helps quantify the magnitude of the difference, which is crucial for practical significance.
Dataset: R ToothGrowth (OJ vs VC).
CSV: https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/ToothGrowth.csv
Variables: len (tooth length), supp ∈ {OJ, VC}.
Hypotheses (two‑sided):
- \(H_0: \mu_{\text{OJ}} = \mu_{ ext{VC}}\)
- \(H_1: \mu_{\text{OJ}}\neq \mu_{ ext{VC}}\)
Assumptions: Independent samples; normality (approx); Welch’s t‑test does not assume equal variance.
# Load ToothGrowth
url_tg = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/ToothGrowth.csv"
tg = pd.read_csv(url_tg)
# Drop the Rdatasets index column
for c in tg.columns:
if 'Unnamed' in c:
tg = tg.drop(columns=[c])
tg.head()
| rownames | len | supp | dose | |
|---|---|---|---|---|
| 0 | 1 | 4.200000 | VC | 0.500000 |
| 1 | 2 | 11.500000 | VC | 0.500000 |
| 2 | 3 | 7.300000 | VC | 0.500000 |
| 3 | 4 | 5.800000 | VC | 0.500000 |
| 4 | 5 | 6.400000 | VC | 0.500000 |
# Groups
oj = tg.loc[(tg['supp']=='OJ') & (tg.dose == 0.5),'len']
vc = tg.loc[(tg['supp']=='VC') & (tg.dose == 0.5),'len']
# Welch's t-test
t_stat, p_val = stats.ttest_ind(oj, vc, equal_var=False)
# Cohen's d (using pooled SD with group sizes)
def cohens_d(x, y):
nx, ny = len(x), len(y)
sx2, sy2 = np.var(x, ddof=1), np.var(y, ddof=1)
sp2 = ((nx-1)*sx2 + (ny-1)*sy2) / (nx+ny-2)
d = (np.mean(x) - np.mean(y)) / np.sqrt(sp2)
return d
d = cohens_d(oj, vc)
print(f"Welch t‑test: t = {t_stat:.4f}, p = {p_val:.6f}")
print(f"Mean's: OJ = {np.mean(oj):.3f}, VC = {np.mean(vc):.3f}")
print(f"Cohen's d = {d:.3f}")
Welch t‑test: t = 3.1697, p = 0.006359
Mean's: OJ = 13.230, VC = 7.980
Cohen's d = 1.418
Decision rule: If p < 0.05, reject \(H_0\) → evidence that delivery method affects mean tooth length. Cohen's d values of 0.2 means small difference between the groups, 0.5 for medium, and 0.8 or greater for large effects.
Result: OJ is more effective: T-tests show a statistically significant difference between the two delivery methods at these lower doses, with orange juice leading to greater tooth growth. P-values: The p-values are below \(0.05\), indicating the difference is unlikely to be due to random chance. For example, at 0.5mg, the p-value is around \(0.006\).
def plot_bivariate_histograms(dataset, con_col, cat_col, title='', x_label = ''):
bi_con_cat = dataset.groupby([cat_col])[con_col].plot.hist(alpha = 0.5)
plt.xlabel(con_col)
plt.legend(dataset.groupby([cat_col])[con_col].count().axes[0].tolist())
plt.title(title)
plt.xlabel(x_label)
plt.show();
plot_bivariate_histograms(tg[tg.dose==0.5], 'len', 'supp', title='Joint histogram', x_label = 'Length')

One‑Way ANOVA : Branch‑wise Gross Income (with Tukey HSD)¶
This dataset captures detailed transaction records from a supermarket chain operating in Myanmar, covering three major cities: Yangon, Naypyitaw, and Mandalay. The data spans a three-month period from January to March 2019, providing a rich view of retail operations. Each record includes information on branch location, product line, customer demographics, payment methods, and financial metrics such as gross income and total sales.
For our hypothesis test, we focus on whether branch location influences gross income, which is a critical metric for profitability. Retail managers often need to know if certain branches consistently outperform others, as this insight can guide resource allocation, marketing strategies, and inventory planning.
Hypothesis:
\(H_0\): The mean gross income is the same across all three branches (A, B, C).
\(H_1\): At least one branch has a different mean gross income.
Why it matters: If the test reveals significant differences, management can investigate underlying factors such as customer purchasing power, branch size, or local marketing effectiveness. This analysis mirrors real-world business intelligence tasks where data-driven decisions optimize operations and profitability.
Result Interpretation: After running ANOVA, if the p-value is less than 0.05, we reject the null hypothesis, concluding that branch location impacts gross income. Post-hoc analysis using Tukey HSD identifies which branches differ significantly, enabling targeted strategies for improvement.
Dataset: Supermarket Sales (three branches: A, B, C).
CSV: https://raw.githubusercontent.com/selva86/datasets/master/supermarket_sales.csv
Outcome: gross income (continuous). Factor: Branch (A/B/C).
Hypotheses:
- \(H_0: \mu_A = \mu_B = \mu_C\)
- \(H_1\): At least one mean differs
Plan: Fit ANOVA (ANalysis Of VAriance), check homogeneity using Levene, then Tukey HSD for post‑hoc pairwise comparisons.
# Load Supermarket Sales
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/supermarket_sales.csv")
# Tidy column names
df.columns = [c.strip().replace(' ', '_').lower() for c in df.columns]
df
| invoice_id | branch | city | customer_type | gender | product_line | unit_price | quantity | tax_5% | total | date | time | payment | cogs | gross_margin_percentage | gross_income | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 750-67-8428 | A | Yangon | Member | Female | Health and beauty | 74.690000 | 7 | 26.141500 | 548.971500 | 1/5/2019 | 13:08 | Ewallet | 522.830000 | 4.761905 | 26.141500 | 9.100000 |
| 1 | 226-31-3081 | C | Naypyitaw | Normal | Female | Electronic accessories | 15.280000 | 5 | 3.820000 | 80.220000 | 3/8/2019 | 10:29 | Cash | 76.400000 | 4.761905 | 3.820000 | 9.600000 |
| 2 | 631-41-3108 | A | Yangon | Normal | Male | Home and lifestyle | 46.330000 | 7 | 16.215500 | 340.525500 | 3/3/2019 | 13:23 | Credit card | 324.310000 | 4.761905 | 16.215500 | 7.400000 |
| 3 | 123-19-1176 | A | Yangon | Member | Male | Health and beauty | 58.220000 | 8 | 23.288000 | 489.048000 | 1/27/2019 | 20:33 | Ewallet | 465.760000 | 4.761905 | 23.288000 | 8.400000 |
| 4 | 373-73-7910 | A | Yangon | Normal | Male | Sports and travel | 86.310000 | 7 | 30.208500 | 634.378500 | 2/8/2019 | 10:37 | Ewallet | 604.170000 | 4.761905 | 30.208500 | 5.300000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 233-67-5758 | C | Naypyitaw | Normal | Male | Health and beauty | 40.350000 | 1 | 2.017500 | 42.367500 | 1/29/2019 | 13:46 | Ewallet | 40.350000 | 4.761905 | 2.017500 | 6.200000 |
| 996 | 303-96-2227 | B | Mandalay | Normal | Female | Home and lifestyle | 97.380000 | 10 | 48.690000 | 1,022.490000 | 3/2/2019 | 17:16 | Ewallet | 973.800000 | 4.761905 | 48.690000 | 4.400000 |
| 997 | 727-02-1313 | A | Yangon | Member | Male | Food and beverages | 31.840000 | 1 | 1.592000 | 33.432000 | 2/9/2019 | 13:22 | Cash | 31.840000 | 4.761905 | 1.592000 | 7.700000 |
| 998 | 347-56-2442 | A | Yangon | Normal | Male | Home and lifestyle | 65.820000 | 1 | 3.291000 | 69.111000 | 2/22/2019 | 15:33 | Cash | 65.820000 | 4.761905 | 3.291000 | 4.100000 |
| 999 | 849-09-3807 | A | Yangon | Member | Female | Fashion accessories | 88.340000 | 7 | 30.919000 | 649.299000 | 2/18/2019 | 13:28 | Cash | 618.380000 | 4.761905 | 30.919000 | 6.600000 |
1000 rows × 17 columns
# One-way ANOVA
model = ols('gross_income ~ C(branch)', data=df).fit()
anova_tbl = sm.stats.anova_lm(model, typ=2)
print(anova_tbl)
sum_sq df F PR(>F)
C(branch) 242.602644 2.000000 0.884583 0.413210
Residual 136,716.894906 997.000000 NaN NaN
# Levene test for homogeneity of variances
from scipy.stats import levene
branch_a = df[df['branch']=='A']['gross_income']
branch_b = df[df['branch']=='B']['gross_income']
branch_c = df[df['branch']=='C']['gross_income']
print("Levene p-value:", levene(branch_a, branch_b, branch_c).pvalue)
Levene p-value: 0.08946425577002974
# Tukey HSD post-hoc
tukey = pairwise_tukeyhsd(endog=df['gross_income'], groups=df['branch'], alpha=0.05)
print(tukey.summary())
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj lower upper reject
---------------------------------------------------
A B 0.358 0.9171 -1.7627 2.4788 False
A C 1.1784 0.3954 -0.9489 3.3057 False
B C 0.8203 0.6405 -1.3195 2.9602 False
---------------------------------------------------
Decision rule: If ANOVA p < 0.05, reject \(H_0\) and use Tukey HSD to identify differing pairs. If Levene p < 0.05, consider Welch’s ANOVA or robust alternatives.
As p=0.413 (PR(>F) = 0.413) is greater than 5%, we can fail to reject the Null hypothesis and claim that the three branches have similar gross revenue. Additionally we can see that the variances are homogeneous (as Levene's p-value > 5%) and t-tests between each combination shows no significance (Tukey's test is essentially a t-test, except that it corrects for family-wise error rate).
bi_variate_boxplot = sns.boxplot(x="branch", y="gross_income", data=df)
bi_variate_boxplot.set(title = 'Box Chart of gross income across branches');

Chi‑Square Test of Independence: Gender × Product Category (Retail)¶
In multi‑category retail, merchandising teams care deeply about who buys what. This real customer shopping dataset covers transactions across multiple shopping malls in Istanbul (2021–2023) and includes gender and product category for each purchase. We’ll test whether product category preferences differ by gender, which can inform assortment planning, aisle placement, personalized recommendations, targeted promotions, and store layout decisions.
Variables:
- gender ∈ {Male, Female}
- category ∈ {e.g., Clothing, Electronics, Accessories, …} (multiple categories present in the file)
Hypotheses (Chi‑Square Test of Independence):
\(H_0\): Gender and product category are independent (no association).
\(H_1\): Gender and product category are associated (category preference depends on gender).
Why it matters:
A significant association suggests different category affinities by gender. Retailers can adjust promotions, content, inventory mix, and store displays to better match demand, which often improves conversion rates and gross margin.
Assumptions & Data Checks:
- Sufficient expected counts (preferably ≥ 5 per cell). If some categories are rare, consider grouping similar categories or analyzing the top N categories to satisfy assumptions.
- Observations are independent (each row is a separate transaction).
(These are standard conditions for the Chi‑Square test in categorical analysis.)
Dataset Source:
Customer Shopping Data (GitHub) — file: customer_shopping_data.csv
Interpretation:
1. If p < 0.05, reject \(H_0\): gender and product category are associated.
2. Cramér’s V indicates strength of association: ~0.1 = weak, ~0.3 = moderate, ~0.5 = strong.
We compute the Chi‑square test of independence and Cramér’s V as an effect size.
import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np
# Load dataset
url = "https://raw.githubusercontent.com/gokcengiz/Shopping-data-analysis/main/customer_shopping_data.csv"
df = pd.read_csv(url)
df
| invoice_no | customer_id | gender | age | category | quantity | price | payment_method | invoice_date | shopping_mall | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I138884 | C241288 | Female | 28 | Clothing | 5 | 1,500.400000 | Credit Card | 5/8/2022 | Kanyon |
| 1 | I317333 | C111565 | Male | 21 | Shoes | 3 | 1,800.510000 | Debit Card | 12/12/2021 | Forum Istanbul |
| 2 | I127801 | C266599 | Male | 20 | Clothing | 1 | 300.080000 | Cash | 9/11/2021 | Metrocity |
| 3 | I173702 | C988172 | Female | 66 | Shoes | 5 | 3,000.850000 | Credit Card | 16/05/2021 | Metropol AVM |
| 4 | I337046 | C189076 | Female | 53 | Books | 4 | 60.600000 | Cash | 24/10/2021 | Kanyon |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99452 | I219422 | C441542 | Female | 45 | Souvenir | 5 | 58.650000 | Credit Card | 21/09/2022 | Kanyon |
| 99453 | I325143 | C569580 | Male | 27 | Food & Beverage | 2 | 10.460000 | Cash | 22/09/2021 | Forum Istanbul |
| 99454 | I824010 | C103292 | Male | 63 | Food & Beverage | 2 | 10.460000 | Debit Card | 28/03/2021 | Metrocity |
| 99455 | I702964 | C800631 | Male | 56 | Technology | 4 | 4,200.000000 | Cash | 16/03/2021 | Istinye Park |
| 99456 | I232867 | C273973 | Female | 36 | Souvenir | 3 | 35.190000 | Credit Card | 15/10/2022 | Mall of Istanbul |
99457 rows × 10 columns
# Build contingency table: Gender vs Payment Method
ct = pd.crosstab(df['category'], df['gender'])
chi2, p, dof, expected = chi2_contingency(ct)
print("Contingency Table:\n", ct)
print(f"Chi-square = {chi2:.4f}, p-value = {p:.6f}, dof = {dof}")
Contingency Table:
gender Female Male
category
Books 2906 2075
Clothing 20652 13835
Cosmetics 9070 6027
Food & Beverage 8804 5972
Shoes 5967 4067
Souvenir 3017 1982
Technology 2981 2015
Toys 6085 4002
Chi-square = 7.5679, p-value = 0.372234, dof = 7
# Compute Cramér's V
n = ct.values.sum()
phi2 = chi2 / n
r, c = ct.shape
cramers_v = np.sqrt(phi2 / min(r-1, c-1))
print(f"Cramér's V = {cramers_v:.3f}")
Cramér's V = 0.009
0.009 indicates a very weak association
from statsmodels.graphics.mosaicplot import mosaic
def plot_mosaics(data, x_col, y_col, title='', colors_list =[]):
dict_of_tuples = {}
# create the clean set of percentages to print
for x_col_ in data[x_col].unique():
for y_col_ in data[y_col].unique():
n = len(data[(data[x_col]==x_col_)&(data[y_col]==y_col_)][x_col])
d = len(data[(data[x_col]==x_col_)][x_col])
len_ = len(data[x_col])
if((d==0) or (n/d<=0.04)):
# if the percentage within a class is less than 4%, do not print the percentage
dict_of_tuples[(str(x_col_), str(y_col_))] = ''
elif(n/len_<=0.02):
# If its a tiny class with less than 2% of the total data, do not print
dict_of_tuples[(str(x_col_), str(y_col_))] = ''
else:
dict_of_tuples[(str(x_col_), str(y_col_))] = str(int(n/d*100))+"%"
dict_of_colors = dict_of_tuples.copy()
if(len(colors_list)>0):
# create a clean set of colors
for i, x_col_ in enumerate(data[x_col].unique()):
for y_col_ in data[y_col].unique():
dict_of_colors[(str(x_col_), str(y_col_))] = {'color':colors_list[i], 'alpha':0.8}
# Plot the mosaic plot
labelizer = lambda k: dict_of_tuples[k]
fig, ax = plt.subplots(figsize=(8,6))
if(len(colors_list)>0):
mosaic(data.sort_values([x_col, y_col]), [x_col, y_col],
statistic = False, axes_label = True, label_rotation = [90, 0],
labelizer=labelizer, properties=dict_of_colors, gap=0.008, ax=ax)
else:
mosaic(data.sort_values([x_col, y_col]), [x_col, y_col],
statistic = False, axes_label = True, label_rotation = [90, 0],
labelizer=labelizer, gap=0.008, ax=ax)
if(title==''):
plt.title(str(y_col) + ' percentages across ' + str(x_col))
else:
plt.title(title)
plt.show();
plot_mosaics(df, 'category', 'gender')

Decision rule: If p < 0.05, reject \(H_0\). Report Cramér’s V to quantify the association strength.
As p is greater than 5%, we fail to reject the null hypothesis, indicating no significant association between gender and product category.
References¶
- Udacity A/B test (ab_data.csv) GitHub mirror: https://github.com/beery4010/Analyze-AB-Test-Results
- ToothGrowth dataset (Rdatasets CSV): https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/datasets/ToothGrowth.csv
- Supermarket Sales dataset (selva86/datasets CSV): https://github.com/selva86/datasets/blob/master/supermarket_sales.csv
- Retail data : https://github.com/gokcengiz/Shopping-data-analysis