Skip to content

Hypothesis Testing Examples

This notebook demonstrates core hypothesis tests using publicly available datasets commonly used in data science:

Summary

Independent variable Dependent variable Type of plots Type of Hypothesis test
Continuous Continuous Scatter plots Correlation test
Continuous Categorical Bar charts (two categories) t-test/z-test, (more than 2 categories) F test
Categorical Continuous Joint Histograms (two categories) t-test/z-test, (more than 2 categories) F test
Categorical Categorical Mosaic charts Chi-Square independence test

Examples We are using the below four usecases to demonstrate these:

  • Two‑proportion Z‑test: Marketing A/B conversion (Udacity e‑commerce A/B test)
    Data source: ab_data.csv (GitHub mirror)
  • Welch's t‑test: Supplement efficacy (ToothGrowth: Orange Juice vs Vitamin C)
    Data source: Rdatasets (CSV)
  • One‑way ANOVA + Tukey HSD: Branch‑wise gross income (Supermarket Sales)
    Data source: selva86/datasets (CSV)
  • Chi‑square independence + Cramér's V: Gender vs Product category
    Data source: Retail data (CSV)
import numpy as np, pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.float_format', lambda x: f"{x:,.6f}")

Two‑Proportion Z‑Test : Marketing A/B Conversion

In digital marketing, companies frequently run A/B tests to compare two versions of a webpage or campaign. The objective is to determine which version leads to higher conversions (e.g., purchases, sign-ups). In this dataset, the control group saw the old landing page, while the treatment group saw the new page. The hypothesis test checks if the new page significantly improves conversion rates.

Result Interpretation: After running the two-proportion Z-test, we obtained a p-value.

If this p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there is no statistically significant difference between the old and new pages. This suggests that the redesign did not lead to a measurable improvement in conversions.
If the p-value were below 0.05, we would conclude that the new page performs differently (better or worse) than the old one.

This insight helps marketing teams decide whether to adopt the new design or stick with the old one, ensuring data-driven decisions rather than relying on intuition.

Dataset: Udacity e‑commerce A/B test (ab_data.csv).
Link: https://github.com/beery4010/Analyze-AB-Test-Results/blob/master/ab_data.csv

Variables: group ∈ {control, treatment}, converted ∈ {0,1}.

Goal: Test whether conversion rates differ between old (control) and new (treatment) pages.

Hypotheses (two‑sided):
- $ H_0: p_{\text{treatment}} = p_{ ext{control}} $
- $ H_1: p_{\text{treatment}} \neq p_{ ext{control}} $

Assumptions: Independent Bernoulli trials; large sample sizes so normal approximation holds.

from statsmodels.stats.proportion import proportion_confint

# Load dataset (raw CSV via GitHub)
url_ab = "https://raw.githubusercontent.com/beery4010/Analyze-AB-Test-Results/master/ab_data.csv"
ab = pd.read_csv(url_ab)
ab
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1
... ... ... ... ... ...
294473 751197 2017-01-03 22:28:38.630509 control old_page 0
294474 945152 2017-01-12 00:51:57.078372 control old_page 0
294475 734608 2017-01-22 11:45:03.439544 control old_page 0
294476 697314 2017-01-15 01:20:28.957438 control old_page 0
294477 715931 2017-01-16 12:40:24.467417 treatment new_page 0

294478 rows × 5 columns

# Align groups and landing pages as done in the Udacity project
ab = ab[((ab['group'] == 'control') & (ab['landing_page'] == 'old_page')) |
        ((ab['group'] == 'treatment') & (ab['landing_page'] == 'new_page'))]

# Compute conversions and sample sizes
conv_control = ab.loc[ab['group']=='control','converted'].sum()
n_control    = ab.loc[ab['group']=='control','converted'].count()
conv_treat   = ab.loc[ab['group']=='treatment','converted'].sum()
n_treat      = ab.loc[ab['group']=='treatment','converted'].count()
print('In our control group, we had', n_control,' out of whom ', conv_control, 'converted, making a probability of p=', round(conv_control/n_control, 4))
print("In our treatment group we had", n_treat, " out of which ", conv_treat, 'converted., making a probability of p=', round(conv_treat/n_treat, 4))
In our control group, we had 145274  out of whom  17489 converted, making a probability of p= 0.1204
In our treatment group we had 145311  out of which  17264 converted., making a probability of p= 0.1188
# Two‑sided two‑proportion z‑test
z_stat, p_val = proportions_ztest([conv_treat, conv_control], [n_treat, n_control], alternative='two-sided')
print(f"Two‑proportion z‑test: z = {z_stat:.4f}, p = {p_val:.6f}")
Two‑proportion z‑test: z = -1.3116, p = 0.189653
# Wald 95% CI for each proportion
ci_treat = proportion_confint(conv_treat, n_treat, method='normal')
ci_ctrl  = proportion_confint(conv_control, n_control, method='normal')

# Difference in proportions & CI (Wald)
p_hat_t = conv_treat / n_treat
p_hat_c = conv_control / n_control
diff = p_hat_t - p_hat_c
se = np.sqrt(p_hat_t*(1-p_hat_t)/n_treat + p_hat_c*(1-p_hat_c)/n_control)
ci_diff = (diff - 1.96*se, diff + 1.96*se)

print(f"Counts (treat/control): {conv_treat}/{n_treat} vs {conv_control}/{n_control}")
print(f"p_treat = {p_hat_t:.5f} 95% CI {ci_treat}")
print(f"p_ctrl  = {p_hat_c:.5f} 95% CI {ci_ctrl}")
print(f"Diff (treat - control) = {diff:.5f} 95% CI {ci_diff}")
Counts (treat/control): 17264/145311 vs 17489/145274
p_treat = 0.11881 95% CI (0.11714362162601945, 0.12047087417952866)
p_ctrl  = 0.12039 95% CI (0.11871294722381814, 0.12205966177710426)
Diff (treat - control) = -0.00158 95% CI (-0.003938713688889012, 0.0007806004935147211)

Decision rule: If p < 0.05, reject \(H_0\) and conclude conversion differs between pages. Otherwise, fail to reject \(H_0\).
As p for two-proportion z-test is p=0.189 > 0.05, we are failing to reject the Null hypothesis. This means that the new and old page have an approximately equal chance of converting users. We recommend to the e-commerce company to keep the old page. This will save time and money on creating a new page.

def plot_z_hypothesis(
    data_list,
    pop_mean=0.0,
    pop_sd=1.0,
    alternative='two.sided',   # 'two.sided' | 'greater' | 'less'
    type_test = 'mean', #'mean' | 'prob'
    alpha=0.05,
    label='Sampling distribution',
    title='z-test (sampling distribution of x̄)',
    x_label = "Sampling distribution of x̄",
    figsize=(8, 5)
):
    """
    Visualize z-test decision regions for the sampling distribution of the sample mean.

    Parameters
    ----------
    data_list : array-like
        Sample values (used only to compute x̄ and n).
    pop_mean : float
        Hypothesized population mean (μ under H0).
    pop_sd : float
        Known population standard deviation (σ).
    alternative : str
        'two.sided', 'greater', or 'less' (same semantics as your R code).
    alpha : float
        Significance level for critical regions.
    label : str
        Label for the distribution curve.
    title : str
        Plot title.
    figsize : tuple
        Figure size.

    Returns
    -------
    fig, ax : matplotlib Figure and Axes
    """
    x = np.asarray(data_list)
    n = len(x)
    xbar = np.mean(x)

    # Standard error of the mean
    se = pop_sd / np.sqrt(n)

    # Range for plotting ±4 SE around μ (as in your R function)
    grid = np.linspace(pop_mean - 4 * se, pop_mean + 4 * se, 4001)
    pdf = stats.norm.pdf(grid, loc=pop_mean, scale=se)

    # Compute critical cutoffs under H0 for the chosen alternative
    if alternative == 'two.sided':
        # symmetric cutoffs: (alpha/2) and (1 - alpha/2)
        lower_cut = stats.norm.ppf(alpha / 2, loc=pop_mean, scale=se)
        upper_cut = stats.norm.ppf(1 - alpha / 2, loc=pop_mean, scale=se)

        # Retain between cutoffs; reject outside
        retain_mask = (grid >= lower_cut) & (grid <= upper_cut)
        reject_mask = ~retain_mask

    elif alternative == 'greater':
        # reject on right tail
        cutoff = stats.norm.ppf(1 - alpha, loc=pop_mean, scale=se)
        retain_mask = (grid <= cutoff)
        reject_mask = (grid > cutoff)
        lower_cut, upper_cut = None, cutoff

    elif alternative == 'less':
        # reject on left tail
        cutoff = stats.norm.ppf(alpha, loc=pop_mean, scale=se)
        retain_mask = (grid >= cutoff)
        reject_mask = (grid < cutoff)
        lower_cut, upper_cut = cutoff, None
    else:
        raise ValueError("alternative must be one of {'two.sided','greater','less'}")

    # Build a DataFrame like the R pipeline for convenience (optional)
    df = pd.DataFrame({'x': grid, 'pdf': pdf, 'retain': retain_mask})

    # Plot
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot(df['x'], df['pdf'], color='black', lw=1.2, label=label)

    # Shade retain region
    ax.fill_between(df['x'], 0, df['pdf'], where=df['retain'], color='#69b3a2', alpha=0.4, label='Retain H₀')

    # Shade reject region(s)
    ax.fill_between(df['x'], 0, df['pdf'], where=~df['retain'], color='#e76f51', alpha=0.4, label='Reject H₀')

    # Critical lines
    if alternative == 'two.sided':
        ax.axvline(lower_cut, color='#e76f51', ls='--', lw=1)
        ax.axvline(upper_cut, color='#e76f51', ls='--', lw=1)
        ax.text(lower_cut, ax.get_ylim()[1]*0.3, f"Lower crit\n{lower_cut:.2f}", ha='right', va='top', fontsize=9)
        ax.text(upper_cut, ax.get_ylim()[1]*0.3, f"Upper crit\n{upper_cut:.2f}", ha='left', va='top', fontsize=9)
    elif alternative == 'greater':
        ax.axvline(upper_cut, color='#e76f51', ls='--', lw=1)
        ax.text(upper_cut, ax.get_ylim()[1]*0.9, f"Crit\n{upper_cut:.2f}", ha='left', va='top', fontsize=9)
    elif alternative == 'less':
        ax.axvline(lower_cut, color='#e76f51', ls='--', lw=1)
        ax.text(lower_cut, ax.get_ylim()[1]*0.9, f"Crit\n{lower_cut:.2f}", ha='right', va='top', fontsize=9)

    # x̄ line and annotation
    if(type_test == 'prob'):
        ax.axvline(xbar, color='#264653', lw=2, ls='-', label='z')
        ax.annotate(f"z = {xbar:.2f}",
                xy=(xbar, stats.norm.pdf(xbar, loc=pop_mean, scale=se)),
                xytext=(xbar, ax.get_ylim()[1]*0.6),
                arrowprops=dict(arrowstyle='->', color='#264653'),
                ha='center', color='#264653')
    else:
        ax.axvline(xbar, color='#264653', lw=2, ls='-', label='Sample mean (x̄)')
        ax.annotate(f"x̄ = {xbar:.2f}",
                xy=(xbar, stats.norm.pdf(xbar, loc=pop_mean, scale=se)),
                xytext=(xbar, ax.get_ylim()[1]*0.6),
                arrowprops=dict(arrowstyle='->', color='#264653'),
                ha='center', color='#264653')

    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel("Density")
    ax.legend(loc='upper right', frameon=False)
    ax.grid(alpha=0.15)
    plt.show();
plot_z_hypothesis([z_stat], type_test = 'prob', title='Two proportion z-test', label = 'Standard Normal', x_label = '')

png

Welch’s t‑Test : Effect of Vitamin C on Tooth Growth

This classic dataset explores the impact of Vitamin C on tooth growth in guinea pigs, a foundational experiment in nutritional science. The response variable is the length of odontoblasts, which are cells responsible for tooth development. Sixty guinea pigs were randomly assigned to receive Vitamin C in one of two delivery methods:
1. Orange Juice (OJ)
2. Ascorbic Acid (VC) (a synthetic form of Vitamin C)

Each animal was also given one of three dose levels: 0.5 mg/day, 1 mg/day, or 2 mg/day. The experiment aims to determine whether the delivery method influences tooth growth, controlling for dosage.

Why it matters:
Understanding the effectiveness of different Vitamin C sources can guide dietary recommendations and supplement formulations. In modern analytics, this type of test parallels comparing two treatments or interventions in healthcare or A/B testing in product design.

Hypothesis:
\(H_0\): The mean tooth length is the same for both delivery methods (OJ and VC).
\(H_1\): The mean tooth length differs between the two methods.
This analysis focuses on the 0.5 mg/day dosage level; similar analyses can be performed for other dosages.

Result Interpretation: After running Welch’s t-test, if the p-value is less than 0.05, we reject the null hypothesis, concluding that the delivery method significantly affects tooth growth. If the p-value is greater than 0.05, we fail to reject \(H_0\), suggesting no measurable difference between OJ and VC. Additionally, reporting Cohen’s d helps quantify the magnitude of the difference, which is crucial for practical significance.

Dataset: R ToothGrowth (OJ vs VC).
CSV: https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/ToothGrowth.csv

Variables: len (tooth length), supp ∈ {OJ, VC}.

Hypotheses (two‑sided):
- \(H_0: \mu_{\text{OJ}} = \mu_{ ext{VC}}\)
- \(H_1: \mu_{\text{OJ}}\neq \mu_{ ext{VC}}\)

Assumptions: Independent samples; normality (approx); Welch’s t‑test does not assume equal variance.

# Load ToothGrowth
url_tg = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/ToothGrowth.csv"
tg = pd.read_csv(url_tg)
# Drop the Rdatasets index column
for c in tg.columns:
    if 'Unnamed' in c:
        tg = tg.drop(columns=[c])
tg.head()
rownames len supp dose
0 1 4.200000 VC 0.500000
1 2 11.500000 VC 0.500000
2 3 7.300000 VC 0.500000
3 4 5.800000 VC 0.500000
4 5 6.400000 VC 0.500000
# Groups
oj = tg.loc[(tg['supp']=='OJ') & (tg.dose == 0.5),'len']
vc = tg.loc[(tg['supp']=='VC') & (tg.dose == 0.5),'len']

# Welch's t-test
t_stat, p_val = stats.ttest_ind(oj, vc, equal_var=False)

# Cohen's d (using pooled SD with group sizes)
def cohens_d(x, y):
    nx, ny = len(x), len(y)
    sx2, sy2 = np.var(x, ddof=1), np.var(y, ddof=1)
    sp2 = ((nx-1)*sx2 + (ny-1)*sy2) / (nx+ny-2)
    d = (np.mean(x) - np.mean(y)) / np.sqrt(sp2)
    return d

d = cohens_d(oj, vc)

print(f"Welch t‑test: t = {t_stat:.4f}, p = {p_val:.6f}")
print(f"Mean's: OJ = {np.mean(oj):.3f}, VC = {np.mean(vc):.3f}")
print(f"Cohen's d = {d:.3f}")
Welch t‑test: t = 3.1697, p = 0.006359
Mean's: OJ = 13.230, VC = 7.980
Cohen's d = 1.418

Decision rule: If p < 0.05, reject \(H_0\) → evidence that delivery method affects mean tooth length. Cohen's d values of 0.2 means small difference between the groups, 0.5 for medium, and 0.8 or greater for large effects.

Result: OJ is more effective: T-tests show a statistically significant difference between the two delivery methods at these lower doses, with orange juice leading to greater tooth growth. P-values: The p-values are below \(0.05\), indicating the difference is unlikely to be due to random chance. For example, at 0.5mg, the p-value is around \(0.006\).

def plot_bivariate_histograms(dataset, con_col, cat_col, title='', x_label = ''):
    bi_con_cat = dataset.groupby([cat_col])[con_col].plot.hist(alpha = 0.5)
    plt.xlabel(con_col)
    plt.legend(dataset.groupby([cat_col])[con_col].count().axes[0].tolist())
    plt.title(title)
    plt.xlabel(x_label)
    plt.show();
plot_bivariate_histograms(tg[tg.dose==0.5], 'len', 'supp', title='Joint histogram', x_label = 'Length')

png

One‑Way ANOVA : Branch‑wise Gross Income (with Tukey HSD)

This dataset captures detailed transaction records from a supermarket chain operating in Myanmar, covering three major cities: Yangon, Naypyitaw, and Mandalay. The data spans a three-month period from January to March 2019, providing a rich view of retail operations. Each record includes information on branch location, product line, customer demographics, payment methods, and financial metrics such as gross income and total sales.
For our hypothesis test, we focus on whether branch location influences gross income, which is a critical metric for profitability. Retail managers often need to know if certain branches consistently outperform others, as this insight can guide resource allocation, marketing strategies, and inventory planning.
Hypothesis:
\(H_0\): The mean gross income is the same across all three branches (A, B, C).
\(H_1\): At least one branch has a different mean gross income.

Why it matters: If the test reveals significant differences, management can investigate underlying factors such as customer purchasing power, branch size, or local marketing effectiveness. This analysis mirrors real-world business intelligence tasks where data-driven decisions optimize operations and profitability.

Result Interpretation: After running ANOVA, if the p-value is less than 0.05, we reject the null hypothesis, concluding that branch location impacts gross income. Post-hoc analysis using Tukey HSD identifies which branches differ significantly, enabling targeted strategies for improvement.

Dataset: Supermarket Sales (three branches: A, B, C).
CSV: https://raw.githubusercontent.com/selva86/datasets/master/supermarket_sales.csv

Outcome: gross income (continuous). Factor: Branch (A/B/C).

Hypotheses:
- \(H_0: \mu_A = \mu_B = \mu_C\)
- \(H_1\): At least one mean differs

Plan: Fit ANOVA (ANalysis Of VAriance), check homogeneity using Levene, then Tukey HSD for post‑hoc pairwise comparisons.

# Load Supermarket Sales
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/supermarket_sales.csv")

# Tidy column names
df.columns = [c.strip().replace(' ', '_').lower() for c in df.columns]

df
invoice_id branch city customer_type gender product_line unit_price quantity tax_5% total date time payment cogs gross_margin_percentage gross_income rating
0 750-67-8428 A Yangon Member Female Health and beauty 74.690000 7 26.141500 548.971500 1/5/2019 13:08 Ewallet 522.830000 4.761905 26.141500 9.100000
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.280000 5 3.820000 80.220000 3/8/2019 10:29 Cash 76.400000 4.761905 3.820000 9.600000
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.330000 7 16.215500 340.525500 3/3/2019 13:23 Credit card 324.310000 4.761905 16.215500 7.400000
3 123-19-1176 A Yangon Member Male Health and beauty 58.220000 8 23.288000 489.048000 1/27/2019 20:33 Ewallet 465.760000 4.761905 23.288000 8.400000
4 373-73-7910 A Yangon Normal Male Sports and travel 86.310000 7 30.208500 634.378500 2/8/2019 10:37 Ewallet 604.170000 4.761905 30.208500 5.300000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 233-67-5758 C Naypyitaw Normal Male Health and beauty 40.350000 1 2.017500 42.367500 1/29/2019 13:46 Ewallet 40.350000 4.761905 2.017500 6.200000
996 303-96-2227 B Mandalay Normal Female Home and lifestyle 97.380000 10 48.690000 1,022.490000 3/2/2019 17:16 Ewallet 973.800000 4.761905 48.690000 4.400000
997 727-02-1313 A Yangon Member Male Food and beverages 31.840000 1 1.592000 33.432000 2/9/2019 13:22 Cash 31.840000 4.761905 1.592000 7.700000
998 347-56-2442 A Yangon Normal Male Home and lifestyle 65.820000 1 3.291000 69.111000 2/22/2019 15:33 Cash 65.820000 4.761905 3.291000 4.100000
999 849-09-3807 A Yangon Member Female Fashion accessories 88.340000 7 30.919000 649.299000 2/18/2019 13:28 Cash 618.380000 4.761905 30.919000 6.600000

1000 rows × 17 columns

# One-way ANOVA
model = ols('gross_income ~ C(branch)', data=df).fit()
anova_tbl = sm.stats.anova_lm(model, typ=2)
print(anova_tbl)
                  sum_sq         df        F   PR(>F)
C(branch)     242.602644   2.000000 0.884583 0.413210
Residual  136,716.894906 997.000000      NaN      NaN
# Levene test for homogeneity of variances
from scipy.stats import levene
branch_a = df[df['branch']=='A']['gross_income']
branch_b = df[df['branch']=='B']['gross_income']
branch_c = df[df['branch']=='C']['gross_income']
print("Levene p-value:", levene(branch_a, branch_b, branch_c).pvalue)
Levene p-value: 0.08946425577002974
# Tukey HSD post-hoc
tukey = pairwise_tukeyhsd(endog=df['gross_income'], groups=df['branch'], alpha=0.05)
print(tukey.summary())
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     A      B    0.358 0.9171 -1.7627 2.4788  False
     A      C   1.1784 0.3954 -0.9489 3.3057  False
     B      C   0.8203 0.6405 -1.3195 2.9602  False
---------------------------------------------------

Decision rule: If ANOVA p < 0.05, reject \(H_0\) and use Tukey HSD to identify differing pairs. If Levene p < 0.05, consider Welch’s ANOVA or robust alternatives.

As p=0.413 (PR(>F) = 0.413) is greater than 5%, we can fail to reject the Null hypothesis and claim that the three branches have similar gross revenue. Additionally we can see that the variances are homogeneous (as Levene's p-value > 5%) and t-tests between each combination shows no significance (Tukey's test is essentially a t-test, except that it corrects for family-wise error rate).

bi_variate_boxplot = sns.boxplot(x="branch", y="gross_income", data=df)
bi_variate_boxplot.set(title = 'Box Chart of gross income across branches');

png

Chi‑Square Test of Independence: Gender × Product Category (Retail)

In multi‑category retail, merchandising teams care deeply about who buys what. This real customer shopping dataset covers transactions across multiple shopping malls in Istanbul (2021–2023) and includes gender and product category for each purchase. We’ll test whether product category preferences differ by gender, which can inform assortment planning, aisle placement, personalized recommendations, targeted promotions, and store layout decisions.
Variables:
- gender ∈ {Male, Female}
- category ∈ {e.g., Clothing, Electronics, Accessories, …} (multiple categories present in the file)

Hypotheses (Chi‑Square Test of Independence):
\(H_0\): Gender and product category are independent (no association).
\(H_1\): Gender and product category are associated (category preference depends on gender).

Why it matters:
A significant association suggests different category affinities by gender. Retailers can adjust promotions, content, inventory mix, and store displays to better match demand, which often improves conversion rates and gross margin.

Assumptions & Data Checks:
- Sufficient expected counts (preferably ≥ 5 per cell). If some categories are rare, consider grouping similar categories or analyzing the top N categories to satisfy assumptions.
- Observations are independent (each row is a separate transaction).
(These are standard conditions for the Chi‑Square test in categorical analysis.)

Dataset Source:
Customer Shopping Data (GitHub) — file: customer_shopping_data.csv

Interpretation:
1. If p < 0.05, reject \(H_0\): gender and product category are associated.
2. Cramér’s V indicates strength of association: ~0.1 = weak, ~0.3 = moderate, ~0.5 = strong.

We compute the Chi‑square test of independence and Cramér’s V as an effect size.

import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np

# Load dataset
url = "https://raw.githubusercontent.com/gokcengiz/Shopping-data-analysis/main/customer_shopping_data.csv"
df = pd.read_csv(url)
df
invoice_no customer_id gender age category quantity price payment_method invoice_date shopping_mall
0 I138884 C241288 Female 28 Clothing 5 1,500.400000 Credit Card 5/8/2022 Kanyon
1 I317333 C111565 Male 21 Shoes 3 1,800.510000 Debit Card 12/12/2021 Forum Istanbul
2 I127801 C266599 Male 20 Clothing 1 300.080000 Cash 9/11/2021 Metrocity
3 I173702 C988172 Female 66 Shoes 5 3,000.850000 Credit Card 16/05/2021 Metropol AVM
4 I337046 C189076 Female 53 Books 4 60.600000 Cash 24/10/2021 Kanyon
... ... ... ... ... ... ... ... ... ... ...
99452 I219422 C441542 Female 45 Souvenir 5 58.650000 Credit Card 21/09/2022 Kanyon
99453 I325143 C569580 Male 27 Food & Beverage 2 10.460000 Cash 22/09/2021 Forum Istanbul
99454 I824010 C103292 Male 63 Food & Beverage 2 10.460000 Debit Card 28/03/2021 Metrocity
99455 I702964 C800631 Male 56 Technology 4 4,200.000000 Cash 16/03/2021 Istinye Park
99456 I232867 C273973 Female 36 Souvenir 3 35.190000 Credit Card 15/10/2022 Mall of Istanbul

99457 rows × 10 columns

# Build contingency table: Gender vs Payment Method
ct = pd.crosstab(df['category'], df['gender'])
chi2, p, dof, expected = chi2_contingency(ct)
print("Contingency Table:\n", ct)
print(f"Chi-square = {chi2:.4f}, p-value = {p:.6f}, dof = {dof}")
Contingency Table:
 gender           Female   Male
category                      
Books              2906   2075
Clothing          20652  13835
Cosmetics          9070   6027
Food & Beverage    8804   5972
Shoes              5967   4067
Souvenir           3017   1982
Technology         2981   2015
Toys               6085   4002
Chi-square = 7.5679, p-value = 0.372234, dof = 7
# Compute Cramér's V
n = ct.values.sum()
phi2 = chi2 / n
r, c = ct.shape
cramers_v = np.sqrt(phi2 / min(r-1, c-1))

print(f"Cramér's V = {cramers_v:.3f}")
Cramér's V = 0.009

0.009 indicates a very weak association

from statsmodels.graphics.mosaicplot import mosaic
def plot_mosaics(data, x_col, y_col, title='', colors_list =[]):
    dict_of_tuples = {}

    # create the clean set of percentages to print
    for x_col_ in data[x_col].unique():
        for y_col_ in data[y_col].unique():
            n = len(data[(data[x_col]==x_col_)&(data[y_col]==y_col_)][x_col])
            d = len(data[(data[x_col]==x_col_)][x_col])
            len_ = len(data[x_col])
            if((d==0) or (n/d<=0.04)):
                # if the percentage within a class is less than 4%, do not print the percentage
                dict_of_tuples[(str(x_col_), str(y_col_))] = ''
            elif(n/len_<=0.02):
                # If its a tiny class with less than 2% of the total data, do not print
                dict_of_tuples[(str(x_col_), str(y_col_))] = ''
            else:
                dict_of_tuples[(str(x_col_), str(y_col_))] = str(int(n/d*100))+"%"

    dict_of_colors = dict_of_tuples.copy()
    if(len(colors_list)>0):
        # create a clean set of colors
        for i, x_col_ in enumerate(data[x_col].unique()):
            for y_col_ in data[y_col].unique():
                dict_of_colors[(str(x_col_), str(y_col_))] = {'color':colors_list[i], 'alpha':0.8}

    # Plot the mosaic plot
    labelizer = lambda k: dict_of_tuples[k]    
    fig, ax = plt.subplots(figsize=(8,6))
    if(len(colors_list)>0):
        mosaic(data.sort_values([x_col, y_col]), [x_col, y_col], 
               statistic = False, axes_label = True, label_rotation = [90, 0],
               labelizer=labelizer, properties=dict_of_colors, gap=0.008, ax=ax)
    else:
        mosaic(data.sort_values([x_col, y_col]), [x_col, y_col], 
               statistic = False, axes_label = True, label_rotation = [90, 0],
               labelizer=labelizer, gap=0.008, ax=ax)
    if(title==''):
        plt.title(str(y_col) + ' percentages across ' + str(x_col))
    else:
        plt.title(title)
    plt.show();
plot_mosaics(df, 'category', 'gender')

png

Decision rule: If p < 0.05, reject \(H_0\). Report Cramér’s V to quantify the association strength.
As p is greater than 5%, we fail to reject the null hypothesis, indicating no significant association between gender and product category.

References

Back to top