Hypothesis Testing Examples¶

This notebook demonstrates core hypothesis tests using publicly available datasets commonly used in data science:

Summary

Independent variable	Dependent variable	Type of plots	Type of Hypothesis test
Continuous	Continuous	Scatter plots	Correlation test
Continuous	Categorical	Bar charts	(two categories) t-test/z-test, (more than 2 categories) F test
Categorical	Continuous	Joint Histograms	(two categories) t-test/z-test, (more than 2 categories) F test
Categorical	Categorical	Mosaic charts	Chi-Square independence test

Examples We are using the below four usecases to demonstrate these:

Two‑proportion Z‑test: Marketing A/B conversion (Udacity e‑commerce A/B test)
Data source: ab_data.csv (GitHub mirror)
Welch's t‑test: Supplement efficacy (ToothGrowth: Orange Juice vs Vitamin C)
Data source: Rdatasets (CSV)
One‑way ANOVA + Tukey HSD: Branch‑wise gross income (Supermarket Sales)
Data source: selva86/datasets (CSV)
Chi‑square independence + Cramér's V: Gender vs Product category
Data source: Retail data (CSV)

import numpy as np, pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.float_format', lambda x: f"{x:,.6f}")

Two‑Proportion Z‑Test : Marketing A/B Conversion¶

In digital marketing, companies frequently run A/B tests to compare two versions of a webpage or campaign. The objective is to determine which version leads to higher conversions (e.g., purchases, sign-ups). In this dataset, the control group saw the old landing page, while the treatment group saw the new page. The hypothesis test checks if the new page significantly improves conversion rates.

Result Interpretation: After running the two-proportion Z-test, we obtained a p-value.

If this p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there is no statistically significant difference between the old and new pages. This suggests that the redesign did not lead to a measurable improvement in conversions.
If the p-value were below 0.05, we would conclude that the new page performs differently (better or worse) than the old one.

This insight helps marketing teams decide whether to adopt the new design or stick with the old one, ensuring data-driven decisions rather than relying on intuition.

Dataset: Udacity e‑commerce A/B test (ab_data.csv).
Link: https://github.com/beery4010/Analyze-AB-Test-Results/blob/master/ab_data.csv

Variables: group ∈ {control, treatment}, converted ∈ {0,1}.

Goal: Test whether conversion rates differ between old (control) and new (treatment) pages.

Hypotheses (two‑sided):
- $ H_0: p_{\text{treatment}} = p_{ ext{control}} $
- $ H_1: p_{\text{treatment}} \neq p_{ ext{control}} $

Assumptions: Independent Bernoulli trials; large sample sizes so normal approximation holds.

from statsmodels.stats.proportion import proportion_confint

# Load dataset (raw CSV via GitHub)
url_ab = "https://raw.githubusercontent.com/beery4010/Analyze-AB-Test-Results/master/ab_data.csv"
ab = pd.read_csv(url_ab)
ab

	user_id	timestamp	group	landing_page	converted
0	851104	2017-01-21 22:11:48.556739	control	old_page	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0
4	864975	2017-01-21 01:52:26.210827	control	old_page	1
...	...	...	...	...	...
294473	751197	2017-01-03 22:28:38.630509	control	old_page	0
294474	945152	2017-01-12 00:51:57.078372	control	old_page	0
294475	734608	2017-01-22 11:45:03.439544	control	old_page	0
294476	697314	2017-01-15 01:20:28.957438	control	old_page	0
294477	715931	2017-01-16 12:40:24.467417	treatment	new_page	0

294478 rows × 5 columns

# Align groups and landing pages as done in the Udacity project
ab = ab[((ab['group'] == 'control') & (ab['landing_page'] == 'old_page')) |
        ((ab['group'] == 'treatment') & (ab['landing_page'] == 'new_page'))]

# Compute conversions and sample sizes
conv_control = ab.loc[ab['group']=='control','converted'].sum()
n_control    = ab.loc[ab['group']=='control','converted'].count()
conv_treat   = ab.loc[ab['group']=='treatment','converted'].sum()
n_treat      = ab.loc[ab['group']=='treatment','converted'].count()
print('In our control group, we had', n_control,' out of whom ', conv_control, 'converted, making a probability of p=', round(conv_control/n_control, 4))
print("In our treatment group we had", n_treat, " out of which ", conv_treat, 'converted., making a probability of p=', round(conv_treat/n_treat, 4))

In our control group, we had 145274  out of whom  17489 converted, making a probability of p= 0.1204
In our treatment group we had 145311  out of which  17264 converted., making a probability of p= 0.1188

# Two‑sided two‑proportion z‑test
z_stat, p_val = proportions_ztest([conv_treat, conv_control], [n_treat, n_control], alternative='two-sided')
print(f"Two‑proportion z‑test: z = {z_stat:.4f}, p = {p_val:.6f}")

Two‑proportion z‑test: z = -1.3116, p = 0.189653

# Wald 95% CI for each proportion
ci_treat = proportion_confint(conv_treat, n_treat, method='normal')
ci_ctrl  = proportion_confint(conv_control, n_control, method='normal')

# Difference in proportions & CI (Wald)
p_hat_t = conv_treat / n_treat
p_hat_c = conv_control / n_control
diff = p_hat_t - p_hat_c
se = np.sqrt(p_hat_t*(1-p_hat_t)/n_treat + p_hat_c*(1-p_hat_c)/n_control)
ci_diff = (diff - 1.96*se, diff + 1.96*se)

print(f"Counts (treat/control): {conv_treat}/{n_treat} vs {conv_control}/{n_control}")
print(f"p_treat = {p_hat_t:.5f} 95% CI {ci_treat}")
print(f"p_ctrl  = {p_hat_c:.5f} 95% CI {ci_ctrl}")
print(f"Diff (treat - control) = {diff:.5f} 95% CI {ci_diff}")

Counts (treat/control): 17264/145311 vs 17489/145274
p_treat = 0.11881 95% CI (0.11714362162601945, 0.12047087417952866)
p_ctrl  = 0.12039 95% CI (0.11871294722381814, 0.12205966177710426)
Diff (treat - control) = -0.00158 95% CI (-0.003938713688889012, 0.0007806004935147211)

Decision rule: If p < 0.05, reject $H_0$ and conclude conversion differs between pages. Otherwise, fail to reject $H_0$.
As p for two-proportion z-test is p=0.189 > 0.05, we are failing to reject the Null hypothesis. This means that the new and old page have an approximately equal chance of converting users. We recommend to the e-commerce company to keep the old page. This will save time and money on creating a new page.

def plot_z_hypothesis(
    data_list,
    pop_mean=0.0,
    pop_sd=1.0,
    alternative='two.sided',   # 'two.sided' | 'greater' | 'less'
    type_test = 'mean', #'mean' | 'prob'
    alpha=0.05,
    label='Sampling distribution',
    title='z-test (sampling distribution of x̄)',
    x_label = "Sampling distribution of x̄",
    figsize=(8, 5)
):
    """
    Visualize z-test decision regions for the sampling distribution of the sample mean.

    Parameters
    ----------
    data_list : array-like
        Sample values (used only to compute x̄ and n).
    pop_mean : float
        Hypothesized population mean (μ under H0).
    pop_sd : float
        Known population standard deviation (σ).
    alternative : str
        'two.sided', 'greater', or 'less' (same semantics as your R code).
    alpha : float
        Significance level for critical regions.
    label : str
        Label for the distribution curve.
    title : str
        Plot title.
    figsize : tuple
        Figure size.

    Returns
    -------
    fig, ax : matplotlib Figure and Axes
    """
    x = np.asarray(data_list)
    n = len(x)
    xbar = np.mean(x)

    # Standard error of the mean
    se = pop_sd / np.sqrt(n)

    # Range for plotting ±4 SE around μ (as in your R function)
    grid = np.linspace(pop_mean - 4 * se, pop_mean + 4 * se, 4001)
    pdf = stats.norm.pdf(grid, loc=pop_mean, scale=se)

    # Compute critical cutoffs under H0 for the chosen alternative
    if alternative == 'two.sided':
        # symmetric cutoffs: (alpha/2) and (1 - alpha/2)
        lower_cut = stats.norm.ppf(alpha / 2, loc=pop_mean, scale=se)
        upper_cut = stats.norm.ppf(1 - alpha / 2, loc=pop_mean, scale=se)

        # Retain between cutoffs; reject outside
        retain_mask = (grid >= lower_cut) & (grid <= upper_cut)
        reject_mask = ~retain_mask

    elif alternative == 'greater':
        # reject on right tail
        cutoff = stats.norm.ppf(1 - alpha, loc=pop_mean, scale=se)
        retain_mask = (grid <= cutoff)
        reject_mask = (grid > cutoff)
        lower_cut, upper_cut = None, cutoff

    elif alternative == 'less':
        # reject on left tail
        cutoff = stats.norm.ppf(alpha, loc=pop_mean, scale=se)
        retain_mask = (grid >= cutoff)
        reject_mask = (grid < cutoff)
        lower_cut, upper_cut = cutoff, None
    else:
        raise ValueError("alternative must be one of {'two.sided','greater','less'}")

    # Build a DataFrame like the R pipeline for convenience (optional)
    df = pd.DataFrame({'x': grid, 'pdf': pdf, 'retain': retain_mask})

    # Plot
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot(df['x'], df['pdf'], color='black', lw=1.2, label=label)

    # Shade retain region
    ax.fill_between(df['x'], 0, df['pdf'], where=df['retain'], color='#69b3a2', alpha=0.4, label='Retain H₀')

    # Shade reject region(s)
    ax.fill_between(df['x'], 0, df['pdf'], where=~df['retain'], color='#e76f51', alpha=0.4, label='Reject H₀')

    # Critical lines
    if alternative == 'two.sided':
        ax.axvline(lower_cut, color='#e76f51', ls='--', lw=1)
        ax.axvline(upper_cut, color='#e76f51', ls='--', lw=1)
        ax.text(lower_cut, ax.get_ylim()[1]*0.3, f"Lower crit\n{lower_cut:.2f}", ha='right', va='top', fontsize=9)
        ax.text(upper_cut, ax.get_ylim()[1]*0.3, f"Upper crit\n{upper_cut:.2f}", ha='left', va='top', fontsize=9)
    elif alternative == 'greater':
        ax.axvline(upper_cut, color='#e76f51', ls='--', lw=1)
        ax.text(upper_cut, ax.get_ylim()[1]*0.9, f"Crit\n{upper_cut:.2f}", ha='left', va='top', fontsize=9)
    elif alternative == 'less':
        ax.axvline(lower_cut, color='#e76f51', ls='--', lw=1)
        ax.text(lower_cut, ax.get_ylim()[1]*0.9, f"Crit\n{lower_cut:.2f}", ha='right', va='top', fontsize=9)

    # x̄ line and annotation
    if(type_test == 'prob'):
        ax.axvline(xbar, color='#264653', lw=2, ls='-', label='z')
        ax.annotate(f"z = {xbar:.2f}",
                xy=(xbar, stats.norm.pdf(xbar, loc=pop_mean, scale=se)),
                xytext=(xbar, ax.get_ylim()[1]*0.6),
                arrowprops=dict(arrowstyle='->', color='#264653'),
                ha='center', color='#264653')
    else:
        ax.axvline(xbar, color='#264653', lw=2, ls='-', label='Sample mean (x̄)')
        ax.annotate(f"x̄ = {xbar:.2f}",
                xy=(xbar, stats.norm.pdf(xbar, loc=pop_mean, scale=se)),
                xytext=(xbar, ax.get_ylim()[1]*0.6),
                arrowprops=dict(arrowstyle='->', color='#264653'),
                ha='center', color='#264653')

    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel("Density")
    ax.legend(loc='upper right', frameon=False)
    ax.grid(alpha=0.15)
    plt.show();
plot_z_hypothesis([z_stat], type_test = 'prob', title='Two proportion z-test', label = 'Standard Normal', x_label = '')

png

Welch’s t‑Test : Effect of Vitamin C on Tooth Growth¶

This classic dataset explores the impact of Vitamin C on tooth growth in guinea pigs, a foundational experiment in nutritional science. The response variable is the length of odontoblasts, which are cells responsible for tooth development. Sixty guinea pigs were randomly assigned to receive Vitamin C in one of two delivery methods:
1. Orange Juice (OJ)
2. Ascorbic Acid (VC) (a synthetic form of Vitamin C)

Each animal was also given one of three dose levels: 0.5 mg/day, 1 mg/day, or 2 mg/day. The experiment aims to determine whether the delivery method influences tooth growth, controlling for dosage.

Why it matters:
Understanding the effectiveness of different Vitamin C sources can guide dietary recommendations and supplement formulations. In modern analytics, this type of test parallels comparing two treatments or interventions in healthcare or A/B testing in product design.

Hypothesis:
$H_0$: The mean tooth length is the same for both delivery methods (OJ and VC).
$H_1$: The mean tooth length differs between the two methods.
This analysis focuses on the 0.5 mg/day dosage level; similar analyses can be performed for other dosages.

Result Interpretation: After running Welch’s t-test, if the p-value is less than 0.05, we reject the null hypothesis, concluding that the delivery method significantly affects tooth growth. If the p-value is greater than 0.05, we fail to reject $H_0$, suggesting no measurable difference between OJ and VC. Additionally, reporting Cohen’s d helps quantify the magnitude of the difference, which is crucial for practical significance.

Dataset: R ToothGrowth (OJ vs VC).
CSV: https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/ToothGrowth.csv

Variables: len (tooth length), supp ∈ {OJ, VC}.

Hypotheses (two‑sided):
- $H_0: \mu_{\text{OJ}} = \mu_{ ext{VC}}$
- $H_1: \mu_{\text{OJ}}\neq \mu_{ ext{VC}}$

Assumptions: Independent samples; normality (approx); Welch’s t‑test does not assume equal variance.

# Load ToothGrowth
url_tg = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/ToothGrowth.csv"
tg = pd.read_csv(url_tg)
# Drop the Rdatasets index column
for c in tg.columns:
    if 'Unnamed' in c:
        tg = tg.drop(columns=[c])
tg.head()

	rownames	len	supp	dose
0	1	4.200000	VC	0.500000
1	2	11.500000	VC	0.500000
2	3	7.300000	VC	0.500000
3	4	5.800000	VC	0.500000
4	5	6.400000	VC	0.500000

# Groups
oj = tg.loc[(tg['supp']=='OJ') & (tg.dose == 0.5),'len']
vc = tg.loc[(tg['supp']=='VC') & (tg.dose == 0.5),'len']

# Welch's t-test
t_stat, p_val = stats.ttest_ind(oj, vc, equal_var=False)

# Cohen's d (using pooled SD with group sizes)
def cohens_d(x, y):
    nx, ny = len(x), len(y)
    sx2, sy2 = np.var(x, ddof=1), np.var(y, ddof=1)
    sp2 = ((nx-1)*sx2 + (ny-1)*sy2) / (nx+ny-2)
    d = (np.mean(x) - np.mean(y)) / np.sqrt(sp2)
    return d

d = cohens_d(oj, vc)

print(f"Welch t‑test: t = {t_stat:.4f}, p = {p_val:.6f}")
print(f"Mean's: OJ = {np.mean(oj):.3f}, VC = {np.mean(vc):.3f}")
print(f"Cohen's d = {d:.3f}")

Welch t‑test: t = 3.1697, p = 0.006359
Mean's: OJ = 13.230, VC = 7.980
Cohen's d = 1.418

Decision rule: If p < 0.05, reject $H_0$ → evidence that delivery method affects mean tooth length. Cohen's d values of 0.2 means small difference between the groups, 0.5 for medium, and 0.8 or greater for large effects.

Result: OJ is more effective: T-tests show a statistically significant difference between the two delivery methods at these lower doses, with orange juice leading to greater tooth growth. P-values: The p-values are below $0.05$, indicating the difference is unlikely to be due to random chance. For example, at 0.5mg, the p-value is around $0.006$.

def plot_bivariate_histograms(dataset, con_col, cat_col, title='', x_label = ''):
    bi_con_cat = dataset.groupby([cat_col])[con_col].plot.hist(alpha = 0.5)
    plt.xlabel(con_col)
    plt.legend(dataset.groupby([cat_col])[con_col].count().axes[0].tolist())
    plt.title(title)
    plt.xlabel(x_label)
    plt.show();
plot_bivariate_histograms(tg[tg.dose==0.5], 'len', 'supp', title='Joint histogram', x_label = 'Length')

png

One‑Way ANOVA : Branch‑wise Gross Income (with Tukey HSD)¶

This dataset captures detailed transaction records from a supermarket chain operating in Myanmar, covering three major cities: Yangon, Naypyitaw, and Mandalay. The data spans a three-month period from January to March 2019, providing a rich view of retail operations. Each record includes information on branch location, product line, customer demographics, payment methods, and financial metrics such as gross income and total sales.
For our hypothesis test, we focus on whether branch location influences gross income, which is a critical metric for profitability. Retail managers often need to know if certain branches consistently outperform others, as this insight can guide resource allocation, marketing strategies, and inventory planning.
Hypothesis:
$H_0$: The mean gross income is the same across all three branches (A, B, C).
$H_1$: At least one branch has a different mean gross income.

Why it matters: If the test reveals significant differences, management can investigate underlying factors such as customer purchasing power, branch size, or local marketing effectiveness. This analysis mirrors real-world business intelligence tasks where data-driven decisions optimize operations and profitability.

Result Interpretation: After running ANOVA, if the p-value is less than 0.05, we reject the null hypothesis, concluding that branch location impacts gross income. Post-hoc analysis using Tukey HSD identifies which branches differ significantly, enabling targeted strategies for improvement.

Dataset: Supermarket Sales (three branches: A, B, C).
CSV: https://raw.githubusercontent.com/selva86/datasets/master/supermarket_sales.csv

Outcome: gross income (continuous). Factor: Branch (A/B/C).

Hypotheses:
- $H_0: \mu_A = \mu_B = \mu_C$
- $H_1$: At least one mean differs

Plan: Fit ANOVA (ANalysis Of VAriance), check homogeneity using Levene, then Tukey HSD for post‑hoc pairwise comparisons.

# Load Supermarket Sales
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/supermarket_sales.csv")

# Tidy column names
df.columns = [c.strip().replace(' ', '_').lower() for c in df.columns]

df

	invoice_id	branch	city	customer_type	gender	product_line	unit_price	quantity	tax_5%	total	date	time	payment	cogs	gross_margin_percentage	gross_income	rating
0	750-67-8428	A	Yangon	Member	Female	Health and beauty	74.690000	7	26.141500	548.971500	1/5/2019	13:08	Ewallet	522.830000	4.761905	26.141500	9.100000
1	226-31-3081	C	Naypyitaw	Normal	Female	Electronic accessories	15.280000	5	3.820000	80.220000	3/8/2019	10:29	Cash	76.400000	4.761905	3.820000	9.600000
2	631-41-3108	A	Yangon	Normal	Male	Home and lifestyle	46.330000	7	16.215500	340.525500	3/3/2019	13:23	Credit card	324.310000	4.761905	16.215500	7.400000
3	123-19-1176	A	Yangon	Member	Male	Health and beauty	58.220000	8	23.288000	489.048000	1/27/2019	20:33	Ewallet	465.760000	4.761905	23.288000	8.400000
4	373-73-7910	A	Yangon	Normal	Male	Sports and travel	86.310000	7	30.208500	634.378500	2/8/2019	10:37	Ewallet	604.170000	4.761905	30.208500	5.300000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	233-67-5758	C	Naypyitaw	Normal	Male	Health and beauty	40.350000	1	2.017500	42.367500	1/29/2019	13:46	Ewallet	40.350000	4.761905	2.017500	6.200000
996	303-96-2227	B	Mandalay	Normal	Female	Home and lifestyle	97.380000	10	48.690000	1,022.490000	3/2/2019	17:16	Ewallet	973.800000	4.761905	48.690000	4.400000
997	727-02-1313	A	Yangon	Member	Male	Food and beverages	31.840000	1	1.592000	33.432000	2/9/2019	13:22	Cash	31.840000	4.761905	1.592000	7.700000
998	347-56-2442	A	Yangon	Normal	Male	Home and lifestyle	65.820000	1	3.291000	69.111000	2/22/2019	15:33	Cash	65.820000	4.761905	3.291000	4.100000
999	849-09-3807	A	Yangon	Member	Female	Fashion accessories	88.340000	7	30.919000	649.299000	2/18/2019	13:28	Cash	618.380000	4.761905	30.919000	6.600000

1000 rows × 17 columns

# One-way ANOVA
model = ols('gross_income ~ C(branch)', data=df).fit()
anova_tbl = sm.stats.anova_lm(model, typ=2)
print(anova_tbl)

                  sum_sq         df        F   PR(>F)
C(branch)     242.602644   2.000000 0.884583 0.413210
Residual  136,716.894906 997.000000      NaN      NaN

# Levene test for homogeneity of variances
from scipy.stats import levene
branch_a = df[df['branch']=='A']['gross_income']
branch_b = df[df['branch']=='B']['gross_income']
branch_c = df[df['branch']=='C']['gross_income']
print("Levene p-value:", levene(branch_a, branch_b, branch_c).pvalue)

Levene p-value: 0.08946425577002974

# Tukey HSD post-hoc
tukey = pairwise_tukeyhsd(endog=df['gross_income'], groups=df['branch'], alpha=0.05)
print(tukey.summary())

Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     A      B    0.358 0.9171 -1.7627 2.4788  False
     A      C   1.1784 0.3954 -0.9489 3.3057  False
     B      C   0.8203 0.6405 -1.3195 2.9602  False
---------------------------------------------------

Decision rule: If ANOVA p < 0.05, reject $H_0$ and use Tukey HSD to identify differing pairs. If Levene p < 0.05, consider Welch’s ANOVA or robust alternatives.

As p=0.413 (PR(>F) = 0.413) is greater than 5%, we can fail to reject the Null hypothesis and claim that the three branches have similar gross revenue. Additionally we can see that the variances are homogeneous (as Levene's p-value > 5%) and t-tests between each combination shows no significance (Tukey's test is essentially a t-test, except that it corrects for family-wise error rate).

bi_variate_boxplot = sns.boxplot(x="branch", y="gross_income", data=df)
bi_variate_boxplot.set(title = 'Box Chart of gross income across branches');

png

Chi‑Square Test of Independence: Gender × Product Category (Retail)¶

In multi‑category retail, merchandising teams care deeply about who buys what. This real customer shopping dataset covers transactions across multiple shopping malls in Istanbul (2021–2023) and includes gender and product category for each purchase. We’ll test whether product category preferences differ by gender, which can inform assortment planning, aisle placement, personalized recommendations, targeted promotions, and store layout decisions.
Variables:
- gender ∈ {Male, Female}
- category ∈ {e.g., Clothing, Electronics, Accessories, …} (multiple categories present in the file)

Hypotheses (Chi‑Square Test of Independence):
$H_0$: Gender and product category are independent (no association).
$H_1$: Gender and product category are associated (category preference depends on gender).

Why it matters:
A significant association suggests different category affinities by gender. Retailers can adjust promotions, content, inventory mix, and store displays to better match demand, which often improves conversion rates and gross margin.

Assumptions & Data Checks:
- Sufficient expected counts (preferably ≥ 5 per cell). If some categories are rare, consider grouping similar categories or analyzing the top N categories to satisfy assumptions.
- Observations are independent (each row is a separate transaction).
(These are standard conditions for the Chi‑Square test in categorical analysis.)

Dataset Source:
Customer Shopping Data (GitHub) — file: customer_shopping_data.csv

Interpretation:
1. If p < 0.05, reject $H_0$: gender and product category are associated.
2. Cramér’s V indicates strength of association: ~0.1 = weak, ~0.3 = moderate, ~0.5 = strong.

We compute the Chi‑square test of independence and Cramér’s V as an effect size.

import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np

# Load dataset
url = "https://raw.githubusercontent.com/gokcengiz/Shopping-data-analysis/main/customer_shopping_data.csv"
df = pd.read_csv(url)
df

	invoice_no	customer_id	gender	age	category	quantity	price	payment_method	invoice_date	shopping_mall
0	I138884	C241288	Female	28	Clothing	5	1,500.400000	Credit Card	5/8/2022	Kanyon
1	I317333	C111565	Male	21	Shoes	3	1,800.510000	Debit Card	12/12/2021	Forum Istanbul
2	I127801	C266599	Male	20	Clothing	1	300.080000	Cash	9/11/2021	Metrocity
3	I173702	C988172	Female	66	Shoes	5	3,000.850000	Credit Card	16/05/2021	Metropol AVM
4	I337046	C189076	Female	53	Books	4	60.600000	Cash	24/10/2021	Kanyon
...	...	...	...	...	...	...	...	...	...	...
99452	I219422	C441542	Female	45	Souvenir	5	58.650000	Credit Card	21/09/2022	Kanyon
99453	I325143	C569580	Male	27	Food & Beverage	2	10.460000	Cash	22/09/2021	Forum Istanbul
99454	I824010	C103292	Male	63	Food & Beverage	2	10.460000	Debit Card	28/03/2021	Metrocity
99455	I702964	C800631	Male	56	Technology	4	4,200.000000	Cash	16/03/2021	Istinye Park
99456	I232867	C273973	Female	36	Souvenir	3	35.190000	Credit Card	15/10/2022	Mall of Istanbul

99457 rows × 10 columns

# Build contingency table: Gender vs Payment Method
ct = pd.crosstab(df['category'], df['gender'])
chi2, p, dof, expected = chi2_contingency(ct)
print("Contingency Table:\n", ct)
print(f"Chi-square = {chi2:.4f}, p-value = {p:.6f}, dof = {dof}")

Contingency Table:
 gender           Female   Male
category                      
Books              2906   2075
Clothing          20652  13835
Cosmetics          9070   6027
Food & Beverage    8804   5972
Shoes              5967   4067
Souvenir           3017   1982
Technology         2981   2015
Toys               6085   4002
Chi-square = 7.5679, p-value = 0.372234, dof = 7

# Compute Cramér's V
n = ct.values.sum()
phi2 = chi2 / n
r, c = ct.shape
cramers_v = np.sqrt(phi2 / min(r-1, c-1))

print(f"Cramér's V = {cramers_v:.3f}")

Cramér's V = 0.009

0.009 indicates a very weak association

from statsmodels.graphics.mosaicplot import mosaic
def plot_mosaics(data, x_col, y_col, title='', colors_list =[]):
    dict_of_tuples = {}

    # create the clean set of percentages to print
    for x_col_ in data[x_col].unique():
        for y_col_ in data[y_col].unique():
            n = len(data[(data[x_col]==x_col_)&(data[y_col]==y_col_)][x_col])
            d = len(data[(data[x_col]==x_col_)][x_col])
            len_ = len(data[x_col])
            if((d==0) or (n/d<=0.04)):
                # if the percentage within a class is less than 4%, do not print the percentage
                dict_of_tuples[(str(x_col_), str(y_col_))] = ''
            elif(n/len_<=0.02):
                # If its a tiny class with less than 2% of the total data, do not print
                dict_of_tuples[(str(x_col_), str(y_col_))] = ''
            else:
                dict_of_tuples[(str(x_col_), str(y_col_))] = str(int(n/d*100))+"%"

    dict_of_colors = dict_of_tuples.copy()
    if(len(colors_list)>0):
        # create a clean set of colors
        for i, x_col_ in enumerate(data[x_col].unique()):
            for y_col_ in data[y_col].unique():
                dict_of_colors[(str(x_col_), str(y_col_))] = {'color':colors_list[i], 'alpha':0.8}

    # Plot the mosaic plot
    labelizer = lambda k: dict_of_tuples[k]    
    fig, ax = plt.subplots(figsize=(8,6))
    if(len(colors_list)>0):
        mosaic(data.sort_values([x_col, y_col]), [x_col, y_col], 
               statistic = False, axes_label = True, label_rotation = [90, 0],
               labelizer=labelizer, properties=dict_of_colors, gap=0.008, ax=ax)
    else:
        mosaic(data.sort_values([x_col, y_col]), [x_col, y_col], 
               statistic = False, axes_label = True, label_rotation = [90, 0],
               labelizer=labelizer, gap=0.008, ax=ax)
    if(title==''):
        plt.title(str(y_col) + ' percentages across ' + str(x_col))
    else:
        plt.title(title)
    plt.show();
plot_mosaics(df, 'category', 'gender')

png

Decision rule: If p < 0.05, reject $H_0$. Report Cramér’s V to quantify the association strength.
As p is greater than 5%, we fail to reject the null hypothesis, indicating no significant association between gender and product category.

References¶

Udacity A/B test (ab_data.csv) GitHub mirror: https://github.com/beery4010/Analyze-AB-Test-Results
ToothGrowth dataset (Rdatasets CSV): https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/datasets/ToothGrowth.csv
Supermarket Sales dataset (selva86/datasets CSV): https://github.com/selva86/datasets/blob/master/supermarket_sales.csv
Retail data : https://github.com/gokcengiz/Shopping-data-analysis