Analyzing Train Delays in India Using DBSCAN¶

Use Case Overview¶

Train delays are a common challenge in India’s vast railway network, impacting millions of passengers daily. Understanding patterns behind these delays can help optimize scheduling and improve operational efficiency. In this blog, we collected real-time train running status data, calculated delay metrics, and applied clustering techniques to identify geographical and temporal patterns of delays across stations. The goal is to uncover hotspots where delays frequently occur and analyze their severity.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Web Scraping with BeautifulSoup¶

To gather train delay data, we used BeautifulSoup, a Python library for parsing HTML and XML documents. The data was sourced from runningstatus.in, which provides live train status updates. BeautifulSoup allowed us to:
1. Extract structured data from HTML tables containing station names, scheduled and actual arrival/departure times, and delay information.
2. Handle inconsistencies in HTML tags and parse nested elements like <abbr> for station codes and <small> for delay status.
3. Convert the extracted data into a pandas DataFrame for further analysis.

This approach automated data collection for thousands of trains, creating a rich dataset of over 26,000 rows.

For the Train Janmabhoomi express, we have the following in the live running status:

The corresponding html code is:

def get_table(train_number):
    # The link takes two inputs, train number and date in YYYYMMDD (2025-12-13)
    link = 'https://runningstatus.in/status/'+ str(train_number)+'-on-20251213'
    try:
        response = requests.get(link, timeout=15)
    except:
        # Timeout happens if there is no train for the train number on the date
        print("timeout for", train_number)
        return pd.DataFrame()
    if response.status_code == 200:
        page_content = response.text
        soup = BeautifulSoup(page_content, 'html.parser')
    else:
        print(train_number, 'Failed')
        return pd.DataFrame()
    table = soup.find("table")
    last_updated = None
    for thead in table.find_all("thead"):
        # thead contains the header row in the table
        last_row = thead.find("tr")
        # tr is the row
        if last_row:
            td = last_row.find("td", colspan=True)
            if td and "Last Updated" in td.get_text(strip=True):
                last_updated = td.get_text(strip=True)
                break

    rows = []
    tbody = table.find("tbody")
    for tr in tbody.find_all("tr"):
        # td is every cell (table data)
        tds = tr.find_all("td")
        if len(tds) != 4:
            continue
            # should have 4 rows

        # --- Station cell ---
        station_cell = tds[0]
        abbr = station_cell.find("abbr") # this is the station name abbreviation
        station_name = abbr.get_text(strip=True) if abbr else station_cell.get_text(strip=True)
        station_code = abbr["title"] if (abbr and abbr.has_attr("title")) else None

        delay_small = station_cell.find("small") # this is delay status
        delay_status = delay_small.get_text(strip=True) if delay_small else None

        # --- Arrival cell ---
        arrival_cell = tds[1]
        arrival_text = arrival_cell.get_text(" ", strip=True)  # e.g., "02:11 PM / 02:11 PM" or "Source"
        arrival_status_tag = arrival_cell.find("span")
        arrival_status = arrival_status_tag.get_text(strip=True) if arrival_status_tag else None

        # Parse arrival times if present
        arr_sch, arr_act = None, None
        if "/" in arrival_text:
            parts = [p.strip() for p in arrival_text.split("/")]
            if len(parts) == 2:
                arr_sch, arr_act = parts

        # --- Departure cell ---
        departure_cell = tds[2]
        departure_text = departure_cell.get_text(" ", strip=True)  # e.g., "02:12 PM / 02:12 PM" or "Destination"
        departure_status_tag = departure_cell.find("span")
        departure_status = departure_status_tag.get_text(strip=True) if departure_status_tag else None

        dep_sch, dep_act = None, None
        if "/" in departure_text:
            parts = [p.strip() for p in departure_text.split("/")]
            if len(parts) == 2:
                dep_sch, dep_act = parts

        # --- PF cell ---
        pf = tds[3].get_text(strip=True)

        rows.append({
            "Station Name": station_name,
            "Station Code": station_code,
            "Delay Status": delay_status,
            "Arrival Scheduled": arr_sch,
            "Arrival Actual": arr_act,
            "Arrival Status": arrival_status,
            "Departure Scheduled": dep_sch,
            "Departure Actual": dep_act,
            "Departure Status": departure_status,  # e.g., "Destination"
            "PF": pf
        })

    df = pd.DataFrame(rows)
    df['train_number'] = train_number
    print(train_number)
    return df

Superfast express Trains usually are of the form 12XXX or 22XXX. This analysis focuses on superfast trains, as they operate regularly and cover major stations.

train_details = pd.DataFrame()
for train_number in range(12101, 12999):
    train_details = pd.concat([train_details, get_table(train_number)], ignore_index=False)
for train_number in range(22101, 22999):
    train_details = pd.concat([train_details, get_table(train_number)], ignore_index=False)
train_details

	Station Name	Station Code	Delay Status	Arrival Scheduled	Arrival Actual	Arrival Status	Departure Scheduled	Departure Actual	Departure Status	PF	train_number
0	Lokmanyatilak	LTT		None	None	Source	08:35PM	08:36PM 01M Late	08:36PM	PF: 3	12101
1	Kalyan	KYN	58 Km/Hr	09:12PM	09:35PM 23M Late	09:35PM	09:15PM	09:45PM 30M Late	09:45PM	PF: 4	12101
2	Bhusaval	BSL	30 Km/Hr	02:50AM	03:20AM 30M Late	03:20AM	02:55AM	03:28AM 33M Late	03:28AM	PF: 5	12101
3	Akola	AK	120 Km/Hr	04:50AM	05:20AM 30M Late	05:20AM	04:55AM	05:24AM 29M Late	05:24AM	PF: 2	12101
4	Badnera	BD	105 Km/Hr	06:20AM	06:36AM 16M Late	06:36AM	06:23AM	06:39AM 16M Late	06:39AM	PF: 2	12101
...	...	...	...	...	...	...	...	...	...	...	...
27	Keshorai Patan	KPTN	126 Km/Hr	09:23AM	09:37AM 14M Late	09:37AM	09:25AM	09:43AM 18M Late	09:43AM	PF: 1	22998
28	Kota	KOTA	95 Km/Hr	09:50AM	09:57AM 07M Late	09:57AM	10:00AM	10:08AM 08M Late	10:08AM	PF: 2	22998
29	New Kota	NKOT	64 Km/Hr	10:13AM	10:21AM 08M Late	10:21AM	10:15AM	10:24AM 09M Late	10:24AM	PF: 2	22998
30	Ramganj Mandi	RMA	121 Km/Hr	11:03AM	11:14AM 11M Late	11:14AM	11:05AM	11:19AM 14M Late	11:19AM	PF: 1	22998
31	Jhalawar City	JLWC	55 Km/Hr	12:05PM	12:02PM No Delay	12:02PM	None	None	Destination	PF: 1	22998

26152 rows × 11 columns

train_details.columns = ['station_name', 'station_code', 'delay_status', 'scheduled_arrival', 'actual_arrival', 'arrival_status', 'scheduled_departure', 'actual_departure', 'departure_status', 'pf', 'train_number']

Creating variables such as delay in departure and arrival times.

train_details[['actual_departure', 'delay_departure']] = train_details['actual_departure'].str.split('M', n=1, expand=True)
train_details[['actual_arrival', 'arrival_departure']] = train_details['actual_arrival'].str.split('M', n=1, expand=True)
time_cols = ['scheduled_arrival', 'actual_arrival', 'scheduled_departure', 'actual_departure']
train_details['actual_departure'] = train_details['actual_departure'] + 'M'
train_details['actual_arrival'] = train_details['actual_arrival'] + 'M'
for time_col in time_cols:
    train_details[time_col] = pd.to_datetime(train_details[time_col])

Creating delay variables

train_details['arrival_delay'] = (train_details.actual_arrival - train_details.scheduled_arrival)/pd.Timedelta(minutes=1)
train_details['departure_delay'] = (train_details.actual_departure - train_details.scheduled_departure)/pd.Timedelta(minutes=1)
train_details['arrival_delay'] = np.maximum(0, train_details['arrival_delay'].fillna(0))
train_details['departure_delay'] = np.maximum(0, train_details['departure_delay'].fillna(0))
train_details

# Filtering for midnight issues: 
# When the day changes and the arrival is just after midnight, the delay is shown as 24*60=1440 rather than the actual delay
train_details.loc[train_details.arrival_delay >= 1350, 'arrival_delay'] = 1440 - train_details[train_details.arrival_delay >= 1350].arrival_delay
train_details.sort_values('arrival_delay', ascending=True).tail(25)

	Unnamed: 0	station_name	station_code	delay_status	scheduled_arrival	actual_arrival	arrival_status	scheduled_departure	actual_departure	departure_status	pf	train_number	delay_departure	arrival_departure	arrival_delay	departure_delay
3491	32	Howrah	HWH	98 Km/Hr	2025-12-14 07:35:00	2025-12-14 15:28:00	03:28PM	NaT	NaT	Destination	PF: 10	12334	NaN	7H 53M Late	473.0	0.0
13082	15	Rourkela	ROU	-4 Km/Hr	2025-12-14 12:57:00	2025-12-14 20:57:00	08:57PM	2025-12-14 13:05:00	2025-12-14 21:41:00	09:41PM	PF: 3	12871	8H 36M Late	8H Late	480.0	516.0
3490	31	Bardhaman	BWN	4 Km/Hr	2025-12-14 05:46:00	2025-12-14 13:53:00	01:53PM	2025-12-14 05:50:00	2025-12-14 13:56:00	01:56PM	PF: 3	12334	8H 6M Late	8H 7M Late	487.0	486.0
4571	33	Mathura	MTJ	91 Km/Hr	2025-12-14 02:55:00	2025-12-14 11:05:00	11:05AM	2025-12-14 02:57:00	2025-12-14 11:08:00	11:08AM	PF: 2	12409	8H 11M Late	8H 10M Late	490.0	491.0
23263	18	Joychandi Pahar	JOC	21 Km/Hr	2025-12-14 06:50:00	2025-12-14 15:10:00	03:10PM	2025-12-14 06:52:00	2025-12-14 15:36:00	03:36PM	-	22844	8H 44M Late	8H 20M Late	500.0	524.0
975	5	Bina	BINA	13 Km/Hr	2025-12-14 00:50:00	2025-12-14 09:11:00	09:11AM	2025-12-14 00:53:00	2025-12-14 09:15:00	09:15AM	PF: 3	12162	8H 22M Late	8H 21M Late	501.0	502.0
842	5	Adra	ADRA	101 Km/Hr	2025-12-14 00:05:00	2025-12-14 09:10:00	09:10AM	2025-12-14 00:10:00	2025-12-14 09:17:00	09:17AM	PF: 1	12152	9H 7M Late	9H 5M Late	545.0	547.0
977	7	Itarsi	ET	96 Km/Hr	2025-12-14 04:15:00	2025-12-14 13:20:00	01:20PM	2025-12-14 04:20:00	2025-12-14 13:23:00	01:23PM	PF: 2	12162	9H 3M Late	9H 5M Late	545.0	543.0
976	6	Rani Kamalapati	RKMP	91 Km/Hr	2025-12-14 02:40:00	2025-12-14 11:46:00	11:46AM	2025-12-14 02:50:00	2025-12-14 11:56:00	11:56AM	PF: 5	12162	9H 6M Late	9H 6M Late	546.0	546.0
978	8	Timarni	TBN	130 Km/Hr	2025-12-14 04:58:00	2025-12-14 14:09:00	02:09PM	2025-12-14 05:00:00	2025-12-14 14:10:00	02:10PM	PF: 1	12162	9H 10M Late	9H 11M Late	551.0	550.0
4572	34	Hazrat Nizamuddin	NZM	68 Km/Hr	2025-12-14 05:00:00	2025-12-14 14:11:00	02:11PM	NaT	NaT	Destination	PF: 2	12409	NaN	9H 11M Late	551.0	0.0
23264	19	Purulia	PRR	55 Km/Hr	2025-12-14 07:30:00	2025-12-14 16:41:00	04:41PM	2025-12-14 07:32:00	2025-12-14 16:45:00	04:45PM	PF: 3	22844	9H 13M Late	9H 11M Late	551.0	553.0
979	9	Harda	HD	83 Km/Hr	2025-12-14 05:11:00	2025-12-14 14:26:00	02:26PM	2025-12-14 05:13:00	2025-12-14 14:28:00	02:28PM	PF: 3	12162	9H 15M Late	9H 15M Late	555.0	555.0
13083	16	Raj Gangpur	GP	40 Km/Hr	2025-12-14 13:30:00	2025-12-14 22:48:00	10:48PM	2025-12-14 13:32:00	2025-12-14 22:51:00	10:51PM	PF: 1	12871	9H 19M Late	9H 18M Late	558.0	559.0
843	6	Purulia	PRR	55 Km/Hr	2025-12-14 00:55:00	2025-12-14 10:22:00	10:22AM	2025-12-14 01:00:00	2025-12-14 10:25:00	10:25AM	PF: 3	12152	9H 25M Late	9H 27M Late	567.0	565.0
13084	17	Garpos	GPH	53 Km/Hr	2025-12-14 13:56:00	2025-12-14 23:24:00	11:24PM	2025-12-14 13:57:00	2025-12-14 23:25:00	11:25PM	PF: 1	12871	9H 28M Late	9H 28M Late	568.0	568.0
980	10	Khandwa Junction	KNW	91 Km/Hr	2025-12-14 06:43:00	2025-12-14 16:20:00	04:20PM	2025-12-14 06:45:00	2025-12-14 16:25:00	04:25PM	PF: 5	12162	9H 40M Late	9H 37M Late	577.0	580.0
981	11	Burhanpur	BAU	130 Km/Hr	2025-12-14 07:38:00	2025-12-14 17:16:00	05:16PM	2025-12-14 07:40:00	2025-12-14 17:18:00	05:18PM	PF: 1	12162	9H 38M Late	9H 38M Late	578.0	578.0
13085	18	Bamra	BMB	109 Km/Hr	2025-12-14 14:01:00	2025-12-14 23:39:00	11:39PM	2025-12-14 14:02:00	2025-12-14 23:41:00	11:41PM	PF: 1	12871	9H 39M Late	9H 38M Late	578.0	579.0
982	12	Bhusaval	BSL	115 Km/Hr	2025-12-14 08:25:00	2025-12-14 18:03:00	06:03PM	2025-12-14 08:30:00	2025-12-14 08:30:00	08:30AM	PF: 4	12162	9H 38M Late	9H 38M Late	578.0	0.0
13086	19	Bagdihi	BEH	128 Km/Hr	2025-12-14 14:13:00	2025-12-14 23:53:00	11:53PM	2025-12-14 14:14:00	2025-12-14 23:53:00	11:53PM	PF: 2	12871	9H 39M Late	9H 40M Late	580.0	579.0
844	7	Chakaradharpur	CKP	85 Km/Hr	2025-12-14 02:55:00	2025-12-14 12:37:00	12:37PM	2025-12-14 02:57:00	2025-12-14 12:41:00	12:41PM	PF: 1	12152	9H 44M Late	9H 42M Late	582.0	584.0
23265	20	Tatanagar	TATA	53 Km/Hr	2025-12-14 09:30:00	2025-12-14 19:21:00	07:21PM	2025-12-14 09:55:00	2025-12-14 09:55:00	09:55AM	PF: 2	22844	9H 51M Late	9H 51M Late	591.0	0.0
845	8	Rourkela	ROU	78 Km/Hr	2025-12-14 04:20:00	2025-12-14 14:42:00	02:42PM	2025-12-14 04:28:00	2025-12-14 14:49:00	02:49PM	PF: 3	12152	10H 21M Late	10H 22M Late	622.0	621.0
846	9	Jharsuguda	JSG	71 Km/Hr	2025-12-14 06:13:00	2025-12-14 17:03:00	05:03PM	2025-12-14 06:15:00	2025-12-14 17:08:00	05:08PM	PF: 1	12152	10H 53M Late	10H 50M Late	650.0	653.0

Creating an delay dataset at station level. Ignoring small stations (Stations that have less than 10 train stops in a day)

delays = train_details.groupby(['station_name', 'station_code']).aggregate({
    'arrival_delay': 'mean',
    'departure_delay': 'mean',
    'delay_status': 'count'
}).reset_index()
delays = delays[delays.delay_status>=10].reset_index(drop=True)
delays.sort_values('delay_status', ascending=False)

	station_name	station_code	arrival_delay	departure_delay	delay_status
556	Vijayawada	BZA	6.866337	7.564356	187
514	Surat	ST	5.010204	5.663265	187
94	Bhusaval	BSL	13.020513	11.815385	185
229	Itarsi	ET	10.010471	10.460733	183
550	Vadodara	BRC	7.391753	6.835052	182
...	...	...	...	...	...
215	Harpalpur	HPP	17.600000	18.100000	10
220	Hindaun City	HAN	37.400000	37.100000	10
241	Jamtara	JMT	59.900000	60.000000	10
214	Harihar	HRR	2.300000	2.300000	10
40	Aunrihar	ARJ	36.600000	29.200000	10

577 rows × 5 columns

Geocoding with Google Maps API¶

To visualize delays geographically, we needed latitude and longitude for each station. The Google Geocoding API was used to convert station names into coordinates:
1. Constructed queries like "Station Name (Code) Railway station, India" to improve accuracy.
2. Applied region bias (region='in') to ensure results were relevant to India.
3. Parsed JSON responses to extract lat and lng values.

This step enabled mapping delays across India, providing spatial insights into where clusters of delays occur

import config
import requests

def get_lat_lng(location, gmaps_key=config.gmaps_key, region_bias = 'in'):
    """
    Returns (lat, lng) for a given location using Google Geocoding API.
    Raises informative exceptions on common failure modes.
    """
    url = "https://maps.googleapis.com/maps/api/geocode/json"
    params = {
        "address": location,
        "key": gmaps_key,
    }
    # Optional: region bias (e.g., "in" for India) to improve relevance
    if region_bias:
        params["region"] = region_bias

    response = requests.get(url, params=params, timeout=15)
    response.raise_for_status()  # network/HTTP-level errors

    payload = response.json()

    status = payload.get("status")
    if status != "OK":
        # Common statuses: ZERO_RESULTS, OVER_QUERY_LIMIT, REQUEST_DENIED, INVALID_REQUEST
        error_message = payload.get("error_message")
        return 0,0

    results = payload.get("results", [])
    if not results:
        return 0,0

    location_obj = results[0]["geometry"]["location"]  # guaranteed when status == OK
    lat = location_obj["lat"]
    lng = location_obj["lng"]
    return lat, lng

delays['address'] = delays['station_name']+' ('+ delays.station_code + ') Railway station, India'
delays[['lat','long']] = delays['address'].apply(lambda addr: pd.Series(get_lat_lng(addr), index=["lat", "lng"]))
delays

	station_name	station_code	arrival_delay	departure_delay	delay_status	address	lat	long
0	Abhaipur	AHA	5.785714	5.857143	13	Abhaipur (AHA) Railway station, India	25.215607	86.322353
1	Abu Road	ABR	2.250000	1.975000	38	Abu Road (ABR) Railway station, India	24.480749	72.785071
2	Achhnera	AH	33.800000	34.200000	17	Achhnera (AH) Railway station, India	27.177370	77.753791
3	Adoni	AD	16.208333	16.541667	21	Adoni (AD) Railway station, India	15.631882	77.275883
4	Adra	ADRA	33.920000	28.640000	24	Adra (ADRA) Railway station, India	23.496133	86.676755
...	...	...	...	...	...	...	...	...
572	Washim	WHM	0.687500	0.875000	14	Washim (WHM) Railway station, India	20.103064	77.148166
573	Yadgir	YG	19.050000	19.000000	17	Yadgir (YG) Railway station, India	16.742595	77.133291
574	Yelahanka	YNK	5.352941	6.470588	16	Yelahanka (YNK) Railway station, India	13.104980	77.591798
575	Yerraguntla	YA	22.437500	23.062500	15	Yerraguntla (YA) Railway station, India	14.645948	78.549873
576	Yesvantpur	YPR	1.421053	2.605263	28	Yesvantpur (YPR) Railway station, India	13.023212	77.551373

577 rows × 8 columns

# Fixing mapping issues
delays.loc[delays.address == 'Kareli (KY) Railway station, India', ['lat', 'long']] = [22.931886, 79.066079]

Clustering with DBSCAN¶

For clustering, we used DBSCAN (Density-Based Spatial Clustering of Applications with Noise) from scikit-learn. DBSCAN is ideal for this use case because:
1. It identifies clusters of stations with similar delay patterns without requiring the number of clusters upfront.
2. Handles noise effectively, marking outliers (stations with unique delay behavior) with a label of -1.
3. Works well with spatial data when combined with delay metrics.

To ensure fair clustering, we normalized arrival delays and station coordinates using MinMaxScaler before applying DBSCAN with:
1. eps = 0.07 (distance threshold in normalized units)
2. min_samples = 5 (minimum points to form a cluster)

The result: stations grouped into clusters based on delay severity and geographic proximity, revealing hotspots of chronic delays.

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler
import numpy as np

scaler = MinMaxScaler().set_output(transform="pandas")
X = scaler.fit_transform(delays[['arrival_delay', 'lat', 'long']])

# 3) Run DBSCAN with a single eps in meters
eps_meters = 0.07
min_samples = 5

db = DBSCAN(eps=eps_meters, min_samples=min_samples, metric='euclidean')
delays['labels'] = db.fit_predict(X)

delays['labels'].unique()

array([ 0,  1, -1,  2,  3])

# Details of each cluster
delays.groupby('labels').aggregate({
    'arrival_delay':['min', 'max', np.mean, np.std],
    'departure_delay':['min', 'max', np.mean, np.std]
})

	arrival_delay				departure_delay
	min	max	mean	std	min	max	mean	std
labels
-1	0.000000	113.750000	34.124167	23.680011	0.071429	113.833333	33.884537	24.351357
0	1.470588	8.000000	4.306110	2.471251	1.500000	8.368421	4.545647	2.490323
1	0.000000	50.615385	13.112440	9.528010	0.000000	50.615385	13.106862	9.426237
2	26.833333	35.368421	30.642969	2.950345	26.367647	35.894737	29.186663	3.379501
3	30.615385	44.200000	38.826864	4.979863	31.000000	45.500000	39.613117	5.290575

Cluster Analysis¶

Cluster -1 (Noise, color:black)¶

Arrival Delay: Mean ≈ 34.12 min, max up to 113.75 min
Departure Delay: Mean ≈ 33.88 min, max up to 113.83 min
Interpretation: These are extreme cases or irregular stations with very high delays. DBSCAN marks them as outliers because they don’t fit the density pattern of other clusters.

Cluster 0 (Minimal Delays, color: green, mostly near Patna)¶

Arrival Delay: Mean ≈ 4.31 min, max ≈ 8 min
Departure Delay: Mean ≈ 4.54 min
Interpretation: Stations in this cluster are highly efficient with almost negligible delays. Likely major hubs or well-managed routes.

Cluster 1 (Moderate Delays, color: blue, most of India)¶

Arrival Delay: Mean ≈ 13.11 min, max ≈ 50.61 min
Departure Delay: Mean ≈ 13.10 min
Interpretation: These stations experience moderate delays, possibly due to congestion or operational constraints.

Cluster 2 (High Delays, color: Yellow, Bilaspur-Raigarh stretch)¶

Arrival Delay: Mean ≈ 30.64 min, max ≈ 35.36 min
Departure Delay: Mean ≈ 29.18 min
Interpretation: Stations consistently facing high delays. These could be bottlenecks in the network or areas with infrastructure limitations.

Cluster 3 (Severe Delays, color: red, Ongole-Khammam-Nalgonda stretch)¶

Arrival Delay: Mean ≈ 38.82 min, max ≈ 44.20 min
Departure Delay: Mean ≈ 39.61 min
Interpretation: Chronic delay hotspots. Likely major junctions with heavy traffic or systemic scheduling issues.

Key Insights¶

Cluster 0 represents the best-performing stations.
Clusters 2 and 3 highlight critical problem areas needing immediate attention.
Noise (-1) includes extreme anomalies—these might require separate investigation.

Visualizing the Results¶

Using GeoPandas and Matplotlib, we plotted:
1. Clusters on an India map with color-coded labels.
2. Delay intensity using a heatmap-style scatter plot.

These visualizations make it easy to identify regions where delays are concentrated, such as major junctions or congested routes.

# Set the default colors
my_colors_list = {-1: 'black', 0: 'green', 1: 'blue', 2: 'orange', 3: 'red'}
delays['label_colors'] = delays.labels.map(my_colors_list)

import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

# Has India map with state boundaries
map_gdf = gpd.read_file('https://gist.githubusercontent.com/jbrobst/56c13bbbf9d97d187fea01ca62ea5112/raw/e388c4cae20aa53cb5090210a42ebb9b765c0a36/india_states.geojson')

fig, ax = plt.subplots(figsize=(10, 10))
map_gdf.plot(ax=ax, color='white', edgecolor='lightgrey')
delays.plot(x='long', y='lat', c = 'label_colors', kind='scatter', alpha=0.75, ax=ax)
plt.title("Clusters of train stations based on delays on 13/14-12-2025")

# Add legends
legend_spec = [
    ("black",  "Outliers"),
    ("green",  "No delays"),
    ("blue",   "Average delays"),
    ("orange", "High delays"),
    ("red",    "Severe delays"),
]

# Build legend handles
handles = [Line2D([0], [0],marker='o', color='w', label=label, markerfacecolor=color, markersize=10) for color, label in legend_spec]
ax.legend(handles=handles, title="Cluster meaning", loc="upper right", frameon=True)
plt.show();

png

fig, ax = plt.subplots(figsize=(10, 10))
map_gdf.plot(ax=ax, color='white', edgecolor='lightgrey')
delays.plot(x='long', y='lat', s = 'delay_status', c = 'arrival_delay', colormap='Reds', kind='scatter', ax=ax)
plt.title("Average delays on on 13/14-12-2025")
plt.show()

png

Key Insights¶

Stations like Khammam and Raighar had high delays counts.
Clusters often formed around busy corridors, indicating systemic congestion.
Outliers represented stations with unusually high delays, possibly due to local operational issues.

This blog demonstrates how clustering techniques like DBSCAN can uncover delay hotspots and operational bottlenecks in India’s railway network. To make this approach more impactful, we could extend the analysis across longer timeframes and incorporate additional factors such as weather, seasonal demand, and maintenance schedules. These enhancements will enable predictive modeling for proactive scheduling and resource allocation. Operationally, Indian Railways can leverage these insights to prioritize infrastructure upgrades in chronic delay clusters, implement dynamic timetabling, and improve passenger communication for better reliability and customer experience.

(Blog text improved using Gen AI)

Analyzing Train Delays in India Using DBSCAN¶

Use Case Overview¶

Web Scraping with BeautifulSoup¶

Geocoding with Google Maps API¶

Clustering with DBSCAN¶

Cluster Analysis¶

Cluster -1 (Noise, color:black)¶

Cluster 0 (Minimal Delays, color: green, mostly near Patna)¶

Cluster 1 (Moderate Delays, color: blue, most of India)¶

Cluster 2 (High Delays, color: Yellow, Bilaspur-Raigarh stretch)¶

Cluster 3 (Severe Delays, color: red, Ongole-Khammam-Nalgonda stretch)¶

Key Insights¶

Visualizing the Results¶

Key Insights¶

References¶