Skip to content

Analyzing Train Delays in India Using DBSCAN

Use Case Overview

Train delays are a common challenge in India’s vast railway network, impacting millions of passengers daily. Understanding patterns behind these delays can help optimize scheduling and improve operational efficiency. In this blog, we collected real-time train running status data, calculated delay metrics, and applied clustering techniques to identify geographical and temporal patterns of delays across stations. The goal is to uncover hotspots where delays frequently occur and analyze their severity.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Web Scraping with BeautifulSoup

To gather train delay data, we used BeautifulSoup, a Python library for parsing HTML and XML documents. The data was sourced from runningstatus.in, which provides live train status updates. BeautifulSoup allowed us to:
1. Extract structured data from HTML tables containing station names, scheduled and actual arrival/departure times, and delay information.
2. Handle inconsistencies in HTML tags and parse nested elements like <abbr> for station codes and <small> for delay status.
3. Convert the extracted data into a pandas DataFrame for further analysis.

This approach automated data collection for thousands of trains, creating a rich dataset of over 26,000 rows.

For the Train Janmabhoomi express, we have the following in the live running status:
image.png

The corresponding html code is:
image.png

def get_table(train_number):
    # The link takes two inputs, train number and date in YYYYMMDD (2025-12-13)
    link = 'https://runningstatus.in/status/'+ str(train_number)+'-on-20251213'
    try:
        response = requests.get(link, timeout=15)
    except:
        # Timeout happens if there is no train for the train number on the date
        print("timeout for", train_number)
        return pd.DataFrame()
    if response.status_code == 200:
        page_content = response.text
        soup = BeautifulSoup(page_content, 'html.parser')
    else:
        print(train_number, 'Failed')
        return pd.DataFrame()
    table = soup.find("table")
    last_updated = None
    for thead in table.find_all("thead"):
        # thead contains the header row in the table
        last_row = thead.find("tr")
        # tr is the row
        if last_row:
            td = last_row.find("td", colspan=True)
            if td and "Last Updated" in td.get_text(strip=True):
                last_updated = td.get_text(strip=True)
                break

    rows = []
    tbody = table.find("tbody")
    for tr in tbody.find_all("tr"):
        # td is every cell (table data)
        tds = tr.find_all("td")
        if len(tds) != 4:
            continue
            # should have 4 rows

        # --- Station cell ---
        station_cell = tds[0]
        abbr = station_cell.find("abbr") # this is the station name abbreviation
        station_name = abbr.get_text(strip=True) if abbr else station_cell.get_text(strip=True)
        station_code = abbr["title"] if (abbr and abbr.has_attr("title")) else None

        delay_small = station_cell.find("small") # this is delay status
        delay_status = delay_small.get_text(strip=True) if delay_small else None

        # --- Arrival cell ---
        arrival_cell = tds[1]
        arrival_text = arrival_cell.get_text(" ", strip=True)  # e.g., "02:11 PM / 02:11 PM" or "Source"
        arrival_status_tag = arrival_cell.find("span")
        arrival_status = arrival_status_tag.get_text(strip=True) if arrival_status_tag else None

        # Parse arrival times if present
        arr_sch, arr_act = None, None
        if "/" in arrival_text:
            parts = [p.strip() for p in arrival_text.split("/")]
            if len(parts) == 2:
                arr_sch, arr_act = parts

        # --- Departure cell ---
        departure_cell = tds[2]
        departure_text = departure_cell.get_text(" ", strip=True)  # e.g., "02:12 PM / 02:12 PM" or "Destination"
        departure_status_tag = departure_cell.find("span")
        departure_status = departure_status_tag.get_text(strip=True) if departure_status_tag else None

        dep_sch, dep_act = None, None
        if "/" in departure_text:
            parts = [p.strip() for p in departure_text.split("/")]
            if len(parts) == 2:
                dep_sch, dep_act = parts

        # --- PF cell ---
        pf = tds[3].get_text(strip=True)

        rows.append({
            "Station Name": station_name,
            "Station Code": station_code,
            "Delay Status": delay_status,
            "Arrival Scheduled": arr_sch,
            "Arrival Actual": arr_act,
            "Arrival Status": arrival_status,
            "Departure Scheduled": dep_sch,
            "Departure Actual": dep_act,
            "Departure Status": departure_status,  # e.g., "Destination"
            "PF": pf
        })

    df = pd.DataFrame(rows)
    df['train_number'] = train_number
    print(train_number)
    return df

Superfast express Trains usually are of the form 12XXX or 22XXX. This analysis focuses on superfast trains, as they operate regularly and cover major stations.

train_details = pd.DataFrame()
for train_number in range(12101, 12999):
    train_details = pd.concat([train_details, get_table(train_number)], ignore_index=False)
for train_number in range(22101, 22999):
    train_details = pd.concat([train_details, get_table(train_number)], ignore_index=False)
train_details
Station Name Station Code Delay Status Arrival Scheduled Arrival Actual Arrival Status Departure Scheduled Departure Actual Departure Status PF train_number
0 Lokmanyatilak LTT None None Source 08:35PM 08:36PM 01M Late 08:36PM PF: 3 12101
1 Kalyan KYN 58 Km/Hr 09:12PM 09:35PM 23M Late 09:35PM 09:15PM 09:45PM 30M Late 09:45PM PF: 4 12101
2 Bhusaval BSL 30 Km/Hr 02:50AM 03:20AM 30M Late 03:20AM 02:55AM 03:28AM 33M Late 03:28AM PF: 5 12101
3 Akola AK 120 Km/Hr 04:50AM 05:20AM 30M Late 05:20AM 04:55AM 05:24AM 29M Late 05:24AM PF: 2 12101
4 Badnera BD 105 Km/Hr 06:20AM 06:36AM 16M Late 06:36AM 06:23AM 06:39AM 16M Late 06:39AM PF: 2 12101
... ... ... ... ... ... ... ... ... ... ... ...
27 Keshorai Patan KPTN 126 Km/Hr 09:23AM 09:37AM 14M Late 09:37AM 09:25AM 09:43AM 18M Late 09:43AM PF: 1 22998
28 Kota KOTA 95 Km/Hr 09:50AM 09:57AM 07M Late 09:57AM 10:00AM 10:08AM 08M Late 10:08AM PF: 2 22998
29 New Kota NKOT 64 Km/Hr 10:13AM 10:21AM 08M Late 10:21AM 10:15AM 10:24AM 09M Late 10:24AM PF: 2 22998
30 Ramganj Mandi RMA 121 Km/Hr 11:03AM 11:14AM 11M Late 11:14AM 11:05AM 11:19AM 14M Late 11:19AM PF: 1 22998
31 Jhalawar City JLWC 55 Km/Hr 12:05PM 12:02PM No Delay 12:02PM None None Destination PF: 1 22998

26152 rows × 11 columns

train_details.columns = ['station_name', 'station_code', 'delay_status', 'scheduled_arrival', 'actual_arrival', 'arrival_status', 'scheduled_departure', 'actual_departure', 'departure_status', 'pf', 'train_number']

Creating variables such as delay in departure and arrival times.

train_details[['actual_departure', 'delay_departure']] = train_details['actual_departure'].str.split('M', n=1, expand=True)
train_details[['actual_arrival', 'arrival_departure']] = train_details['actual_arrival'].str.split('M', n=1, expand=True)
time_cols = ['scheduled_arrival', 'actual_arrival', 'scheduled_departure', 'actual_departure']
train_details['actual_departure'] = train_details['actual_departure'] + 'M'
train_details['actual_arrival'] = train_details['actual_arrival'] + 'M'
for time_col in time_cols:
    train_details[time_col] = pd.to_datetime(train_details[time_col])

Creating delay variables

train_details['arrival_delay'] = (train_details.actual_arrival - train_details.scheduled_arrival)/pd.Timedelta(minutes=1)
train_details['departure_delay'] = (train_details.actual_departure - train_details.scheduled_departure)/pd.Timedelta(minutes=1)
train_details['arrival_delay'] = np.maximum(0, train_details['arrival_delay'].fillna(0))
train_details['departure_delay'] = np.maximum(0, train_details['departure_delay'].fillna(0))
train_details
# Filtering for midnight issues: 
# When the day changes and the arrival is just after midnight, the delay is shown as 24*60=1440 rather than the actual delay
train_details.loc[train_details.arrival_delay >= 1350, 'arrival_delay'] = 1440 - train_details[train_details.arrival_delay >= 1350].arrival_delay
train_details.sort_values('arrival_delay', ascending=True).tail(25)
Unnamed: 0 station_name station_code delay_status scheduled_arrival actual_arrival arrival_status scheduled_departure actual_departure departure_status pf train_number delay_departure arrival_departure arrival_delay departure_delay
3491 32 Howrah HWH 98 Km/Hr 2025-12-14 07:35:00 2025-12-14 15:28:00 03:28PM NaT NaT Destination PF: 10 12334 NaN 7H 53M Late 473.0 0.0
13082 15 Rourkela ROU -4 Km/Hr 2025-12-14 12:57:00 2025-12-14 20:57:00 08:57PM 2025-12-14 13:05:00 2025-12-14 21:41:00 09:41PM PF: 3 12871 8H 36M Late 8H Late 480.0 516.0
3490 31 Bardhaman BWN 4 Km/Hr 2025-12-14 05:46:00 2025-12-14 13:53:00 01:53PM 2025-12-14 05:50:00 2025-12-14 13:56:00 01:56PM PF: 3 12334 8H 6M Late 8H 7M Late 487.0 486.0
4571 33 Mathura MTJ 91 Km/Hr 2025-12-14 02:55:00 2025-12-14 11:05:00 11:05AM 2025-12-14 02:57:00 2025-12-14 11:08:00 11:08AM PF: 2 12409 8H 11M Late 8H 10M Late 490.0 491.0
23263 18 Joychandi Pahar JOC 21 Km/Hr 2025-12-14 06:50:00 2025-12-14 15:10:00 03:10PM 2025-12-14 06:52:00 2025-12-14 15:36:00 03:36PM - 22844 8H 44M Late 8H 20M Late 500.0 524.0
975 5 Bina BINA 13 Km/Hr 2025-12-14 00:50:00 2025-12-14 09:11:00 09:11AM 2025-12-14 00:53:00 2025-12-14 09:15:00 09:15AM PF: 3 12162 8H 22M Late 8H 21M Late 501.0 502.0
842 5 Adra ADRA 101 Km/Hr 2025-12-14 00:05:00 2025-12-14 09:10:00 09:10AM 2025-12-14 00:10:00 2025-12-14 09:17:00 09:17AM PF: 1 12152 9H 7M Late 9H 5M Late 545.0 547.0
977 7 Itarsi ET 96 Km/Hr 2025-12-14 04:15:00 2025-12-14 13:20:00 01:20PM 2025-12-14 04:20:00 2025-12-14 13:23:00 01:23PM PF: 2 12162 9H 3M Late 9H 5M Late 545.0 543.0
976 6 Rani Kamalapati RKMP 91 Km/Hr 2025-12-14 02:40:00 2025-12-14 11:46:00 11:46AM 2025-12-14 02:50:00 2025-12-14 11:56:00 11:56AM PF: 5 12162 9H 6M Late 9H 6M Late 546.0 546.0
978 8 Timarni TBN 130 Km/Hr 2025-12-14 04:58:00 2025-12-14 14:09:00 02:09PM 2025-12-14 05:00:00 2025-12-14 14:10:00 02:10PM PF: 1 12162 9H 10M Late 9H 11M Late 551.0 550.0
4572 34 Hazrat Nizamuddin NZM 68 Km/Hr 2025-12-14 05:00:00 2025-12-14 14:11:00 02:11PM NaT NaT Destination PF: 2 12409 NaN 9H 11M Late 551.0 0.0
23264 19 Purulia PRR 55 Km/Hr 2025-12-14 07:30:00 2025-12-14 16:41:00 04:41PM 2025-12-14 07:32:00 2025-12-14 16:45:00 04:45PM PF: 3 22844 9H 13M Late 9H 11M Late 551.0 553.0
979 9 Harda HD 83 Km/Hr 2025-12-14 05:11:00 2025-12-14 14:26:00 02:26PM 2025-12-14 05:13:00 2025-12-14 14:28:00 02:28PM PF: 3 12162 9H 15M Late 9H 15M Late 555.0 555.0
13083 16 Raj Gangpur GP 40 Km/Hr 2025-12-14 13:30:00 2025-12-14 22:48:00 10:48PM 2025-12-14 13:32:00 2025-12-14 22:51:00 10:51PM PF: 1 12871 9H 19M Late 9H 18M Late 558.0 559.0
843 6 Purulia PRR 55 Km/Hr 2025-12-14 00:55:00 2025-12-14 10:22:00 10:22AM 2025-12-14 01:00:00 2025-12-14 10:25:00 10:25AM PF: 3 12152 9H 25M Late 9H 27M Late 567.0 565.0
13084 17 Garpos GPH 53 Km/Hr 2025-12-14 13:56:00 2025-12-14 23:24:00 11:24PM 2025-12-14 13:57:00 2025-12-14 23:25:00 11:25PM PF: 1 12871 9H 28M Late 9H 28M Late 568.0 568.0
980 10 Khandwa Junction KNW 91 Km/Hr 2025-12-14 06:43:00 2025-12-14 16:20:00 04:20PM 2025-12-14 06:45:00 2025-12-14 16:25:00 04:25PM PF: 5 12162 9H 40M Late 9H 37M Late 577.0 580.0
981 11 Burhanpur BAU 130 Km/Hr 2025-12-14 07:38:00 2025-12-14 17:16:00 05:16PM 2025-12-14 07:40:00 2025-12-14 17:18:00 05:18PM PF: 1 12162 9H 38M Late 9H 38M Late 578.0 578.0
13085 18 Bamra BMB 109 Km/Hr 2025-12-14 14:01:00 2025-12-14 23:39:00 11:39PM 2025-12-14 14:02:00 2025-12-14 23:41:00 11:41PM PF: 1 12871 9H 39M Late 9H 38M Late 578.0 579.0
982 12 Bhusaval BSL 115 Km/Hr 2025-12-14 08:25:00 2025-12-14 18:03:00 06:03PM 2025-12-14 08:30:00 2025-12-14 08:30:00 08:30AM PF: 4 12162 9H 38M Late 9H 38M Late 578.0 0.0
13086 19 Bagdihi BEH 128 Km/Hr 2025-12-14 14:13:00 2025-12-14 23:53:00 11:53PM 2025-12-14 14:14:00 2025-12-14 23:53:00 11:53PM PF: 2 12871 9H 39M Late 9H 40M Late 580.0 579.0
844 7 Chakaradharpur CKP 85 Km/Hr 2025-12-14 02:55:00 2025-12-14 12:37:00 12:37PM 2025-12-14 02:57:00 2025-12-14 12:41:00 12:41PM PF: 1 12152 9H 44M Late 9H 42M Late 582.0 584.0
23265 20 Tatanagar TATA 53 Km/Hr 2025-12-14 09:30:00 2025-12-14 19:21:00 07:21PM 2025-12-14 09:55:00 2025-12-14 09:55:00 09:55AM PF: 2 22844 9H 51M Late 9H 51M Late 591.0 0.0
845 8 Rourkela ROU 78 Km/Hr 2025-12-14 04:20:00 2025-12-14 14:42:00 02:42PM 2025-12-14 04:28:00 2025-12-14 14:49:00 02:49PM PF: 3 12152 10H 21M Late 10H 22M Late 622.0 621.0
846 9 Jharsuguda JSG 71 Km/Hr 2025-12-14 06:13:00 2025-12-14 17:03:00 05:03PM 2025-12-14 06:15:00 2025-12-14 17:08:00 05:08PM PF: 1 12152 10H 53M Late 10H 50M Late 650.0 653.0

Creating an delay dataset at station level. Ignoring small stations (Stations that have less than 10 train stops in a day)

delays = train_details.groupby(['station_name', 'station_code']).aggregate({
    'arrival_delay': 'mean',
    'departure_delay': 'mean',
    'delay_status': 'count'
}).reset_index()
delays = delays[delays.delay_status>=10].reset_index(drop=True)
delays.sort_values('delay_status', ascending=False)
station_name station_code arrival_delay departure_delay delay_status
556 Vijayawada BZA 6.866337 7.564356 187
514 Surat ST 5.010204 5.663265 187
94 Bhusaval BSL 13.020513 11.815385 185
229 Itarsi ET 10.010471 10.460733 183
550 Vadodara BRC 7.391753 6.835052 182
... ... ... ... ... ...
215 Harpalpur HPP 17.600000 18.100000 10
220 Hindaun City HAN 37.400000 37.100000 10
241 Jamtara JMT 59.900000 60.000000 10
214 Harihar HRR 2.300000 2.300000 10
40 Aunrihar ARJ 36.600000 29.200000 10

577 rows × 5 columns

Geocoding with Google Maps API

To visualize delays geographically, we needed latitude and longitude for each station. The Google Geocoding API was used to convert station names into coordinates:
1. Constructed queries like "Station Name (Code) Railway station, India" to improve accuracy.
2. Applied region bias (region='in') to ensure results were relevant to India.
3. Parsed JSON responses to extract lat and lng values.

This step enabled mapping delays across India, providing spatial insights into where clusters of delays occur

import config
import requests

def get_lat_lng(location, gmaps_key=config.gmaps_key, region_bias = 'in'):
    """
    Returns (lat, lng) for a given location using Google Geocoding API.
    Raises informative exceptions on common failure modes.
    """
    url = "https://maps.googleapis.com/maps/api/geocode/json"
    params = {
        "address": location,
        "key": gmaps_key,
    }
    # Optional: region bias (e.g., "in" for India) to improve relevance
    if region_bias:
        params["region"] = region_bias

    response = requests.get(url, params=params, timeout=15)
    response.raise_for_status()  # network/HTTP-level errors

    payload = response.json()

    status = payload.get("status")
    if status != "OK":
        # Common statuses: ZERO_RESULTS, OVER_QUERY_LIMIT, REQUEST_DENIED, INVALID_REQUEST
        error_message = payload.get("error_message")
        return 0,0

    results = payload.get("results", [])
    if not results:
        return 0,0

    location_obj = results[0]["geometry"]["location"]  # guaranteed when status == OK
    lat = location_obj["lat"]
    lng = location_obj["lng"]
    return lat, lng
delays['address'] = delays['station_name']+' ('+ delays.station_code + ') Railway station, India'
delays[['lat','long']] = delays['address'].apply(lambda addr: pd.Series(get_lat_lng(addr), index=["lat", "lng"]))
delays
station_name station_code arrival_delay departure_delay delay_status address lat long
0 Abhaipur AHA 5.785714 5.857143 13 Abhaipur (AHA) Railway station, India 25.215607 86.322353
1 Abu Road ABR 2.250000 1.975000 38 Abu Road (ABR) Railway station, India 24.480749 72.785071
2 Achhnera AH 33.800000 34.200000 17 Achhnera (AH) Railway station, India 27.177370 77.753791
3 Adoni AD 16.208333 16.541667 21 Adoni (AD) Railway station, India 15.631882 77.275883
4 Adra ADRA 33.920000 28.640000 24 Adra (ADRA) Railway station, India 23.496133 86.676755
... ... ... ... ... ... ... ... ...
572 Washim WHM 0.687500 0.875000 14 Washim (WHM) Railway station, India 20.103064 77.148166
573 Yadgir YG 19.050000 19.000000 17 Yadgir (YG) Railway station, India 16.742595 77.133291
574 Yelahanka YNK 5.352941 6.470588 16 Yelahanka (YNK) Railway station, India 13.104980 77.591798
575 Yerraguntla YA 22.437500 23.062500 15 Yerraguntla (YA) Railway station, India 14.645948 78.549873
576 Yesvantpur YPR 1.421053 2.605263 28 Yesvantpur (YPR) Railway station, India 13.023212 77.551373

577 rows × 8 columns

# Fixing mapping issues
delays.loc[delays.address == 'Kareli (KY) Railway station, India', ['lat', 'long']] = [22.931886, 79.066079]

Clustering with DBSCAN

For clustering, we used DBSCAN (Density-Based Spatial Clustering of Applications with Noise) from scikit-learn. DBSCAN is ideal for this use case because:
1. It identifies clusters of stations with similar delay patterns without requiring the number of clusters upfront.
2. Handles noise effectively, marking outliers (stations with unique delay behavior) with a label of -1.
3. Works well with spatial data when combined with delay metrics.

To ensure fair clustering, we normalized arrival delays and station coordinates using MinMaxScaler before applying DBSCAN with:
1. eps = 0.07 (distance threshold in normalized units)
2. min_samples = 5 (minimum points to form a cluster)

The result: stations grouped into clusters based on delay severity and geographic proximity, revealing hotspots of chronic delays.

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler
import numpy as np
scaler = MinMaxScaler().set_output(transform="pandas")
X = scaler.fit_transform(delays[['arrival_delay', 'lat', 'long']])
# 3) Run DBSCAN with a single eps in meters
eps_meters = 0.07
min_samples = 5

db = DBSCAN(eps=eps_meters, min_samples=min_samples, metric='euclidean')
delays['labels'] = db.fit_predict(X)
delays['labels'].unique()
array([ 0,  1, -1,  2,  3])
# Details of each cluster
delays.groupby('labels').aggregate({
    'arrival_delay':['min', 'max', np.mean, np.std],
    'departure_delay':['min', 'max', np.mean, np.std]
})
arrival_delay departure_delay
min max mean std min max mean std
labels
-1 0.000000 113.750000 34.124167 23.680011 0.071429 113.833333 33.884537 24.351357
0 1.470588 8.000000 4.306110 2.471251 1.500000 8.368421 4.545647 2.490323
1 0.000000 50.615385 13.112440 9.528010 0.000000 50.615385 13.106862 9.426237
2 26.833333 35.368421 30.642969 2.950345 26.367647 35.894737 29.186663 3.379501
3 30.615385 44.200000 38.826864 4.979863 31.000000 45.500000 39.613117 5.290575

Cluster Analysis

Cluster -1 (Noise, color:black)

  • Arrival Delay: Mean ≈ 34.12 min, max up to 113.75 min
  • Departure Delay: Mean ≈ 33.88 min, max up to 113.83 min
  • Interpretation: These are extreme cases or irregular stations with very high delays. DBSCAN marks them as outliers because they don’t fit the density pattern of other clusters.

Cluster 0 (Minimal Delays, color: green, mostly near Patna)

  • Arrival Delay: Mean ≈ 4.31 min, max ≈ 8 min
  • Departure Delay: Mean ≈ 4.54 min
  • Interpretation: Stations in this cluster are highly efficient with almost negligible delays. Likely major hubs or well-managed routes.

Cluster 1 (Moderate Delays, color: blue, most of India)

  • Arrival Delay: Mean ≈ 13.11 min, max ≈ 50.61 min
  • Departure Delay: Mean ≈ 13.10 min
  • Interpretation: These stations experience moderate delays, possibly due to congestion or operational constraints.

Cluster 2 (High Delays, color: Yellow, Bilaspur-Raigarh stretch)

  • Arrival Delay: Mean ≈ 30.64 min, max ≈ 35.36 min
  • Departure Delay: Mean ≈ 29.18 min
  • Interpretation: Stations consistently facing high delays. These could be bottlenecks in the network or areas with infrastructure limitations.

Cluster 3 (Severe Delays, color: red, Ongole-Khammam-Nalgonda stretch)

  • Arrival Delay: Mean ≈ 38.82 min, max ≈ 44.20 min
  • Departure Delay: Mean ≈ 39.61 min
  • Interpretation: Chronic delay hotspots. Likely major junctions with heavy traffic or systemic scheduling issues.

Key Insights

  1. Cluster 0 represents the best-performing stations.
  2. Clusters 2 and 3 highlight critical problem areas needing immediate attention.
  3. Noise (-1) includes extreme anomalies—these might require separate investigation.

Visualizing the Results

Using GeoPandas and Matplotlib, we plotted:
1. Clusters on an India map with color-coded labels.
2. Delay intensity using a heatmap-style scatter plot.

These visualizations make it easy to identify regions where delays are concentrated, such as major junctions or congested routes.

# Set the default colors
my_colors_list = {-1: 'black', 0: 'green', 1: 'blue', 2: 'orange', 3: 'red'}
delays['label_colors'] = delays.labels.map(my_colors_list)
import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

# Has India map with state boundaries
map_gdf = gpd.read_file('https://gist.githubusercontent.com/jbrobst/56c13bbbf9d97d187fea01ca62ea5112/raw/e388c4cae20aa53cb5090210a42ebb9b765c0a36/india_states.geojson')

fig, ax = plt.subplots(figsize=(10, 10))
map_gdf.plot(ax=ax, color='white', edgecolor='lightgrey')
delays.plot(x='long', y='lat', c = 'label_colors', kind='scatter', alpha=0.75, ax=ax)
plt.title("Clusters of train stations based on delays on 13/14-12-2025")

# Add legends
legend_spec = [
    ("black",  "Outliers"),
    ("green",  "No delays"),
    ("blue",   "Average delays"),
    ("orange", "High delays"),
    ("red",    "Severe delays"),
]

# Build legend handles
handles = [Line2D([0], [0],marker='o', color='w', label=label, markerfacecolor=color, markersize=10) for color, label in legend_spec]
ax.legend(handles=handles, title="Cluster meaning", loc="upper right", frameon=True)
plt.show();

png

fig, ax = plt.subplots(figsize=(10, 10))
map_gdf.plot(ax=ax, color='white', edgecolor='lightgrey')
delays.plot(x='long', y='lat', s = 'delay_status', c = 'arrival_delay', colormap='Reds', kind='scatter', ax=ax)
plt.title("Average delays on on 13/14-12-2025")
plt.show()

png

Key Insights

  1. Stations like Khammam and Raighar had high delays counts.
  2. Clusters often formed around busy corridors, indicating systemic congestion.
  3. Outliers represented stations with unusually high delays, possibly due to local operational issues.

This blog demonstrates how clustering techniques like DBSCAN can uncover delay hotspots and operational bottlenecks in India’s railway network. To make this approach more impactful, we could extend the analysis across longer timeframes and incorporate additional factors such as weather, seasonal demand, and maintenance schedules. These enhancements will enable predictive modeling for proactive scheduling and resource allocation. Operationally, Indian Railways can leverage these insights to prioritize infrastructure upgrades in chronic delay clusters, implement dynamic timetabling, and improve passenger communication for better reliability and customer experience.

(Blog text improved using Gen AI)

References

  1. List of superfast trains
  2. Data for web scraping
  3. Modeling train delays : A study of Indian Railways
  4. RSTGCN: Railway-centric Spatio-Temporal Graph Convolutional Network for Train Delay Prediction
Back to top