Analyzing Train Delays in India Using DBSCAN¶
Use Case Overview¶
Train delays are a common challenge in India’s vast railway network, impacting millions of passengers daily. Understanding patterns behind these delays can help optimize scheduling and improve operational efficiency. In this blog, we collected real-time train running status data, calculated delay metrics, and applied clustering techniques to identify geographical and temporal patterns of delays across stations. The goal is to uncover hotspots where delays frequently occur and analyze their severity.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
Web Scraping with BeautifulSoup¶
To gather train delay data, we used BeautifulSoup, a Python library for parsing HTML and XML documents. The data was sourced from runningstatus.in, which provides live train status updates. BeautifulSoup allowed us to:
1. Extract structured data from HTML tables containing station names, scheduled and actual arrival/departure times, and delay information.
2. Handle inconsistencies in HTML tags and parse nested elements like <abbr> for station codes and <small> for delay status.
3. Convert the extracted data into a pandas DataFrame for further analysis.
This approach automated data collection for thousands of trains, creating a rich dataset of over 26,000 rows.
For the Train Janmabhoomi express, we have the following in the live running status:

The corresponding html code is:

def get_table(train_number):
# The link takes two inputs, train number and date in YYYYMMDD (2025-12-13)
link = 'https://runningstatus.in/status/'+ str(train_number)+'-on-20251213'
try:
response = requests.get(link, timeout=15)
except:
# Timeout happens if there is no train for the train number on the date
print("timeout for", train_number)
return pd.DataFrame()
if response.status_code == 200:
page_content = response.text
soup = BeautifulSoup(page_content, 'html.parser')
else:
print(train_number, 'Failed')
return pd.DataFrame()
table = soup.find("table")
last_updated = None
for thead in table.find_all("thead"):
# thead contains the header row in the table
last_row = thead.find("tr")
# tr is the row
if last_row:
td = last_row.find("td", colspan=True)
if td and "Last Updated" in td.get_text(strip=True):
last_updated = td.get_text(strip=True)
break
rows = []
tbody = table.find("tbody")
for tr in tbody.find_all("tr"):
# td is every cell (table data)
tds = tr.find_all("td")
if len(tds) != 4:
continue
# should have 4 rows
# --- Station cell ---
station_cell = tds[0]
abbr = station_cell.find("abbr") # this is the station name abbreviation
station_name = abbr.get_text(strip=True) if abbr else station_cell.get_text(strip=True)
station_code = abbr["title"] if (abbr and abbr.has_attr("title")) else None
delay_small = station_cell.find("small") # this is delay status
delay_status = delay_small.get_text(strip=True) if delay_small else None
# --- Arrival cell ---
arrival_cell = tds[1]
arrival_text = arrival_cell.get_text(" ", strip=True) # e.g., "02:11 PM / 02:11 PM" or "Source"
arrival_status_tag = arrival_cell.find("span")
arrival_status = arrival_status_tag.get_text(strip=True) if arrival_status_tag else None
# Parse arrival times if present
arr_sch, arr_act = None, None
if "/" in arrival_text:
parts = [p.strip() for p in arrival_text.split("/")]
if len(parts) == 2:
arr_sch, arr_act = parts
# --- Departure cell ---
departure_cell = tds[2]
departure_text = departure_cell.get_text(" ", strip=True) # e.g., "02:12 PM / 02:12 PM" or "Destination"
departure_status_tag = departure_cell.find("span")
departure_status = departure_status_tag.get_text(strip=True) if departure_status_tag else None
dep_sch, dep_act = None, None
if "/" in departure_text:
parts = [p.strip() for p in departure_text.split("/")]
if len(parts) == 2:
dep_sch, dep_act = parts
# --- PF cell ---
pf = tds[3].get_text(strip=True)
rows.append({
"Station Name": station_name,
"Station Code": station_code,
"Delay Status": delay_status,
"Arrival Scheduled": arr_sch,
"Arrival Actual": arr_act,
"Arrival Status": arrival_status,
"Departure Scheduled": dep_sch,
"Departure Actual": dep_act,
"Departure Status": departure_status, # e.g., "Destination"
"PF": pf
})
df = pd.DataFrame(rows)
df['train_number'] = train_number
print(train_number)
return df
Superfast express Trains usually are of the form 12XXX or 22XXX. This analysis focuses on superfast trains, as they operate regularly and cover major stations.
train_details = pd.DataFrame()
for train_number in range(12101, 12999):
train_details = pd.concat([train_details, get_table(train_number)], ignore_index=False)
for train_number in range(22101, 22999):
train_details = pd.concat([train_details, get_table(train_number)], ignore_index=False)
train_details
| Station Name | Station Code | Delay Status | Arrival Scheduled | Arrival Actual | Arrival Status | Departure Scheduled | Departure Actual | Departure Status | PF | train_number | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Lokmanyatilak | LTT | None | None | Source | 08:35PM | 08:36PM 01M Late | 08:36PM | PF: 3 | 12101 | |
| 1 | Kalyan | KYN | 58 Km/Hr | 09:12PM | 09:35PM 23M Late | 09:35PM | 09:15PM | 09:45PM 30M Late | 09:45PM | PF: 4 | 12101 |
| 2 | Bhusaval | BSL | 30 Km/Hr | 02:50AM | 03:20AM 30M Late | 03:20AM | 02:55AM | 03:28AM 33M Late | 03:28AM | PF: 5 | 12101 |
| 3 | Akola | AK | 120 Km/Hr | 04:50AM | 05:20AM 30M Late | 05:20AM | 04:55AM | 05:24AM 29M Late | 05:24AM | PF: 2 | 12101 |
| 4 | Badnera | BD | 105 Km/Hr | 06:20AM | 06:36AM 16M Late | 06:36AM | 06:23AM | 06:39AM 16M Late | 06:39AM | PF: 2 | 12101 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 27 | Keshorai Patan | KPTN | 126 Km/Hr | 09:23AM | 09:37AM 14M Late | 09:37AM | 09:25AM | 09:43AM 18M Late | 09:43AM | PF: 1 | 22998 |
| 28 | Kota | KOTA | 95 Km/Hr | 09:50AM | 09:57AM 07M Late | 09:57AM | 10:00AM | 10:08AM 08M Late | 10:08AM | PF: 2 | 22998 |
| 29 | New Kota | NKOT | 64 Km/Hr | 10:13AM | 10:21AM 08M Late | 10:21AM | 10:15AM | 10:24AM 09M Late | 10:24AM | PF: 2 | 22998 |
| 30 | Ramganj Mandi | RMA | 121 Km/Hr | 11:03AM | 11:14AM 11M Late | 11:14AM | 11:05AM | 11:19AM 14M Late | 11:19AM | PF: 1 | 22998 |
| 31 | Jhalawar City | JLWC | 55 Km/Hr | 12:05PM | 12:02PM No Delay | 12:02PM | None | None | Destination | PF: 1 | 22998 |
26152 rows × 11 columns
train_details.columns = ['station_name', 'station_code', 'delay_status', 'scheduled_arrival', 'actual_arrival', 'arrival_status', 'scheduled_departure', 'actual_departure', 'departure_status', 'pf', 'train_number']
Creating variables such as delay in departure and arrival times.
train_details[['actual_departure', 'delay_departure']] = train_details['actual_departure'].str.split('M', n=1, expand=True)
train_details[['actual_arrival', 'arrival_departure']] = train_details['actual_arrival'].str.split('M', n=1, expand=True)
time_cols = ['scheduled_arrival', 'actual_arrival', 'scheduled_departure', 'actual_departure']
train_details['actual_departure'] = train_details['actual_departure'] + 'M'
train_details['actual_arrival'] = train_details['actual_arrival'] + 'M'
for time_col in time_cols:
train_details[time_col] = pd.to_datetime(train_details[time_col])
Creating delay variables
train_details['arrival_delay'] = (train_details.actual_arrival - train_details.scheduled_arrival)/pd.Timedelta(minutes=1)
train_details['departure_delay'] = (train_details.actual_departure - train_details.scheduled_departure)/pd.Timedelta(minutes=1)
train_details['arrival_delay'] = np.maximum(0, train_details['arrival_delay'].fillna(0))
train_details['departure_delay'] = np.maximum(0, train_details['departure_delay'].fillna(0))
train_details
# Filtering for midnight issues:
# When the day changes and the arrival is just after midnight, the delay is shown as 24*60=1440 rather than the actual delay
train_details.loc[train_details.arrival_delay >= 1350, 'arrival_delay'] = 1440 - train_details[train_details.arrival_delay >= 1350].arrival_delay
train_details.sort_values('arrival_delay', ascending=True).tail(25)
| Unnamed: 0 | station_name | station_code | delay_status | scheduled_arrival | actual_arrival | arrival_status | scheduled_departure | actual_departure | departure_status | pf | train_number | delay_departure | arrival_departure | arrival_delay | departure_delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3491 | 32 | Howrah | HWH | 98 Km/Hr | 2025-12-14 07:35:00 | 2025-12-14 15:28:00 | 03:28PM | NaT | NaT | Destination | PF: 10 | 12334 | NaN | 7H 53M Late | 473.0 | 0.0 |
| 13082 | 15 | Rourkela | ROU | -4 Km/Hr | 2025-12-14 12:57:00 | 2025-12-14 20:57:00 | 08:57PM | 2025-12-14 13:05:00 | 2025-12-14 21:41:00 | 09:41PM | PF: 3 | 12871 | 8H 36M Late | 8H Late | 480.0 | 516.0 |
| 3490 | 31 | Bardhaman | BWN | 4 Km/Hr | 2025-12-14 05:46:00 | 2025-12-14 13:53:00 | 01:53PM | 2025-12-14 05:50:00 | 2025-12-14 13:56:00 | 01:56PM | PF: 3 | 12334 | 8H 6M Late | 8H 7M Late | 487.0 | 486.0 |
| 4571 | 33 | Mathura | MTJ | 91 Km/Hr | 2025-12-14 02:55:00 | 2025-12-14 11:05:00 | 11:05AM | 2025-12-14 02:57:00 | 2025-12-14 11:08:00 | 11:08AM | PF: 2 | 12409 | 8H 11M Late | 8H 10M Late | 490.0 | 491.0 |
| 23263 | 18 | Joychandi Pahar | JOC | 21 Km/Hr | 2025-12-14 06:50:00 | 2025-12-14 15:10:00 | 03:10PM | 2025-12-14 06:52:00 | 2025-12-14 15:36:00 | 03:36PM | - | 22844 | 8H 44M Late | 8H 20M Late | 500.0 | 524.0 |
| 975 | 5 | Bina | BINA | 13 Km/Hr | 2025-12-14 00:50:00 | 2025-12-14 09:11:00 | 09:11AM | 2025-12-14 00:53:00 | 2025-12-14 09:15:00 | 09:15AM | PF: 3 | 12162 | 8H 22M Late | 8H 21M Late | 501.0 | 502.0 |
| 842 | 5 | Adra | ADRA | 101 Km/Hr | 2025-12-14 00:05:00 | 2025-12-14 09:10:00 | 09:10AM | 2025-12-14 00:10:00 | 2025-12-14 09:17:00 | 09:17AM | PF: 1 | 12152 | 9H 7M Late | 9H 5M Late | 545.0 | 547.0 |
| 977 | 7 | Itarsi | ET | 96 Km/Hr | 2025-12-14 04:15:00 | 2025-12-14 13:20:00 | 01:20PM | 2025-12-14 04:20:00 | 2025-12-14 13:23:00 | 01:23PM | PF: 2 | 12162 | 9H 3M Late | 9H 5M Late | 545.0 | 543.0 |
| 976 | 6 | Rani Kamalapati | RKMP | 91 Km/Hr | 2025-12-14 02:40:00 | 2025-12-14 11:46:00 | 11:46AM | 2025-12-14 02:50:00 | 2025-12-14 11:56:00 | 11:56AM | PF: 5 | 12162 | 9H 6M Late | 9H 6M Late | 546.0 | 546.0 |
| 978 | 8 | Timarni | TBN | 130 Km/Hr | 2025-12-14 04:58:00 | 2025-12-14 14:09:00 | 02:09PM | 2025-12-14 05:00:00 | 2025-12-14 14:10:00 | 02:10PM | PF: 1 | 12162 | 9H 10M Late | 9H 11M Late | 551.0 | 550.0 |
| 4572 | 34 | Hazrat Nizamuddin | NZM | 68 Km/Hr | 2025-12-14 05:00:00 | 2025-12-14 14:11:00 | 02:11PM | NaT | NaT | Destination | PF: 2 | 12409 | NaN | 9H 11M Late | 551.0 | 0.0 |
| 23264 | 19 | Purulia | PRR | 55 Km/Hr | 2025-12-14 07:30:00 | 2025-12-14 16:41:00 | 04:41PM | 2025-12-14 07:32:00 | 2025-12-14 16:45:00 | 04:45PM | PF: 3 | 22844 | 9H 13M Late | 9H 11M Late | 551.0 | 553.0 |
| 979 | 9 | Harda | HD | 83 Km/Hr | 2025-12-14 05:11:00 | 2025-12-14 14:26:00 | 02:26PM | 2025-12-14 05:13:00 | 2025-12-14 14:28:00 | 02:28PM | PF: 3 | 12162 | 9H 15M Late | 9H 15M Late | 555.0 | 555.0 |
| 13083 | 16 | Raj Gangpur | GP | 40 Km/Hr | 2025-12-14 13:30:00 | 2025-12-14 22:48:00 | 10:48PM | 2025-12-14 13:32:00 | 2025-12-14 22:51:00 | 10:51PM | PF: 1 | 12871 | 9H 19M Late | 9H 18M Late | 558.0 | 559.0 |
| 843 | 6 | Purulia | PRR | 55 Km/Hr | 2025-12-14 00:55:00 | 2025-12-14 10:22:00 | 10:22AM | 2025-12-14 01:00:00 | 2025-12-14 10:25:00 | 10:25AM | PF: 3 | 12152 | 9H 25M Late | 9H 27M Late | 567.0 | 565.0 |
| 13084 | 17 | Garpos | GPH | 53 Km/Hr | 2025-12-14 13:56:00 | 2025-12-14 23:24:00 | 11:24PM | 2025-12-14 13:57:00 | 2025-12-14 23:25:00 | 11:25PM | PF: 1 | 12871 | 9H 28M Late | 9H 28M Late | 568.0 | 568.0 |
| 980 | 10 | Khandwa Junction | KNW | 91 Km/Hr | 2025-12-14 06:43:00 | 2025-12-14 16:20:00 | 04:20PM | 2025-12-14 06:45:00 | 2025-12-14 16:25:00 | 04:25PM | PF: 5 | 12162 | 9H 40M Late | 9H 37M Late | 577.0 | 580.0 |
| 981 | 11 | Burhanpur | BAU | 130 Km/Hr | 2025-12-14 07:38:00 | 2025-12-14 17:16:00 | 05:16PM | 2025-12-14 07:40:00 | 2025-12-14 17:18:00 | 05:18PM | PF: 1 | 12162 | 9H 38M Late | 9H 38M Late | 578.0 | 578.0 |
| 13085 | 18 | Bamra | BMB | 109 Km/Hr | 2025-12-14 14:01:00 | 2025-12-14 23:39:00 | 11:39PM | 2025-12-14 14:02:00 | 2025-12-14 23:41:00 | 11:41PM | PF: 1 | 12871 | 9H 39M Late | 9H 38M Late | 578.0 | 579.0 |
| 982 | 12 | Bhusaval | BSL | 115 Km/Hr | 2025-12-14 08:25:00 | 2025-12-14 18:03:00 | 06:03PM | 2025-12-14 08:30:00 | 2025-12-14 08:30:00 | 08:30AM | PF: 4 | 12162 | 9H 38M Late | 9H 38M Late | 578.0 | 0.0 |
| 13086 | 19 | Bagdihi | BEH | 128 Km/Hr | 2025-12-14 14:13:00 | 2025-12-14 23:53:00 | 11:53PM | 2025-12-14 14:14:00 | 2025-12-14 23:53:00 | 11:53PM | PF: 2 | 12871 | 9H 39M Late | 9H 40M Late | 580.0 | 579.0 |
| 844 | 7 | Chakaradharpur | CKP | 85 Km/Hr | 2025-12-14 02:55:00 | 2025-12-14 12:37:00 | 12:37PM | 2025-12-14 02:57:00 | 2025-12-14 12:41:00 | 12:41PM | PF: 1 | 12152 | 9H 44M Late | 9H 42M Late | 582.0 | 584.0 |
| 23265 | 20 | Tatanagar | TATA | 53 Km/Hr | 2025-12-14 09:30:00 | 2025-12-14 19:21:00 | 07:21PM | 2025-12-14 09:55:00 | 2025-12-14 09:55:00 | 09:55AM | PF: 2 | 22844 | 9H 51M Late | 9H 51M Late | 591.0 | 0.0 |
| 845 | 8 | Rourkela | ROU | 78 Km/Hr | 2025-12-14 04:20:00 | 2025-12-14 14:42:00 | 02:42PM | 2025-12-14 04:28:00 | 2025-12-14 14:49:00 | 02:49PM | PF: 3 | 12152 | 10H 21M Late | 10H 22M Late | 622.0 | 621.0 |
| 846 | 9 | Jharsuguda | JSG | 71 Km/Hr | 2025-12-14 06:13:00 | 2025-12-14 17:03:00 | 05:03PM | 2025-12-14 06:15:00 | 2025-12-14 17:08:00 | 05:08PM | PF: 1 | 12152 | 10H 53M Late | 10H 50M Late | 650.0 | 653.0 |
Creating an delay dataset at station level. Ignoring small stations (Stations that have less than 10 train stops in a day)
delays = train_details.groupby(['station_name', 'station_code']).aggregate({
'arrival_delay': 'mean',
'departure_delay': 'mean',
'delay_status': 'count'
}).reset_index()
delays = delays[delays.delay_status>=10].reset_index(drop=True)
delays.sort_values('delay_status', ascending=False)
| station_name | station_code | arrival_delay | departure_delay | delay_status | |
|---|---|---|---|---|---|
| 556 | Vijayawada | BZA | 6.866337 | 7.564356 | 187 |
| 514 | Surat | ST | 5.010204 | 5.663265 | 187 |
| 94 | Bhusaval | BSL | 13.020513 | 11.815385 | 185 |
| 229 | Itarsi | ET | 10.010471 | 10.460733 | 183 |
| 550 | Vadodara | BRC | 7.391753 | 6.835052 | 182 |
| ... | ... | ... | ... | ... | ... |
| 215 | Harpalpur | HPP | 17.600000 | 18.100000 | 10 |
| 220 | Hindaun City | HAN | 37.400000 | 37.100000 | 10 |
| 241 | Jamtara | JMT | 59.900000 | 60.000000 | 10 |
| 214 | Harihar | HRR | 2.300000 | 2.300000 | 10 |
| 40 | Aunrihar | ARJ | 36.600000 | 29.200000 | 10 |
577 rows × 5 columns
Geocoding with Google Maps API¶
To visualize delays geographically, we needed latitude and longitude for each station. The Google Geocoding API was used to convert station names into coordinates:
1. Constructed queries like "Station Name (Code) Railway station, India" to improve accuracy.
2. Applied region bias (region='in') to ensure results were relevant to India.
3. Parsed JSON responses to extract lat and lng values.
This step enabled mapping delays across India, providing spatial insights into where clusters of delays occur
import config
import requests
def get_lat_lng(location, gmaps_key=config.gmaps_key, region_bias = 'in'):
"""
Returns (lat, lng) for a given location using Google Geocoding API.
Raises informative exceptions on common failure modes.
"""
url = "https://maps.googleapis.com/maps/api/geocode/json"
params = {
"address": location,
"key": gmaps_key,
}
# Optional: region bias (e.g., "in" for India) to improve relevance
if region_bias:
params["region"] = region_bias
response = requests.get(url, params=params, timeout=15)
response.raise_for_status() # network/HTTP-level errors
payload = response.json()
status = payload.get("status")
if status != "OK":
# Common statuses: ZERO_RESULTS, OVER_QUERY_LIMIT, REQUEST_DENIED, INVALID_REQUEST
error_message = payload.get("error_message")
return 0,0
results = payload.get("results", [])
if not results:
return 0,0
location_obj = results[0]["geometry"]["location"] # guaranteed when status == OK
lat = location_obj["lat"]
lng = location_obj["lng"]
return lat, lng
delays['address'] = delays['station_name']+' ('+ delays.station_code + ') Railway station, India'
delays[['lat','long']] = delays['address'].apply(lambda addr: pd.Series(get_lat_lng(addr), index=["lat", "lng"]))
delays
| station_name | station_code | arrival_delay | departure_delay | delay_status | address | lat | long | |
|---|---|---|---|---|---|---|---|---|
| 0 | Abhaipur | AHA | 5.785714 | 5.857143 | 13 | Abhaipur (AHA) Railway station, India | 25.215607 | 86.322353 |
| 1 | Abu Road | ABR | 2.250000 | 1.975000 | 38 | Abu Road (ABR) Railway station, India | 24.480749 | 72.785071 |
| 2 | Achhnera | AH | 33.800000 | 34.200000 | 17 | Achhnera (AH) Railway station, India | 27.177370 | 77.753791 |
| 3 | Adoni | AD | 16.208333 | 16.541667 | 21 | Adoni (AD) Railway station, India | 15.631882 | 77.275883 |
| 4 | Adra | ADRA | 33.920000 | 28.640000 | 24 | Adra (ADRA) Railway station, India | 23.496133 | 86.676755 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 572 | Washim | WHM | 0.687500 | 0.875000 | 14 | Washim (WHM) Railway station, India | 20.103064 | 77.148166 |
| 573 | Yadgir | YG | 19.050000 | 19.000000 | 17 | Yadgir (YG) Railway station, India | 16.742595 | 77.133291 |
| 574 | Yelahanka | YNK | 5.352941 | 6.470588 | 16 | Yelahanka (YNK) Railway station, India | 13.104980 | 77.591798 |
| 575 | Yerraguntla | YA | 22.437500 | 23.062500 | 15 | Yerraguntla (YA) Railway station, India | 14.645948 | 78.549873 |
| 576 | Yesvantpur | YPR | 1.421053 | 2.605263 | 28 | Yesvantpur (YPR) Railway station, India | 13.023212 | 77.551373 |
577 rows × 8 columns
# Fixing mapping issues
delays.loc[delays.address == 'Kareli (KY) Railway station, India', ['lat', 'long']] = [22.931886, 79.066079]
Clustering with DBSCAN¶
For clustering, we used DBSCAN (Density-Based Spatial Clustering of Applications with Noise) from scikit-learn. DBSCAN is ideal for this use case because:
1. It identifies clusters of stations with similar delay patterns without requiring the number of clusters upfront.
2. Handles noise effectively, marking outliers (stations with unique delay behavior) with a label of -1.
3. Works well with spatial data when combined with delay metrics.
To ensure fair clustering, we normalized arrival delays and station coordinates using MinMaxScaler before applying DBSCAN with:
1. eps = 0.07 (distance threshold in normalized units)
2. min_samples = 5 (minimum points to form a cluster)
The result: stations grouped into clusters based on delay severity and geographic proximity, revealing hotspots of chronic delays.
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler
import numpy as np
scaler = MinMaxScaler().set_output(transform="pandas")
X = scaler.fit_transform(delays[['arrival_delay', 'lat', 'long']])
# 3) Run DBSCAN with a single eps in meters
eps_meters = 0.07
min_samples = 5
db = DBSCAN(eps=eps_meters, min_samples=min_samples, metric='euclidean')
delays['labels'] = db.fit_predict(X)
delays['labels'].unique()
array([ 0, 1, -1, 2, 3])
# Details of each cluster
delays.groupby('labels').aggregate({
'arrival_delay':['min', 'max', np.mean, np.std],
'departure_delay':['min', 'max', np.mean, np.std]
})
| arrival_delay | departure_delay | |||||||
|---|---|---|---|---|---|---|---|---|
| min | max | mean | std | min | max | mean | std | |
| labels | ||||||||
| -1 | 0.000000 | 113.750000 | 34.124167 | 23.680011 | 0.071429 | 113.833333 | 33.884537 | 24.351357 |
| 0 | 1.470588 | 8.000000 | 4.306110 | 2.471251 | 1.500000 | 8.368421 | 4.545647 | 2.490323 |
| 1 | 0.000000 | 50.615385 | 13.112440 | 9.528010 | 0.000000 | 50.615385 | 13.106862 | 9.426237 |
| 2 | 26.833333 | 35.368421 | 30.642969 | 2.950345 | 26.367647 | 35.894737 | 29.186663 | 3.379501 |
| 3 | 30.615385 | 44.200000 | 38.826864 | 4.979863 | 31.000000 | 45.500000 | 39.613117 | 5.290575 |
Cluster Analysis¶
Cluster -1 (Noise, color:black)¶
- Arrival Delay: Mean ≈ 34.12 min, max up to 113.75 min
- Departure Delay: Mean ≈ 33.88 min, max up to 113.83 min
- Interpretation: These are extreme cases or irregular stations with very high delays. DBSCAN marks them as outliers because they don’t fit the density pattern of other clusters.
Cluster 0 (Minimal Delays, color: green, mostly near Patna)¶
- Arrival Delay: Mean ≈ 4.31 min, max ≈ 8 min
- Departure Delay: Mean ≈ 4.54 min
- Interpretation: Stations in this cluster are highly efficient with almost negligible delays. Likely major hubs or well-managed routes.
Cluster 1 (Moderate Delays, color: blue, most of India)¶
- Arrival Delay: Mean ≈ 13.11 min, max ≈ 50.61 min
- Departure Delay: Mean ≈ 13.10 min
- Interpretation: These stations experience moderate delays, possibly due to congestion or operational constraints.
Cluster 2 (High Delays, color: Yellow, Bilaspur-Raigarh stretch)¶
- Arrival Delay: Mean ≈ 30.64 min, max ≈ 35.36 min
- Departure Delay: Mean ≈ 29.18 min
- Interpretation: Stations consistently facing high delays. These could be bottlenecks in the network or areas with infrastructure limitations.
Cluster 3 (Severe Delays, color: red, Ongole-Khammam-Nalgonda stretch)¶
- Arrival Delay: Mean ≈ 38.82 min, max ≈ 44.20 min
- Departure Delay: Mean ≈ 39.61 min
- Interpretation: Chronic delay hotspots. Likely major junctions with heavy traffic or systemic scheduling issues.
Key Insights¶
- Cluster 0 represents the best-performing stations.
- Clusters 2 and 3 highlight critical problem areas needing immediate attention.
- Noise (-1) includes extreme anomalies—these might require separate investigation.
Visualizing the Results¶
Using GeoPandas and Matplotlib, we plotted:
1. Clusters on an India map with color-coded labels.
2. Delay intensity using a heatmap-style scatter plot.
These visualizations make it easy to identify regions where delays are concentrated, such as major junctions or congested routes.
# Set the default colors
my_colors_list = {-1: 'black', 0: 'green', 1: 'blue', 2: 'orange', 3: 'red'}
delays['label_colors'] = delays.labels.map(my_colors_list)
import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
# Has India map with state boundaries
map_gdf = gpd.read_file('https://gist.githubusercontent.com/jbrobst/56c13bbbf9d97d187fea01ca62ea5112/raw/e388c4cae20aa53cb5090210a42ebb9b765c0a36/india_states.geojson')
fig, ax = plt.subplots(figsize=(10, 10))
map_gdf.plot(ax=ax, color='white', edgecolor='lightgrey')
delays.plot(x='long', y='lat', c = 'label_colors', kind='scatter', alpha=0.75, ax=ax)
plt.title("Clusters of train stations based on delays on 13/14-12-2025")
# Add legends
legend_spec = [
("black", "Outliers"),
("green", "No delays"),
("blue", "Average delays"),
("orange", "High delays"),
("red", "Severe delays"),
]
# Build legend handles
handles = [Line2D([0], [0],marker='o', color='w', label=label, markerfacecolor=color, markersize=10) for color, label in legend_spec]
ax.legend(handles=handles, title="Cluster meaning", loc="upper right", frameon=True)
plt.show();

fig, ax = plt.subplots(figsize=(10, 10))
map_gdf.plot(ax=ax, color='white', edgecolor='lightgrey')
delays.plot(x='long', y='lat', s = 'delay_status', c = 'arrival_delay', colormap='Reds', kind='scatter', ax=ax)
plt.title("Average delays on on 13/14-12-2025")
plt.show()

Key Insights¶
- Stations like Khammam and Raighar had high delays counts.
- Clusters often formed around busy corridors, indicating systemic congestion.
- Outliers represented stations with unusually high delays, possibly due to local operational issues.
This blog demonstrates how clustering techniques like DBSCAN can uncover delay hotspots and operational bottlenecks in India’s railway network. To make this approach more impactful, we could extend the analysis across longer timeframes and incorporate additional factors such as weather, seasonal demand, and maintenance schedules. These enhancements will enable predictive modeling for proactive scheduling and resource allocation. Operationally, Indian Railways can leverage these insights to prioritize infrastructure upgrades in chronic delay clusters, implement dynamic timetabling, and improve passenger communication for better reliability and customer experience.
(Blog text improved using Gen AI)