什么是聚类？

聚类是一种无监督机器学习技术，涉及根据类似性将一组数据点分组为簇。聚类的目标是确保同一簇内的数据点彼此之间的类似性高于簇内的数据点之间的类似性。这种类似性可以使用各种距离度量来衡量，例如欧几里得距离或余弦类似性，具体取决于数据的性质。

1. K均值聚类

K-Means通过最小化数据点与其各自聚类质心之间的距离将数据划分为k 个聚类。

用例：

客户细分：电子商务平台使用 K-Means 聚类根据购买行为对客户进行分组，以进行有针对性的营销。

代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the dataset - Dataset can be downloaded from Kaggle
path = r"ClusteringarchiveMall_Customers.csv"
data = pd.read_csv(path)

# Display the first few rows of the dataset
print(data.head())

# Select the features 'Annual Income' and 'Spending Score'
X = data[['Annual Income (k$)', 'Spending Score (1-100)']]

# Standardize the data to normalize it
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_scaled)

# Add the cluster labels to the original dataset
data['Cluster'] = kmeans.labels_

# Plot the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis', marker='o', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x', label='Centroids')
plt.title("K-Means Clustering (Customer Segmentation)")
plt.xlabel("Annual Income (scaled)")
plt.ylabel("Spending Score (scaled)")
plt.legend()
plt.grid(True)
plt.show()

可视化：

机器学习算法：聚类

2. K中位数聚类

K-Medians 使用中位数而不是平均值来计算聚类质心，这使得它对异常值更具鲁棒性。

用例：

根据中位数收入进行聚类：K-Medians可以根据中位数收入对城市进行聚类，这在经济研究中很有用。

代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

data = {
    'City': ['City A', 'City B', 'City C', 'City D', 'City E', 'City F', 'City G', 'City H', 'City I', 'City J'],
    'Median Income': [50000, 62000, 58000, 30000, 95000, 42000, 74000, 36000, 87000, 40000],
    'Population': [100000, 150000, 120000, 80000, 250000, 90000, 130000, 70000, 200000, 110000]
}

# Convert to a Dataframe
data = pd.DataFrame(data)

# Display the dataset
print(data)

# Select the relevant features: 'Median Income' and 'Population'
X = data[['Median Income', 'Population']]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering as a base
kmeans = KMeans(n_clusters=3, random_state=42).fit(X_scaled)

# Function to compute medians for clusters
def compute_medians(X, labels):
    unique_labels = np.unique(labels)
    medians = np.zeros((len(unique_labels), X.shape[1]))
    for label in unique_labels:
        cluster_points = X[labels == label]
        medians[label] = np.median(cluster_points, axis=0)
    return medians

# Iterate to approximate K-Medians
for _ in range(5): # Run iterations to refine medians
    labels = kmeans.labels_
    medians = compute_medians(X_scaled, labels)

    # Update KMeans with the computed medians as initial centroids
    kmeans = KMeans(n_clusters=3, init=medians, n_init=1, random_state=42)
    kmeans.fit(X_scaled)

# Add cluster labels to the original dataset
data['Cluster'] = kmeans.labels_

# Plot the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis', marker='o', s=100)
plt.scatter(medians[:, 0], medians[:, 1], s=300, c='red', marker='x', label='Medians')

# Annotate the points with city names
for i, city in enumerate(data['City']):
    plt.text(X_scaled[i, 0], X_scaled[i, 1], city, fontsize=9, ha='right', color='black')

plt.title("Approximate K-Medians Clustering (Synthetic Data: Median Income and Population)")
plt.xlabel("Median Income (scaled)")
plt.ylabel("Population (scaled)")
plt.legend()
plt.grid(True)
plt.show()

可视化：

机器学习算法：聚类

3. BIRCH（使用层次结构的平衡迭代减少和聚类）

BIRCH逐步对传入数据进行聚类，对于大型数据集超级有效。

用例：

实时数据聚类：BIRCH 超级适合在物联网系统等实时应用中对数据进行聚类。

代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import Birch
from sklearn.preprocessing import StandardScaler

# Create synthetic dataset representing cities, population, and sensor data (e.g., temperature or energy consumption)
data = {
    'City': ['City A', 'City B', 'City C', 'City D', 'City E', 'City F', 'City G', 'City H', 'City I', 'City J'],
    'Population': [100000, 150000, 120000, 80000, 250000, 90000, 130000, 70000, 200000, 110000],
    'Average Temperature (°C)': [22.4, 25.8, 21.3, 18.9, 27.5, 19.6, 23.1, 15.7, 26.3, 20.8],  # Synthetic sensor data
    'Energy Consumption (MWh)': [1200, 1500, 1100, 800, 2000, 900, 1300, 600, 1900, 950]  # Another form of sensor data
}

# Convert to a DataFrame
data = pd.DataFrame(data)

# Display the synthetic dataset
print(data)

# Select the relevant features: 'Population', 'Average Temperature', and 'Energy Consumption'
X = data[['Population', 'Average Temperature (°C)', 'Energy Consumption (MWh)']]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply BIRCH clustering
birch_model = Birch(n_clusters=3)  # Using n_clusters=3 as an example
birch_model.fit(X_scaled)

# Add cluster labels to the original dataset
data['Cluster'] = birch_model.labels_

# Plot the clusters using Population vs Average Temperature for simplicity
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=birch_model.labels_, cmap='viridis', marker='o', s=100)
plt.title("BIRCH Clustering (Synthetic IoT Data: Population, Temperature, and Energy Usage)")

# Annotate the points with city names
for i, city in enumerate(data['City']):
    plt.text(X_scaled[i, 0], X_scaled[i, 1], city, fontsize=9, ha='right', color='black')

plt.xlabel("Population (scaled)")
plt.ylabel("Average Temperature (°C) (scaled)")
plt.grid(True)
plt.show()

# Display the dataset with cluster labels
print(data)

可视化：

机器学习算法：聚类

4.模糊C均值

模糊 C 均值将数据点分配给具有不同成员资格程度的多个集群，而不是硬分配。

用例：

图像分割：模糊 C 均值可用于分割医学图像，其中像素可能属于多种组织。

代码：

# pip install scikit-fuzzy scikit-image numpy matplotlib
import numpy as np
import matplotlib.pyplot as plt
import skfuzzy as fuzz
from skimage import data
from skimage.color import rgb2gray
from sklearn.preprocessing import StandardScaler

# Load a sample image from skimage (replace this with actual medical image data if available)
# Convert the image to grayscale to simulate pixel intensities for segmentation
image = rgb2gray(data.astronaut())  # Using 'astronaut' image for demonstration; replace with medical image
image = image[100:200, 100:200]  # Crop part of the image for simplicity

# Display the original grayscale image
plt.figure(figsize=(5, 5))
plt.imshow(image, cmap='gray')
plt.title("Original Grayscale Image")
plt.show()

# Reshape the image into a 1D array (each pixel as a data point)
pixels = image.reshape(-1, 1)

# Standardize the pixel intensities
scaler = StandardScaler()
pixels_scaled = scaler.fit_transform(pixels)

# Apply Fuzzy C-Means Clustering with 3 clusters (simulating 3 tissue types)
n_clusters = 3
cntr, u, _, _, _, _, _ = fuzz.cluster.cmeans(pixels_scaled.T, c=n_clusters, m=2, error=0.005, maxiter=1000)

# Assign each pixel to the cluster with the highest membership
cluster_labels = np.argmax(u, axis=0)

# Reshape the cluster labels back into the original image dimensions
segmented_image = cluster_labels.reshape(image.shape)

# Display the segmented image
plt.figure(figsize=(5, 5))
plt.imshow(segmented_image, cmap='viridis')
plt.title("Fuzzy C-Means Segmentation")
plt.show()

可视化：

机器学习算法：聚类

5. 小批量 K 均值

Mini Batch K-Means 是 K-Means 的更快版本，它使用随机数据子集来更新质心，使其可扩展至大型数据集。

用例：

文档聚类：小批量 K-Means 适用于将大型文本语料库（例如，新闻文章按主题进行聚类）。

代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MiniBatchKMeans

# Create a synthetic dataset of documents (simulating news articles)
documents = [
    "The economy is growing steadily with the GDP increasing by 3%.",
    "New advancements in technology are driving growth in various sectors.",
    "The recent election has resulted in significant changes in policies.",
    "Healthcare improvements have led to better outcomes for patients.",
    "The stock market is volatile with mixed results from major companies.",
    "Climate change continues to be a pressing issue around the world.",
    "Sports events are being affected by ongoing global health concerns.",
    "Travel restrictions are being lifted as vaccination rates rise.",
    "New trends in fashion are emerging from various parts of the globe.",
    "Education reform is essential for preparing future generations."
]

# Convert documents to a DataFrame
df = pd.DataFrame(documents, columns=['Document'])

# Display the first few documents
print(df)

# Convert text documents to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['Document'])

# Apply Mini Batch K-Means clustering
n_clusters = 3  # Adjust the number of clusters as needed
mbkmeans = MiniBatchKMeans(n_clusters=n_clusters, batch_size=2, random_state=42)
mbkmeans.fit(X)

# Add cluster labels to the original dataset
df['Cluster'] = mbkmeans.labels_

# Display the documents with their corresponding cluster labels
print(df)

# Plotting the cluster assignments
# For visualization, we'll use the first two dimensions from the TF-IDF features
X_dense = X.toarray()  # Convert sparse matrix to dense
plt.figure(figsize=(10, 6))

# Scatter plot based on the first two features
plt.scatter(X_dense[:, 0], X_dense[:, 1], c=mbkmeans.labels_, cmap='viridis', marker='o', s=100)

# Annotate the points with document indices
for i, doc in enumerate(df['Document']):
    plt.text(X_dense[i, 0], X_dense[i, 1], str(i), fontsize=12, ha='right')

plt.title("Mini Batch K-Means Clustering (Synthetic Document Data)")
plt.xlabel("Feature 1 (TF-IDF)")
plt.ylabel("Feature 2 (TF-IDF)")
plt.grid(True)
plt.show()

6. DBSCAN

DBSCAN根据密度对数据进行聚类。识别高密度区域中的聚类，同时将低密度区域中的点视为噪声。

用例：

地理空间聚类：DBSCAN 用于聚类 GPS 坐标，以识别活动频繁的区域（例如交通拥堵）。

代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

# Simulate a dataset of GPS coordinates (latitude and longitude)
data = {
    'Latitude': [37.7749, 37.775, 37.7748, 37.7751, 37.7752, 37.758, 37.759, 37.760, 37.780, 37.781],
    'Longitude': [-122.4194, -122.4195, -122.4193, -122.4196, -122.4197, -122.428, -122.429, -122.430, -122.400, -122.401]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Display the first few rows of the dataset
print(df)

# Extract latitude and longitude for DBSCAN
X = df[['Latitude', 'Longitude']].values

# Apply DBSCAN for Geospatial Clustering
dbscan = DBSCAN(eps=0.01, min_samples=2)  # eps is the maximum distance between points in the same cluster
df['Cluster'] = dbscan.fit_predict(X)

# Plot the clusters
plt.figure(figsize=(10, 8))

# Assign different colors to different clusters; outliers will be marked as -1
unique_labels = np.unique(df['Cluster'])

for cluster_label in unique_labels:
    if cluster_label == -1:
        # Plot noise points (outliers)
        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'Longitude'], 
                    df.loc[df['Cluster'] == cluster_label, 'Latitude'], 
                    color='red', label='Noise', marker='x', s=100)
    else:
        # Plot clustered points
        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'Longitude'], 
                    df.loc[df['Cluster'] == cluster_label, 'Latitude'], 
                    label=f'Cluster {cluster_label}', s=100)

# Add titles and labels
plt.title("DBSCAN Geospatial Clustering of GPS Coordinates", fontsize=16)
plt.xlabel("Longitude", fontsize=14)
plt.ylabel("Latitude", fontsize=14)
plt.legend()
plt.grid(True)

# Show plot
plt.show()

# Print the resulting DataFrame with cluster labels
print(df)

可视化：

机器学习算法：聚类

7. OPTICS

OPTICS 是一种基于密度的聚类算法，类似于 DBSCAN，但可以检测不同密度的聚类。

用例：

天文数据聚类：OPTICS 用于对天文数据进行聚类，例如识别具有不同密度的星系团。

代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import OPTICS

# Simulate a dataset of astronomical data (e.g., positions of galaxies)
np.random.seed(42)
n_points = 500

# Cluster 1 (dense cluster of galaxies)
galaxies_1 = np.random.normal(loc=[100, 200], scale=5, size=(150, 2))

# Cluster 2 (less dense cluster of galaxies)
galaxies_2 = np.random.normal(loc=[300, 100], scale=15, size=(100, 2))

# Cluster 3 (another dense cluster)
galaxies_3 = np.random.normal(loc=[400, 400], scale=5, size=(200, 2))

# Outliers (random points)
outliers = np.random.uniform(low=[50, 50], high=[450, 450], size=(50, 2))

# Combine all data points into one dataset
astronomical_data = np.vstack([galaxies_1, galaxies_2, galaxies_3, outliers])

# Create a DataFrame for the data
df = pd.DataFrame(astronomical_data, columns=['X', 'Y'])

# Apply OPTICS clustering
optics = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.1)
df['Cluster'] = optics.fit_predict(astronomical_data)

# Plot the clusters
plt.figure(figsize=(10, 8))

# Assign different colors to different clusters; outliers will be marked as -1
unique_labels = np.unique(df['Cluster'])

for cluster_label in unique_labels:
    if cluster_label == -1:
        # Plot noise points (outliers)
        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'X'], 
                    df.loc[df['Cluster'] == cluster_label, 'Y'], 
                    color='red', label='Noise', marker='x', s=50)
    else:
        # Plot clustered points
        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'X'], 
                    df.loc[df['Cluster'] == cluster_label, 'Y'], 
                    label=f'Cluster {cluster_label}', s=50)

# Add titles and labels
plt.title("OPTICS Clustering of Astronomical Data (Simulated Galaxies)", fontsize=16)
plt.xlabel("X Coordinate", fontsize=14)
plt.ylabel("Y Coordinate", fontsize=14)
plt.legend()
plt.grid(True)

# Show plot
plt.show()

# Print the resulting DataFrame with cluster labels
print(df)

可视化：

机器学习算法：聚类

8.模糊K模式

模糊 K 模式用于对分类数据进行聚类，其模糊分配与模糊 C 均值类似，但使用模式而不是均值。

用例：

分类数据聚类：模糊 K 模式对于调查回应有用，或者对于聚类调查回应或客户偏好有用。

代码：

# pip install kmodes
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from kmodes.kmodes import KModes

# Create a synthetic dataset representing survey responses (categorical data)
data = {
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male'],
    'Preferred Product': ['A', 'B', 'A', 'C', 'C', 'B', 'A', 'C', 'B', 'C'],
    'Payment Method': ['Credit Card', 'Debit Card', 'Cash', 'Credit Card', 'Cash', 'Cash', 'Debit Card', 'Credit Card', 'Cash', 'Credit Card'],
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South']
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Display the first few rows of the dataset
print(df)

# Convert the categorical data to numeric using Label Encoding (KModes handles categorical data internally, but this helps with visualization)
df_numeric = df.apply(lambda x: pd.factorize(x)[0])

# Apply Fuzzy K-Modes (we're using standard K-Modes here as Fuzzy K-Modes isn't readily available in libraries)
kmodes = KModes(n_clusters=3, init='Huang', n_init=5, verbose=1)

# Fit the model and predict clusters
clusters = kmodes.fit_predict(df_numeric)

# Add cluster labels to the original dataset
df['Cluster'] = clusters

# Display the DataFrame with cluster labels
print(df)

# Set up markers and colors for each cluster
markers = ['o', 's', 'D']  # Circle, square, and diamond for different clusters
colors = ['blue', 'green', 'orange']

# Create the plot
plt.figure(figsize=(10, 6))

# Plot each cluster with a different color and marker
for i, cluster_label in enumerate(np.unique(clusters)):
    clustered_data = df[df['Cluster'] == cluster_label]
    plt.scatter(clustered_data.index, [i] * len(clustered_data), color=colors[i], marker=markers[i], s=150, label=f'Cluster {cluster_label}')
    
    # Annotate each point with its categorical values (Gender, Product, Payment Method, Region)
    for j in clustered_data.index:
        plt.text(j, i, f"{df.loc[j, 'Gender']}, {df.loc[j, 'Preferred Product']}, {df.loc[j, 'Payment Method']}", 
                 fontsize=9, ha='left', color='black')

# Add titles and labels
plt.title("Fuzzy K-Modes Clustering of Survey Responses", fontsize=16)
plt.xlabel("Survey Respondent Index", fontsize=14)
plt.ylabel("Cluster", fontsize=14)
plt.yticks([0, 1, 2], ['Cluster 0', 'Cluster 1', 'Cluster 2'])
plt.legend()

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

可视化：

机器学习算法：聚类

9. Expectation-Maximization(EM)

EM 算法估计概率模型的参数，一般与高斯混合模型 (GMM) 一起用于聚类。

用例：

金融中的异常检测：通过拟合 GMM，EM 可用于检测金融交易数据中的异常值。

代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

# Simulate a dataset representing financial transaction data
# Columns: 'Amount' (transaction amount), 'Time' (time of transaction)
np.random.seed(42)

# Cluster 1 (normal transactions, low amounts)
normal_1 = np.random.normal(loc=[100, 50], scale=[10, 5], size=(200, 2))

# Cluster 2 (normal transactions, high amounts)
normal_2 = np.random.normal(loc=[1000, 100], scale=[50, 10], size=(100, 2))

# Anomalous transactions (outliers)
anomalies = np.random.uniform(low=[5000, 150], high=[10000, 200], size=(10, 2))

# Combine the normal transactions and anomalies
transaction_data = np.vstack([normal_1, normal_2, anomalies])

# Create a DataFrame for the data
df = pd.DataFrame(transaction_data, columns=['Amount', 'Time'])

# Fit a Gaussian Mixture Model (GMM) with 2 components (normal transaction clusters)
gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=42)
gmm.fit(transaction_data)

# Predict the probability of each point belonging to the Gaussian mixture
probs = gmm.score_samples(transaction_data)

# Define a threshold for anomalies (e.g., points with low probability are flagged as anomalies)
threshold = np.percentile(probs, 2)  # The lowest 2% of points are flagged as anomalies
df['Anomaly'] = probs < threshold

# Plot the transactions and anomalies
plt.figure(figsize=(10, 8))

# Plot normal transactions
normal_transactions = df[df['Anomaly'] == False]
plt.scatter(normal_transactions['Amount'], normal_transactions['Time'], c='green', label='Normal Transactions', s=50)

# Plot anomalies
anomalous_transactions = df[df['Anomaly'] == True]
plt.scatter(anomalous_transactions['Amount'], anomalous_transactions['Time'], c='red', label='Anomalies', s=100, marker='x')

# Add titles and labels
plt.title("Anomaly Detection in Financial Transactions Using GMM", fontsize=16)
plt.xlabel("Transaction Amount", fontsize=14)
plt.ylabel("Transaction Time", fontsize=14)
plt.legend()
plt.grid(True)

# Show plot
plt.show()

# Display the first few rows of the dataset with the anomaly flag
print(df.head())

可视化：

机器学习算法：聚类

10.层次聚类

层次聚类构建了聚类层次结构，方法是将每个数据点作为自己的聚类并合并，或从一个大聚类开始并将其拆分。它不需要事先指定聚类数量。

用例：

分类法构建：该算法一般用于构建生物分类法，例如根据遗传类似性对物种进行分类。

代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler

# Simulate a dataset representing species genetic similarities (e.g., based on genetic markers)
np.random.seed(42)

# Genetic features for 6 species (simulating genetic markers or characteristics)
data = {
    'Species': ['Species A', 'Species B', 'Species C', 'Species D', 'Species E', 'Species F'],
    'Genetic Marker 1': [0.1, 0.3, 0.2, 0.5, 0.7, 0.9],
    'Genetic Marker 2': [0.2, 0.4, 0.2, 0.6, 0.8, 1.0],
    'Genetic Marker 3': [0.3, 0.1, 0.5, 0.4, 0.9, 0.8],
    'Genetic Marker 4': [0.4, 0.3, 0.6, 0.3, 0.6, 0.7]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Display the dataset
print(df)

# Extract only the genetic markers for clustering
genetic_data = df.iloc[:, 1:].values  # Exclude species names

# Standardize the data (important for genetic similarity clustering)
scaler = StandardScaler()
genetic_data_scaled = scaler.fit_transform(genetic_data)

# Apply hierarchical clustering using the 'ward' method (agglomerative)
linked = linkage(genetic_data_scaled, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 6))
dendrogram(linked, labels=df['Species'].values, leaf_rotation=90, leaf_font_size=12)
plt.title('Hierarchical Clustering Dendrogram (Genetic Similarities)', fontsize=16)
plt.xlabel('Species', fontsize=14)
plt.ylabel('Distance (Genetic Similarity)', fontsize=14)
plt.grid(True)
plt.tight_layout()
plt.show()

可视化：

机器学习算法：聚类

11.最小生成树（MST）

最小生成树 (MST) 算法构建一棵连接所有数据点且总边权重最小的树。它一般用于最小化网络中连接节点的成本，例如铺设电缆。

不过，MST 并不是像 K-Means 或 DBSCAN 那样的传统意义上的聚类算法，但它可以用作辅助聚类过程的工具。

用例：

网络设计：MST 可用于设计高效的网络结构，例如电网或管道，其目标是用最少的代码连接所有节点。

代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
from scipy.spatial import distance_matrix

# Simulate a dataset representing cities (nodes) with x, y coordinates (positions)
np.random.seed(42)

# Generate random coordinates for 10 cities (nodes)
cities = pd.DataFrame({
    'City': ['City A', 'City B', 'City C', 'City D', 'City E', 'City F', 'City G', 'City H', 'City I', 'City J'],
    'X': np.random.uniform(0, 100, 10),
    'Y': np.random.uniform(0, 100, 10)
})

# Display the dataset
print(cities)

# Create a distance matrix representing the cost between each pair of cities
distance_matrix_df = pd.DataFrame(distance_matrix(cities[['X', 'Y']], cities[['X', 'Y']]), 
                                  index=cities['City'], columns=cities['City'])

# Display the distance matrix
print("
Distance Matrix:
", distance_matrix_df)

# Create a graph from the distance matrix
G = nx.Graph()

# Add nodes (cities) to the graph
for i, city in cities.iterrows():
    G.add_node(city['City'], pos=(city['X'], city['Y']))

# Add edges (distances between cities) to the graph
for i in range(len(cities)):
    for j in range(i + 1, len(cities)):
        G.add_edge(cities['City'][i], cities['City'][j], weight=distance_matrix_df.iloc[i, j])

# Compute the Minimum Spanning Tree (MST) using Kruskal's algorithm
mst = nx.minimum_spanning_tree(G, algorithm='kruskal')

# Plot the cities and the MST
plt.figure(figsize=(10, 8))

# Get positions for the cities
pos = {city['City']: (city['X'], city['Y']) for i, city in cities.iterrows()}

# Draw the nodes (cities)
nx.draw_networkx_nodes(G, pos, node_size=500, node_color='lightblue', alpha=0.9, label='Cities')

# Draw the MST edges
nx.draw_networkx_edges(mst, pos, edgelist=mst.edges(), edge_color='green', width=2, label='MST Edges')

# Draw the edge labels (distances)
edge_labels = nx.get_edge_attributes(mst, 'weight')
nx.draw_networkx_edge_labels(mst, pos, edge_labels=edge_labels, font_size=8)

# Draw the city labels
nx.draw_networkx_labels(G, pos, font_size=10, font_family="sans-serif")

# Add title and labels
plt.title("Minimum Spanning Tree for Network Design (Simulated Cities)", fontsize=16)
plt.legend(loc='upper left')
plt.grid(True)
plt.show()

# Print the total weight (cost) of the MST
mst_total_cost = sum(nx.get_edge_attributes(mst, 'weight').values())
print(f"
Total cost of the Minimum Spanning Tree: {mst_total_cost:.2f}")