i am just starting with machine learning. I am trying to deal with simple examples.
i have a *.csv file with columns containing text, numbers and dates. My goal, for learning purposes, is to classify the rows of this file in unsupervised mode using for example KMeans.
I don’t know how to combine all transformers to use it with KMeans.
Maybe someone can help me with different suggestions.
`
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class DateTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X_date_encoded = np.zeros_like(X)
valid_dates = ~np.isnat(X).all(axis=1)
if np.any(valid_dates):
X_date_encoded[valid_dates] = (X[valid_dates] - X[valid_dates].min()) / (
X[valid_dates].max() - X[valid_dates].min())
return X_date_encoded
def classify_excel_rows(file_path, n_clusters=3):
"""
Classify the rows of an Excel sheet using unsupervised learning (KMeans).
Args:
- file_path: The path to the Excel file.
- n_clusters: The number of clusters to form.
Returns:
- A DataFrame with an additional column 'Cluster' indicating the cluster each row belongs to.
"""
# Read the Excel file
data = pd.read_csv(file_path, sep=';', skiprows=4)
data['Value'] = data['Value'].str.replace(',', '.').astype(float)
data['Value'] = pd.to_numeric(data['Value'], errors='coerce')
date_format = "%d.%m.%Y"
data['Date_1'] = pd.to_datetime(data['Date_1'], format=date_format, errors='coerce')
data['Date_2'] = pd.to_datetime(data['Date_2'], format=date_format, errors='coerce')
# Identify different data types
text_columns = data.select_dtypes(include=['object']).columns.tolist()
numeric_columns = data.select_dtypes(include=['float64']).columns
date_columns = data.select_dtypes(include=['datetime']).columns
# Preprocess the data
numeric_transformer = make_pipeline(StandardScaler())
text_transformer = make_pipeline(TfidfVectorizer())
date_transformer = make_pipeline(DateTransformer())
**X = ???? # how to combine all transformers to use it with KMeans?**
[...]
# Perform K-Means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X)
`
I have tried with the example code above. I don’t know how to use KMeans with this type of data for unsupervised learning/classification. I expect to cluster the rows of my *.csv file according to the column values.
Pete is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.