Data Analyst KIM

[ML] Kaggle_뇌졸중 데이터를 활용한 분류모델 만들기(불균형데이터,SMOTE) 본문

데이터 분석/ML&DL&NLP

[ML] Kaggle_뇌졸중 데이터를 활용한 분류모델 만들기(불균형데이터,SMOTE)

김두연 2023. 11. 3. 16:38
반응형

 

 

Cerebral Stroke Prediction-Imbalanced Dataset

Identify Stroke on Imbalanced Dataset

www.kaggle.com

1. 라이브러리 불러오기

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from sklearn.impute import KNNImputer #Imputation
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

2. 데이터 확인

df = pd.read_csv('dataset.csv')
df.info()

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='stroke',data=df)
plt.title("Imbalanced data")
plt.show()

print('Target class is ', '{0:0.4f}'. format(783/(42617+783)*100), '%')

<Target class is 1.8041 %>


3. 데이터 전처리

  • 인코딩 - 원핫
  • 결측값 처리 - KNNInputer
  • 스케일링 - 정규화
# 인코딩
df = pd.get_dummies(df,columns=['gender','ever_married','work_type','Residence_type','smoking_status'])

# 결측값 처리 
imputer = KNNImputer(missing_values=np.nan)
tab = imputer.fit_transform(df)
df_new = pd.DataFrame(tab, columns=df.columns)

# target 분리
X = df_new.drop('stroke',axis=1)
y = df_new['stroke']

# 스케일링
from sklearn.preprocessing import MinMaxScaler
MM = MinMaxScaler()
X = MM.fit_transform(X)

4. 데이터셋 분리

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

print("X_train.shape : ", X_train.shape)
print("y_train.shape : ", y_train.shape)
print("X_test.shape : " ,X_test.shape)
print("y_test.shape : " ,y_test.shape)


5. 모델링

5-1. 불균형 데이터 그대로 

from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier

dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
lr = LogisticRegression()
kc = KNeighborsClassifier()
svc = SVC()
gbc = GradientBoostingClassifier()
lgv = LGBMClassifier()

models = [dt,rf,lr,kc,svc,gbc,lgv]

for model in models:
    print("MODEL NAME:",model)
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    
    print(classification_report(y_test,y_pred))

5-2. 오버샘플링

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=1)
X_sm,y_sm = sm.fit_resample(X,y)

X_train,X_test,y_train,y_test = train_test_split(X_sm,y_sm,test_size=0.2,random_state=1)

dt1 = DecisionTreeClassifier()
rf1 = RandomForestClassifier()
lr1 = LogisticRegression()
kc1 = KNeighborsClassifier()
lgv1 = LGBMClassifier()

models = [dt1,rf1,lr1,kc1,lgv1]

for model in models:
    print("MODEL NAME:",model)
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    
    print(classification_report(y_test,y_pred))

5-3. 언더샘플링

from imblearn.under_sampling import RandomUnderSampler

us = RandomUnderSampler(random_state=1)
X_us,y_us = us.fit_resample(X,y)

X_train,X_test,y_train,y_test = train_test_split(X_us,y_us,test_size=0.2,random_state=1)

dt2 = DecisionTreeClassifier()
rf2 = RandomForestClassifier()
lr2 = LogisticRegression()
kc2 = KNeighborsClassifier()
lgv2 = LGBMClassifier()

models = [dt2,rf2,lr2,kc2,lgv2]

for model in models:
    print("MODEL NAME:",model)
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    
    print(classification_report(y_test,y_pred))


6. 최종 모델 = 오버샘플링한 모델

 

반응형