[로지스틱 회귀분석] 유방암 데이터를 이용해서 모델링

Notice

Recent Posts

Recent Comments

Link

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

Data Analyst KIM

[로지스틱 회귀분석] 유방암 데이터를 이용해서 모델링 본문

데이터 분석/Python | SQL | BI Tools

[로지스틱 회귀분석] 유방암 데이터를 이용해서 모델링

김두연 2023. 5. 30. 23:08

로지스틱 회귀분석을 이용하여 유방암 예측하기

1. 패키지 불러오기

import pandas as pd
from sklearn import datasets
from sklearn.metrics import *                            # For accuacy_score
from sklearn.preprocessing import StandardScaler         # 데이터 정규화
from sklearn.model_selection import train_test_split     # 데이터 분할
from sklearn.linear_model import LogisticRegression      # 로지스틱 
from sklearn.model_selection import GridSearchCV         # 그리드 서치

2. 데이터셋 불러오기

dataset = datasets.load_breast_cancer()    # 유방암 진단 데이터셋
dir(dataset)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

dataset.data.shape                        # 예측변수 형태

(569, 30)

dataset.feature_names                     # 예측변수 이름

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

dataset.target_names                      # Target의 라벨

array(['malignant', 'benign'], dtype='<U9')

# 데이터 프레임으로 변환 
breast_cancer_df=pd.DataFrame(data=dataset.data, columns=dataset.feature_names)  # 예측변수들을 데이터 프레임으로 변환
breast_cancer_df['label']=dataset.target                                         # Target 추가
breast_cancer_df

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	label
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.16220	0.66560	0.7119	0.2654	0.4601	0.11890	0
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.12380	0.18660	0.2416	0.1860	0.2750	0.08902	0
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.14440	0.42450	0.4504	0.2430	0.3613	0.08758	0
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.20980	0.86630	0.6869	0.2575	0.6638	0.17300	0
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.13740	0.20500	0.4000	0.1625	0.2364	0.07678	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
564	21.56	22.39	142.00	1479.0	0.11100	0.11590	0.24390	0.13890	0.1726	0.05623	...	26.40	166.10	2027.0	0.14100	0.21130	0.4107	0.2216	0.2060	0.07115	0
565	20.13	28.25	131.20	1261.0	0.09780	0.10340	0.14400	0.09791	0.1752	0.05533	...	38.25	155.00	1731.0	0.11660	0.19220	0.3215	0.1628	0.2572	0.06637	0
566	16.60	28.08	108.30	858.1	0.08455	0.10230	0.09251	0.05302	0.1590	0.05648	...	34.12	126.70	1124.0	0.11390	0.30940	0.3403	0.1418	0.2218	0.07820	0
567	20.60	29.33	140.10	1265.0	0.11780	0.27700	0.35140	0.15200	0.2397	0.07016	...	39.42	184.60	1821.0	0.16500	0.86810	0.9387	0.2650	0.4087	0.12400	0
568	7.76	24.54	47.92	181.0	0.05263	0.04362	0.00000	0.00000	0.1587	0.05884	...	30.37	59.16	268.6	0.08996	0.06444	0.0000	0.0000	0.2871	0.07039	1

569 rows × 31 columns

# Input 과 Targrt 분리
X , y = dataset['data'],dataset['target']

3. 데이터 셋 분할

X_train , X_test , y_train , y_test = train_test_split(X,y ,test_size = 0.3, random_state = 1)

4. 전처리

# 표준화
scaler = StandardScaler()                   # 통계에서 사용하는 대표적인 표준화방법
X_train_std = scaler.fit_transform(X_train) # scaler.fit_transform를 이용하여 표준화 
X_test_std = scaler.transform(X_test)       # scaler.transform 이용하여 표준화
                                            # test의 표준화의 경우 fit을 사용하지 않는다.
                                            # 트리기반의 머신러닝을 제외하고는 거의 표준화를 사용한다.

5. 모형 훈련

model = LogisticRegression()
model.fit(X_train_std,y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

6. 모형 평가

# 정확도 계산
test_pred = model.predict(X_test_std)              # 예측 클래스 
train_score = model.score(X_train_std, y_train)
test_score = model.score(X_test_std, y_test)
test_auc = roc_auc_score(y_test, test_pred)

# 정확도 출력
print('학습셋 정확도: {:.3f}'.format(train_score))
print('테스트셋 정확도: {:.3f}, 테스트셋 AUC: {:.3f}'.format(test_score, test_auc))

학습셋 정확도: 0.990
테스트셋 정확도: 0.971, 테스트셋 AUC: 0.970

+ 다양한 최적화 방법에 기반한 모형 예측 및 평가

로지스틱 회귀 문제에서 객체 LogisticRegression()를 이용하여 회귀 계수를 최적화하는 방법은 다음과 같다.

lbfgs : 병렬 처리 가능, multimodal class에 적용
liblinear : 데이터셋의 크기가 작은 경우에 효과적으로 작동, 크기가 큰 경우에는 속도가 느림, 병렬 처리 불가능, 국소 최적화 이슈가 있음
newton-cg : 좀 더 정교한 최적화 가능, 속도가 느림, multimodal class에 적용
sag, saga : 경사 하강법 기반의 최적화 방법으로 대용량 데이터에서 효과적으로 작동

# 모델 객체 생성
lr = LogisticRegression()

# 회귀계수 최적화 방법
param_grid = {'solver':['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga']}


# 그리드 서치 객체 생성 (, )
lr_gscv = GridSearchCV(estimator=lr, 
                        param_grid=param_grid, 
                        scoring='accuracy', 
                        cv=3,                        # cv : Cross-Validation
                        refit=True,                  # refit : 최적의 초모수 조합으로 재학습
                        n_jobs=-1,                   # n_jobs : 사용할 CPU 코어 갯수, n_jobs=-1 : 모든 CPU 코어 사용
                        verbose=2)


# 그리드 서치로 모델 학습
lr_gscv.fit(X_train_std, y_train)

# 그리드 서치의 학습 결과 출력
print('최적의 하이퍼파라미터: {0}'.format(lr_gscv.best_params_))
print('최적의 하이퍼파라미터일 때 정확도: {0:.2f}'.format(lr_gscv.best_score_))

# 최적화 모델
best_model = lr_gscv.best_estimator_

# 정확도 계산
test_pred = best_model.predict(X_test_std)              # 예측 클래스 
test_accuracy = accuracy_score(y_test, test_pred)
test_auc = roc_auc_score(y_test, test_pred)

# 정확도 출력
print('테스트셋 정확도: {:.3f}, 테스트셋 AUC: {:.3f}'.format(test_accuracy, test_auc))

Fitting 3 folds for each of 5 candidates, totalling 15 fits
최적의 하이퍼파라미터: {'solver': 'saga'}
최적의 하이퍼파라미터일 때 정확도: 0.97
테스트셋 정확도: 0.971, 테스트셋 AUC: 0.970


C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(

저작자표시 (새창열림)

'데이터 분석 > Python | SQL | BI Tools' 카테고리의 다른 글

레이블 인코딩(Label encoding) vs 원-핫 인코딩(One-Hot encoding) 비교 (1)	2023.06.19
[프로그래머스Lv.2] 3월에 태어난 여성 회원 목록 출력하기 (0)	2023.06.11
[Markdown] 주피터 노트북 마크다운 정리(feat.슈퍼짱짱님) (0)	2023.05.23
[SQL] 실제 기업의 매출 데이터의 구성 요소 알아보기 (0)	2023.05.15
[SQL이론] SQL 기본 문법 사용 시 주의사항 (0)	2023.05.15

'데이터 분석/Python | SQL | BI Tools' Related Articles

Data Analyst KIM

[로지스틱 회귀분석] 유방암 데이터를 이용해서 모델링 본문

[로지스틱 회귀분석] 유방암 데이터를 이용해서 모델링

로지스틱 회귀분석을 이용하여 유방암 예측하기

1. 패키지 불러오기

2. 데이터셋 불러오기

3. 데이터 셋 분할

4. 전처리

5. 모형 훈련

6. 모형 평가

+ 다양한 최적화 방법에 기반한 모형 예측 및 평가

로지스틱 회귀 문제에서 객체 LogisticRegression()를 이용하여 회귀 계수를 최적화하는 방법은 다음과 같다.

'데이터 분석 > Python | SQL | BI Tools' 카테고리의 다른 글

티스토리툴바