Scikit-learn 주요 함수 정리 -(1) 전처리, 특성추출, 평가 함수

2024. 1. 18. 21:15

Scikit-learn 이란?

Python 기반의 머신러닝 라이브러리로, 머신러닝 관련 다양한 알고리즘과 함수들을 포함하고 있어, 머신러닝 프로젝트에서는 필수 라이브러리이다.

설치 방법

pip install scikit-learn

Scikit-learn 주요 함수

사실 Scikit-learn은 계속 새로운 버전이 등장하는 라이브러리이기 때문에, 라이브러리 내의 모든 함수를 보기 위해서는 공식 홈페이지에 방문하는 것이 좋다.
아래 정리된 내용은 지금껏 Scikit-learn을 사용해 오면서, 순전히 주관적인 기준으로 Scikit-learn의 주요 함수를 정리한 것이다.
Scikit-learn은 크게 분류하면, Classification, Regression , Clustering 등의 특정 알고리즘을 구현해 놓은 알고리즘 함수와 전처리, 특성추출, 평가등을 통해, 머신러닝 데이터를 쉽게 처리할 수 있게 하는 비알고리즘 함수로 나눌 수 있다.

[머신러닝 데이터처리 함수]

분류	함수명	설명
preprocessing	StandardScaler	z-score normalization
	MinMaxScaler	min-max normalization
	OrdinalEncoder	범주형 변수 숫자로 encoding
model_selection	train_test_split	trainset, testset split
	KFold	K-fold 교차 검증
	cross_validate	교차 검증
	GridSearchCV	하이퍼파라미터 튜닝(Grid Search로)
	RandomizedSearchCV	하이퍼파라미터 튜닝(Random Search로)
metrics	accuracy_score	accuracy 계산
	top_k_accuracy_score	top-k accuracy 계산
	auc	auc 계산
	roc_curve	roc curve(fpr, tpr)
	roc_auc_score	roc curve에서 auc 계산
	confusion_matrix	confusion matrix
	recall_score	reacall 계산
	classification_report	분류 모델 성능 요약 report 생성
	mean_squared_error	MSE 계산
	r2_score	결정계수 계산

<preprocessing>

preprocessing은 숫자형 데이터를 normalize하거나, encoding 하는 데 사용된다.

StandardScaler : 평균 0, 표준편차 1로 값을 normalize한다.
MinMaxScaler : min-max normalization을 진행한다. (0~1사이의 범위를 가짐)

→ fit을 이용하여, scaler를 조정(최대,최소 나 평균 표준편차)하고, transform을 통해, 데이터를 변환한다. fit_transform을 사용하면, 해당 데이터셋 내에서 scale을 조정할 수 있다.

from sklearn import preprocessing
import numpy as np

if __name__ == '__main__':
    # x = [1,2,3,...,9]
    x = np.arange(10).reshape(10,1)
    
    # z-score normalize
    std_normal = preprocessing.StandardScaler()
    normalized_x =  std_normal.fit_transform(x)
    
    
    # min-max normalize
    minmax_normal = preprocessing.MinMaxScaler()
    normalized_x = minmax_normal.fit_transform(x)

OrdinalEncoder : 범주형 데이터들에 숫자를 mapping 해줄 때 사용한다.

→ Scaler와 마찬가지로, fit을 이용하여, 문자를 숫자로 만드는 dictionary 구조를 만들 수 있고, transform을 이용하여 적용할 수 있다. fit_transform으로 dictionary 생성 및 변환이 가능하고, inverse_transform을 통해, 거꾸로 숫자에서 문자로 바꾸는 변환도 가능하다.

from sklearn import preprocessing
import numpy as np

if __name__ == '__main__':
    category = np.array(["사과", "딸기", "배", "두리안"])
    category = np.expand_dims(category,axis=-1)
    encoder.fit_transform(category)
    
    # mapping 정보 확인 (해당 list의 index에 번호 대응)
    print(encoder.categories_)

<model_selection>

model_selection은 데이터를 나누거나, 교차검증, 하이퍼 파라미터등을 통해, model 실험에 도움이 되는 함수들이 존재한다.

train_test_split : data를 train과 test로 분할해준다. 비율을 지정할 수 있는 등, 굉장히 유용하여, 많이 사용한다.

→ array를 넣어주고, train size나 test size 중 1개를 설정하여(비율), 정한다. random_state를 명시적으로 지정하면, 복원 가능하다.

from sklearn import datasets
from sklearn import model_selection

if __name__ == '__main__':
	# iris data load
	iris_data = datasets.load_iris()
    
    X = iris_data["data"]
	Y = iris_data["target"]
    
    # train과 test를 8:2로 분리
    x_train, x_test, y_train, y_test = model_selection.train_test_split(X,Y, train_size = 0.8)
    
    print(x_train.size) #480
    print(x_test.size)  #120

KFold : K-fold 교차 검증을 위한, 객체 생성에 사용된다.

→ n_split를 통해, K값을 정해주고, shuffle을 통해, 각 반복에서 데이터를 섞을지 여부를 선택가능하다.

from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np


if __name__ == '__main__':
    iris = load_iris()
    X, Y = iris.data, iris.target

    # K-Fold 교차 검증
    kfold = KFold(n_splits=5, shuffle=True, random_state=0)

    model = LogisticRegression()

    fold_accuracies = []

    for train_index, test_index in kfold.split(X):
        x_train, x_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]

        model.fit(x_train, y_train)

        y_pred = model.predict(x_test)

        accuracy = accuracy_score(y_test, y_pred)
        fold_accuracies.append(accuracy)
        
    for i, accuracy in enumerate(fold_accuracies, 1):
        print(f"Fold {i} Accuracy: {accuracy}")

    mean_accuracy = np.mean(fold_accuracies)
    print(f"\nMean Cross-Validation Accuracy: {mean_accuracy}")

GridSearchCV : GridSearch로 최적의 하이퍼파라미터를 찾는다.
RandomizedSearchCV : Random Search로 최적의 하이퍼파라미터를 찾는다. GridSearch보다 빠르다.

→ 최적화하려는 모델을 인자로 넣어준다. pram_grid 인자에 탐색할 하이퍼파라미터의 리스트들을 딕셔너리 형태로 넣어준다. scoring 인자를 통해, 모델의 성능 평가를 위한 지표 설정이 가능하다. cv를 통해, Fold 수를 지정 가능하다.

from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy.stats import uniform, randint

if __name__ == '__main__':
    iris = load_iris()
    X, y = iris.data, iris.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    model = LogisticRegression()

    param_grid = {
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2']
    }

    param_dist = {
        'C': uniform(0.001, 100),
        'penalty': ['l1', 'l2']
    }

    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    print("GridSearchCV - Best Parameters:", grid_search.best_params_)
    print("GridSearchCV - Best Accuracy:", grid_search.best_score_)

    random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=0)
    random_search.fit(X_train, y_train)

    print("\nRandomizedSearchCV - Best Parameters:", random_search.best_params_)
    print("RandomizedSearchCV - Best Accuracy:", random_search.best_score_)

<metrics>

metrics는 모델의 성능을 쉽게 측정할 수 있도록 한다. torch 등의 딥러닝에서 얻은 데이터도 numpy나 list로 변환하여, 사이킷런 의 metrics를 이용하면, 매우 쉽게 성능을 구할 수 있다.

accuracy_score : accuracy 값을 구할 수 있다.
top_k_accuracy_score : top-k accuracy(상위 k개 중 정답 존재하는지) score를 구할 수 있다.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, top_k_accuracy_score, confusion_matrix, recall_score, r2_score, classification_report

if __name__ == '__main__':
    iris = load_iris()
    X,y = iris["data"], iris["target"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy Score: {accuracy}")
    # Accuracy Score: 1.0
    
    top_k_acc = top_k_accuracy_score(y_test, model.predict_proba(X_test), k=2)
    print(f"Top-k Accuracy Score: {top_k_acc}")
    # Top-k Accuracy Score: 1.0

roc_curve : 분류 모델의 roc curve의 x축과 y축 (각각 fpr, tpr) 값을 구할 수 있다.
auc : AUC(area under the curve)를 쉽게 구할 수 있다. auc를 사용하기 전에는 roc_curve함수를 먼저 사용하여, fpr과 tpr을 구해야 한다.
roc_auc_score : auc 값을 roc_curve 선행 없이 구할 수 있다.
confusion_matrix : confusion matrix를 쉽게 구할 수 있다.
recall_score : recall 값을 구할 수 있다.
classification_report : recall, precision , f1 score등의 결과를 쉽게 구할 수 있다.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import auc, roc_curve, roc_auc_score, confusion_matrix, recall_score, classification_report
import numpy as np

if __name__ == '__main__':
    iris = load_iris()
    X,y = iris["data"], iris["target"]
    
    y = np.array([1 if i==0 else 0 for i in y])

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    area_under_curve = auc(fpr, tpr)
    print(f"AUC Score: {area_under_curve}")
    # AUC Score: 1.0

    plt.plot(fpr, tpr, label='ROC Curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

    roc_auc = roc_auc_score(y_test, y_prob)
    print(f"ROC AUC Score: {roc_auc}")
    # ROC AUC Score: 1.0

    conf_matrix = confusion_matrix(y_test, y_pred)
    print(f"Confusion Matrix:\n{conf_matrix}")
    # Confusion Matrix:
    # [[20  0]
    # [ 0 10]]

    recall = recall_score(y_test, y_pred)
    print(f"Recall Score: {recall}")
    # Recall Score: 1.0

    class_report = classification_report(y_test, y_pred)
    print(f"Classification Report:\n{class_report}")
    # Classification Report:
    #               precision    recall  f1-score   support
    # 
    #            0       1.00      1.00      1.00        20
    #            1       1.00      1.00      1.00        10
    # 
    #     accuracy                           1.00        30
    #    macro avg       1.00      1.00      1.00        30
    # weighted avg       1.00      1.00      1.00        30

mean_squared_error : 평균 제곱 오차(Mean Squared Error, MSE)를 구할 수 있다.
r2_score : 회귀 모델의 결정 계수를 구할 수 있다.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

if __name__ == '__main__':
    iris = load_iris()
    X,y = iris["data"][:,0].reshape(-1,1), iris["data"][:,-1]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = LinearRegression()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    print(f"Mean Sqared Error:{mse}")
    # Mean Sqared Error:0.1541458433507937
    
    r2 = r2_score(y_test, y_pred) 
    print(f"R2 Score:{r2}")
    # R2 Score:0.7575009893273535

'Python' 카테고리의 다른 글

Multiprocessing으로 Pandas Apply 속도 향상 하기 (41)	2023.11.13
Pandas 데이터 구조 & 함수 정리 (2)	2023.11.02
Pypi 사용법 & 명령어 모음 & 폐쇄망 사용법 (1)	2023.08.16
JAX: Just Another XLA 설명 (2)	2023.08.07
Pandas 성능 향상을 위한 방법들 (2)	2023.07.21

DevHwi

Scikit-learn 주요 함수 정리 -(1) 전처리, 특성추출, 평가 함수

Scikit-learn 이란?

설치 방법

Scikit-learn 주요 함수

'Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바