xgboost

XGBoost⚑

XGBoost(Xtreme Gradient Boosting)는 그라디언트 부스팅 알고리즘의 고성능 구현으로, 주로 예측 정확도를 높이고 학습 속도를 향상시키기 위해 다양한 최적화 기법이 추가된 모델입니다. XGBoost는 분류와 회귀 문제를 모두 해결할 수 있으며, 특히 Kaggle 대회와 같은 머신러닝 경진대회에서 높은 성능을 자랑합니다.

주요 개념과 특징⚑

기본 원리⚑

XGBoost는 앙상블 학습 방법의 일종으로, 여러 개의 약한 학습기를 결합하여 강력한 예측 모델을 만듭니다.
기본적으로 결정 트리(Decision Tree)를 약한 학습기로 사용합니다. 새로운 트리를 추가할 때마다 이전 트리들의 오류를 줄이는 방향으로 학습합니다.

목표 함수⚑

XGBoost는 손실 함수(Loss Function)와 정규화 항(Regularization Term)으로 구성된 목표 함수를 최소화합니다.
손실 함수는 모델 예측값과 실제 값 간의 차이를 측정하며, 정규화 항은 모델의 복잡도를 제어합니다.

Gradient Boosting⚑

각 반복에서 잔차(residual)를 최소화하는 새로운 트리를 학습합니다.
잔차는 이전 모델의 예측값과 실제 값의 차이입니다.
학습 과정에서 손실 함수의 그라디언트를 계산하여 새로운 트리를 추가합니다.

정규화⚑

XGBoost는 모델의 복잡도를 제어하기 위해 정규화를 사용합니다.
정규화 항은 트리의 리프 노드 개수와 리프 노드의 가중치 크기를 포함합니다.
이를 통해 과적합을 방지하고 모델의 일반화 성능을 향상시킵니다.

분산 처리⚑

XGBoost는 분산 및 병렬 처리를 지원하여 대규모 데이터셋에서 빠른 학습이 가능합니다.
데이터 분할과 트리 생성 과정을 병렬로 수행합니다.

Pruning (가지치기)⚑

XGBoost는 사후 가지치기(post-pruning)를 사용하여 불필요한 노드를 제거하고 모델의 복잡도를 줄입니다.
가지치기는 최소 손실 감소량(minimum loss reduction)을 기준으로 수행됩니다.

Shrinkage (Learning Rate)⚑

각 트리의 가중치를 줄이기 위해 학습률(learning rate)을 적용합니다.
학습률은 각 트리의 예측값에 곱해져 트리의 기여도를 조절합니다.
이를 통해 모델이 너무 빠르게 과적합되지 않도록 합니다.

학습 과정⚑

초기화⚑

모델의 초기 예측값을 설정합니다. (예: 회귀 문제에서는 타겟 값의 평균)

반복⚑

각 반복에서 새로운 트리를 학습합니다.
잔차를 계산하고, 이를 최소화하는 방향으로 트리를 학습합니다.
각 트리의 예측값에 학습률을 곱해 모델에 추가합니다.

    def fit(self, X, y):
        # Initialize model with mean of target values
        self.init_pred = np.mean(y)
        y_pred = np.full(y.shape, self.init_pred)

        for _ in range(self.n_estimators):
            # Calculate residuals
            residuals = y - y_pred

            # Train a new tree on residuals
            tree = self._build_tree(X, residuals, depth=0)

            # Update predictions
            y_pred += self.learning_rate * self._predict_tree(X, tree)

            # Save the trained tree
            self.trees.append(tree)

학습 과정은 _build_tree 함수를 recursive 하게 적용하여 트리를 생성합니다.
최적의 분할을 찾아가며 Recursive 하게 트리를 생성합니다.

    def _build_tree(self, X, residuals, depth):
        if depth >= self.max_depth or len(X) < self.min_samples_split:
            return np.mean(residuals)

        best_split = self._find_best_split(X, residuals)
        if best_split is None:
            return np.mean(residuals)

        left_idx, right_idx = best_split['left_idx'], best_split['right_idx']
        left_subtree = self._build_tree(X[left_idx], residuals[left_idx], depth + 1)
        right_subtree = self._build_tree(X[right_idx], residuals[right_idx], depth + 1)

        return {
            'feature': best_split['feature'],
            'threshold': best_split['threshold'],
            'left': left_subtree,
            'right': right_subtree
        }

손실 함수(Loss Function)⚑

XGBoost의 손실 감소량(Gain) 공식은 그라디언트 부스팅 알고리즘의 원리와 정규화 항을 적용한 손실 함수의 도함수로부터 유도됩니다. 이 공식을 이해하려면 몇 가지 기본 개념과 수식을 살펴볼 필요가 있습니다.

1. 목적 함수(Objective Function)⚑

XGBoost는 특정 손실 함수를 최소화하는 방향으로 모델을 학습합니다. 회귀 문제의 경우, 일반적으로 사용되는 손실 함수는 다음과 같습니다:

\[ L(y, \hat{y}) = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

여기에 정규화 항을 추가하여 모델의 복잡도를 제어합니다:

\[ L(\theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \Omega(f_t) \]

여기서 \(\Omega(f_t)\)는 정규화 항입니다. 트리 기반 모델의 경우, 다음과 같이 정의됩니다:

\[ \Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2 \]

여기서 \(T\)는 트리의 리프 노드 수, \(w_j\)는 리프 노드의 가중치입니다.

2. 2차 근사(Taylor Expansion)⚑

XGBoost는 손실 함수를 2차 테일러 근사(Taylor Expansion)하여 모델을 업데이트합니다.

\[ \mathbf{L}(\mathbf{y}, \hat{\mathbf{y}} + \mathbf{f}_t(\mathbf{x})) \approx \mathbf{L}(\mathbf{y}, \hat{\mathbf{y}}) + \nabla \mathbf{L}(\mathbf{y}, \hat{\mathbf{y}})^T \mathbf{f}_t(\mathbf{x}) + \frac{1}{2} \mathbf{f}_t(\mathbf{x})^T \mathbf{H} \mathbf{f}_t(\mathbf{x}) \]

즉, 손실 함수를 1차 및 2차 도함수를 사용하여 근사합니다.
각 데이터 포인트 \(i\)에 대해 손실 함수의 근사는 다음과 같습니다

\[ l(y_i, \hat{y}_i + f_t(x_i)) \approx l(y_i, \hat{y}_i) + g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2 \]

여기서:

\(g_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}\)는 손실 함수의 1차 도함수 (그라디언트)
\(h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}\)는 손실 함수의 2차 도함수 (헤시안)

3. 트리의 손실 감소량⚑

트리의 각 분할에서 손실 감소량은 다음과 같이 계산됩니다:

\[ \text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma \]

여기서:

\(G_L\)와\(G_R\)는 각각 왼쪽과 오른쪽 노드의 그라디언트 합
\(H_L\)와\(H_R\)는 각각 왼쪽과 오른쪽 노드의 헤시안 합
\(\lambda\)는 L2 정규화 항
\(\gamma\)는 분할을 만들기 위한 최소 손실 감소량

3.1 손실 감소량 유도⚑

각 노드의 손실:
- 각 리프 노드에서의 손실은 다음과 같이 정의됩니다:

\[ \text{Loss} = \sum_{i \in \text{left}} g_i w_{\text{left}} + \frac{1}{2} \sum_{i \in \text{left}} h_i w_{\text{left}}^2 + \sum_{i \in \text{right}} g_i w_{\text{right}} + \frac{1}{2} \sum_{i \in \text{right}} h_i w_{\text{right}}^2 \]

최적의 리프 노드 가중치 \(w\):
- 최적의 리프 노드 가중치는 다음과 같이 주어집니다:

\[ w^* = -\frac{\sum g_i}{\sum h_i + \lambda} \]

분할 전후의 손실 감소량:
- 전체 손실에서 분할 전후의 손실 감소량을 계산하여 Gain을 구합니다. 최적의 가중치를 적용한 후의 손실을 계산하면:

\[ \text{Loss}_{\text{left}} = -\frac{1}{2} \frac{(\sum_{i \in \text{left}} g_i)^2}{\sum_{i \in \text{left}} h_i + \lambda} \]

\[ \text{Loss}_{\text{right}} = -\frac{1}{2} \frac{(\sum_{i \in \text{right}} g_i)^2}{\sum_{i \in \text{right}} h_i + \lambda} \]

전체 손실 감소량:
- 분할 후의 손실 감소량을 계산합니다. 이때, 정규화 항과 최소 손실 감소량\(\gamma\)를 고려합니다.

\[ \text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma \]

이 공식은 각 분할의 손실 감소량을 정확하게 계산하여 최적의 분할을 선택하는 데 사용됩니다. 이를 통해 모델의 성능을 최적화할 수 있습니다.

코드⚑

_find_best_split 함수를 통해 최적의 분할을 찾습니다.
각 feature와 threshold에 대해 손실 함수를 계산하고, 가장 낮은 손실을 갖는 분할을 선택합니다.
실제 XGBoost 라이브러리에서는 이러한 과정을 최적화하여 빠르게 수행합니다.

def _find_best_split(self, X, gradients, hessians):
    best_split = None
    best_gain = -float('inf')

    for feature in range(X.shape[1]):
        thresholds = np.unique(X[:, feature])
        for threshold in thresholds:
            left_idx = X[:, feature] < threshold
            right_idx = ~left_idx

            if np.sum(left_idx) == 0 or np.sum(right_idx) == 0:
                continue

            # Left and right gradients and hessians
            grad_left = np.sum(gradients[left_idx])
            grad_right = np.sum(gradients[right_idx])
            hess_left = np.sum(hessians[left_idx])
            hess_right = np.sum(hessians[right_idx])

            # Gain from the split
            gain = 0.5 * (
                (grad_left ** 2 / (hess_left + self.reg_lambda)) +
                (grad_right ** 2 / (hess_right + self.reg_lambda)) -
                ((grad_left + grad_right) ** 2 / (hess_left + hess_right + self.reg_lambda))
            ) - self.gamma

            if gain > best_gain:
                best_gain = gain
                best_split = {
                    'feature': feature,
                    'threshold': threshold,
                    'left_idx': left_idx,
                    'right_idx': right_idx
                }

    return best_split

종료 조건⚑

설정한 트리 개수(n_estimators)에 도달하거나 손실 함수의 감소가 특정 기준 이하로 떨어지면 학습을 종료합니다.
- gamma: 리프 노드 분할을 위한 최소 손실 감소량을 설정합니다.
- max_depth: 트리의 최대 깊이를 설정합니다.

    def _build_tree(self, X, residuals, depth):
        if depth >= self.max_depth or len(X) < self.min_samples_split:
            return np.mean(residuals)

        best_split = self._find_best_split(X, residuals)
        if best_split is None:
            return np.mean(residuals)

장점⚑

높은 성능: 정규화, 가지치기, 학습률 조절 등을 통해 높은 예측 성능을 제공합니다.
빠른 학습: 분산 처리와 병렬 처리를 통해 빠른 학습이 가능합니다.
유연성: 회귀, 분류, 랭킹 등 다양한 문제에 적용할 수 있습니다.
하이퍼파라미터 튜닝: 다양한 하이퍼파라미터를 통해 모델 성능을 세밀하게 조정할 수 있습니다.

Implementation⚑

import numpy as np

class SimpleXGBoost:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, min_samples_split=2, reg_lambda=1.0, gamma=0):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.reg_lambda = reg_lambda
        self.gamma = gamma
        self.trees = []

    def fit(self, X, y):
        # Initialize model with mean of target values
        self.init_pred = np.mean(y)
        y_pred = np.full(y.shape, self.init_pred)

        for _ in range(self.n_estimators):
            # Calculate residuals
            residuals = y - y_pred

            # Gradients and Hessians for second-order gradient boosting
            gradients = residuals
            hessians = np.ones_like(gradients)

            # Train a new tree on gradients
            tree = self._build_tree(X, gradients, hessians, depth=0)

            # Update predictions
            y_pred += self.learning_rate * self._predict_tree(X, tree)

            # Save the trained tree
            self.trees.append(tree)

    def predict(self, X):
        # Start with initial prediction
        y_pred = np.full((X.shape[0],), self.init_pred)

        # Add the predictions from each tree
        for tree in self.trees:
            y_pred += self.learning_rate * self._predict_tree(X, tree)

        return y_pred

    def _build_tree(self, X, gradients, hessians, depth):
        if depth >= self.max_depth or len(X) < self.min_samples_split:
            return np.sum(gradients) / (np.sum(hessians) + self.reg_lambda)

        best_split = self._find_best_split(X, gradients, hessians)
        if best_split is None:
            return np.sum(gradients) / (np.sum(hessians) + self.reg_lambda)

        left_idx, right_idx = best_split['left_idx'], best_split['right_idx']
        left_subtree = self._build_tree(X[left_idx], gradients[left_idx], hessians[left_idx], depth + 1)
        right_subtree = self._build_tree(X[right_idx], gradients[right_idx], hessians[right_idx], depth + 1)

        return {
            'feature': best_split['feature'],
            'threshold': best_split['threshold'],
            'left': left_subtree,
            'right': right_subtree
        }

    def _find_best_split(self, X, gradients, hessians):
        best_split = None
        best_gain = -float('inf')

        for feature in range(X.shape[1]):
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                left_idx = X[:, feature] < threshold
                right_idx = ~left_idx

                if np.sum(left_idx) == 0 or np.sum(right_idx) == 0:
                    continue

                # Left and right gradients and hessians
                grad_left = np.sum(gradients[left_idx])
                grad_right = np.sum(gradients[right_idx])
                hess_left = np.sum(hessians[left_idx])
                hess_right = np.sum(hessians[right_idx])

                # Gain from the split
                gain = 0.5 * (
                    (grad_left ** 2 / (hess_left + self.reg_lambda)) +
                    (grad_right ** 2 / (hess_right + self.reg_lambda)) -
                    ((grad_left + grad_right) ** 2 / (hess_left + hess_right + self.reg_lambda))
                ) - self.gamma

                if gain > best_gain:
                    best_gain = gain
                    best_split = {
                        'feature': feature,
                        'threshold': threshold,
                        'left_idx': left_idx,
                        'right_idx': right_idx
                    }

        return best_split

    def _predict_tree(self, X, tree):
        if not isinstance(tree, dict):
            return np.full(X.shape[0], tree)

        feature = tree['feature']
        threshold = tree['threshold']
        left_idx = X[:, feature] < threshold
        right_idx = ~left_idx

        predictions = np.empty(X.shape[0])
        predictions[left_idx] = self._predict_tree(X[left_idx], tree['left'])
        predictions[right_idx] = self._predict_tree(X[right_idx], tree['right'])

        return predictions

# Load dataset (using sklearn)
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load dataset
boston = fetch_openml(name='boston', version=1, as_frame=False)
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SimpleXGBoost model
model = SimpleXGBoost(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = np.mean((y_test - y_pred) ** 2)
print(f'Mean Squared Error: {mse:.4f}')