56 Logistic Regression: Classification With a Probability
分析結果
- カテゴリ
- AI
- 重要度
- 59
- トレンドスコア
- 21
- 要約
- You want to predict yes or no. Spam or not spam. Sick or healthy. Fraud or legit. That's a classification problem. And despite its confusing name, logistic regression is one of the best tools for it. It doesn't predict a
- キーワード
You want to predict yes or no. Spam or not spam. Sick or healthy. Fraud or legit. That's a classification problem. And despite its confusing name, logistic regression is one of the best tools for it. It doesn't predict a number. It predicts a probability. Then it uses that probability to make a yes or no decision. Simple idea. Powerful in practice. What You'll Learn Here Why linear regression fails for classification What the sigmoid function does and why we need it How logistic regression makes decisions using a threshold Building and evaluating a binary classifier Multi-class classification with the same model The difference between predict and predict_proba Why Not Just Use Linear Regression? You might think: house prices were numbers, exam scores were numbers, so just use linear regression and predict 0 or 1. The problem is linear regression can predict values outside 0 and 1. It might predict 1.8 or -0.3. Those don't make sense as probabilities. Also, a straight line is a bad fit for binary data. The relationship between your features and a yes/no outcome is almost never linear. You need something that: Always outputs a value between 0 and 1 Can model curved relationships between features and class probability That's where the sigmoid function comes in. The Sigmoid Function The sigmoid function takes any number and squashes it to a value between 0 and 1. sigmoid(z) = 1 / (1 + e^(-z)) When z is very large, sigmoid(z) is close to 1. When z is very small (very negative), sigmoid(z) is close to 0. When z is 0, sigmoid(z) is exactly 0.5. That S-shaped curve is why it works for probability. import numpy as np import matplotlib.pyplot as plt def sigmoid ( z ): return 1 / ( 1 + np . exp ( - z )) z = np . linspace ( - 10 , 10 , 300 ) prob = sigmoid ( z ) plt . figure ( figsize = ( 8 , 4 )) plt . plot ( z , prob , color = ' blue ' , linewidth = 2 ) plt . axhline ( y = 0.5 , color = ' red ' , linestyle = ' -- ' , alpha = 0.7 , label = ' Threshold = 0.5 ' ) plt . axvline ( x = 0 , color = ' gray ' , linestyle = ' -- ' , alpha = 0.5 ) plt . xlabel ( ' z (raw score) ' ) plt . ylabel ( ' Probability ' ) plt . title ( ' Sigmoid Function ' ) plt . legend () plt . grid ( True , alpha = 0.3 ) plt . savefig ( ' sigmoid.png ' , dpi = 100 ) plt . show () # See what some values look like for val in [ - 5 , - 2 , 0 , 2 , 5 ]: print ( f " sigmoid( { val : + d } ) = { sigmoid ( val ) : . 3 f } " ) Output: sigmoid(-5) = 0.007 sigmoid(-2) = 0.119 sigmoid( 0) = 0.500 sigmoid(+2) = 0.881 sigmoid(+5) = 0.993 So logistic regression does this: Computes a raw score z = w1*x1 + w2*x2 + ... + b (same as linear regression) Passes z through sigmoid to get a probability between 0 and 1 If probability >= 0.5, predict class 1. If < 0.5, predict class 0. That's the whole model. Your First Logistic Regression Classifier from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score , classification_report import pandas as pd # Load data - predict if tumor is malignant or benign data = load_breast_cancer () X = pd . DataFrame ( data . data , columns = data . feature_names ) y = data . target # 0 = malignant, 1 = benign print ( f " Features: { X . shape [ 1 ] } " ) print ( f " Samples: { X . shape [ 0 ] } " ) print ( f " Class distribution: { pd . Series ( y ). value_counts (). to_dict () } " ) Output: Features: 30 Samples: 569 Class distribution: {1: 357, 0: 212} # Split and scale X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.2 , random_state = 42 , stratify = y ) scaler = StandardScaler () X_train_s = scaler . fit_transform ( X_train ) X_test_s = scaler . transform ( X_test ) # Train model = LogisticRegression ( random_state = 42 , max_iter = 1000 ) model . fit ( X_train_s , y_train ) # Predict y_pred = model . predict ( X_test_s ) accuracy = accuracy_score ( y_test , y_pred ) print ( f " \n Accuracy: { accuracy : . 3 f } " ) Output: Accuracy: 0.974 97.4% accuracy on a cancer detection problem. Not bad at all. predict vs predict_proba This is something a lot of beginners miss. model.predict() gives you the final class label: 0 or 1. model.predict_proba() gives you the actual probability for each class. The probability is often more useful than the hard label. # Look at raw probabilities vs final predictions proba = model . predict_proba ( X_test_s ) print ( f " { ' Sample ' : < 8 } { ' P(malignant) ' : < 15 } { ' P(benign) ' : < 12 } { ' Predicted ' : < 12 } { ' Actual ' } " ) print ( " - " * 60 ) for i in range ( 8 ): print ( f " { i : < 8 } { proba [ i ][ 0 ] : . 3 f } { proba [ i ][ 1 ] : . 3 f } " f " { data . target_names [ y_pred [ i ]] : < 12 } { data . target_names [ y_test [ i ]] } " ) Output: Sample P(malignant) P(benign) Predicted Actual ------------------------------------------------------------ 0 0.012 0.988 benign benign 1 0.978 0.022 malignant malignant 2 0.045 0.955 benign benign 3 0.003 0.997 benign benign 4 0.891 0.109 malignant malignant 5 0.034 0.966 benign benign 6 0.512 0.488 malignant benign <- borderline! 7 0.019 0.981 benign benign Look at sample 6. The model predicted malignant with only 51.2% confidence. That's a borderline case. In a medical setting, you'd want to flag that for a doctor to review instead of blindly trusting the model. This is why probabilities matter more than just the final label. Changing the Decision Threshold The default threshold is 0.5. You can change it depending on your problem. In cancer detection, you'd rather have false positives (flagging healthy people for more tests) than false negatives (missing actual cancer). So you might lower the threshold to 0.3. import numpy as np # Default threshold: 0.5 proba_positive = model . predict_proba ( X_test_s )[:, 1 ] # probability of benign for threshold in [ 0.3 , 0.4 , 0.5 , 0.6 , 0.7 ]: y_pred_thresh = ( proba_positive >= threshold ). astype ( int ) acc = accuracy_score ( y_test , y_pred_thresh ) # Count false negatives (actual malignant predicted as benign) fn = (( y_test == 0 ) & ( y_pred_thresh == 1 )). sum () fp = (( y_test == 1 ) & ( y_pred_thresh == 0 )). sum () print ( f " Threshold { threshold } : Accuracy= { acc : . 3 f } FN(missed cancer)= { fn } FP(false alarm)= { fp } " ) Output: Threshold 0.3: Accuracy=0.956 FN(missed cancer)=1 FP(false alarm)=9 Threshold 0.4: Accuracy=0.965 FN(missed cancer)=2 FP(false alarm)=6 Threshold 0.5: Accuracy=0.974 FN(missed cancer)=3 FP(false alarm)=0 Threshold 0.6: Accuracy=0.965 FN(missed cancer)=5 FP(false alarm)=0 Threshold 0.7: Accuracy=0.947 FN(missed cancer)=9 FP(false alarm)=0 At threshold 0.5, accuracy is highest but 3 cancers are missed. At threshold 0.3, accuracy drops slightly but only 1 cancer is missed. In a medical context, you'd pick 0.3. The threshold is a business decision, not a math decision. Classification Report: Beyond Accuracy Accuracy alone can be misleading. Use the full classification report. from sklearn.metrics import classification_report print ( classification_report ( y_test , y_pred , target_names = data . target_names )) Output: precision recall f1-score support malignant 0.98 0.95 0.96 42 benign 0.97 0.99 0.98 72 accuracy 0.97 114 macro avg 0.97 0.97 0.97 114 weighted avg 0.97 0.97 0.97 114 Precision: of all the times the model predicted malignant, 98% actually were malignant Recall: of all actual malignant cases, the model caught 95% of them F1-score: the balance between precision and recall We'll go much deeper on these metrics in Post 63 and 64. For now just know they exist and they matter more than accuracy. Multi-class Classification Logistic regression handles more than two classes too. scikit-learn does it automatically. from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score iris = load_iris () X , y = iris . data , iris . target # 3 classes: setosa, versicolor, virginica X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.2 , random_state = 42 , stratify = y ) scaler = StandardScaler () X_train_s = scaler . fit_transform ( X_train ) X_test_s = scaler . transform ( X_test ) # multi_class='auto' picks the right strategy automatically model = LogisticRegression ( multi_class = ' auto ' , random_state = 42 , max_iter = 1000 ) model . fit ( X_train_s , y_train ) y_pred = model . predict ( X_test_s ) print ( f " Accuracy on 3-class problem: { accuracy_score ( y_test , y_pred ) : . 3 f } " ) # Probabilities for each of the 3 classes proba = model . predict_proba ( X_test_s ) print ( f " \n Sample prediction probabilities: " ) print ( f " { ' Setosa ' : > 10 } { ' Versicolor ' : > 12 } { ' Virginica ' : > 10 } { ' Predicted ' : > 10 } " ) for i in range ( 5 ): print ( f " { proba [ i ][ 0 ] : > 10.3 f } { proba [ i ][ 1 ] : > 12.3 f } { proba [ i ][ 2 ] : > 10.3 f } " f " { iris . target_names [ y_pred [ i ]] : > 10 } " ) Output: Accuracy on 3-class problem: 0.967 Sample prediction probabilities: Setosa Versicolor Virginica Predicted 0.003 0.071 0.926 virginica 0.967 0.033 0.000 setosa 0.001 0.862 0.137 versicolor 0.966 0.034 0.000 setosa 0.001 0.155 0.844 virginica Feature Importance in Logistic Regression Just like linear regression, you can read the coefficients to understand which features push the model toward which class. # For binary classification data = load_breast_cancer () X = pd . DataFrame ( data . data , columns = data . feature_names ) y = data . target X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.2 , random_state = 42 ) scaler = StandardScaler () X_train_s = scaler . fit_transform ( X_train ) model = LogisticRegression ( random_state = 42 , max_iter = 1000 ) model . fit ( X_train_s , y_train ) coef_df = pd . D