![]() The function outputs also the corresponding thresholds. You can use the function roc_curve from Sklearn to calculate the false positive rate (fpr) and the true positive rate (tpr). With the ROC curve, you consider multiple thresholds between 0 and 1 and calculate the true positive rate as a function of the false positive rate for each of them. However, you can choose other thresholds, and the metrics you use to evaluate the performance of your model will depend on this threshold. For instance, using the default choice of the decision threshold at 0.5, you consider that the estimated class is 1 when the model outputs a score above 0.5. A probability above the threshold is considered as a positive class. For models like logistic regression which outputs probabilities between 0 and 1, you need to convert this score to the class 0 or 1 using a decision threshold, or just threshold. In classification tasks, you want to estimate the class of data samples. You can see that there is no positive observation that has been correctly classified (TP) with the random model. It presents a table organized as follows:įigure 2: Illustration of a confusion matrix. You can use the function confusion_matrix from Sklearn. Let’s calculate these values for your first logistic regression model. ![]() The false negatives (FN): the prediction is 0 but the true class is 1.The true negatives (TN): the prediction is 0 and the true class is 0.The false positives (FP): the prediction is 1 but the true class is 0.The true positives (TP): the prediction is 1 and the true class is 1.The main idea is to separate the estimations from the model into four categories: Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. You can check the very good explanations of Aurélien Géron about ROC curves in Géron, Aurélien. ROC CurvesĪ good alternative to the accuracy is the Receiver Operating Characteristics (ROC) curve. You need other metrics to assess the performance of models with imbalanced datasets. In our example, the model could output only zeros and you would get around 86% accuracy. To summarize, having a different number of observations corresponding to each class, you can’t rely on the accuracy to evaluate your model’s performance. This shows that, even with a random model, the accuracy is not bad at all: it doesn’t mean that the model is good. This rating is the target: this is what you’ll try to estimate.įirst, let’s load the data and have a look at the features: ![]() The related paper is Cortez, Paulo, et al. ”Modeling wine preferences by data mining from physicochemical properties.” Decision Support Systems 47.4 (2009): 547-553.įigure 1: Illustration of wine quality modeling.Īs illustrated in Figure 1, the dataset represents chemical analyses of wines (the features) and ratings of their quality. To do this, we’ll use a dataset showing various chemical properties of red wines and ratings of their quality. You’ll develop methods allowing you to evaluate your models considering imbalanced data with the area under the Receiver Operating Characteristics (ROC) curve. You want to do a binary classification of the quality (distinguishing very good wines from not very good ones). Let’s say that you would like to predict the quality of wines from various of their chemical properties. Building from this example, you’ll see the notion of the area under the curve and integrals from a mathematical point of view (from my book Essential Math for Data Science). In this article, you’ll learn about integrals and the area under the curve using the practical data science example of the area under the ROC curve used to compare the performances of two machine learning models. In the context of machine learning and data science, you might use integrals to calculate the area under the curve (for instance, to evaluate the performance of a model with the ROC curve, or to calculate probability from densities. Calculus is a branch of mathematics that gives tools to study the rate of change of functions through two main areas: derivatives and integrals.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |