Featured

F1 Score in Machine Learning: Formula, Precision and Recall

F1 Score in Machine Learning: Formula, Precision and Recall


In machine learning, it is not always true that high accuracy is the ultimate goal, especially when dealing with imbalanced data sets. 

For example, let there be a medical test, which is 95% accurate in identifying healthy patients but fails to identify most actual disease cases. Its high accuracy, however, conceals a significant weakness. It is here that the F1 Score proves helpful. 

That is why the F1 Score gives equal importance to precision (the percentage of selected items that are relevant) and recall (the percentage of relevant chosen items) to make the models perform stably even in the case of data bias.

What is the F1 Score in Machine Learning?

F1 Score is a popular performance measure used more often in machine learning and measures the trace of precision and recall together. It is beneficial for classification tasks with imbalanced data because accuracy can be misleading. 

The F1 Score gives an accurate measure of the performance of a model, which does not favor false negatives or false positives exclusively, as it works by averaging precision and recall; both the incorrectly rejected positives and the incorrectly accepted negatives have been considered.

Understanding the Basics: Accuracy, Precision, and Recall 

1. Accuracy

Definition: Accuracy measures the overall correctness of a model by calculating the ratio of correctly predicted observations (both true positives and true negatives) to the total number of observations.

Formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

  • TP: True Positives
  • TN: True Negatives
  • FP: False Positives
  • FN: False Negatives

When Accuracy Is Useful:

  • Ideal when the dataset is balanced and false positives and negatives have similar consequences.
  • Common in general-purpose classification problems where the data is evenly distributed among classes.

Limitations:

  • It can be misleading in imbalanced datasets.
    Example: In a dataset where 95% of samples belong to one class, predicting all samples as that class gives 95% accuracy, but the model learns nothing helpful.
  • Doesn’t differentiate between the types of errors (false positives vs. false negatives).

2. Precision

Definition: Precision is the proportion of correctly predicted positive observations to the total predicted positives. It tells us how many of the predicted positive cases were positive.

Formula:

Precision = TP / (TP + FP)

Intuitive Explanation:

Of all instances that the model classified as positive, how many are truly positive? High precision means fewer false positives.

When Precision Matters:

  • When the cost of a false positive is high.
  • Examples:
    • Email spam detection: We don’t want essential emails (non-spam) to be marked as spam.
    • Fraud detection: Avoid flagging too many legitimate transactions.

3. Recall (Sensitivity or True Positive Rate)

Definition: Recall is the proportion of actual positive cases that the model correctly identified.

Formula:

Recall = TP / (TP + FN)

Intuitive Explanation:

Out of all real positive cases, how many did the model successfully detect? High recall means fewer false negatives.

When Recall Is Critical:

  • When a positive case has serious consequences.
  • Examples:
    • Medical diagnosis: Missing a disease (fapredictive analyticslse negative) can be fatal.
    • Security systems: Failing to detect an intruder or threat.

Precision and recall provide a deeper understanding of a model’s performance, especially when accuracy alone isn’t enough. Their trade-off is often handled using the F1 Score, which we’ll explore next.

The Confusion Matrix: Foundation for Metrics

Confusion MatrixConfusion Matrix

A confusion matrix is a fundamental tool in machine learning that visualizes the performance of a classification model by comparing predicted labels against actual labels. It categorizes predictions into four distinct outcomes.

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Understanding the Components

  • True Positive (TP): Correctly predicted positive instances.
  • True Negative (TN): Correctly predicted negative instances.
  • False Positive (FP): Incorrectly predicted as positive when negative.
  • False Negative (FN): Incorrectly predicted as negative when positive.

These components are essential for calculating various performance metrics:

Calculating Key Metrics

  • Accuracy: Measures the overall correctness of the model.
    Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Precision: Indicates the accuracy of optimistic predictions.
    Formula: Precision = TP / (TP + FP)
  • Recall (Sensitivity): Measures the model’s ability to identify all positive instances.
    Formula: Recall = TP / (TP + FN)
  • F1 Score: Harmonic mean of precision and recall, balancing the two.
    Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

These calculated metrics of the confusion matrix enable the performance of various classification models to be evaluated and optimized with respect to the goal at hand.

F1 Score: The Harmonic Mean of Precision and Recall

Definition and Formula:

The F1 Score is the mean F1 score of Precision and Recall. It gives a single value of how good (or bad) a model is since it considers both the false positives and negatives.

Harmonic Mean of Precision and RecallHarmonic Mean of Precision and Recall

Why the Harmonic Mean is Used:

The harmonic mean is used instead of the arithmetic mean because the approximate value assigns a higher weight to the smaller of the two (Precision or Recall). This ensures that if one of them is low, the F1 score will be significantly affected, emphasizing the relatively equal importance of the two measures.

Range of F1 Score:

  • 0 to 1: The F1 score ranges from 0 (worst) to 1 (best).
    • 1: Perfect precision and recall.
    • 0: Either precision or recall is 0, indicating poor performance.

Example Calculation:

Given a confusion matrix with:

  • TP = 50, FP = 10, FN = 5
  • Precision = 5050+10=0.833\frac{50}{50 + 10} = 0.83350+1050​=0.833
  • Recall = 5050+5=0.909\frac{50}{50 + 5} = 0.90950+550​=0.909

Therefore, when calculating the F1 Score according to the above formula, the F1 Score will be 0.869. It is at a reasonable level because it has a brilliant balance between precision and recall.

Comparing Metrics: When to Use F1 Score Over Accuracy

When to Use F1 Score?

  1. Imbalanced Datasets:

It is more appropriate to use the F1 score when the classes are imbalanced in the dataset (Fraud detection, Disease diagnosis). In such situations, accuracy is quite deceptive, as a model that may have high accuracy due to correctly classifying most of the majority class data may have low accuracy on the minority class data.

  1. Reducing Both the Number of True Positives and True Negatives

F1 score is most suitable when both the empirical risks of false positives, also called Type I errors, and false negatives, also known as Type II errors, are costly. For example, whether false positive or false negative cases happen is nearly equally crucial in medical testing or spam detection.

How F1 Score Balances Precision and Recall:

The F1 Score is the ‘right’ measure, combining precision (how many of these cases were correctly identified) and recall (how many were accurately predicted as positive cases).

This is because when one of the measurements is low, the F1 score reduces this value, so the model retains a good average. 

This is especially the case in those problems where it is unadvisable to have a shallow performance in both objectives, and this can be seen in many necessary fields.

Use Cases Where F1 Score is Preferred:

1. Medical Diagnosis

For something like cancer, we want a test that is unlikely to miss the cancer patient but will not misidentify a healthy individual as positive either. To some extent, the F1 score helps maintain both types of errors when used.

2. Fraud Detection

In financial transaction processing, fraud detection models must detect or identify fraudulent transactions (High recall) while simultaneously identifying and labeling an excessive number of genuine transactions as fraudulent (High precision). The F1 score ensures this balance.

When Is Accuracy Sufficient?

  1. Balanced Datasets

Specifically, when the classes in the data set are balanced, accuracy is usually a reasonable rate to measure the model’s performance since a good model is expected to bring out reasonable predictions for both classes.

  1. Low Impact of False Positives/Negatives

High levels of false positives and negatives may not be a considerable issue in some cases, making accuracy a good measure for the model.

Key Takeaway

F1 Score should be used when the data is imbalanced, false positive and false negative detection are equally important, and in high-risk areas such as medical diagnosis, fraud detection, etc.

Use accuracy when the classes are balanced, and false negatives and positives are not a big issue with the test outcome.

As the F1 Score considers both precision and recall, it can be convenient in tasks where the cost of mistakes can be significant.

Interpreting the F1 Score in Practice

What Constitutes a “Good” F1 Score?

The values of the F1 score vary according to the context and category in a particular application.

  • High F1 Score (0.8–1.0): Signifies good model conditions concerning the precision and recall value of the model.
  • Moderate F1 Score (0.6–0.8): Assertively and positively recommends better performance, but provides recommendations showing ample space that needs to be covered.
  • Low F1 Score (<0.6): Weak signal that shows that there is a lot to improve in the model.

Sometimes, like in diagnostics or handling fraud cases, even an F1 metrics score can be too high or moderate, and higher scores are preferable.

Using F1 Score for Model Selection and Tuning

The F1 score is instrumental in:

  • Comparing Models: It offers an objective and fair measure for evaluation, especially when compared to cases of class imbalance.
  • Hyperparameter Tuning: This can be accomplished by changing the default values of a single parameter to increase the F1 measure of the model.
  • Threshold Adjustment: Adjustable thresholds for different CPU decisions can be used to control the precision and size of the relevant information set and, therefore, increase the F1 score.

For example, we can apply cross-validation to fine-tune the hyperparameters to obtain the highest F1 score, or use the random or grid search techniques.

Macro, Micro, and Weighted F1 Scores for Multi-Class Problems

In multi-class classification, averaging methods are used to compute the F1 score across multiple classes:

  • Macro F1 Score: It first measures the F1 score for each class and then takes the average of the scores. Since it destroys all classes irrespective of how often they occur, this treats them equally.
  • Micro F1 Score: Combines the results obtained in all classes to obtain the F1 average score. This certainly positions the frequent classes on a higher scale than other classes with lower student attendance.
  • Weighted F1 Score: The average of the F1 score of each class is calculated using the formula F1 = 2 (precision x recall) / (precision + recall) for each class, with an additional weighting for several true positives. This addresses class imbalance by assigning extra weights to more populated classes in the dataset.

The selection of the averaging method is based on the standards of the specific application and the nature of the data used.

Conclusion

The F1 Score is a crucial metric in machine learning, especially when dealing with imbalanced datasets or when false positives and negatives carry significant consequences. Its ability to balance precision and recall makes it indispensable in medical diagnostics and fraud detection.

The MIT IDSS Data Science and Machine Learning program offers comprehensive training for professionals to deepen their understanding of such metrics and their applications. 

This 12-week online course, developed by MIT faculty, covers essential topics including predictive analytics, model evaluation, and real-world case studies, equipping participants with the skills to make informed, data-driven decisions.

F1 Score in Machine Learning: Formula, Precision and Recall

Source link