Skip to the content.

Illuminating Intelligence - ML

Assessing Predictive Power for Major Power Outages in the U.S.

Data source: data

My exploratory data analysis on this dataset can be found here.

Framing the Problem

Problem Identification

Prediction problem: Predict the severity of an outage (classes: 1-minor, 2-moderate, 3-major, 4-severe)

Type: Multiclass Classification

Response variable: Four severity categories - category name (label): minor (1), moderate (2), major (3), severe (4)

Evaluation metric: Matthews’ correlation coefficient

Exploratory Analysis for Feature Engineering

Determining appropriate thresholds for outage severity (target variable)

Figure 1

Figure 1 shows a boxplot of the value ranges for attribute number of customers affected in an outage. The observed ranges are used to determine outage severity in terms of customers affected.

Figure 2

Figure 2 shows a boxplot of the value ranges for outage duration in hours. The observed ranges are used to determine outage severity in terms of outage duration.

Determining appropriate transformations for learning features

Figure 3

Figure 3 shows the distribution of utility contribution (in %) to GSP, subset by severity. Since the distribution of every class appears to be centered at a point on the x-axis, binning the utility contribution percentages by 1-D k-means clusters could capture the individual pattern of each severity class.

Figure 4

Figure 4 shows the distribution of the annual number of customers served in the U.S. state where an outage occurred, subset by severity. The composition of severity classes seems to fluctuate and changes significantly at different levels of customers served, with slightly similar patterns every 3-4 bins; this suggest that a uniform discretization (binning) strategy could potentially group similar data points to reduce noise, while emphasizing relevant patterns across the groups.

Figure 5

Figure 5 shows the distribution of anomaly levels, subset by severity. Interestingly, across the classes, the distributions are already in a nice form with gradual changes across the anomaly levels, except for a spike in class 4 (severe) for levels above 1.5. To prevent encountering information loss from unnecessary transformations, the anomaly levels attribute will be kept in original form.

Baseline Model

Preliminary Description

Model: Random Forest Classifier

Features - name (type, transformation):

Figure 6

Figure 6 shows the confusion matrix of the baseline model. Importantly, it emphasizes the class imbalance present in the dataset through the high number of true positives and true negatives for the severe class.

Figure 7

Figure 7 contains one ROC curve for each class, generated independently using the One-vs-Rest multiclass mechanism on the test set. The model has good class-specific performance as shown by having AUC’s consistently above 0.71 for every class, indicating a high capacity to correctly predict any given class from the rest.

Detailed Description & Performance of Baseline Model

Model: Random Forest Classifier



Overall: The baseline model achieved a training Matthews correlation coefficient (MCC) of 0.74, which indicates a strong correlation between actual and predicted labels. However, the model only achieved a testing MCC of 0.31, to distinguish against 0.31 in the context of accuracy, a 0.31 correlation indicates fair predictive power given there is a moderate degree of agreement in the model’s predictions. Nonetheless, the observed drop in performance is indicative of overfitting, hence the model is not generalizing well to unseen data. Similarly, the baseline model achieved 81% training accuracy, and only 51% testing accuracy, which is also indicative of overfitting and an inability to generalize to unseen data.

Addressing class imbalance: As inspected, there is a subtle class imbalance present, where class 4 (severe) represents 42% of labels in the train set and 41% of labels in the test set. Thus, the MCC is an insightful performance indicator because it takes into account the nuance of false discoveries (FP, FN) in its calculation. In our multiclass context, the contrast among macro, micro, and weighted average MCC is also important to understand how the model performs for each class. Since the weighted MCC takes into account class frequencies and balances out each class’s influence on the final metric value using weights, a higher weighted MCC than macro and micro MCC’s indicates the model is performing well across all classes. Granted, our class imbalance was considerably subtle to begin with.

Generalization capacity: Although according to the above train and test performance metrics, the model is seemingly not generalizing well to unseen data, a close analysis of the Receiver Operating Characteristic (ROC) curves for each class, generated using the One-vs-Rest mechanism, on the test set reveal that the model is actually performing relatively well on unseen data. Across all classes, with areas under the ROC curve ranging from 0.72 to 0.81, the model illustrates a good capacity to distinguish any given class from the rest, which is a promising sign about the model’s applicability on unseen data.

Conclusion (baseline): The model is acceptable is its ability to differentiate any given class (out of 4) from the rest, on both train and test sets. However, it is not good for its inability to generalize to unseen data due to overfitting.

Final Model

Preliminary Description

Model: Random Forest Classifier

Features from Baseline - name (type, transformation):

New Features - name (type, transformation):

Methodology: hyperparameters max tree depth, n, and m are determined by grid search

Figure 8

Figure 8 shows the confusion matrix of the final model. Similar to baseline, the higher values along the diagonal indicate the model is consistently making correct predictions for each class. Conversely, several high values off the diagonal indicate those are instances where the model mistakenly predicted an incorrect class for some true label.

Figure 9

Figure 9 shows the ROC curves for each class. Similar to baseline, the curves, and their areas, indicate strong class-specific performance for all classes, specifically the model’s ability to distinguish any given class from the rest.

Detailed Description & Performance of Final Model

Model: Random Forest Classifier



Across both performance metrics, accuracy and MCC, the final model seems to be a subtle improvement from the baseline model, though both models still do not generalize well to unseen data, as evident in the heavy performance drops when evaluated on the test set. Interestingly, it appears that the final model carries virtually every pros of the baseline model, which includes good performance across all classes (weighted MCC > micro and macro MCC), and good class-specific performance. The latter is shown by the Test ROC curves, where the areas under the curve are consistently high, which indicates the model’s ability to discern between any given class versus the rest. Nonetheless, there is still a high degree of overfitting in the final model that is causing poor performance on unseen data.

Improvements from Baseline:

1) Higher MCC and Accuracy: The final model consistently achieve higher performance metrics over the baseline model; this is particularly significant with MCC, which considers both class-specific performance (macro MCC) and class frequencies (weighted MCC), making it more robust and reliable given our slightly imbalanced dataset.

2) Stability: Since the final model was implemented with hyperparameter-tuning through grid search and 5-fold cross-validation, meaning it has been tested much more thoroughly across different samples of the dataset and configurations, its performance results is likely to be significantly more stable and reliable than baseline.

Fairness Analysis

Fairness Analysis Description

Purpose: Assess how the final model perform with respect to two groups, or how “fair” it generates predictions for the two groups.

Group 1: Outages in states serving higher annual numbers of customers

Group 2: Outages in states serving lower annual numbers of customers

Null Hypothesis: The final model is fair, its Matthews correlation coefficient for Group 1 and Group 2 are approximately the same, any observed differences are due to chance.

Alternative Hypothesis: The final model is unfair, its Matthews correlation coefficient for Group 1 is lower than its MCC for Group 2. The observed difference is unlikely to have occurred by chance alone.


Rationale: The two groups above were chosen because the distinction between them is relevant in comparing some given outages. In assessing whether the final model performs better, or worse, for one group over the other, or fair across both groups, we can assess how “fair” the model is toward the groups.

Figure 10

Figure 10 shows the distribution of annual total customers served in the U.S. state where an outage occurred, marked with three plausible thresholds to separate low vs. high values, that are the median, mean, and 75th percentile. Considering the right-skewed distribution of the data, and the best half split, the median is the ideal threshold.

Figure 11

Figure 11 shows the empirical distribution of the difference in MCC under the null hypothesis that the final model is fair both in its predictions for outages in states serving lower numbers of customers, and outages in states serving larger numbers of customers. At a p-value of approximately 0.35, by chance, it is reasonable to have the observed difference of -0.03.

Fairness Analysis Conclusion

P-value: 0.35

Conclusion: Since our p-value of 0.35 is signficantly greater than our significance level of 0.01, we Fail to Reject the null hypothesis, and conclude that it is reasonable to have a difference in MCC as observed. With a considerably low observed difference in MCC, of roughly -0.03, that is not statistically significant, our final model does likely to have achieved MCC parity.