Illuminating Cognizance

Project - A Comprehensive Look Into Major Power Outages in the U.S.

My predictive analysis on this dataset can be found here.

Introduction

About the Data:

The subject dataset contains information on major power outages in the continental U.S.

Number of observations: 1534 rows/outages
Relevant columns/attributes:
- U.S._STATE: Represents all the states in the continental U.S.
- POSTAL.CODE (or STATE.ABBR): Represents the postal code of the U.S. states
- ANOMALY.LEVEL: This represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. It is estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region (5°N to 5°S, 120–170°W)
- OUTAGE.START.DATE: This variable indicates the day of the year when the outage event started (as reported by the corresponding Utility in the region)
- OUTAGE.START.TIME: This variable indicates the time of the day when the outage event started (as reported by the corresponding Utility in the region)
- OUTAGE.RESTORATION.DATE: This variable indicates the day of the year when power was restored to all the customers (as reported by the corresponding Utility in the region)
- OUTAGE.RESTORATION.TIME: This variable indicates the time of the day when power was restored to all the customers (as reported by the corresponding Utility in the region)
- CAUSE.CATEGORY: Categories of all the events causing the major power outages
- CAUSE.CATEGORY.DETAIL: Detailed description of the event categories causing the major power outages
- OUTAGE.DURATION: Duration of outage events (in minutes)
- DEMAND.LOSS.MW: Amount of peak demand lost during an outage event (in Megawatt) [but in many cases, total demand is reported]
- RES.PRICE: Monthly electricity price in the residential sector (cents/kilowatt-hour)
- PC.REALGSP.STATE: Per capita real gross state product (GSP) in the U.S. state (measured in 2009 chained U.S. dollars)
- PCT_WATER_INLAND: Percentage of inland water area in the U.S. state as compared to the overall inland water area in the continental U.S. (in %)

Source: data

Analysis Question and Significance

Main Question: Do power outage attributes, specifically start time, consumption information, cause category, and environmental anomaly level, seem to influence the nature of the outages, as measured by duration and missingness of peak demand?

If there are trends, are they statistically significant?

Significance: Analyzing the relationships between metrics of major power outages can provide data-driven insights that may lead to a better understanding of the factors that contribute to, and the outcomes, of outages. In the context of urban planning, understanding power outages can significantly assist planners and city officials in coordinating preventive/response mechanisms to improve the lives of city inhabitants.

Cleaning and EDA

Data Cleaning

Combined OUTAGE.START.DATE and OUTAGE.START.TIME into one column OUTAGE.START, and combined OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME into one column OUTAGE.END
- For both outage start and end times, combining the date and time columns helps convert the times to type datetime and simplify the data by having two less, yet more meaningful columns in regards to reflecting the data generating process, which is to record the date and time an outage occurred. Having date and time together in the datetime data type also permits a wider range of operations and plotting capabilities to utilize.
Converted outage duration from minutes to hours (DURATION.HR)
- Perhaps a personal preference, converting numerical time data to a higher-order unit can simplify the EDA process by dealing with smaller numbers, which are oftentimes more intuitive and easier to process for the human eye. The connection to the data generating process is the same, which is to record the length (duration) of an outage.
The columns OUTAGE.START.DATE, OUTAGE.RESTORATION.DATE, OUTAGE.RESTORATION.TIME are dropped, which helps simplify the analysis and reduce memory use by keeping a smaller dataset. OUTAGE.START.TIME was not dropped as it is used in later steps.

Data snippet:

U.S._STATE	POSTAL.CODE	ANOMALY.LEVEL	OUTAGE.START.TIME	CAUSE.CATEGORY	CAUSE.CATEGORY.DETAIL	DEMAND.LOSS.MW	RES.PRICE	PC.REALGSP.STATE	PCT_WATER_INLAND	DURATION.HR
Minnesota	MN	-0.3	5:00:00 PM	severe weather	nan	nan	11.6	51268	5.47874	51
Minnesota	MN	-0.1	6:38:00 PM	intentional attack	vandalism	nan	12.12	53499	5.47874	0.0166667
Minnesota	MN	-1.5	8:00:00 PM	severe weather	heavy wind	nan	10.87	50447	5.47874	50
Minnesota	MN	-0.1	4:30:00 AM	severe weather	thunderstorm	nan	11.79	51598	5.47874	42.5
Minnesota	MN	1.2	2:00:00 AM	severe weather	nan	250	13.07	54431	5.47874	29

Univariate Analysis

Figure 1

Figure 1 shows the distribution of outage duration in hours. Outages that lasted under 5 hours are the most common at 38.8%, followed by outages that lasted between 5 to 15 hours at 14.1%. The histogram then gradually flatten out with decreasing proportions as duration increases.

Figure 2

Figure 2 shows the distribution of outage cause categories. Outages caused by severe weather are the most common at 49.7%, followed by outages caused by intentional attacks and system operability disruptions, at 27.2% and 8.3%, respectively. The least common cause of outages is islanding at 2.9%.

Bivariate Analysis

Figure 3

Figure 3 shows a scatterplot between outage duration in hours and monthly residential electricity price. Not considering outliers, there is no apparent trend shown. The majority of outages are less than 200 hours in duration, with a range of monthly residential electricity price between 6 to 20 cents per kilowatt-hour.

Figure 4

Figure 4 shows a scatterplot between outage duration in hours and per capita real gross state product. Not considering outliers, there is no apparent trend shown. Similar to previous plot, the majority of outages are less than 200 hours in duration, with a range of per capita gross state product between 31K to 65K U.S. dollars.

Interesting Aggregates

Pivot table snippet

U.S._STATE	equipment failure	fuel supply emergency	intentional attack	islanding	public appeal	severe weather	system operability disruption
Florida	9.24167	nan	0.833333	nan	72	107.003	3.42833
California	8.74683	102.577	15.7743	3.58095	33.8019	48.8062	6.06111
Kentucky	10.8667	209.5	1.8	nan	nan	74.6685	nan
Texas	6.76	232	4.97949	nan	19.0069	64.2482	13.5133
Indiana	0.0166667	204	7.03125	2.08889	nan	75.3882	77.86

The pivot table above shows the average outage duration aggregated by state and cause category. When visualized, the data facilitates a comparison of the cause category composition of outages among states, particularly information on which outage categories are prevalent in which states, or not, knowing this can help planners channel investments and effort toward the most appropriate solutions.

Interesting multivariate plots

Figure 5

Figure 5 shows average outage duration aggregated by state and cause categories. Looking at Michigan for example, equipment failure has the greatest average outage duration at approximately 440.6 hours, followed by severe weather at 80.5 hours. Notably, severe weather seems to be a common cause in virtually every state.

Figure 6

Figure 6 shows average anomaly level by cause category, subsetted by cause detail. As mentioned, severe weather as a cause category triumphs over other categories in terms of commonality, with “public appeal” having an unusually high average ONI Index at 2.3.

Assessment of Missingness

NMAR Analysis

The DEMAND.LOSS.MW column could be NMAR - Not Missing At Random; this is because outages that occurred near the fringes, or completely outside, of peak demand hours (4 PM - 9 PM) will be more likely to not have demand lost data than outages that spanned a full peak demand hours window. In addition, low peak demand lost data maybe omitted (not recorded) if considered trivial, therefore lower values of DEMAND.LOSS.MW are likely to be missing (NMAR - missingness depends on the values themselves).

Potential additional data: Knowing the specific peak demand hours window in each outage region could explain the missingness of peak demand lost data by making it MAR under the rationale above. The general concensus for peak electricity demand hours window is 4 PM - 9 PM; however, this window may vary across different places.

Missingness Dependency

DEMAND.LOSS.MW could possibly depend on outage start time

Rationale: A short outage that began outside of, or further from, high-demand hours (4 PM - 9 PM) is likely to not have data for DEMAND.LOSS.MW.

DEMAND.LOSS.MW is likely to not depend on PCT_WATER_INLAND

Recall: PCT_WATER_INLAND ~ percentage of inland water area in the U.S. state as compared to the overall inland water area in the continental U.S. (in %).

Dependency Test 1 (start times)

Figure 7

Figure 7 shows the distribution of outage start times by whether peak demand lost was missing. The median start time of outages with peak demand is greater, and closer to the peak demand window, than the median start time of outages without peak demand; this observation agrees the rationale above.

Null Hypothesis: The start times between outages where the amount of peak demand lost is missing, and outages where the amount of peak demand lost is not missing, have the same distribution. Any observed difference is due to chance alone.

Alternative Hypothesis: The outage start times by missingness of peak demand lost have different distributions. The observed difference is unlikely due to chance alone.

Test statistic: difference in group median start time (in seconds)

The median is the appropriate measure of central tendency in this case because 1) we are interested in the peak demand, in which the median lies closer to the window, and 2) the distribution of outage start times is not normal, therefore the the mean is biased towards the outliers of start times.

Significance level: 5%

Method: shuffle DEMAND.MISSING (status of missing) column to simulate under null hypothesis

Figure 8

Figure 8 shows the simulated test statistics, differences of median outage start times in seconds, including the observed difference and the 5% significance level. As shown, our observation lies to the left of the significance level.

Result: P-value = 0.1796, Fail to reject the null hypothesis at a 5% significance level

Missingness of peak demand lost is likely to not depend on outage start times.

Dependency Test 2 (state water percentage)

Figure 9

(Note: ignore the “mean” annotation on the boxplots, the mean vertical line only applies to the histogram)

Figure 9 shows the distribution of state water percentage (PCT_WATER_INLAND) by whether peak demand lost is missing. The two distributions have the same median, and the mean is signficantly biased towards outliers states with high proportions of inland water. Although the distribution is not exactly normal, the same median indicates that the two distributions are centered at the same location, and therefore the Kolmogorov-Smirnov statistic may be appropriate.

Figure 10

Figure 10 shows the cumulative distribution functions of the two distributions above. The greatest difference seems to be at roughly 1.7 state water percentage.

Null Hypothesis: The distribution of state inland water percentage in outages where peak demand lost is missing is the same as in outages where peak demand lost is not missing. Any observed difference is due to chance alone.

Alternative Hypothesis: The distributions of state inland water percentage between the two groups are different. The observed difference is unlikely due to chance alone.

Test statistic: K-S statistic

Significance level: 5%

Method: shuffle DEMAND.MISSING (status of missing) column to simulate under null hypothesis.

Figure 11

Figure 11 shows the empirical distribution of the K-S Statistic as previously described. Our observed K-S Statistic is roughly 0.084 and lies to the right of the 5% significance level.

Result: P-value = 0.0036, Reject the null hypothesis at a 5% significance level

Missingness of peak demand lost could possibly depend on the state proportion of inland water relative to continental U.S. (possibly MAR dependent).

Hypothesis Testing

General purpose: Test the observed distribution of anomaly level by cause category

Selected Columns: ANOMALY.LEVEL and CAUSE.CATEGORY

Preliminary Inspections

Figure 12

Figure 12 indicates that the observed distribution of anomaly levels is non-normal. Since the distribution is right-skewed, the mean is biased towards higher anomaly levels.

Figure 13

Figure 13 shows the observed distribution of anomaly levels at a more granular level, that is for each cause category. As observed in Figure 12, besides equipment failure (normal with a slight mean deviation) and fuel supply emergency (normal with a stretch), the distributions are non-normal.

Conclusion from preliminary inspection:

A non-parametric test (permutation) seems appropriate because the observed distribution of anomaly levels is non-normal.
Significance: In relation to the initial main question, since we are interested in assessing the relationship between the outage attributes; a good assessment of the anomaly levels can be whether, or not, extreme values of the ONI indices indicate different outage cause categories in comparison to regular values.
- Details of which values are considered extreme are in the test description below.

Test Begin

Null Hypothesis: The distribution of outage cause categories with extreme anomaly levels is the same as in outage cause categories with regular anomaly levels. Any observed difference is due to chance alone.

Alternative Hypothesis: The distributions of outage cause categories between the two groups are different. The observed difference is unlikely due to chance alone.

Test statistic: Total Variation Distance

The TVD is appropriate because it permits a direct comparison of the discrepancy in cause category distributions between extreme versus regular anomaly values, and the subject distribution is categorical.

Significance level: 5%

Method: shuffle the binary column that indicates whether the associated anomaly level is extreme to simulate under null hypothesis

Anomaly levels above 0.5 or below -0.5 are considered extreme and are indicative of El Nino and La Nina, respectively.

Figure 14

Figure 14 shows the empirical distribution of the simulated TVDs, in which our observed TVD lies to the right of the 5% signficance level.

Conclusion: P-value = 0.0014, Reject the null hypothesis at a 5% significance level.

The distribution of cause categories in outages with extreme anomaly levels are statistically significantly different than outages with regular anomaly levels.

Illuminating Cognizance

Introduction

About the Data:

Analysis Question and Significance

Cleaning and EDA

Data Cleaning

Univariate Analysis

Bivariate Analysis

Interesting Aggregates

Interesting multivariate plots

Assessment of Missingness

NMAR Analysis

Missingness Dependency

Dependency Test 1 (start times)

Dependency Test 2 (state water percentage)

Hypothesis Testing

Preliminary Inspections

Test Begin

Thank you!