Machine Learning For Energy Theft Detection

7 min readDec 21, 2020

Introduction

As estimates suggest, Non-Technical Loss(NTL) in any sub-Saharan power distribution company may range up to 40% of the total energy received from the grid. Industrial NTL detection systems are still largely based on expert knowledge when deciding whether to carry out costly on-site inspections of customers.

With millions of prepaid meters soon to be deployed to most households through the National Mass Metering Programme, power distribution companies would have to find novel ways to deal with the challenges of meter bypass and energy theft as a whole. This project pushes power distribution companies to move to large-scale deployments of automated systems that learn NTL profiles from data.

Machine learning has invaded our everyday lives, from online communications to online recommendations on your favourite streaming site, which have all impressed the people. Also, when it comes to energy theft detection, machine learning has proved to be potentially very effective.

In this project, I used a novel system that combines automated statistical decision making with expert knowledge. Data was sourced from Blue Reeva Energy a hypothetical power distribution company in Rwanda. First, I deployed a machine learning framework that classifies customers based on their likelihood of meter bypass by using a variety of features derived from the customers’ consumption data. The methodology used is specifically tailored to the level of noise in the data. Second, in order to allow human experts to feed their knowledge in the decision loop, I built an App using R’s shiny web framework to output prediction results at various granularity levels.

inverse behaviour between inspections and energy theft

From the chart above, we could see how theft drops as inspection occurs. One striking observation is in November where theft attained a new average, meaning that physical inspections alone are not effective to detect and curtail electricity theft. Machine learning provides the capability of classifying customers by their probability of meter bypass from studying the patterns in their consumption data and customers caught on bypass.

ggplot(df)+
 geom_bar(aes(x=Status, fill= Status),position = “dodge”) +
 ggtitle(“Bypass Vs No-Bypass”)

Our hypothetical power distribution company has lower no bypass cases than bypass cases as we can see from the chart above. However, relative to western countries, a typical power distribution company in sub-Saharan Africa has far lower billed energy compared to input energy. This is mainly because commercial losses are very inherent. To draw inference from a different perspective, our chart could also mean that there are a lot of bypasses which physical investigations alone have not really been effective in reducing.

Exploratory data analysis

To guide the analysis, I tried to answer the following questions about the bypass and no Bye-pass segments:

Which tariff class is more likely to be bypassed?
Which meter brand is more like to bebypassed?
Do people with high availability bypass more than people that do not have high availability?
Do people with high load bypass more than people that do not have high load?

df%>% 
  count(Status, Customer.tariff.as.per.bill) %>% 
  filter(Status == "Bypass") %>% 
  top_n(10) %>%
  ggplot(., aes(reorder(Customer.tariff.as.per.bill,n), n, fill = n)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Tarrif with Largest Bypass Case") +
  xlab("Tarrif")

As seen above, R2 customers and C1 customers are the most bypassed tariff. Next, we take a look at the most bypassed meters.

df%>% 
  count(Status, Meter.Make) %>% 
  filter(Status == "Bypass") %>% 
  top_n(10) %>%
  ggplot(., aes(reorder(Meter.Make,n), n, fill = n)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Meter with Largest Bypass Case") +
  xlab("Meter")

Status of meters with the highest bypass case

It is obvious that K-electric meters are the most bypassed meters in our hypothetical power distribution company. This brings us to the next question, do people with high availability bypass the most?

df%>% 
 count(Status, Availability) %>% 
 filter(Status == “Bypass”, Availability>0) %>% 
 top_n(10) %>%
 ggplot(., aes(reorder(Availability,n), n, fill = n)) +
 geom_bar(stat = “identity”) +
 coord_flip() +
 ggtitle(“Meter with Largest Bypass Case”) +
 xlab(“Availability”)

Bypass cases based on monthly availability

Wow, we could see that higher availability is an indication of higher probability to bypass. Although this may not always be the case from our chart, there are several other factors we may want to consider with respect to other causal variables but let’s leave it at this.

#Load
df%>% 
 count(Status, Estimated.Connected.Load) %>% 
 filter(Status == “Bypass”, Estimated.Connected.Load>0) %>% 
 top_n(10) %>%
 ggplot(., aes(reorder(Estimated.Connected.Load,n), n, fill = n)) +
 geom_bar(stat = “identity”) +
 coord_flip() +
 ggtitle(“Estimated Connected Load Largest Bypass Case”) +
 xlab(“Estimated Connected Load”)

Cases of meter bypass by the estimated connected load

Although this may not be the ideal approach, what I should have done was to bin the load and derive the frequency of bypass from each class, however, this also gives us a crude idea of the load level that is often bypassed.

Machine Learning

numericVarName <- names(which(sapply(df, is.numeric)))
 corr <- cor(df[,numericVarName], use = ‘pairwise.complete.obs’)
 ggcorrplot(corr, lab = TRUE)

As seen above, this is a correlation analysis to examine relationships between the variables in the dataset.

Feature Engineering

# Recode meters with bypass cases into High and Lowdf <- df %>%
 mutate(MeterBypass = 
 case_when(Meter.Make == ‘ANALOGUE’ ~ “high”,
 Meter.Make == ‘CONLOG’ ~ “high”,
 Meter.Make == ‘DINRAIL’ ~ “high”,
 Meter.Make == ‘EDMI’ ~ “high”,
 Meter.Make == ‘EMCON’ ~ “high”,
 Meter.Make == ‘GENUS’ ~ “high”,
 Meter.Make == ‘HEXING’ ~ “high”,
 Meter.Make == ‘HOLLEY’ ~ “high”,
 Meter.Make == ‘ITRON’ ~ “high”,
 Meter.Make == ‘K-ELECTRIC’ ~ “high”,
 Meter.Make == ‘MOJEC’ ~ “high”,
 Meter.Make == ‘TECNO’ ~ “high”,
 Meter.Make == ‘L&G’ ~ “high”,
 Meter.Make == ‘LANDIS’ ~ “high”,
 Meter.Make == ‘MBH’ ~ “high”,
 Meter.Make == ‘UNISTAR’ ~ “high”,
 Meter.Make == ‘SIEMENS’ ~ “high”,
 TRUE ~ “Low”))table(df$MeterBypass)# Check to see Tariff with highest bypass
 df <- df %>%
 mutate(TariffBypass = 
 case_when(Customer.tariff.as.per.bill == ‘R2’ ~ “high”,
 Customer.tariff.as.per.bill == ‘A1’ ~ “high”,
 Customer.tariff.as.per.bill == ‘c1’ ~ “high”,
 Customer.tariff.as.per.bill == ‘C2’ ~ “high”,
 Customer.tariff.as.per.bill == ‘R3’ ~ “high”,
 TRUE ~ “Low”))
 table(df$TariffBypass)

other features have been deliberately omitted for special reasons.

#Bar plots of categorical variables
 p1 <- ggplot(df, aes(x=MeterBypass)) + ggtitle(“Meter”) + xlab(“Meter”) +
 geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) +
 ylab(“Percentage”) + coord_flip() + theme_minimal()
 p2 <- ggplot(df, aes(x=FeederBypass)) + ggtitle(“Feeder”) +
 xlab(“Feeder”) + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab(“Percentage”) + coord_flip() + theme_minimal()
 p3 <- ggplot(df, aes(x=TariffBypass)) + ggtitle(“Tariff”) + xlab(“Tariff”) + 
 geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab(“Percentage”) + coord_flip() + theme_minimal()
 p4 <- ggplot(df, aes(x=Status)) + ggtitle(“Status”) + xlab(“status”) +
 geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab(“Percentage”) + coord_flip() + theme_minimal()
 p5 <- ggplot(df, aes(x=group_cons)) + ggtitle(“Consumption”) + xlab(“Consumption”) +
 geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab(“Percentage”) + coord_flip() + theme_minimal()
 grid.arrange(p1, p2, p3, p4,p5, ncol=2)

Let’s try out a simple feature selection with Chi-Square

# Feature selection with Chi-square
 chi.square <- vector()
 p.value <- vector()
 cateVar <- ml %>% 
 dplyr::select(Status) %>% 
 keep(is.factor)
 
 for (i in 1:length(ml)) {
 p.value[i] <- chisq.test(ml$Status, unname(unlist(ml[i])), correct = FALSE)[3]$p.value
 chi.square[i] <- unname(chisq.test(ml$Status, unname(unlist(ml[i])), correct = FALSE)[1]$statistic)
 }
 
 chi_sqaure_test <- tibble(variable = names(ml)) %>% 
 add_column(chi.square = chi.square) %>% 
 add_column(p.value = p.value)
 knitr::kable(chi_sqaure_test)

Splitting the data into train and test set

# data.frame(colnames(df))
 #BUILDING THE MACHINE LEARNING MODEL
 intrain<- createDataPartition(ml$Status,p=0.75,list=FALSE)
 set.seed(2017)
 training<- ml[intrain,]
 testing<- ml[-intrain,]

Confirming the splitting is correct and performing prediction


 dim(training); dim(testing)
#Fitting the Logistic Regression Model:
 LogModel <- glm(Status ~ .,data=training,family=binomial, maxit=100)
 print(summary(LogModel))#Feature Analysis:
 anova(LogModel, test=”Chisq”)
 head(testing)
#Assessing the predictive ability of the Logistic Regression model
 testing$Status <- as.factor(testing$Status)
 #testing$Status[testing$Status==”0"] <- “0”
 #testing$Status[testing$Status==”1"] <- “1”
 
 fitted.results <- predict(LogModel,newdata=testing,type=’response’)
 fitted.results <- ifelse(fitted.results > 0.5,1,0)
 misClasificError <- mean(fitted.results != testing$Status)
 print(paste(‘Logistic Regression Accuracy’,1-misClasificError))

My next article would be building a shiny App that will output the results of our predictions so that managers can make decisions with them. This App will output a list of customers with their bypass probability score. Please follow me on all my social media handles for other interesting stuff. Thank you for reading.

Machine Learning For Energy Theft Detection

Written by Goddy Tams George