Note. - This article has some improvements from Yana Kane-Esrig( https://www.linkedin.com/in/ykaneesrig/ ), mentioned in this article: http://laranikalranalytics.blogspot.com/2021/03/updated-using-r-and-h2o-to-identify.html
We will identify anomalous units on the production line by using measurements data from testing stations and Isolation Forest model. Anomalous products are not failures, anomalies are units close to measurement limits, so we can display maintenance warnings before the station starts to make scrap.
Before starting we need the next software installed and working:
- R language installed.
- H2O open source framework.
- Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases
- R studio.
Get your data.
About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository
I downloaded SECOM data to /tmp
How many records?:
Training data set - In my real project, I use 100 thousand test passed records, it is around a month of production data.
Testing data set - I use the last 24 hours of testing station data.
Note. On a real environment, get and process testing stations data one by one is the suggested approach.
Let's start coding:
# Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) )
# Reading data file
setwd( "/tmp" )
allData = read.csv( "secom.data", sep = " ", header = FALSE, encoding = "UTF-8" )
################################################################################
# Dataset fixing, there are a lot of NaN records
################################################################################
for(i in 1:ncol(allData)) {
if( sum(is.na( allData[, i])) > 0 ) {
#cat( "Processing column:", i, ", number missing:", sum( is.na( allData[, i])), "\n" )
# These next two lines of code were left here for better understanding, the code may be one line.
temp_missing = is.na( allData[, i]) # Identify all missing(NAs) column values
ValidColumnValues = allData[,i][!temp_missing] # Store valid column values
# Set valid values to NAs found on this column with random values from sample function
allData [,i][temp_missing] = sample(ValidColumnValues, size = sum(temp_missing), replace = TRUE)
}
}
# spliting all data, the first 90% for training and the rest 10% for testing our model.
trainingData = allData[1:floor(dim(allData)[1]*.9),]
testingData = allData[(floor(dim(allData)[1]*.9)+1):dim(allData)[1],]
################################################################################
# Creating Anomaly Detection Model
################################################################################
h2o.init( nthreads = -1, max_mem_size = "5G", port = 3333 )
h2o.removeAll() ## Removes the data from the h2o cluster in preparation for our final model.
# Convert the training dataset to H2O format.
trainingData_hex = as.h2o( trainingData, destination_frame = "train_hex" )
# Build an Isolation forest model
trainingModel = h2o.isolationForest( training_frame = trainingData_hex
, sample_rate = 0.1
, max_depth = 32
, ntrees = 100
)
# According to H2O doc: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
# Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees.
# Isolation Forest creates multiple decision trees to isolate observations.
#
# Trees are split randomly, The assumption is that:
#
# IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,
# IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.
#
# The less splits needed, the unit is more likely to be anomalous.
#
# The average number of splits is then used as a score.
# Calculate score for training dataset
score <- h2o.predict( trainingModel, trainingData_hex )
result_pred <- as.vector( score$predict )
################################################################################
# Setting threshold value for anomaly detection.
################################################################################
# Setting desired yield threshold percentage.
yieldThreshold = .99 # Let's say we want a 99% yield
# Using yield threshold to get score limit to filter anomalous units.
scoreLimit = round( quantile( result_pred, yieldThreshold ), 3 )
################################################################################
# Get anomalies from testing data, using model and scoreLimit got using training data.
################################################################################
# Convert testing data frame to H2O format.
testingDataH2O = as.h2o( testingData, destination_frame = "testingData_hex" )
# Get score using training model
testingScore <- h2o.predict( trainingModel, testingDataH2O )
# Add row score at the beginning of testing dataset
testingData = cbind( RowScore = round( as.vector( testingScore$predict ), 3 ), testingData )
# Get anomalies from testing data
anomalies = testingData[ testingData$RowScore > scoreLimit, ]
if( dim(anomalies)[1] > 0 ){
cat( "Email to Engineering: Anomalies detected in the sample data, station needs maintenance." )
# Plotting anomalies found.
plot( x = row.names(anomalies)
, y = anomalies$RowScore
, xlab = "Main Dataset Row Number."
, ylab = "Anomaly Score"
, main = paste0( "Anomalies, Yield Threshold: "
, yieldThreshold
, ", Score Limit: "
, scoreLimit )
, pch = 2
, cex.main = 1
, frame.plot = FALSE
, col = "blue", panel.first=grid() )
}
# Advantages of using this approach:
# This is a very fast way to get anomalies.
# Very easy to implement,
# Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) )
# Reading data file
setwd( "/tmp" )
allData = read.csv( "secom.data", sep = " ", header = FALSE, encoding = "UTF-8" )
################################################################################
# Dataset fixing, there are a lot of NaN records
################################################################################
for(i in 1:ncol(allData)) {
if( sum(is.na( allData[, i])) > 0 ) {
#cat( "Processing column:", i, ", number missing:", sum( is.na( allData[, i])), "\n" )
# These next two lines of code were left here for better understanding, the code may be one line.
temp_missing = is.na( allData[, i]) # Identify all missing(NAs) column values
ValidColumnValues = allData[,i][!temp_missing] # Store valid column values
# Set valid values to NAs found on this column with random values from sample function
allData [,i][temp_missing] = sample(ValidColumnValues, size = sum(temp_missing), replace = TRUE)
}
}
# spliting all data, the first 90% for training and the rest 10% for testing our model.
trainingData = allData[1:floor(dim(allData)[1]*.9),]
testingData = allData[(floor(dim(allData)[1]*.9)+1):dim(allData)[1],]
################################################################################
# Creating Anomaly Detection Model
################################################################################
h2o.init( nthreads = -1, max_mem_size = "5G", port = 3333 )
h2o.removeAll() ## Removes the data from the h2o cluster in preparation for our final model.
# Convert the training dataset to H2O format.
trainingData_hex = as.h2o( trainingData, destination_frame = "train_hex" )
# Build an Isolation forest model
trainingModel = h2o.isolationForest( training_frame = trainingData_hex
, sample_rate = 0.1
, max_depth = 32
, ntrees = 100
)
# According to H2O doc: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
# Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees.
# Isolation Forest creates multiple decision trees to isolate observations.
#
# Trees are split randomly, The assumption is that:
#
# IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,
# IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.
#
# The less splits needed, the unit is more likely to be anomalous.
#
# The average number of splits is then used as a score.
# Calculate score for training dataset
score <- h2o.predict( trainingModel, trainingData_hex )
result_pred <- as.vector( score$predict )
################################################################################
# Setting threshold value for anomaly detection.
################################################################################
# Setting desired yield threshold percentage.
yieldThreshold = .99 # Let's say we want a 99% yield
# Using yield threshold to get score limit to filter anomalous units.
scoreLimit = round( quantile( result_pred, yieldThreshold ), 3 )
################################################################################
# Get anomalies from testing data, using model and scoreLimit got using training data.
################################################################################
# Convert testing data frame to H2O format.
testingDataH2O = as.h2o( testingData, destination_frame = "testingData_hex" )
# Get score using training model
testingScore <- h2o.predict( trainingModel, testingDataH2O )
# Add row score at the beginning of testing dataset
testingData = cbind( RowScore = round( as.vector( testingScore$predict ), 3 ), testingData )
# Get anomalies from testing data
anomalies = testingData[ testingData$RowScore > scoreLimit, ]
if( dim(anomalies)[1] > 0 ){
cat( "Email to Engineering: Anomalies detected in the sample data, station needs maintenance." )
# Plotting anomalies found.
plot( x = row.names(anomalies)
, y = anomalies$RowScore
, xlab = "Main Dataset Row Number."
, ylab = "Anomaly Score"
, main = paste0( "Anomalies, Yield Threshold: "
, yieldThreshold
, ", Score Limit: "
, scoreLimit )
, pch = 2
, cex.main = 1
, frame.plot = FALSE
, col = "blue", panel.first=grid() )
}
# Advantages of using this approach:
# This is a very fast way to get anomalies.
# Very easy to implement,
Enjoy it!!!...
Carlos Kassab
More information about R: