Note. This is an update to article: http://laranikalranalytics.blogspot.com/2019/03/using-r-and-h2o-to-identify-product.html
- It has some updates but also code optimization from Yana Kane-Esrig( https://www.linkedin.com/in/ykaneesrig/ ), as she mentioned in a message:
The code you posted has two nested for() {} loops. It took a very long time to run.
I used just one for() loop. It was much faster
Here her original code:
num_rows=nrow(allData)
for(i in 1:ncol(allData)) {
temp = allData [,i]
cat( "Processing column:", i, ", number missing:", sum( is.na(temp)), "\n" )
temp_mising =is.na( allData[, i])
temp_values = allData[,i][! temp_mising]
temp_random = sample(temp_values, size = num_rows, replace = TRUE)
temp_imputed = temp
temp_imputed[temp_mising]= temp_random [temp_mising]
# describe(temp_imputed)
allData [,i] = temp_imputed
cat( "Processed column:", i, ", number missing:", sum( is.na(allData [,i])), "\n \n" )
}
I thank a lot to Yana for this improvement, it was taking 18.3406 secs, now it takes 0.1352897 secs on the same computer, same data.
Even when I did some changes to her code, I am using her original idea when imputing at random missing values in the data.
We will identify anomalous products on the production line by using measurements from testing stations and deep learning models. Anomalous products are not failures, these anomalies are products close to the measurement limits, so we can display warnings before the process starts to make failed products and in this way the stations get maintenance.
Before starting we need the next software installed and working:
- R development environment, I suggest you to view my article: Installing our R development environment on Ubuntu 20.04
- R packages we need:
- h2o
- ggplot2
- plotly
- Testing stations data, I suggest to test station by station.
Get your data.
Get your data.
About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository
How many records?:
Training data set - In my real project, I use 100 thousand test passed records, it is around a month of production data.
Testing data set - I use the last 24 hours of testing station data.
Let's the fun begin
# Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) )
suppressWarnings( suppressMessages( library( ggplot2 ) ) )
suppressWarnings( suppressMessages( library( plotly ) ) )
# Reading data file
dataFile = "/tmp/UCI_ML_SecomData/secom.data"
allData = read.csv( dataFile, sep = " ", header = FALSE, encoding = "UTF-8" )
################################################################################
# Dataset fixing, there are a lot of NaN records
################################################################################
startTime = Sys.time()
cat("Dataset Fixing Start Time:", format(startTime, "%Y-%m-%d %H:%M:%S %Z"))
## Dataset Fixing Start Time: 2021-03-25 18:27:41 CST
for(i in 1:ncol(allData)) {
if( sum(is.na( allData[, i])) > 0 ) {
#cat( "Processing column:", i, ", number missing:", sum( is.na( allData[, i])), "\n" )
# These next two lines of code were left here for better understanding, the code may be one line.
temp_missing = is.na( allData[, i]) # Identify all missing(NAs) column values
ValidColumnValues = allData[,i][!temp_missing] # Store valid column values
# Set valid values to NAs found on this column with random values from sample function
allData [,i][temp_missing] = sample(ValidColumnValues, size = sum(temp_missing), replace = TRUE)
}
}
cat("Dataset Fixing End Time:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"))
## Dataset Fixing End Time: 2021-03-25 18:27:41 CST
Sys.time() - startTime
## Time difference of 0.2357759 secs ===>>> Elapsed time
# splitting all data, the first 90% for training and the rest 10% for testing our model.
trainingData = allData[1:floor(dim(allData)[1]*.9),]
testingData = allData[(floor(dim(allData)[1]*.9)+1):dim(allData)[1],]
################################################################################
# Creating Deep Learning Model
# To check the process, go to: http://localhost:54321/flow/index.html.
# Then, go to menu->Admin->Jobs.
################################################################################
startTime = Sys.time()
cat("H2O Processing Start Time:", format(startTime, "%Y-%m-%d %H:%M:%S %Z"))
## H2O Processing Start Time: 2021-03-25 18:27:41 CST
h2o.init( nthreads = -1, max_mem_size = "5G" )
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /tmp/RtmpGKcodE/file5f8812bef14e/h2o_laranikal_started_from_r.out
## /tmp/RtmpGKcodE/file5f8844a5bf51/h2o_laranikal_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 seconds 527 milliseconds
## H2O cluster timezone: America/Mexico_City
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.5
## H2O cluster version age: 9 days
## H2O cluster name: H2O_started_from_R_laranikal_cut876
## H2O cluster total nodes: 1
## H2O cluster total memory: 4.44 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.5.0 (2018-04-23)
h2o.no_progress() # Disable progress bars for Rmd
h2o.removeAll() # Cleans h2o cluster state.
# Convert the training dataset to H2O format.
trainingData_hex = as.h2o( trainingData, destination_frame = "train_hex" )
# Set the input variables
featureNames = colnames(trainingData_hex)
# Creating the first model version.
trainingModel = h2o.deeplearning( x = featureNames, training_frame = trainingData_hex
, model_id = "Station1DeepLearningModel"
, activation = "Tanh"
, autoencoder = TRUE
#, reproducible = FALSE
, reproducible = TRUE
, l1 = 1e-5
, ignore_const_cols = FALSE
, seed = 1234
, hidden = c( 400, 200, 400 ), epochs = 50 )
# Getting the anomaly with training data to set the min MSE( Mean Squared Error )
# value before setting a record as anomaly
trainMSE = as.data.frame( h2o.anomaly( trainingModel
, trainingData_hex
, per_feature = FALSE ) )
# Check the first 30 descendent sorted trainMSE records to see our outliers
head( sort( trainMSE$Reconstruction.MSE , decreasing = TRUE ), 30)
## [1] 0.023172912 0.021540237 0.021323148 0.020527838 0.013531867 0.010603028
## [7] 0.008443205 0.007816767 0.006420300 0.006343981 0.005977489 0.005313815
## [13] 0.005027686 0.004914719 0.004863762 0.004280178 0.004228994 0.004144566
## [19] 0.003994022 0.003870735 0.003727515 0.003621605 0.003583236 0.003351647
## [25] 0.003085467 0.003080419 0.002970324 0.002835871 0.002804033 0.002716736
# 0.020288603 0.017976305 0.012772556 0.011556780 0.010143009 0.009524983 0.007363854
# 0.005889714 0.005604329 0.005189614[11] 0.005185285 0.005118595 0.004639442 0.004497609
# 0.004438342 0.004419993 0.004298936 0.003961503 0.003651326 0.003426971 0.003367108
# 0.003169319 0.002901914 0.002852006 0.002772110 0.002765924 0.002754586 0.002748887
# 0.002619872 0.002474702
# Ploting the errors of reconstructing our training data, to have a graphical view
# of our data reconstruction errors
# plot( sort( trainMSE$Reconstruction.MSE ), main = 'Reconstruction Error', ylab = "MSE Value." )
plot_ly( x = rownames( trainMSE ), y = sort( trainMSE$Reconstruction.MSE ), type = "bar" )
# ggplot(trainMSE, aes( x = rownames( trainMSE ), y = Reconstruction.MSE, group=1 ) ) + geom_line()
ggplot( trainMSE, aes( x = Reconstruction.MSE ) ) + geom_histogram( binwidth = .002, fill="green" ) +
geom_vline( aes( xintercept = mean( Reconstruction.MSE ) ),
color = "blue", linetype = "dashed", size = 1 )
ggplotly()
##############################################################################
# Setting reconstruction error threshold value for anomaly detection.
##############################################################################
# Seeing the chart and the first 30 decreasing sorted MSE records, we can decide .01
# as our min MSE before setting a record as anomaly, because we see Just a few
# records with two decimals greater than zero and we can set those as outliers.
# This value is something you must decide for your data.
# Updating trainingData data set with reconstruction error < .01
trainingDataNew = trainingData[ trainMSE$Reconstruction.MSE < .01, ]
h2o.removeAll() ## Remove the data from the h2o cluster in preparation for our final model.
# Convert our new training data frame to H2O format.
trainingDataNew_hex = as.h2o( trainingDataNew, destination_frame = "train_hex" )
# Creating the final model.
trainingModelNew = h2o.deeplearning( x = featureNames, training_frame = trainingDataNew_hex
, model_id = "Station1DeepLearningModel"
, activation = "Tanh"
, autoencoder = TRUE
#, reproducible = FALSE
, reproducible = TRUE
, l1 = 1e-5
, ignore_const_cols = FALSE
, seed = 1234
, hidden = c( 400, 200, 400 ), epochs = 50 )
################################################################################
# Get anomalies from testing data, using model and threshold set using training data.
################################################################################
# Convert our testing data frame to H2O format.
testingDataH2O = as.h2o( testingData, destination_frame = "test_hex" )
# Getting anomalies found in testing data.
testMSE = as.data.frame( h2o.anomaly( trainingModelNew
, testingDataH2O
, per_feature = FALSE ) )
# Binding our data.
testingData = cbind( MSE = testMSE$Reconstruction.MSE , testingData )
# Filter testing data using the MSE value set as minimum.
anomalies = testingData[ testingData$MSE >= .01, ]
# When anomalies detected, send a notice to maintenance area.
if( dim(anomalies)[1] > 0 ){
cat( "Anomalies detected in the sample data, station needs maintenance." )
}
## Anomalies detected in the sample data, station needs maintenance.
################################################################################
# Remember to shutdown your local H2O instance, unless you want to continue analyzing
# using H2O web: http://localhost:54321/flow/index.html
################################################################################
h2o.shutdown(prompt=FALSE)
cat("H2O Processing End Time:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"))
## H2O Processing End Time: 2021-03-25 18:53:17 CST
Sys.time() - startTime
## Time difference of 25.59161 mins
===>>> Elapsed time
Enjoy it!!!.
Carlos Kassab
More information about R: