Skip to main content

UPDATED: Using R and H2O to identify product anomalies during the manufacturing process.


- It has some updates but also code optimization from Yana Kane-Esrig( https://www.linkedin.com/in/ykaneesrig/ ), as she mentioned in a message:
The code you posted has two nested for() {} loops. It took a very long time to run.
I used just one for() loop. It was much faster
 
Here her original code:
num_rows=nrow(allData)
for(i in 1:ncol(allData)) {
  temp = allData [,i]
  cat( "Processing column:", i, ", number missing:", sum( is.na(temp)), "\n" )
  temp_mising =is.na( allData[, i])
  temp_values = allData[,i][! temp_mising]
  temp_random = sample(temp_values, size = num_rows, replace = TRUE)
  
  temp_imputed = temp
  temp_imputed[temp_mising]= temp_random [temp_mising]
  # describe(temp_imputed)
  allData [,i] = temp_imputed
  
  cat( "Processed column:", i, ", number missing:", sum( is.na(allData [,i])), "\n \n" )
}

I thank a lot to Yana for this improvement, it was taking 18.3406 secs, now it takes 0.1352897 secs on the same computer, same data.

Even when I did some changes to her code, I am using her original idea when imputing at random missing values in the data.

Introduction:

We will identify anomalous products on the production line by using measurements from testing stations and deep learning models. Anomalous products are not failures, these anomalies are products close to the measurement limits, so we can display warnings before the process starts to make failed products and in this way the stations get maintenance.

 Before starting we need the next software installed and working:

- R development environment, I suggest you to view my article: Installing our R development environment on Ubuntu 20.04

- R packages we need:
    - h2o
    ggplot2
    - plotly

- Testing stations data, I suggest to test station by station.


Get your data.


About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository    


How many records?: 
Training data set - In my real project, I use 100 thousand test passed records, it is around a month of production data.
Testing data set - I use the last 24 hours of testing station data.


Let's the fun begin

# Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) )
suppressWarnings( suppressMessages( library( ggplot2 ) ) ) 
suppressWarnings( suppressMessages( library( plotly ) ) ) 

# Reading data file
dataFile = "/tmp/UCI_ML_SecomData/secom.data"
allData = read.csv( dataFile, sep = " ", header = FALSE, encoding = "UTF-8" )

################################################################################
# Dataset fixing, there are a lot of NaN records
################################################################################
startTime = Sys.time()
cat("Dataset Fixing Start Time:", format(startTime, "%Y-%m-%d %H:%M:%S %Z"))
## Dataset Fixing Start Time: 2021-03-25 18:27:41 CST
for(i in 1:ncol(allData)) {
  if( sum(is.na( allData[, i])) > 0 ) {
    #cat( "Processing column:", i, ", number missing:", sum( is.na( allData[, i])), "\n" )
    # These next two lines of code were left here for better understanding, the code may be one line.
    temp_missing = is.na( allData[, i]) # Identify all missing(NAs) column values
    ValidColumnValues = allData[,i][!temp_missing] # Store valid column values
    # Set valid values to NAs found on this column with random values from sample function
    allData [,i][temp_missing] = sample(ValidColumnValues, size = sum(temp_missing), replace = TRUE)
  }
}

cat("Dataset Fixing End Time:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"))
## Dataset Fixing End Time: 2021-03-25 18:27:41 CST
Sys.time() - startTime
## Time difference of 0.2357759 secs ===>>> Elapsed time
# splitting all data, the first 90% for training and the rest 10% for testing our model.
trainingData = allData[1:floor(dim(allData)[1]*.9),]
testingData = allData[(floor(dim(allData)[1]*.9)+1):dim(allData)[1],]


################################################################################
# Creating Deep Learning Model
# To check the process, go to: http://localhost:54321/flow/index.html.
# Then, go to menu->Admin->Jobs.
################################################################################
startTime = Sys.time()
cat("H2O Processing Start Time:", format(startTime, "%Y-%m-%d %H:%M:%S %Z"))
## H2O Processing Start Time: 2021-03-25 18:27:41 CST
  h2o.init( nthreads = -1, max_mem_size = "5G" )
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/RtmpGKcodE/file5f8812bef14e/h2o_laranikal_started_from_r.out
##     /tmp/RtmpGKcodE/file5f8844a5bf51/h2o_laranikal_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: .. Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 527 milliseconds 
##     H2O cluster timezone:       America/Mexico_City 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.5 
##     H2O cluster version age:    9 days  
##     H2O cluster name:           H2O_started_from_R_laranikal_cut876 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   4.44 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.5.0 (2018-04-23)
  h2o.no_progress() # Disable progress bars for Rmd
  h2o.removeAll() # Cleans h2o cluster state.
  
  # Convert the training dataset to H2O format.
  trainingData_hex = as.h2o( trainingData, destination_frame = "train_hex" )
  
  # Set the input variables
  featureNames = colnames(trainingData_hex)
  
  # Creating the first model version.
  trainingModel = h2o.deeplearning( x = featureNames, training_frame = trainingData_hex
                                    , model_id = "Station1DeepLearningModel"
                                    , activation = "Tanh"
                                    , autoencoder = TRUE
                                    #, reproducible = FALSE
                                    , reproducible = TRUE
                                    , l1 = 1e-5
                                    , ignore_const_cols = FALSE
                                    , seed = 1234
                                    , hidden = c( 400, 200, 400 ), epochs = 50 )
  
  
  # Getting the anomaly with training data to set the min MSE( Mean Squared Error )
  # value before setting a record as anomaly
  trainMSE = as.data.frame( h2o.anomaly( trainingModel
                                         , trainingData_hex
                                         , per_feature = FALSE ) )
  
  # Check the first 30 descendent sorted trainMSE records to see our outliers
  head( sort( trainMSE$Reconstruction.MSE , decreasing = TRUE ), 30)
##  [1] 0.023172912 0.021540237 0.021323148 0.020527838 0.013531867 0.010603028
##  [7] 0.008443205 0.007816767 0.006420300 0.006343981 0.005977489 0.005313815
## [13] 0.005027686 0.004914719 0.004863762 0.004280178 0.004228994 0.004144566
## [19] 0.003994022 0.003870735 0.003727515 0.003621605 0.003583236 0.003351647
## [25] 0.003085467 0.003080419 0.002970324 0.002835871 0.002804033 0.002716736
  # 0.020288603 0.017976305 0.012772556 0.011556780 0.010143009 0.009524983 0.007363854 
  # 0.005889714 0.005604329 0.005189614[11] 0.005185285 0.005118595 0.004639442 0.004497609
  # 0.004438342 0.004419993 0.004298936 0.003961503 0.003651326 0.003426971 0.003367108
  # 0.003169319 0.002901914 0.002852006 0.002772110 0.002765924 0.002754586 0.002748887 
  # 0.002619872 0.002474702
  
  # Ploting the errors of reconstructing our training data, to have a graphical view
  # of our data reconstruction errors
  
  # plot( sort( trainMSE$Reconstruction.MSE ), main = 'Reconstruction Error', ylab = "MSE Value." )
  
  plot_ly( x = rownames( trainMSE ), y = sort( trainMSE$Reconstruction.MSE ), type = "bar" )
1102910610901120115111821212124312741304133513661397158189219252803103413724024334644955255565876176486797097477080083186289392395498500.0050.010.0150.02
  # ggplot(trainMSE, aes( x = rownames( trainMSE ), y = Reconstruction.MSE, group=1 ) ) + geom_line()
  

  ggplot( trainMSE, aes( x = Reconstruction.MSE ) ) + geom_histogram( binwidth = .002, fill="green" ) +
    geom_vline( aes( xintercept = mean( Reconstruction.MSE ) ),
                color = "blue", linetype = "dashed", size = 1 )

  ggplotly()
0.0000.0050.0100.0150.0200.02502505007501000
Reconstruction.MSEcount
  ##############################################################################
  # Setting reconstruction error threshold value for anomaly detection.
  ##############################################################################
  
  # Seeing the chart and the first 30 decreasing sorted MSE records, we can decide .01 
  # as our min MSE before setting a record as anomaly, because we see Just a few 
  # records with two decimals greater than zero and we can set those as outliers.
  # This value is something you must decide for your data.
  
  # Updating trainingData data set with reconstruction error < .01
  trainingDataNew = trainingData[ trainMSE$Reconstruction.MSE < .01, ]
  
  h2o.removeAll() ## Remove the data from the h2o cluster in preparation for our final model.
  
  # Convert our new training data frame to H2O format.
  trainingDataNew_hex = as.h2o( trainingDataNew, destination_frame = "train_hex" )
  
  # Creating the final model.
  trainingModelNew = h2o.deeplearning( x = featureNames, training_frame = trainingDataNew_hex
                                       , model_id = "Station1DeepLearningModel"
                                       , activation = "Tanh"
                                       , autoencoder = TRUE
                                       #, reproducible = FALSE
                                       , reproducible = TRUE
                                       , l1 = 1e-5
                                       , ignore_const_cols = FALSE
                                       , seed = 1234
                                       , hidden = c( 400, 200, 400 ), epochs = 50 )

  

  
################################################################################
# Get anomalies from testing data, using model and threshold set using training data.
################################################################################
  
  # Convert our testing data frame to H2O format.
  testingDataH2O = as.h2o( testingData, destination_frame = "test_hex" )
  
  # Getting anomalies found in testing data.
  testMSE = as.data.frame( h2o.anomaly( trainingModelNew
                                        , testingDataH2O
                                        , per_feature = FALSE ) )
  
  # Binding our data.
  testingData = cbind( MSE = testMSE$Reconstruction.MSE , testingData )
  
  # Filter testing data using the MSE value set as minimum.
  anomalies = testingData[ testingData$MSE >= .01,  ]
  
  # When anomalies detected, send a notice to maintenance area.
  if( dim(anomalies)[1] > 0 ){
    cat( "Anomalies detected in the sample data, station needs maintenance." )
  }
## Anomalies detected in the sample data, station needs maintenance.
################################################################################
# Remember to shutdown your local H2O instance, unless you want to continue analyzing
# using H2O web: http://localhost:54321/flow/index.html
################################################################################

h2o.shutdown(prompt=FALSE)
  
cat("H2O Processing End Time:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"))  
## H2O Processing End Time: 2021-03-25 18:53:17 CST
Sys.time() - startTime
## Time difference of 25.59161 mins ===>>> Elapsed time

Enjoy it!!!.






Popular posts from this blog

Installing our R development environment on Ubuntu 20.04

  Step 1: Install R,  Here the link with instructions:  How to instal R on Ubuntu 20.04 Adding the steps I followed because sometimes the links become unavailable: Add GPG key: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 Output: Executing: /tmp/apt-key-gpghome.NtZgt0Un4R/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 gpg: key 51716619E084DAB9: public key "Michael Rutter " imported gpg: Total number processed: 1 gpg: imported: 1 Add repository: $ sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' Output: Hit:1 https://deb.opera.com/opera-stable stable InRelease Hit:2 http://us.archive.ubuntu.com/ubuntu focal InRelease Hit:3 http://archive.canonical.com/ubuntu focal InRelease ...

Using R and H2O Isolation Forest anomaly detection for data quality, further analysis.

Introduction: This is the second article on data quality, for the first part, please go to:  http://laranikalranalytics.blogspot.com/2019/11/using-r-and-h2o-isolation-forest-for.html Since Isolation Forest is building an ensemble of isolation trees, and these trees are created randomly, there is a lot of randomness in the isolation forest training, so, to have a more robust result, 3 isolation forest models will be trained for a better anomaly detection. I will also use Apache Spark for data handling. For a full example, testing data will be used after training the 3 IF(Isolation Forest) models. This way of using Isolation Forest is kind of a general usage also for maintenance prediction. I am working with data from file: https://www.kaggle.com/bradklassen/pga-tour-20102018-data NOTE: There was a problem with the data from the link above, so I created some synthetic data that can be downloaded from this link:  Golf Tour Synthetic Data # Set Java parameters, en...