Skip to main content

Using R and H2O Isolation Forest to identify product anomalies during the manufacturing process.

Note. - This article has some improvements from Yana Kane-Esrig( https://www.linkedin.com/in/ykaneesrig/ ), mentioned in this article: http://laranikalranalytics.blogspot.com/2021/03/updated-using-r-and-h2o-to-identify.html

Introduction:


We will identify anomalous units on the production line by using measurements data from testing stations and Isolation Forest model. Anomalous products are not failures, anomalies are units close to measurement limits, so we can display maintenance warnings before the station starts to make scrap.


Before starting we need the next software installed and working:

R language installed.
H2O open source framework.
- Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases
R studio.

Get your data.

About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository    

I downloaded SECOM data to /tmp


How many records?: 
Training data set - In my real project, I use 100 thousand test passed records, it is around a month of production data.
Testing data set - I use the last 24 hours of testing station data.

Note. On a real environment, get and process testing stations data one by one is the suggested approach.


Let's start coding:




Enjoy it!!!...
Carlos Kassab

More information about R:


Popular posts from this blog

UPDATED: Using R and H2O to identify product anomalies during the manufacturing process.

Note.  This is an update to article:  http://laranikalranalytics.blogspot.com/2019/03/using-r-and-h2o-to-identify-product.html - It has some updates but also code optimization from  Yana Kane-Esrig(  https://www.linkedin.com/in/ykaneesrig/ ) , as she mentioned in a message: The code you posted has two nested for() {} loops. It took a very long time to run. I used just one for() loop. It was much faster   Here her original code: num_rows=nrow(allData) for(i in 1:ncol(allData)) {   temp = allData [,i]   cat( "Processing column:", i, ", number missing:", sum( is.na(temp)), "\n" )    temp_mising =is.na( allData[, i])    temp_values = allData[,i][! temp_mising]    temp_random = sample(temp_values, size = num_rows, replace = TRUE)      temp_imputed = temp   temp_imputed[temp_mising]= temp_random [temp_mising]   # describe(temp_imputed)   allData [,i] = temp_imputed      cat( "Process...

Using R and H2O Isolation Forest anomaly detection for data quality, further analysis.

Introduction: This is the second article on data quality, for the first part, please go to:  http://laranikalranalytics.blogspot.com/2019/11/using-r-and-h2o-isolation-forest-for.html Since Isolation Forest is building an ensemble of isolation trees, and these trees are created randomly, there is a lot of randomness in the isolation forest training, so, to have a more robust result, 3 isolation forest models will be trained for a better anomaly detection. I will also use Apache Spark for data handling. For a full example, testing data will be used after training the 3 IF(Isolation Forest) models. This way of using Isolation Forest is kind of a general usage also for maintenance prediction. I am working with data from file: https://www.kaggle.com/bradklassen/pga-tour-20102018-data NOTE: There was a problem with the data from the link above, so I created some synthetic data that can be downloaded from this link:  Golf Tour Synthetic Data # Set Java parameters, en...

Using R and H2O Isolation Forest For Data Quality

Introduction: We will identify anomalous patterns in data, this process is useful, not only to find inconsistencies and errors but also to find abnormal data behavior, being useful even to find cyber attacks on organizations. On this article there is more information as reference: Data Quality and Anomaly Detection Thoughts For Web Analytics Before starting we need the next software installed and working: -  R language installed. -  H2O open source framework. - Java 8 ( For H2O ). Open JDK:  https://github.com/ojdkbuild/contrib_jdk8u-ci/releases -  R studio. About the data used in this article. # I am using https://www.kaggle.com/bradklassen/pga-tour-20102018-data # The version I have is not the most updated version but anyway, a new version # may be used. Leaving this paragraph as a note, please refer to the next paragraph. NOTE: There was a problem with the data from the link above, so I created some synthetic data that can be ...