Skip to main content

Posts

Using R and H2O Isolation Forest For Data Quality

Introduction:
We will identify anomalous patterns in data, this process is useful, not only to find inconsistencies and errors but also to find abnormal data behavior, being useful even to find cyber attacks on organizations.
On this article there is more information as reference: Data Quality and Anomaly Detection Thoughts For Web Analytics

Before starting we need the next software installed and working:
R language installed.H2O open source framework. - Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releasesR studio. About the data used in this article. # I am using https://www.kaggle.com/bradklassen/pga-tour-20102018-data# The version I have is not the most updated version but anyway, a new version# may be used. # The file I am using is a csv 950 mb file with 9,720,530 records, including header. # # One very important thing is that we are going to see that instead to be lost in more than 9 million records, we will just be looking at 158 records with anom…
Recent posts

Predicting Car Battery Failure With R And H2O - Study

Using R and H2O Isolation Forest to predict car battery failures. Carlos Kassab 2019-May-24
This is a study about what might be if car makers start using machine learning in our cars to predict falures.

# Loading libraries suppressWarnings( suppressMessages( library( h2o ) ) ) suppressWarnings( suppressMessages( library( data.table ) ) ) suppressWarnings( suppressMessages( library( plotly ) ) ) suppressWarnings( suppressMessages( library( DT ) ) ) # Reading data file# Data from: https://www.kaggle.com/yunlevin/levin-vehicle-telematics dataFileName = "/Development/Analytics/AnomalyDetection/AutomovileFailurePrediction/v2.csv" carData = fread( dataFileName, skip=0, header = TRUE ) carBatteryData = data.table( TimeStamp = carData$timeStamp , BatteryVoltage = as.numeric( carData$battery ) ) rm(carData) # Data cleaning, filtering and conversion carBatteryData = na.omit( carBatteryData ) # Keeping just valid Values# Acco…

Production Line Stations Maintenance Prediction - Process Flow.

Steps Needed in a Process to Detect Anomalies And Have a Maintenance Notice Before We Have Scrap Created on The Production Line.

Describing my previous articles( 1, 2 ) process flow:

Get Training Data.At least 2 weeks of passed units measurements.Data Cleaning.Ensure no null values.At least 95% data must have measurement values.Anomalies Detection Model Creation.Deep Learning Autoencoders.orIsolation Forest.Set Yield Threshold Desired, Normally 99%Get Prediction Value Limit by Linking Yield Threshold to Training Data Using The Anomaly Detection Model Created.Get Testing Data.Last 24 Hour Data From Station Measurements, Passed And Failed Units.Testing Data Cleaning.Ensure no null values.Get Anomalies From Testing Data by Using The Model Created And Prediction Limit Found Before.If Anomalies Found, Notify Maintenance to Avoid Scrap.Display Chart Showing Last 24 Hour Anomalies And Failures Found:


As you can see( Anomalies in blue,Failures in orange ), we are detecting anomalies( Units clo…

Using R and H2O Isolation Forest to identify product anomalies during the manufacturing process.

Introduction:

We will identify anomalous units on the production line by using measurements data from testing stations and Isolation Forest model. Anomalous products are not failures, anomalies are units close to measurement limits, so we can display maintenance warnings before the station starts to make scrap.


Before starting we need the next software installed and working:

R language installed.
H2O open source framework.
- Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases
R studio.

Get your data.
About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository

I downloaded SECOM data to c:/tmp

How many records?:  Training data set - In my real project, I use 100 thousand test passed records, it is around a month of production data. Testing data set - I use the last 24 hours of testing station data.
Note. On a real environment, get and process testing stations data one by one is the suggested…