Skip to main content

Using R and H2O Isolation Forest For Data Quality

# Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) ) 
# For interactive plotting
suppressWarnings( suppressMessages( library( dygraphs ) ) )
suppressWarnings( suppressMessages( library( dplyr ) ) )
suppressWarnings( suppressMessages( library( DT ) ) )

# Start a single-node instance of H2O using all available processor cores and reserve 5GB of memory
h2oServer = h2o.init( ip = "localhost", port = 54321, max_mem_size = "5g", nthreads = -1 )
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.out
##     /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 395 milliseconds 
##     H2O cluster timezone:       America/Mexico_City 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.26.0.6 
##     H2O cluster version age:    1 month and 8 days  
##     H2O cluster name:           H2O_started_from_R_ckassab_aat507 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   4.44 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.1 (2019-07-05)
h2o.removeAll() # Removes all data from h2o cluster, ensuring it is clean.
h2o.no_progress()  # Turn off progress bars for notebook readability

# Setting H2O timezone for proper date data type handling
#h2o.getTimezone() ===>>> UTC
#h2o.listTimezones() # We can see all H2O timezones
h2o.setTimezone("US/Central")
## [1] "US/Central"
# Note. I am using Ubuntu 19.10, using /tmp directory
# Every time I boot my computer, I need to copy the data file again to /tmp
# directory.

# Importing data file and setting data types accordingly.
allData = read.csv( "/tmp/PGA_Tour_Golf_Data_2019_Kaggle.csv", sep = ",", header = T )

# When using as.Posixct H2O is not importing data, so we are using as.Date.
allData$Date = as.Date( allData$Date )
allData$Value = as.numeric(allData$Value)

# Convert dataset to H2O format.
allData_hex = as.h2o( allData )

# Build an Isolation forest model
startTime <- Sys.time()
startTime
## [1] "2019-11-10 20:10:30 CST"
trainingModel = h2o.isolationForest( training_frame = allData_hex
                                     , sample_rate = 0.1
                                     , max_depth = 32
                                     , ntrees = 100
                                    )
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Stopping tolerance is ignored for _stopping_rounds=0..
Sys.time()
## [1] "2019-11-10 20:20:15 CST"
Sys.time() - startTime
## Time difference of 9.756691 mins
# According to H2O doc: 
# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
#
# Isolation Forest is similar in principle to Random Forest and is built on 
# the basis of decision trees. 

# Isolation Forest creates multiple decision trees to isolate observations.
# 
# Trees are split randomly, The assumption is that:
#   
#   IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,
#   IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.
# 
#   The less splits needed, the unit is more likely to be anomalous.
# 
# The average number of splits is then used as a score.

# Calculate score for all data.
startTime <- Sys.time()
startTime
## [1] "2019-11-10 20:20:15 CST"
score = h2o.predict( trainingModel, allData_hex )
result_pred = as.vector( score$predict )
Sys.time()
## [1] "2019-11-10 20:23:18 CST"
Sys.time() - startTime
## Time difference of 3.056829 mins
################################################################################
# Setting threshold value for anomaly detection.
################################################################################

# Setting desired threshold percentage.
threshold = .999 # Let's say we want the .001% data different than the rest.

# Using this threshold to get score limit to filter data anomalies.
scoreLimit = round( quantile( result_pred, threshold ), 4 )

# Add row score at the beginning of dataset
allData = cbind( RowScore = round( result_pred, 4 ), allData )

# Get data anomalies by filtering all data.
anomalies = allData[ allData$RowScore > scoreLimit, ]

# As we can see in the summary:
summary(anomalies)
##     RowScore              Player.Name        Date           
##  Min.   :0.9540   Jonas Blixt   : 231   Min.   :2019-07-07  
##  1st Qu.:0.9565   Jordan Spieth : 231   1st Qu.:2019-08-25  
##  Median :0.9614   Julian Etulain: 221   Median :2019-08-25  
##  Mean   :0.9640   Johnson Wagner: 213   Mean   :2019-08-24  
##  3rd Qu.:0.9701   John Chin     : 209   3rd Qu.:2019-08-25  
##  Max.   :1.0000   Keegan Bradley: 209   Max.   :2019-08-25  
##                   (Other)       :8325                       
##                            Statistic   
##  Club Head Speed                : 234  
##  Driving Pct. 300-320 (Measured): 193  
##  Carry Efficiency               : 163  
##  First Tee Early Lowest Round   : 161  
##  First Tee Late Lowest Round    : 160  
##  GIR Percentage - 100+ yards    : 158  
##  (Other)                        :8570  
##                                                      Variable   
##  First Tee Early Lowest Round - (LOW RND)                : 103  
##  First Tee Late Lowest Round - (LOW RND)                 :  96  
##  First Tee Late Lowest Round - (ROUNDS)                  :  64  
##  Driving Pct. 300-320 (Measured) - (TOTAL DRVS - OVERALL):  61  
##  GIR Percentage - 175-200 yards - (%)                    :  61  
##  First Tee Early Lowest Round - (ROUNDS)                 :  58  
##  (Other)                                                 :9196  
##      Value       
##  Min.   :  1268  
##  1st Qu.: 53058  
##  Median : 87088  
##  Mean   :111716  
##  3rd Qu.:184278  
##  Max.   :220583  
## 
# The Statistic: GIR Percentage - 100+ yards is one of the most important values
# Filtering all anomalies within this Statistic value
statisticFilter = "GIR Percentage - 100+ yards"

specificVar = anomalies %>%
  filter(Statistic==statisticFilter)

cat( statisticFilter,": ", dim(specificVar)[1] )
## GIR Percentage - 100+ yards :  158
if( dim(specificVar)[1]  > 0 ) {

  # We want to know the relation between Players and "Approaches from 200-225 yards"
  # So, in order to get a chart, we assign a code to each player
  # Since factors in R are really integer values, we do this to get the codes:
  specificVar$PlayerCode = as.integer(specificVar$Player.Name) 
  
  # To sort our dataset we convert the date to numeric 
  specificVar$DateAsNum = as.numeric( paste0( substr(specificVar$Date,1,4)
                                                      , substr(specificVar$Date,6,7)
                                                      , substr(specificVar$Date,9,10) ) )
  # And sort the data frame.
  specificVar = specificVar[order(specificVar$DateAsNum),]
  # Set records num using a sequence.
  rownames(specificVar) = seq(1:dim(specificVar)[1])
  
  colNamesFinalTable = c( "PlayerCode", "Player.Name", "Date", "Variable", "Value" )
  specificVar = specificVar[, colNamesFinalTable]
  specificVar$PlayerCode = as.factor(specificVar$PlayerCode)
  
  # Creating our final dataframe for our chart.
  specificVarChartData = data.frame( SeqNum = as.integer( rownames(specificVar) )
                                             , PlayerCode = specificVar$PlayerCode
                                             , Value = specificVar$Value
                                             )
  

  
  AnomaliesGraph = dygraph( specificVarChartData, main = ''
                      , xlab = paste(statisticFilter,"Anomaly Number."), ylab = "Player Code." ) %>%
    dyAxis("y", label = "Player Code.") %>%
    dyAxis("y2", label = "Value.", independentTicks = TRUE) %>%
    dySeries( name = "PlayerCode", label = "Player Code.", drawPoints = TRUE, pointShape = "dot"
              , color = "blue", pointSize = 2 ) %>%
    dySeries( name = "Value", label = "Value.", drawPoints = TRUE, pointShape = "dot"
              , color = "green", pointSize = 2, axis = 'y2' ) %>%
    dyRangeSelector()
  dyOptions( AnomaliesGraph, digitsAfterDecimal = 0 )
}
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
Sample chart with the anomalies found:



Sample data table with the anomalies found:








Here is the code on github, including the total html functional demo:
https://github.com/LaranIkal/DataQuality


Enjoy it!!!...

Carlos Kassab

More information about R:

Popular posts from this blog

UPDATED: Using R and H2O to identify product anomalies during the manufacturing process.

Note.  This is an update to article:  http://laranikalranalytics.blogspot.com/2019/03/using-r-and-h2o-to-identify-product.html - It has some updates but also code optimization from  Yana Kane-Esrig(  https://www.linkedin.com/in/ykaneesrig/ ) , as she mentioned in a message: The code you posted has two nested for() {} loops. It took a very long time to run. I used just one for() loop. It was much faster   Here her original code: num_rows=nrow(allData) for(i in 1:ncol(allData)) {   temp = allData [,i]   cat( "Processing column:", i, ", number missing:", sum( is.na(temp)), "\n" )    temp_mising =is.na( allData[, i])    temp_values = allData[,i][! temp_mising]    temp_random = sample(temp_values, size = num_rows, replace = TRUE)      temp_imputed = temp   temp_imputed[temp_mising]= temp_random [temp_mising]   # describe(temp_imputed)   allData [,i] = temp_imputed      cat( "Process...

Installing our R development environment on Ubuntu 20.04

  Step 1: Install R,  Here the link with instructions:  How to instal R on Ubuntu 20.04 Adding the steps I followed because sometimes the links become unavailable: Add GPG key: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 Output: Executing: /tmp/apt-key-gpghome.NtZgt0Un4R/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 gpg: key 51716619E084DAB9: public key "Michael Rutter " imported gpg: Total number processed: 1 gpg: imported: 1 Add repository: $ sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' Output: Hit:1 https://deb.opera.com/opera-stable stable InRelease Hit:2 http://us.archive.ubuntu.com/ubuntu focal InRelease Hit:3 http://archive.canonical.com/ubuntu focal InRelease ...

Using R and H2O Isolation Forest anomaly detection for data quality, further analysis.

Introduction: This is the second article on data quality, for the first part, please go to:  http://laranikalranalytics.blogspot.com/2019/11/using-r-and-h2o-isolation-forest-for.html Since Isolation Forest is building an ensemble of isolation trees, and these trees are created randomly, there is a lot of randomness in the isolation forest training, so, to have a more robust result, 3 isolation forest models will be trained for a better anomaly detection. I will also use Apache Spark for data handling. For a full example, testing data will be used after training the 3 IF(Isolation Forest) models. This way of using Isolation Forest is kind of a general usage also for maintenance prediction. I am working with data from file: https://www.kaggle.com/bradklassen/pga-tour-20102018-data NOTE: There was a problem with the data from the link above, so I created some synthetic data that can be downloaded from this link:  Golf Tour Synthetic Data # Set Java parameters, en...