Using R and H2O Isolation Forest anomaly detection for data quality, further analysis.

Introduction: This is the second article on data quality, for the first part, please go to: Since Isolation Forest is building an ensemble of isolation trees, and these trees are created randomly, there is a lot of randomness in the isolation forest training, so, to have a more robust result, 3 isolation forest models will be trained for a better anomaly detection. I will also use Apache Spark for data handling. For a full example, testing data will be used after training the 3 IF(Isolation Forest) models. This way of using Isolation Forest is kind of a general usage also for maintenance prediction. I am working with data from file: # Set Java parameters, enough memory for Java. options( java.parameters = c( "-Xmx40G" ) ) # 40GB Ram for Java# Loading libraries suppressWarnings(suppressMessages(library(sparklyr))) suppressWarnings(suppres…
Recent posts