Skip to main content

Posts

Showing posts from November, 2019

Using R and H2O Isolation Forest For Data Quality

Introduction: We will identify anomalous patterns in data, this process is useful, not only to find inconsistencies and errors but also to find abnormal data behavior, being useful even to find cyber attacks on organizations. On this article there is more information as reference: Data Quality and Anomaly Detection Thoughts For Web Analytics Before starting we need the next software installed and working: -  R language installed. -  H2O open source framework. - Java 8 ( For H2O ). Open JDK:  https://github.com/ojdkbuild/contrib_jdk8u-ci/releases -  R studio. About the data used in this article. # I am using https://www.kaggle.com/bradklassen/pga-tour-20102018-data # The version I have is not the most updated version but anyway, a new version # may be used. Leaving this paragraph as a note, please refer to the next paragraph. NOTE: There was a problem with the data from the link above, so I created some synthetic data that can be downloaded from thi