Skip to main content

Predicting Car Battery Failure With R And H2O - Study

# Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) ) 
suppressWarnings( suppressMessages( library( data.table ) ) )
suppressWarnings( suppressMessages( library( plotly ) ) )
suppressWarnings( suppressMessages( library( DT ) ) )

# Reading data file
# Data from: https://www.kaggle.com/yunlevin/levin-vehicle-telematics
dataFileName = "/Development/Analytics/AnomalyDetection/AutomovileFailurePrediction/v2.csv"
carData = fread( dataFileName, skip=0, header = TRUE )
carBatteryData = data.table( TimeStamp = carData$timeStamp
                             , BatteryVoltage = as.numeric( carData$battery ) 
                            )
rm(carData)

# Data cleaning, filtering and conversion
carBatteryData = na.omit( carBatteryData ) # Keeping just valid Values

# According to this article: 
# https://shop.advanceautoparts.com/r/advice/car-maintenance/car-battery-voltage-range
#
# A perfect voltage ( without any devices or electronic systems plugged in )  
# is between 13.7 and 14.7V. 
# If the battery isn’t fully charged, it will diminish to 12.4V at 75%, 
# 12V when it’s only operating at 25%, and up to 11.9V when it’s completely discharged. 
#
# Battery voltage while a load is connected is much slower
# it should be something between 9.5V and 10.5V 
#
# This value interval ensures that your battery can store and deliver enough 
# current to start your car and power all your electronics and electric devices 
# without any difficulty

carBatteryData = carBatteryData[BatteryVoltage >= 9.5] # Filtering voltages greater or equal to 9.5
carBatteryData$TimeStamp = as.POSIXct( paste0( substr(carBatteryData$TimeStamp,1,17),"00" ) )
carBatteryData = unique(carBatteryData) # Removing duplicate voltage readings
carBatteryData = carBatteryData[order(TimeStamp)]


# spliting all data, using the last date as testing data and the rest for training.
lastDate = max( as.Date( format( carBatteryData$TimeStamp, "%Y-%m-%d" ) ) )
trainingData = carBatteryData[ as.Date( format( carBatteryData$TimeStamp, "%Y-%m-%d" ) ) != lastDate ]
testingData = carBatteryData[ as.Date( format( carBatteryData$TimeStamp, "%Y-%m-%d" ) ) == lastDate ]



################################################################################
# Creating Anomaly Detection Model
################################################################################

  h2o.init( nthreads = -1, max_mem_size = "5G" )
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\LaranIkal\AppData\Local\Temp\Rtmp6lTw4H/h2o_LaranIkal_started_from_r.out
##     C:\Users\LaranIkal\AppData\Local\Temp\Rtmp6lTw4H/h2o_LaranIkal_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 899 milliseconds 
##     H2O cluster timezone:       America/Mexico_City 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.24.0.2 
##     H2O cluster version age:    1 month and 7 days  
##     H2O cluster name:           H2O_started_from_R_LaranIkal_tzd452 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   4.44 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.6.0 (2019-04-26)
  h2o.no_progress() # Disable progress bars for Rmd
  h2o.removeAll() # Cleans h2o cluster state.
## [1] 0
  # Convert the training dataset to H2O format.
  trainingData_hex = as.h2o( trainingData[,2], destination_frame = "train_hex" )
  
  # Build an Isolation forest model
  trainingModel = h2o.isolationForest( training_frame = trainingData_hex
                                       , sample_rate = 0.1
                                       , max_depth = 32
                                       , ntrees = 100
                                      )
  
  # According to H2O doc: 
  # http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
  #
  # Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees. 
  
  # Isolation Forest creates multiple decision trees to isolate observations.
  # 
  # Trees are split randomly, The assumption is that:
  #   
  #   IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,
  #   IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.
  # 
  #   The less splits needed, the unit is more likely to be anomalous.
  # 
  # The average number of splits is then used as a score.

  # Calculate score for training dataset
  score <- h2o.predict( trainingModel, trainingData_hex )
  result_pred <- as.vector( score$predict )


################################################################################
# Setting threshold value for anomaly detection.
################################################################################

  # Setting desired threshold percentage.
  threshold = .995 # Let's say we have 99.5% voltage values correct
  
  # Using avobe threshold to get score limit to filter anomalous voltage readings.
  scoreLimit = round( quantile( result_pred, threshold ), 4 )
  

  
################################################################################
# Get anomalous voltage readings from testing data, using model and scoreLimit got using training data.
################################################################################

  # Convert testing data frame to H2O format.
  testingDataH2O = as.h2o( testingData[,2], destination_frame = "testingData_hex" )
  
  # Get score using training model
  testingScore <- h2o.predict( trainingModel, testingDataH2O )

  # Add row score at the beginning of testing dataset
  testingData = cbind( RowScore = round( as.vector( testingScore$predict ), 4 ), testingData )

  # Check if there are anomalous voltage readings from testing data
  anomalies = testingData[ testingData$RowScore > scoreLimit, ]
# Here there is and additional filter to ensure maintenance recommendation
  # If there are more than 3 anomalous voltage readings, display an alert.
  if( dim( anomalies )[1]  > 3 ) { 
    cat( "Show alert on car display: Battery got anomalous voltage readings, it is recommended to take it to service." )
    
    plot_ly( data = anomalies
             , x = ~TimeStamp
             , y = ~BatteryVoltage
             , type = 'scatter'
             , mode = "lines"
             , name = 'Anomalies') %>%
      layout( yaxis = list( title = 'Battery Voltage.' )
              , xaxis = list( categoryorder='trace', title = 'Date - Time.' )
               )
  }
## Show alert on car display: Battery got anomalous voltage readings, it is recommended to take it to service.
if( dim( anomalies )[1]  > 3 ) { 
  DT::datatable(anomalies[,c(2,3)], rownames = FALSE )
}
Using this approach we may prevent failures on cars, not only for batteries but for many cases when sensors are used.
Carlos Kassab
We are using R, more information about R:

Popular posts from this blog

UPDATED: Using R and H2O to identify product anomalies during the manufacturing process.

Note.  This is an update to article:  http://laranikalranalytics.blogspot.com/2019/03/using-r-and-h2o-to-identify-product.html - It has some updates but also code optimization from  Yana Kane-Esrig(  https://www.linkedin.com/in/ykaneesrig/ ) , as she mentioned in a message: The code you posted has two nested for() {} loops. It took a very long time to run. I used just one for() loop. It was much faster   Here her original code: num_rows=nrow(allData) for(i in 1:ncol(allData)) {   temp = allData [,i]   cat( "Processing column:", i, ", number missing:", sum( is.na(temp)), "\n" )    temp_mising =is.na( allData[, i])    temp_values = allData[,i][! temp_mising]    temp_random = sample(temp_values, size = num_rows, replace = TRUE)      temp_imputed = temp   temp_imputed[temp_mising]= temp_random [temp_mising]   # describe(temp_imputed)   allData [,i] = temp_imputed      cat( "Processed column:", i, ", number missing:", sum( is.na(allData [,i

Installing our R development environment on Ubuntu 20.04

  Step 1: Install R,  Here the link with instructions:  How to instal R on Ubuntu 20.04 Adding the steps I followed because sometimes the links become unavailable: Add GPG key: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 Output: Executing: /tmp/apt-key-gpghome.NtZgt0Un4R/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 gpg: key 51716619E084DAB9: public key "Michael Rutter " imported gpg: Total number processed: 1 gpg: imported: 1 Add repository: $ sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' Output: Hit:1 https://deb.opera.com/opera-stable stable InRelease Hit:2 http://us.archive.ubuntu.com/ubuntu focal InRelease Hit:3 http://archive.canonical.com/ubuntu focal InRelease

Using R and H2O Isolation Forest anomaly detection for data quality, further analysis.

Introduction: This is the second article on data quality, for the first part, please go to:  http://laranikalranalytics.blogspot.com/2019/11/using-r-and-h2o-isolation-forest-for.html Since Isolation Forest is building an ensemble of isolation trees, and these trees are created randomly, there is a lot of randomness in the isolation forest training, so, to have a more robust result, 3 isolation forest models will be trained for a better anomaly detection. I will also use Apache Spark for data handling. For a full example, testing data will be used after training the 3 IF(Isolation Forest) models. This way of using Isolation Forest is kind of a general usage also for maintenance prediction. I am working with data from file: https://www.kaggle.com/bradklassen/pga-tour-20102018-data NOTE: There was a problem with the data from the link above, so I created some synthetic data that can be downloaded from this link:  Golf Tour Synthetic Data # Set Java parameters, enough memo