Skip to main content

Going from zero to R-Analytics with your team

Before to continue with the posts about how to do things with R, I have decided to describe how I lead the creation of an analytics team starting from zero.

My only intention here is for this information to be useful for companies looking to create their analytics team.

Well, first the first, the people for your team.

How many people is needed will depend on your company size or on the amount of money the company wants to invest in the analytics team creation but, at least you must have 2 developers, just ensure they are real developers.

Now, it is needed to define the software to use.

Database, for this I suggest to start small, let's say, using MariaDB, or SQLServer if you want to pay for a license, just ensure you can run analytic functions, some of them call them window functions, they are very useful when doing analytics. Please read this article to know more about window functions: https://dzone.com/articles/mysql-8-vs-mariadb-comparison-of-window-functions

ETL software, unless you want to pay for a license to IBM or Microsoft or other people, my suggestion here is to use a simple language as we did with PowerShell and create your basic libraries like:

- SQLFunctions to read or write data to your database
- Log Functions to write and delete from the log text files.
- E-Mail functions to send simple emails or attachments included.

The point here is to create all the basic functions an ETL software needs.

One more and important thing why I like  PowerShell is because we have Visual Studio Code, a very useful tool to edit and debug PowerShell scripts.

Besides the ETL software, it is important to build the database structure to configure the ETLs, for this I designed something called AMS( Automatic Monitoring System ), it holds the ETls configuration and logs, and with the help of the team a web tool was created to configure each ETL we were creating.

- The dashboard software. I like Tableau but you can get open source software: https://opensource.com/business/16/11/open-source-dashboard-tools-visualizing-data

Documentation, you will need at leas 2 kind of documents:

- A basic Spread Sheet document to create and activities plan, you really do not need anything complex nor to pay for a special software.
- A Word or Libre Office document template to create requirement specifications from the users, this is so important to have this document, it is very useful for everybody to see the same thing. If you do not create a requirement specification, it will cause confusion on what is really needed to develop.

At this point we have all the needed elements for a basic business intelligence team, this is the first part. For this first part, if your developers are good, they will not need any hired training, to put all of the above to work requires very basic knowledge than can be learned by the same developers.

Now, for each project, it is needed to involve a key user from the company operation and an analyst, the person that serves as interface between the operation and your analytics team.

Converting your Business Intelligence team to an Advanced Analytics team.

To learn how to do advanced analysis on your data by using mathematical algorithms requires specific training, not only to understand the basic operation of the algorithm to be used in each scenario but to know how to create models or to do the different kind of analysis required before the modeling, that is why now there are new positions in the analytics team like data engineer, data scientist, etc.

The first thing is to get a good training, we used promidat.com, these are people with good knowledge and they have experience in training, the site is in Spanish but if you require, they can give you the training in English.

It is important to have more than one person taking the training at the same time because the feedback they can receive from each other.

You can use one of the many training sites like datacamp, which is good but nothing is the same like an official training.

It is important that at the end of the training, your team creates real projects, useful for the company, not only to pass the training level, by doing this, you ensure that all the learned things are applied in practical cases in the company.

To fulfill this strategy it took 2 and a half years because nobody taught me how to do it, I had to figure out the strategies to follow and guide the team for the correct way. I think that knowing the strategy, It will take maybe 1.5 years to fulfill the strategy.


Enjoy it!!!.

Carlos Kassab
https://www.linkedin.com/in/carlos-kassab-48b40743/








Popular posts from this blog

UPDATED: Using R and H2O to identify product anomalies during the manufacturing process.

Note.  This is an update to article:  http://laranikalranalytics.blogspot.com/2019/03/using-r-and-h2o-to-identify-product.html - It has some updates but also code optimization from  Yana Kane-Esrig(  https://www.linkedin.com/in/ykaneesrig/ ) , as she mentioned in a message: The code you posted has two nested for() {} loops. It took a very long time to run. I used just one for() loop. It was much faster   Here her original code: num_rows=nrow(allData) for(i in 1:ncol(allData)) {   temp = allData [,i]   cat( "Processing column:", i, ", number missing:", sum( is.na(temp)), "\n" )    temp_mising =is.na( allData[, i])    temp_values = allData[,i][! temp_mising]    temp_random = sample(temp_values, size = num_rows, replace = TRUE)      temp_imputed = temp   temp_imputed[temp_mising]= temp_random [temp_mising]   # describe(temp_imputed)   allData [,i] = temp_imputed      cat( "Process...

Installing our R development environment on Ubuntu 20.04

  Step 1: Install R,  Here the link with instructions:  How to instal R on Ubuntu 20.04 Adding the steps I followed because sometimes the links become unavailable: Add GPG key: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 Output: Executing: /tmp/apt-key-gpghome.NtZgt0Un4R/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 gpg: key 51716619E084DAB9: public key "Michael Rutter " imported gpg: Total number processed: 1 gpg: imported: 1 Add repository: $ sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' Output: Hit:1 https://deb.opera.com/opera-stable stable InRelease Hit:2 http://us.archive.ubuntu.com/ubuntu focal InRelease Hit:3 http://archive.canonical.com/ubuntu focal InRelease ...

Using R and H2O Isolation Forest anomaly detection for data quality, further analysis.

Introduction: This is the second article on data quality, for the first part, please go to:  http://laranikalranalytics.blogspot.com/2019/11/using-r-and-h2o-isolation-forest-for.html Since Isolation Forest is building an ensemble of isolation trees, and these trees are created randomly, there is a lot of randomness in the isolation forest training, so, to have a more robust result, 3 isolation forest models will be trained for a better anomaly detection. I will also use Apache Spark for data handling. For a full example, testing data will be used after training the 3 IF(Isolation Forest) models. This way of using Isolation Forest is kind of a general usage also for maintenance prediction. I am working with data from file: https://www.kaggle.com/bradklassen/pga-tour-20102018-data NOTE: There was a problem with the data from the link above, so I created some synthetic data that can be downloaded from this link:  Golf Tour Synthetic Data # Set Java parameters, en...