Select Page

Rain prediction model using machine learning

Problem Statement

Can we predict whether it will rain tomorrow or not using data?  

Solution: Classification model (Logistics Regression) using Machine Learning can be used for forecasting whether it will rain tomorrow or not.   

The post aims to convey things that can be achieved with the help of Machine Learning, and as always there is room for improvements/suggestions. 

Dataset: This dataset contains about 10 years of daily weather observations. Source of Data 

In this post, I’ll be briefly explaining the different sections, and later on, I’ll make an elaborative post on each section.   

Social Impact  

Australia has gone through one of the worst bushfires in NSW. Drought and water scarcity have been persistent problems. A machine learning model with a reasonable level of prediction accuracy would help in making sure an adequate amount of resources can be allocated for rainwater harvesting.   

Exploratory Data Analysis 

Summary of Data: Rows and Columns

This data set consists of 142191 rows and 24 columns, with RainTommorw being the dependent variable. 

Figure 1Snapshot of Weather DataSet 

Descriptive Statistics 

Figure 2 Descriptive Statistics 

  • Min_Temp ranges from -8.5 to 33 with a standard deviation of 6.4 
  • Hottest day in Australia had 48 degrees  
  • On average Wind speed remains pretty similar at 9 am and 3 pm. 

More insights can be derived from descriptive statistics. The idea is to get a feel of the data and later on depending on requirements; different parameters can be assessed. 

Missing Value Imputation 

There are different ways of handling missing values in the data. We can delete those observations or can fill them with statistical measures. In this case, statistical measures like mode and mean have been used to replace missing values in categorical and numerical variables, respectively. 

Figure 3 Number of Observations with Missing Values in Different Variables 

Relationship between Variables  

Portland, Walpole, Cairns have more probability of rain as compare to other areas. Similarly, during months from 6 to 8 (June to October) receive more rains. This gives us an indication of including these variables in the final model. 

Figure 4 Effect of Month and Location on Rain Tomorrow 

Skewness in Numerical Variables 

Skewness tends to harm the analysis, and different kinds of transformations like a log sqrt can be used to make them more normally distributed.   

Figure 5 Skewness in Variables 

Correlation Between Variables 

Correlation helps us to find how independent variables are affecting the dependent variables and also at the same time helps us to remove the variables which are highly correlated to each other.  

Figure 6 Highly Correlated Variables 

Min_Temp and Temp9am are highly correlatedwhich indicates that only one of those variables should be included in our model.   


Splitting the data into Training and Test Set 

In most of the machine learning modelling, the entire analysis is done on training data and based on the accuracy best model is deployed on test data.

Figure 7 Splitting data into Training and Test 

Model Building

Different algorithms can be used for making the predictive model. I’ll be using simple logistic regression for demonstrations. A similar approach can be used for applying more sophisticated algorithms like random forest, decision trees, XGBoost, etc. 

Logistic Regression

A simple version of logistic regression without changing the parameter settings is applied to the training test and later on evaluated by using it on the test set.

Your content goes here. Edit or remove this text inline or in the module Content settings. You can also style every aspect of this content in the module Design settings and even apply custom CSS to this text in the module Advanced settings.

Figure 8 Logistic Regression 

Model Evaluation  

Different evaluation metrics can be used based on the problem and industry. In this case, the accuracy score has been used. 

Figure 9 Accuracy of Prediction Model 

This simple algorithm has a predictive accuracy of 99.76 

Model Improvement

Hyperparameter tuning of different parameters can be done for achieving better results. Advance techniques like model stacking and deep learning models can be used in different settings.

If you would like to speak to Decision Inc. about Machine Learning and how it can relate to your business… not just the weather, then please get in touch.

Rajat Gupta
Decision Inc. Australia

Get In Touch

Speak to us about business agility.