Problem Statement
Can we predict whether it will rain tomorrow or not using data?
Solution: Classification model (Logistics Regression) using Machine Learning can be used for forecasting whether it will rain tomorrow or not.
The post aims to convey things that can be achieved with the help of Machine Learning, and as always there is room for improvements/suggestions.
Dataset: This dataset contains about 10 years of daily weather observations. Source of Data
In this post, I’ll be briefly explaining the different sections, and later on, I’ll make an elaborative post on each section.
Social Impact
Australia has gone through one of the worst bushfires in NSW. Drought and water scarcity have been persistent problems. A machine learning model with a reasonable level of prediction accuracy would help in making sure an adequate amount of resources can be allocated for rainwater harvesting.
Exploratory Data Analysis
Summary of Data: Rows and Columns
This data set consists of 142191 rows and 24 columns, with RainTommorw being the dependent variable.

Figure 1: Snapshot of Weather DataSet
Descriptive Statistics

Figure 2 Descriptive Statistics
- Min_Temp ranges from -8.5 to 33 with a standard deviation of 6.4
- Hottest day in Australia had 48 degrees
- On average Wind speed remains pretty similar at 9 am and 3 pm.
More insights can be derived from descriptive statistics. The idea is to get a feel of the data and later on depending on requirements; different parameters can be assessed.
Missing Value Imputation
There are different ways of handling missing values in the data. We can delete those observations or can fill them with statistical measures. In this case, statistical measures like mode and mean have been used to replace missing values in categorical and numerical variables, respectively.

Figure 3 Number of Observations with Missing Values in Different Variables
Relationship between Variables
Portland, Walpole, Cairns have more probability of rain as compare to other areas. Similarly, during months from 6 to 8 (June to October) receive more rains. This gives us an indication of including these variables in the final model.

Figure 4 Effect of Month and Location on Rain Tomorrow
Skewness in Numerical Variables
Skewness tends to harm the analysis, and different kinds of transformations like a log sqrt can be used to make them more normally distributed.

Figure 5 Skewness in Variables
Correlation Between Variables
Correlation helps us to find how independent variables are affecting the dependent variables and also at the same time helps us to remove the variables which are highly correlated to each other.

Figure 6 Highly Correlated Variables
Min_Temp and Temp9am are highly correlated, which indicates that only one of those variables should be included in our model.
Splitting the data into Training and Test Set
In most of the machine learning modelling, the entire analysis is done on training data and based on the accuracy best model is deployed on test data.

Figure 7 Splitting data into Training and Test
Model Building
Different algorithms can be used for making the predictive model. I’ll be using simple logistic regression for demonstrations. A similar approach can be used for applying more sophisticated algorithms like random forest, decision trees, XGBoost, etc.
Logistic Regression
A simple version of logistic regression without changing the parameter settings is applied to the training test and later on evaluated by using it on the test set.
Your content goes here. Edit or remove this text inline or in the module Content settings. You can also style every aspect of this content in the module Design settings and even apply custom CSS to this text in the module Advanced settings.

Figure 8 Logistic Regression
Model Evaluation
Different evaluation metrics can be used based on the problem and industry. In this case, the accuracy score has been used.

Figure 9 Accuracy of Prediction Model
This simple algorithm has a predictive accuracy of 99.76.
Model Improvement
Hyperparameter tuning of different parameters can be done for achieving better results. Advance techniques like model stacking and deep learning models can be used in different settings.
If you would like to speak to Decision Inc. about Machine Learning and how it can relate to your business… not just the weather, then please get in touch.