My background is in Civil Engineering and I have worked as a site engineer. It feels great to write about an industry that I am so familiar with. One of the main things civil engineers strive to do is to make sure that concrete used in construction achieves its maximum characteristic strength (after 1 day, 7 day, 28 day and indefinitely). Different factors affect the strength of concrete, like the amount of cement, ash, slag, superplastic and water, to name but a few.
Figure 1 Snapshot of Dataset
This dataset will help explain the application of machine learning and predictive modelling in the construction industry.
Source of Dataset: github.com/stedy/Machine-Learning-with-R-datasets
In Part 1 I will cover in-depth exploratory data analysis that will form the foundation of future posts that will feature engineering, modelling techniques like linear regression, random forest, lasso regression and neural networks.
As the name suggests, we are trying to predict something based on historical data. Predictive Analytics has received a lot of attention in recent years due to advancements in computing techniques and the increase in computing power. The idea behind this series of articles is to show how we can use machine learning algorithms on this dataset and the value it can generate for the business.
My goal: Build a Regression Model for predicting strength (target variable) of concrete
Snapshot of the Dataset:
It is a good practice to get familiar with the dataset before analysing it
We have different columns like cement, slag, water, ash, strength etc.
Figure 2 Descriptive Statistics
- The average strength of concrete is 35.8 with a standard deviation of 16.7 and a range of 2.3 – 82.6.
- Cement is the most expensive component has the highest variation (104.50 Std) among other variables.
Exploratory Data Analysis (EDA)
We usually spend 60-80% of our time on EDA, depending on the project. The following are generic activities and I have used Pandas Profiling to performing most of the EDA. Here is a link to this package: https://github.com/pandas-profiling/pandas-profiling
Imputing Missing Values
We usually try to impute missing value by summarising statistics like median or mean for numerical variables and mode for categorical variables. In some cases, when there are more than 70-80% values missing in the variable, we can delete that variable, depending on how much information is contained in that variable. However, we do not have any missing values in this dataset.
Figure 3 Missing Values in the Dataset
- One Hot Encoding
Most machine learning algorithms do not understand text, so we need to convert it to numbers. Although we do not have any categorical variable as such in this dataset, if we dig further, we find out the age column is more nominal for this dataset rather than categorical.
Figure 4 Changing age into categorical data type
Figure 5 One Hot Encoding
- Distributions of Variables
Here we look at each variable in-depth and try to form a clear picture of what is happening in the data.
Figure 6 Percentage of Zeros in Columns
Slag, ash, superplastic are the add on materials in concrete which justifies the frequency of their quantity being zeros.
Figure 7 Distribution of Cement
We can see from the above figure that the distribution of cement is skewed. Most of the machine learning algorithms work on the principle that data is typically distributed. We can usually use transformations like a log, sqrt, etc. for removing or reducing the skewness. This will be done under the feature engineering blog post. A similar kind of analysis will be done for each variable.
- Scatterplots of Different Variables
Figure 8 Relationship between Strength and other variables
Figure 9 Correlation and HeatMap
As expected, there is a positive relationship between strength and cement quantity. There are some outliers where cement quantity is more than 500 but the strength is not so great as we are only looking at two variables; it would be challenging to reach a conclusion from this visualisation. Adding more water than required can negatively affect the strength of concrete. A similar relationship is observed between strength and coarseagg.
I will focus more on Feature Engineering in a future post. We try to enrich our dataset by making new features out of the existing features.
Techniques used will be:
- Making Polynomial features out of existing features
- Creating new features based on domain knowledge such as the ratio of cement and water.
- Applying different transformations like sqrt, log, etc to the variables
The flowchart below is a framework to evaluate whether we would like to keep the features in our final model or not. Depending on the increment/decrement of the model’s accuracy, we can decide whether to include a particular feature or not.
If you would like to speak to me about how to unlock the potential in your data, please get in touch with me directly via email [email protected], call me 0481 757 309 or fill in the form below.