# Discuss in the report the performance of the linear and nonlinear regression models. Which one performs better? And why?

Computational Civil Engineering

Portfolio - Computational Exercise

Learning outcome 1: Demonstrate competence in programming fundamentals, including structure and best practice.

Learning outcome 2: Apply numerical methods in a programming context to solve common civil engineering problems.

Learning outcome 3: Write programs to generate data for or to solve civil engineering problems.

Learning outcome 4: Critically compare numerical methods and programmes, considering computational efficiency and accuracy of the results.

In this task, you will work with a dataset that contains information on the ingredients (the amount of cement, blast furnace slag, fly ash, water, superplasticizer, and coarse and fine aggregate in the concrete, all in kg/m3) and age of concrete specimens (in days), along with their corresponding compressive strength (MPa). Your goal is to build a model to predict the compressive strength of concrete based on its composition and age (in Part A) and implement it in a MATLAB function (in Part B). Besides the MATLAB code, you will need to produce a structured report with a flowchart or pseudocode of the algorithm underlying your analyses and a detailed discussion of the options taken to complete this task, namely regarding the structure of your code and the built-in MATLAB functions or external libraries you may have used. You must explain each step and why it is necessary. Regarding the code itself, remember that it should be well-structured, as compact as possible, and easy to read.

N.B. The MATLAB code must be saved and submitted in *.m format, and the final report must be presented in a single PDF file. All elements must be in a selectable text (i.e., screenshots are unacceptable) and must be submitted through the module`s Blackboard page.

To complete Part A, you will need to:

Question 1. Load the dataset into MATLAB and visualize the relationships between each variable and the compressive strength value.
The dataset can be downloaded from the UCI Machine Learning Repository here:

Load it into MATLAB and gain a first understanding of data by calculating the average and the standard deviation of data included in each one of the columns in the dataset. Then, create a plot representing the individual correlation between each variable and the compressive strength of the concrete specimens. To enhance the readability of the output, use a plot with multiple subplots, and add the average and the standard deviation values calculated earlier to the corresponding subplot as a title. Label the axes appropriately and add gridlines and box borders to your plots.

Remember to make your code as compact as possible.
Finally, export the plot produced in this first activity as a *.JPG file and use it in your report to discuss your initial thoughts about the variables that seem to present a stronger correlation with the compressive strength of the concrete, if any.

Question 2. Expand your code to calculate the R2 score associated with each variable using a linear regression model. You`ll find that some of the variables included in the dataset present a very weak correlation (i.e., a very low R2 score) and, thus, can potentially be considered non-significant and excluded from the analysis. With that in mind, create an input to prompt the user to enter the R2 score to be used to split the dataset into potentially significant and non-significant variables. Store the variables found to be potentially significant into a new array and output the number of variables included in that array to the command window and the name of those variables. Note that this output must be able to accommodate all possible results, from no variables included in the sub-dataset to all variables included in the sub- dataset. You can use the example below as a guide.

Question 3. Build a linear regression model to predict the compressive strength of concrete based on its composition and age. Calculate the R2 score of the model and output it to the command window, together with the number of variables used in the analysis. You can use the example below as a guide.

Question 4. Build a nonlinear regression model to predict the compressive strength of concrete based on its composition and age. Calculate the R2 score of the nonlinear model and output it to the command window, together with the number of variables used.

Question 5. Visualize the predicted compressive strengths against the actual compressive strengths for both the linear and nonlinear models using a scatter plot. Label the axes appropriately and add gridlines and box borders to your plot.

Question 6. Discuss in the report the performance of the linear and nonlinear regression models. Which one performs better? And why? Rerun your code for different R2 scores (i.e., sub-datasets with different sizes) and discuss the impact of the number of variables on the performance of the model. To support your conclusions, you can plot the R2 score obtained for different numbers of variables (from 1 to 8).

Now that you have already gained a thorough understanding of the data and the impact of the different parameters on the performance of the model, in Part B, you will need to:

1. Choose the best-performing model used in Part A and build a function that takes as input the cement, slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and age of the concrete and returns the predicted compressive strength.

2. Test your function on a few examples of concrete mixes not included in the original dataset and discuss its performance in the report.