Climate Change Data Analysis

Aditya Shakya , IIT Gandhinagar, s.aditya@iitgn.ac.in

Shantanu Sahu , IIT Gandhinagar, shantanu.s@iitgn.ac.in

Varun Barala , IIT Gandhinagar, barala.v@iitgn.ac.in

Objective

To examine the linear regression model before and after the data file was subjected to SVD. Results for matrices of various sizes are displayed. Showing the interdependencies of various traits, as well as the consequence of eliminating them. Showing a broad understanding of PCA in order to choose the optimal dimension for SVD application. Attempting to solve the dataset using alternative models and regression approaches.

Dataset

The data was accessible in CSV format, which you may get here. It covers several columns such as floor area, monthly minimum, average, and maximum temperatures, and elevation, with a total of around 75000 rows. Some columns are of the numeric type, whereas others, such as "facility type" and "building class," are of the alphabetic type. As a result, we used one hot encoding to transform these data columns into integer type columns for our study, resulting in a file of 123 columns. One hot encoding resulted in the creation of 59 additional columns, each with a value of 0 or 1 instead of an alphabetic string. In the final phase of the model building process, there is a column called "site eui" that is used as the dependent variable.Thus, two arrays with the names X and Y were created, with X containing all columns except site eui and Y containing only site eui. The site eui column represents the quantity of power and what a building consumes as represented in utility bills. These X and Y values will be utilized to build a model.

We then used a procedure to normalize all of the columns except the "site eui" column, taking into account several instances where x may be the column mean, variance, or standard deviation, and compared the results.

              cell_value = (cell_value - avg_of_column)/ x

After then, the Nan values in the file are processed. Because the data was almost full, deleting a row or column was not an option, we used the fillna function to fill the empty cell with the column's mean value. In multivariate linear regression, normalization is required to bring all variables into the same range. When a model is built without normalization, the coefficients for some independent variables will be quite high, resulting in poor modeling.

Importing File For Regression

Linear regression Modeling

We assumed it would operate only with a file named "Train" because the output of the "Test" file was not included in the zip file. As a result, we split the train file's dataset into two parts: training will take place on the first 60,000 rows, while the remaining rows will be utilized to check and compare the model as a new test file. The effectiveness of linear regression models is assessed using a model score. The coefficient of determination of prediction is represented by the model score. The best score that any model can attain is 1, which is a simple solution if the value of the output(dependent) variable is substituted for the independent variable, resulting in a coefficient of variable of one, resulting in a score of one.

On the 60,000 rows, a linear regression model (M1) was developed with the output variable as the site eui column, and the output score was 0.23. The aforementioned model was now evaluated for the remaining rows, yielding a 0.15 model score. The train data had a poor grade since the data we were given had a lot of volatility. The test model's score is also poor since the output is taken from the train model, which already has a low score

Linear Regression

Applying Single Value Decomposition

The Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix into three matrices. It possesses several intriguing algebraic characteristics and communicates key geometrical and theoretical insights into linear transformation.

The SVD of mxn matrix A is given by the formula :

A = UsVT Where U is mxn matrix of the orthonormal eigenvectors of AAT VT is the transpose of the nxn matrix containing the orthonormal eigenvector of ATA. S is a nxn diagonal matrix of the singular values which are the square roots of the eigenvalues of ATA.

The first 60,000 rows of the matrix were used to generate two linear regression models, one without SVD(M2) and the other with SVD(M2) (M3). After SVD, independent testing was carried out on both models using the same 15000 rows of the matrix.

Using Old Model M2

We performed linear regression modeling on the last 15,000 rows of the SVD matrix using the model created with above 60,000 rows using the matrix without SVD, which resulted in the same score of 0.15 if we kept dimension as 122, i.e., the same shape as the original matrix, which is trivial because SVD has the same number of columns as the original matrix; resulted in the same score. In the SVD implementation, the output of score for model M2 was evaluated for different values of input variable K.

The score of model M2 vs. the k is represented in the graph below.

Linear Reg with SVD (using Model M2)

Using New model M3

Because of the considerable variation of the data following SVD modeling, model M3 scores were mostly negative. When the dataset is bad and has a lot of volatility, the model data score might be negative.

Calculating Mean Square Error for the above model The MSE for the above mode was 7.546e+29, which is quite high, indicating that the data set is poor.

Deleting larger values of variable Y We tried to delete the rows with Y values (predicted) more than 10,000 in order to improve the score value. This resulted in the score being more accurate, indicating that

some values in the dataset were substantially hurting the overall linear regression modeling. As a result, 2233 rows are removed from the data set. We used a criterion of 10,000 because we wanted to see what would happen if huge data were excluded from the dataset and all of the variables' coefficients were either low (of the order of 10 3 to 10 4) or high (of the scale of 10 12 to 10 16).

This data set's model score was 0.14 for K value 122, rather than Model M3 where the model score was a very large negative number. After eliminating specific rows, the graph of model score V/s K for the model M3 is shown below.

Linear Regression with SVD on model M3

Calculating MSE

Principal Component Analysis (PCA) Analysis

PCA is a dimensionality-reduction technique for lowering the dimensionality of large data sets by turning a huge collection of variables into a smaller set that preserves the bulk of the information in the larger set. Naturally, lowering the number of variables in a data set affects accuracy; nevertheless, the solution to dimensionality reduction is to give up some accuracy in return for more simplicity. Because smaller data sets are easier to examine and show, and because data analysis with smaller data sets is much easier.

PCA analysis automatically normalizes the original matrix before using SVD and linear regression, which results in a higher score. We ran a PCA analysis on the original dataset, which gave us an array of all the characteristics and their influence on the Y model. As a result, we can see that the attribute "floor area" is the most impacting or has the highest coefficient, followed by "id" and "heating degree days," and so on. This leads to the conclusion that as K is changed, columns will be picked for SVD analysis from left to right.

The influence of several columns on "Y" is depicted in the graph below, along with their respective percentages. The X axis depicts the array of columns from PCA analysis, while the Y axis depicts their relative importance.

PCA Analysis

Some New Modeling Ways:

By reducing the number of columns to K(reduced dimension);

SVD of any matrix outputs three matrices as U,s,VT(having dimension d x d). In the next analysis we multiplied the matrix V with the Initial dataframe(which is converted to Numpy array) to construct a matrix with only x columns and 75000 rows where x is the number of dimensions we used in the above SVD analysis. Now the linear regression model is applied on this model for score analysis which was far better as compared to above analysis where the matrix had 123 columns with low rank (M3) for the same K = 10 , this analysis showed score 0.04 whereas for same k model M3 showed score as -1.92.

As can be seen, lowering columns improved the model by making the score somewhat positive.

Deleting column name with “facility_type”

After looking at the coefficients of various characteristics, we can observe that the value of one hot encoded column had a lot of volatility, leading to such a poor SVD score. Only when the data is truly bad may a score be negative. As a result, we removed the "facility type" column and ran linear regression (using standard deviation to standardize the dataset) on both matrices with and without SVD, which revealed a significant difference in the regression score value after applying SVD. This supported the claim that a single hot encoding of facility type was causing the dataset to deteriorate.

As can be seen in the graph above, eliminating the column “facility_type” improved the score and brought it closer to zero rather than negative infinity.

As can be seen in the 'First Singular Vector V/s Second Singular Vector' graph below, there is less scatter in the graph than in the PCA graph before, and the model became more linear when the column "facility type" was removed.

First k dimension plot

The influence of several columns on "Y" is depicted in the graph below, along with their respective percentages. The X axis depicts the array of columns from PCA analysis, while the Y axis depicts their relative importance.

PCA Analysis after dropping Facility Type

As can be seen in the 'First Singular Vector V/s Second Singular Vector' graph below, there is less scatter in the graph than in the PCA graph before, and the model became more linear when the column "facility type" was removed.

Removing Facitility type and using original Normalization