In this post I provide some quick code snippets demonstrating how to use my C3 library to perform low-rank regression of scattered data. The approach is based on our recently accepted paper:

Alex Gorodetsky, John Jakeman: Gradient-based Optimization for Regression in the Functional Tensor-Train Format. In: Arxiv 1801.00885v2 (Accepted Journal of Computational Physics), Forthcoming.

In that paper, we demonstrate that this approach is valid on a wide number of data sets and compare it to many machine learning algorithms. The results provide an initial indication that low-rank formats have potential to be a powerful general-purpose machine learning modeling format. In later posts, I will go through more detailed discussion about the exact type of low-rank model used and the details of its implementation. The example in this post is intended to get someone started quickly and to demonstrate the parallel to other machine learning techniques.

I assume that you have installed the library with the python interface as described in the README. First lets load numpy, matplotlib, and scikit-learn. We will compare our approach on the “diabetes” data set from scikit-learn.

1: """ Author: Alex Gorodetsky -- goroda@umich.edu """
2: """ License: BSD 3 clause """
3: 
4: import numpy as np
5: import matplotlib.pyplot as plt
6: from sklearn import linear_model, neural_network
7: from sklearn import datasets
8: from sklearn.metrics import mean_squared_error, r2_score

Next we import c3py and setup a function that calls the library to fit the data

 9: import c3py
10: 
11: def fit_ft(xdata, ydata, poly_order=3, initial_rank=1):
12:     """ Setup and Solve a regression problem in low-rank format """
13:     ndata, dim = xdata.shape
14: 
15:     #setup lower and upper bounds of the approximation
16:     lbounds = np.min(xdata, axis=0) - 0.5
17:     ubounds = np.max(xdata, axis=0) + 0.5
18: 
19:     # Intialize the parameters of a low-rank model
20:     # Each dimension is parameterized by polynomials
21:     # with order polyorder, thus they contain poly_order+1 unknowns
22:     ftrain = c3py.FunctionTrain(dim)
23:     for dim_index, lb, ub in zip(range(dim), lbounds, ubounds):
24:         ftrain.set_dim_opts(dim_index, "legendre", lb, ub, poly_order+1)
25: 
26:     # setup the initial rank of the approximation
27:     ranks = [1] + [initial_rank] * (dim-1) + [1]
28:     ftrain.set_ranks(ranks)
29: 
30:     # Build the model while adapting the rank
31:     ftrain.build_data_model(ndata, xdata, ydata, 
32:                             adaptrank=1, kickrank=1, # adapt params
33:                             obj="LS_SPARSECORE", regweight=1e0) #sparsity regularization
34:     return ftrain
35: 

Next we load the diabetes data set and split it into training and testing data. This dataset has a total of 422 instances, each described by 10 features. We take the last 20 instances for testing and train on the rest. This is the same setting as provided in this example.

36: diabetes = datasets.load_diabetes()
37: X = diabetes.data
38: Y = diabetes.target
39: 
40: ndata, nfeatures = X.shape
41: ntrain = int(ndata/2)
42: ntrain = ndata - 20
43: ntest = ndata - ntrain
44: 
45: xtrain = X[:ntrain,:]
46: xtest = X[ntrain:,:]
47: ytrain = Y[:ntrain]
48: ytest = Y[ntrain:]
49: 

We obtain three baseline models for comparing our method with scikit-learn: a baseline guess that is simply the mean of the training data, a linear regression model, and a neural network.

50: # Baseline guess
51: y_pred_baseline = np.mean(ytrain) * np.ones(ytest.shape)
52: mse_baseline = mean_squared_error(y_pred_baseline, ytest)
53: print("Mean Squared Error of the baseline fit is {0:5f}".format(mse_baseline))
54: 
55: # Linear regression
56: linmod = linear_model.LinearRegression() # create the object
57: linmod.fit(xtrain, ytrain) # train
58: y_pred_lin = linmod.predict(xtest) #test
59: mse_lin = mean_squared_error(y_pred_lin, ytest)
60: print("Mean Squared Error of the linear fit is {0:5f}".format(mse_lin))
61: 
62: # Neural Network
63: neural = neural_network.MLPRegressor(activation='relu',solver='lbfgs') #bfgs because small data
64: neural.fit(xtrain, ytrain)
65: y_pred_nn = neural.predict(xtest)
66: mse_nn = mean_squared_error(y_pred_nn, ytest)
67: print("Mean Squared Error of the neural network fit is {0:5f}".format(mse_nn))
68: 

Finally we train and evaluate the FT

69: # Low-rank regression
70: ft = fit_ft(xtrain, ytrain)
71: y_pred_ft = ft.eval(xtest)
72: mse_ft = mean_squared_error(y_pred_ft, ytest)
73: print("Mean Squared Error of the low-rank fit is {0:5f}".format(mse_ft))

The output form this code is

Mean Squared Error of the baseline fit is 5568.964625
Mean Squared Error of the linear fit is 2004.567603
Mean Squared Error of the neural network fit is 1806.851797
Mean Squared Error of the low-rank fit is 1814.017628

The neural network results vary from run to run, most likely due to initialization randomness. The linear, neural network, and low-rank models all provide approximately the same quality of fit (recall there are only 20 testing samples!).


Leave a Reply

Your email address will not be published. Required fields are marked *