Tutorial 2. Fitting of 1D NMR line shape

by Evgenii Kovrigin, 05/17/2011

In this tutorial we will load a dataset created in the previous tutorial and fit its data.

IMPORTANT NOTE: We cannot fit simulated ideal data where noise RMSD was set to 0 because this number is used to weigh the residuals as target-function=sum-of-squares/noise-RMSD. If noise RMSD is 0---fitting fundamentally fails because target-function becomes infinite! We still want to do it to see whether fitting will give us back the parameters used in a simulation. A 'go-around' is simply to simulate data with a very small noise RMSD thus closely reproducing the ideal profile and enabling a test of the fitting algorithms.

Contents

close all
clear all

Load previously prepared dataset

Import saved data

temp=load('tutorial_1.mat');

Extract objects from the import structure

test1=temp.test1;
test2=temp.test2;
setup=temp.setup;

Check contents

test1   % show summary of the dataset
test1.plot_simulation('Ideal');  % show original simulation and perturbed dataset
test2   % show summary of the dataset
test2.plot_simulation('Noisy');  % show original simulation and perturbed dataset
setup
test1 = 

  NMRLineShapes1D handle

  Properties:
                    operation_mode: 'Simulation'
                         data_type: [1x18 char]
                              name: 'Test_1'
                   model_names_map: [10x1 containers.Map]
                  number_of_models: 10
                     model_handles: {1x10 cell}
                       model_names: {1x10 cell}
                model_descriptions: {1x10 cell}
            model_parameters_names: {1x10 cell}
               active_model_number: 2
         active_model_N_parameters: 9
                         algorithm: 'not defined'
               best_fit_parameters: [1x9 double]
        best_fit_parameters_conf95: []
                       data_length: 50
                        X_original: [50x1 double]
                                 X: [50x1 double]
                               XSE: [50x1 double]
                        Y_original: [50x1 double]
                                 Y: [50x1 double]
                               YSE: [50x1 double]
                           Y_ideal: [50x1 double]
              MonteCarlo_range_min: [1x9 double]
              MonteCarlo_range_max: [1x9 double]
              parameter_limits_min: [1x9 double]
              parameter_limits_max: [1x9 double]
                     smooth_points: 250
                            X_name: [1x17 char]
                            Y_name: [1x18 char]
                           X_range: [-100 300]
                           Y_range: []
                       X_direction: 'reverse'
              model_numeric_solver: 'analytical'
             model_numeric_options: 'no options'
               options_lsqcurvefit: [1x1 struct]
                options_fminsearch: [1x1 struct]
        options_fmincon_active_set: [1x1 struct]
    options_fmincon_interior_point: [1x1 struct]
                         conf_high: 97.5000
                          conf_low: 2.5000
                     fraction_conf: 0.0500
                      N_max_trials: 200



test2 = 

  NMRLineShapes1D handle

  Properties:
                    operation_mode: 'Simulation'
                         data_type: [1x18 char]
                              name: 'Test_2'
                   model_names_map: [10x1 containers.Map]
                  number_of_models: 10
                     model_handles: {1x10 cell}
                       model_names: {1x10 cell}
                model_descriptions: {1x10 cell}
            model_parameters_names: {1x10 cell}
               active_model_number: 2
         active_model_N_parameters: 9
                         algorithm: 'not defined'
               best_fit_parameters: [1x9 double]
        best_fit_parameters_conf95: []
                       data_length: 50
                        X_original: [50x1 double]
                                 X: [50x1 double]
                               XSE: [50x1 double]
                        Y_original: [50x1 double]
                                 Y: [50x1 double]
                               YSE: [50x1 double]
                           Y_ideal: [50x1 double]
              MonteCarlo_range_min: [1x9 double]
              MonteCarlo_range_max: [1x9 double]
              parameter_limits_min: [1x9 double]
              parameter_limits_max: [1x9 double]
                     smooth_points: 250
                            X_name: [1x17 char]
                            Y_name: [1x18 char]
                           X_range: [-100 300]
                           Y_range: []
                       X_direction: 'reverse'
              model_numeric_solver: 'analytical'
             model_numeric_options: 'no options'
               options_lsqcurvefit: [1x1 struct]
                options_fminsearch: [1x1 struct]
        options_fmincon_active_set: [1x1 struct]
    options_fmincon_interior_point: [1x1 struct]
                         conf_high: 97.5000
                          conf_low: 2.5000
                     fraction_conf: 0.0500
                      N_max_trials: 200



setup = 

         Rtotal: 1.0000e-03
        LRratio: 0.7500
      Log10_K_A: 5
          k_2_A: 10
           w0_1: 0
           w0_2: 200
         FWHH_1: 20
         FWHH_2: 20
    ScaleFactor: 1

Remember that test1 is ideal dataset with zero noise RMSD. Recalculate data using very small number for noise RMSD

X_RMSD=0;  % does not need to be changed
Y_RMSD=1e-6;
test1.simulate_noisy_data([ setup.Rtotal  setup.LRratio   setup.Log10_K_A   setup.k_2_A  ...
              setup.w0_1   setup.w0_2   setup.FWHH_1  setup.FWHH_2  ...
              setup.ScaleFactor ], X_RMSD, Y_RMSD);
figure1=test1.plot_simulation('Nearly ideal');

Prepare starting parameters and their limits

Each parameter has to have formally assigned minimum and maximum values it cannot exceed in fitting run. The Monte-Carlo range sets the interval the random starting parameters are taken from for determination of confidence interval via Monte-Carlo protocol.

Parameters, which are going to be fixed in fitting may have all these limits set to 0. Let's only fit for binding affinity constant as if all other parameters are known by design of our virtual experiment.

% EQUILIBRIUM CONSTANT (entered as logarithm of base 10)
log10_KA=log10(1e1); % have it off-set from true value

Absolute limits:

log10_KA_min=log10(1e1);
log10_KA_max=log10(1e10);

These ranges are disregarded by 'simplex' algorithm but obeyed by all others.

Monte-Carlo search range:

log10_KA_Monte_Carlo_min=log10(5e1);
log10_KA_Monte_Carlo_max=log10(1e7);

NOTE: These ranges are only used in Monte-Carlo runs if session.vary_starting_parameters='yes'. Its default value (as it is now) is 'no' % consider turning to 'no'. Thus simulated noisy data sets are generated anew but fitting starts from the 'best-fit' parameter set. This set is actually used to synthesize noisy data. True Monte-Carlo approach calls for varying these starting parameters too before every fitting run but (1) in many times I did not see difference in results while (2) 'no' mode may be noticeably faster.

Collect parameter in a vector corresponding to U model.

To view the order
of parameters in the model---simply display the active model of the
dataset:
test1.show_active_model()
ans =

Active model:
Model 2:  "U-model"
Model description
"1D NMR line shape for the U model, no temperature/field dependence, analytical"
Model handle: line_shape_equations_1D.U_model_1D
Current solver: analytical
Model parameters: 
1: Rtotal     2: LRratio     3: log10(K_A)     4: k_2_A     5: w0_1     6: w0_2     7: FWHH_1     8: FWHH_2     9: ScaleFactor     



Create a vector of parameter values

starting_parameters=[ setup.Rtotal     setup.LRratio     log10_KA     setup.k_2_A  ...
                      setup.w0_1     setup.w0_2     setup.FWHH_1     setup.FWHH_2     setup.ScaleFactor ];

Create vectors of absolute limits. Parameters, which will not be optimized do not need meaningful limits (use 0).

min_limits=[ 0 0 log10_KA_min 0 0 0 0 0 0 ];
max_limits=[ 0 0 log10_KA_max 0 0 0 0 0 0 ];

Create vectors of Monte-Carlo ranges. They will only be used if you decide to fit data with determination of confidence intervals.

min_MC = [ 0 0 log10_KA_Monte_Carlo_min 0 0 0 0 0 0 ];
max_MC = [ 0 0 log10_KA_Monte_Carlo_max 0 0 0 0 0 0 ];

Prepare fitting session object

Fitting session is performed using TotalFit object. The datasets themselves have fitting functionality but their internal procedures are far less powerful than the ones offered by TotalFit.

Choose a name (to be used for saving data)

session_name='KA_fit_to_ideal_data';

Create fitting session

session=TotalFit(session_name);

Load dataset with ideal (no noise) simulated data.

session.copy_dataset_into_array(test1);
session.show_datasets()
ans =

Fitting session: KA_fit_to_ideal_data
Current datasets: 
-----------------------------
Dataset 1
--------------- 
Priority = 1
Name: Test_1 
Type: 1D NMR Line shapes
Active model:
Model 2:  "U-model"
Model description
"1D NMR line shape for the U model, no temperature/field dependence, analytical"
Model handle: line_shape_equations_1D.U_model_1D
Current solver: analytical
Model parameters: 
1: Rtotal     2: LRratio     3: log10(K_A)     4: k_2_A     5: w0_1     6: w0_2     7: FWHH_1     8: FWHH_2     9: ScaleFactor     




NOTE: The dataset gets a serial number (1 here), which it is addressed by in the session object.

Prepare fitting environment

session.initialize_relation_matrix(sprintf('%s.all_datasets.txt', session_name))
session.generate_fitting_environment(sprintf('%s.fitting_environment.txt', session_name))
Dataset listing is written to "KA_fit_to_ideal_data.all_datasets.txt".

Fitting environment saved in KA_fit_to_ideal_data.fitting_environment.txt.

ans =

New parameter relations (indexed as in all_parameters vector):

[1]     1:Rtotal
[2]     1:LRratio
[3]     1:log10(K_A)
[4]     1:k_2_A
[5]     1:w0_1
[6]     1:w0_2
[7]     1:FWHH_1
[8]     1:FWHH_2
[9]     1:ScaleFactor

Summary of datasets with lookup vectors (position of each parameter in all_parameter vector)
--------
Dataset 1/Test_1,  model "U-model"
Parameters          :    Rtotal    LRratio    log10(K_A)    k_2_A    w0_1    w0_2    FWHH_1    FWHH_2    ScaleFactor
Lookup vector       :   1   2   3   4   5   6   7   8   9



Assign starting parameters and limits

dataset_serial_number=1;  %
session.assign_parameter_values(dataset_serial_number, starting_parameters,...
                                min_MC, max_MC, min_limits, max_limits);

Inspect initial guess

figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Initial',session.name));

Pretty terrible.

Save initial state of fitting session on a disk for future inspection and reuse

session.export_results(sprintf('%s.initial_guess', session_name), {figure_handle})
Output current session state, statistics and graphs into "KA_fit_to_ideal_data.initial_guess/".
Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_ideal_data.initial_guess/adjustable_parameters_only.txt".
Statistics on per-dataset basis is written into "KA_fit_to_ideal_data.initial_guess/all_datasets.txt".
Writing figures...
Figure 1: small PNG image,  large PNG image,  EPS image,  MATLAB figure file - written
Done

Set which parameters of the model will be fixed and variable

parameter_modes=[ 0 0 1 0 0 0 0 0 0];
session.set_parameters_modes(dataset_serial_number, parameter_modes);

Fit 'nearly ideal' dataset without determination of confidence intervals

Select fitting algorithm. Choices: 'Newton-active-set', 'Newton-interior-point', 'simplex', 'GlobalSearch', 'DirectSearch'. The first one is fast gradient search. The fastest of all and tends to be trapped in local minima or stuck on shallow slopes. The 'simplex' is slower and far more reliable. Yet may get stuck in a local minimum. The two last ones are the slowest algorithms coming from Global Optimization Toolbox. They make effort to ensure that the minimum of sum of squares of residual is a global one but, at least, 50x slower. Each fitting algorithm has options controlling its operation (options_fminsearch, options_fmincon_active_set, options_fmincon_interior_point, GlobalSearch_options, MultiStart_options, DirectSearch_options). They may be set as a property of TotalFit but we do not need them right now.

fit_only_algorithm='GlobalSearch';

Report on start-up

fprintf(1,'\n\n==============\nFitting Algorithm: %s\nVariable parameters:', fit_only_algorithm);

==============
Fitting Algorithm: GlobalSearch
Variable parameters:

Fit without error determination

tic
session.fit_only(fit_only_algorithm);
toc
GlobalSearch stopped because it analyzed all the trial points.

All 62 local solver runs converged with a positive local solver exit flag.
Elapsed time is 7.236392 seconds.

Report on finishing

session.show_fit_statistics()
session.transfer_all_parameters_to_individual_datasets();
ans =



Fitting results for "KA_fit_to_ideal_data"
==========================

Fitting method: GlobalSearch
Number of active datasets (non-zero priority): 1

Reporting only on variable parameters (dataset-#:parameter-name):
(# in the fit-vector)            [min value, max value] for 95% confidence interval
--------------------------
(1) name:  "  1:log10(K_A)"
(1) value=  4.806708e+00      [ 0.000000e+00       0.000000e+00 ]

Total sum of squares (SS) = 8.86423e-07
Statistics per dataset:
------------
(1) "Test_1":     Priority=1,     R^2=0.987,    1e+02% of total SS,   Model= "U-model" 


Inspect final result

figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Initial',session.name));

Export final result

session.export_results(sprintf('%s.fit_only.%s', session_name, fit_only_algorithm), {figure_handle})
Output current session state, statistics and graphs into "KA_fit_to_ideal_data.fit_only.GlobalSearch/".
Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_ideal_data.fit_only.GlobalSearch/adjustable_parameters_only.txt".
Statistics on per-dataset basis is written into "KA_fit_to_ideal_data.fit_only.GlobalSearch/all_datasets.txt".
Writing figures...
Figure 1: small PNG image,  large PNG image,  EPS image,  MATLAB figure file - written
Done

FIT A 'NOISY' DATASET

Create new fitting session: repeat all above steps for the second dataset.

session_name='KA_fit_to_noisy_data';
session=TotalFit(session_name);
session.copy_dataset_into_array(test2);
session.initialize_relation_matrix(sprintf('%s.all_datasets.txt', session_name))
session.generate_fitting_environment(sprintf('%s.fitting_environment.txt', session_name))
dataset_serial_number=1;
session.assign_parameter_values(dataset_serial_number, starting_parameters,...
                                min_MC, max_MC, min_limits, max_limits);
Dataset listing is written to "KA_fit_to_noisy_data.all_datasets.txt".

Fitting environment saved in KA_fit_to_noisy_data.fitting_environment.txt.

ans =

New parameter relations (indexed as in all_parameters vector):

[1]     1:Rtotal
[2]     1:LRratio
[3]     1:log10(K_A)
[4]     1:k_2_A
[5]     1:w0_1
[6]     1:w0_2
[7]     1:FWHH_1
[8]     1:FWHH_2
[9]     1:ScaleFactor

Summary of datasets with lookup vectors (position of each parameter in all_parameter vector)
--------
Dataset 1/Test_2,  model "U-model"
Parameters          :    Rtotal    LRratio    log10(K_A)    k_2_A    w0_1    w0_2    FWHH_1    FWHH_2    ScaleFactor
Lookup vector       :   1   2   3   4   5   6   7   8   9



Inspect initial guess

figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Initial',session.name));

Pretty terrible.

Save initial state of fitting session on a disk for future inspection and reuse

session.export_results(sprintf('%s.initial_guess', session_name), {figure_handle})
Output current session state, statistics and graphs into "KA_fit_to_noisy_data.initial_guess/".
Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_noisy_data.initial_guess/adjustable_parameters_only.txt".
Statistics on per-dataset basis is written into "KA_fit_to_noisy_data.initial_guess/all_datasets.txt".
Writing figures...
Figure 1: small PNG image,  large PNG image,  EPS image,  MATLAB figure file - written
Done

Set parameter modes and fit

parameter_modes=[ 0 0 1 0 0 0 0 0 0];
session.set_parameters_modes(dataset_serial_number, parameter_modes);
% fit_only_algorithm='simplex';
fit_only_algorithm='GlobalSearch';  % Global Optimization Toolbox algorithm
fprintf(1,'\n\n==============\nFitting Algorithm: %s\nVariable parameters:', fit_only_algorithm);
tic
session.fit_only(fit_only_algorithm);
toc
session.show_fit_statistics()
session.transfer_all_parameters_to_individual_datasets();
figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Final',session.name));
session.export_results(sprintf('%s.fit_only.%s', session_name, fit_only_algorithm), {figure_handle})

==============
Fitting Algorithm: GlobalSearch
Variable parameters:
GlobalSearch stopped because it analyzed all the trial points.

All 48 local solver runs converged with a positive local solver exit flag.
Elapsed time is 4.580126 seconds.

ans =



Fitting results for "KA_fit_to_noisy_data"
==========================

Fitting method: GlobalSearch
Number of active datasets (non-zero priority): 1

Reporting only on variable parameters (dataset-#:parameter-name):
(# in the fit-vector)            [min value, max value] for 95% confidence interval
--------------------------
(1) name:  "  1:log10(K_A)"
(1) value=  4.918282e+00      [ 0.000000e+00       0.000000e+00 ]

Total sum of squares (SS) = 1.31037e-06
Statistics per dataset:
------------
(1) "Test_2":     Priority=1,     R^2=0.967,    1e+02% of total SS,   Model= "U-model" 



Output current session state, statistics and graphs into "KA_fit_to_noisy_data.fit_only.GlobalSearch/".
Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_noisy_data.fit_only.GlobalSearch/adjustable_parameters_only.txt".
Statistics on per-dataset basis is written into "KA_fit_to_noisy_data.fit_only.GlobalSearch/all_datasets.txt".
Writing figures...
Figure 1: small PNG image,  large PNG image,  EPS image,  MATLAB figure file - written
Done

Fit noisy dataset with determination of confidence intervals

IMPORTANT NOTE: Confidence intervals are ONLY meaningful if the model fits data well that is a curve goes trough the points! Confidence intervals and parameter correlation plots then show how variation of the curves passing WITHIN errorbars may be accounted for by variation in parameters. If the curve DOES NOT pass through the error bars---Monte-Carlo run only wastes your time because the randomizer varies the synthetic data AROUND THE IDEAL SIMULATED CURVE. However, if this curve IS NOT GOING THROUGH the experimental data to begin with---entire Monte-Carlo run is meaningless because its random variation does reproduce deviation of the data from the model curve.

Before starting this fitting: start parallel processing framework by issuing 'matlabpool open N' command, where N is the number of your CPUs. The IDAP algorithm will then farm Monte-Carlo runs on multiple CPUs. If you do not have multiple CPU---do not do anything and it will still work (slower).

fit_algorithm='simplex';
% fit_algorithm='GlobalSearch';  % would be too slow
session.fit(fit_algorithm, 'single_cluster');
session.show_fit_statistics()
session.transfer_all_parameters_to_individual_datasets();
figure_handle1=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Final',session.name));
Starting fitting session using simplex algorithm on a single cluster.
Working folder: "single_cluster_fit"
Variable parameter:
    '  1:log10(K_A)'

Total dataset is saved into "single_cluster_fit".
Checking on "single_cluster_fit":  10 trials.

Checking on "single_cluster_fit":  20 trials.

Checking on "single_cluster_fit":  30 trials.

Checking on "single_cluster_fit":  40 trials.

Checking on "single_cluster_fit":  50 trials.

Using ALL 50 fitting runs ignoring the exit flags indicating that 0 runs failed to converge).
Elapsed time is 4.127840 seconds.

ans =



Fitting results for "KA_fit_to_noisy_data"
==========================

Fitting method: simplex
Number of active datasets (non-zero priority): 1

Reporting only on variable parameters (dataset-#:parameter-name):
(# in the fit-vector)            [min value, max value] for 95% confidence interval
--------------------------
(1) name:  "  1:log10(K_A)"
(1) value=  4.939737e+00      [ 4.658623e+00       5.786297e+00 ]

Total sum of squares (SS) = 1.30989e-06
Statistics per dataset:
------------
(1) "Test_2":     Priority=1,     R^2=0.967,    1e+02% of total SS,   Model= "U-model" 


Remember: the confidence interval are direct percentiles of all results: you can see all fitting results in session properties: all_raw_starting_parameters and all_raw_final_parameters as well as in all_filtered_starting_parameters and all_filtered_final_parameters, which contain the same information but have only successful runs.

Show distribution of fitting parameters

If fitting parameters ended up at or beyond parameter limits they are thrown out of these distributions

variable_parameter_index=1; % we only have one here
histogram_bins=20;
figure_handle2=session.plot_one_histogram(variable_parameter_index, histogram_bins);

save data

session.export_results(sprintf('%s.Fit.%s', session_name, fit_algorithm), {figure_handle1 figure_handle2})
Output current session state, statistics and graphs into "KA_fit_to_noisy_data.Fit.simplex/".
Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_noisy_data.Fit.simplex/adjustable_parameters_only.txt".
Statistics on per-dataset basis is written into "KA_fit_to_noisy_data.Fit.simplex/all_datasets.txt".
Writing figures...
Figure 1: small PNG image,  large PNG image,  EPS image,  MATLAB figure file - written
Figure 2: small PNG image,  large PNG image,  EPS image,  MATLAB figure file - written
Done

CONCLUSIONS AND COMMENTS

  1. Avoid using Newton-based optimization algorithms --- they are so prone to be stuck, especially in Monte_Carlo fitting. Use 'simplex' or one of Global optimization algorithms instead.
  1. Simplex is a reliable choice when fitting for confidence intervals determination. It ignores parameter limits, however, which sometimes leads to problems with convergence. Global Optmization Algorithms are great but too slow for this fitting mode.