Tutorial 2. Fitting of 1D NMR line shape
by Evgenii Kovrigin, 05/17/2011
In this tutorial we will load a dataset created in the previous tutorial and fit its data.
IMPORTANT NOTE: We cannot fit simulated ideal data where noise RMSD was set to 0 because this number is used to weigh the residuals as target-function=sum-of-squares/noise-RMSD. If noise RMSD is 0---fitting fundamentally fails because target-function becomes infinite! We still want to do it to see whether fitting will give us back the parameters used in a simulation. A 'go-around' is simply to simulate data with a very small noise RMSD thus closely reproducing the ideal profile and enabling a test of the fitting algorithms.
Contents
- Load previously prepared dataset
- Prepare starting parameters and their limits
- Collect parameter in a vector corresponding to U model.
- Prepare fitting session object
- Assign starting parameters and limits
- Set which parameters of the model will be fixed and variable
- Fit 'nearly ideal' dataset without determination of confidence intervals
- FIT A 'NOISY' DATASET
- Fit noisy dataset with determination of confidence intervals
- Show distribution of fitting parameters
- CONCLUSIONS AND COMMENTS
close all clear all
Load previously prepared dataset
Import saved data
temp=load('tutorial_1.mat');
Extract objects from the import structure
test1=temp.test1; test2=temp.test2; setup=temp.setup;
Check contents
test1 % show summary of the dataset test1.plot_simulation('Ideal'); % show original simulation and perturbed dataset test2 % show summary of the dataset test2.plot_simulation('Noisy'); % show original simulation and perturbed dataset setup
test1 = NMRLineShapes1D handle Properties: operation_mode: 'Simulation' data_type: [1x18 char] name: 'Test_1' model_names_map: [10x1 containers.Map] number_of_models: 10 model_handles: {1x10 cell} model_names: {1x10 cell} model_descriptions: {1x10 cell} model_parameters_names: {1x10 cell} active_model_number: 2 active_model_N_parameters: 9 algorithm: 'not defined' best_fit_parameters: [1x9 double] best_fit_parameters_conf95: [] data_length: 50 X_original: [50x1 double] X: [50x1 double] XSE: [50x1 double] Y_original: [50x1 double] Y: [50x1 double] YSE: [50x1 double] Y_ideal: [50x1 double] MonteCarlo_range_min: [1x9 double] MonteCarlo_range_max: [1x9 double] parameter_limits_min: [1x9 double] parameter_limits_max: [1x9 double] smooth_points: 250 X_name: [1x17 char] Y_name: [1x18 char] X_range: [-100 300] Y_range: [] X_direction: 'reverse' model_numeric_solver: 'analytical' model_numeric_options: 'no options' options_lsqcurvefit: [1x1 struct] options_fminsearch: [1x1 struct] options_fmincon_active_set: [1x1 struct] options_fmincon_interior_point: [1x1 struct] conf_high: 97.5000 conf_low: 2.5000 fraction_conf: 0.0500 N_max_trials: 200 test2 = NMRLineShapes1D handle Properties: operation_mode: 'Simulation' data_type: [1x18 char] name: 'Test_2' model_names_map: [10x1 containers.Map] number_of_models: 10 model_handles: {1x10 cell} model_names: {1x10 cell} model_descriptions: {1x10 cell} model_parameters_names: {1x10 cell} active_model_number: 2 active_model_N_parameters: 9 algorithm: 'not defined' best_fit_parameters: [1x9 double] best_fit_parameters_conf95: [] data_length: 50 X_original: [50x1 double] X: [50x1 double] XSE: [50x1 double] Y_original: [50x1 double] Y: [50x1 double] YSE: [50x1 double] Y_ideal: [50x1 double] MonteCarlo_range_min: [1x9 double] MonteCarlo_range_max: [1x9 double] parameter_limits_min: [1x9 double] parameter_limits_max: [1x9 double] smooth_points: 250 X_name: [1x17 char] Y_name: [1x18 char] X_range: [-100 300] Y_range: [] X_direction: 'reverse' model_numeric_solver: 'analytical' model_numeric_options: 'no options' options_lsqcurvefit: [1x1 struct] options_fminsearch: [1x1 struct] options_fmincon_active_set: [1x1 struct] options_fmincon_interior_point: [1x1 struct] conf_high: 97.5000 conf_low: 2.5000 fraction_conf: 0.0500 N_max_trials: 200 setup = Rtotal: 1.0000e-03 LRratio: 0.7500 Log10_K_A: 5 k_2_A: 10 w0_1: 0 w0_2: 200 FWHH_1: 20 FWHH_2: 20 ScaleFactor: 1
Remember that test1 is ideal dataset with zero noise RMSD. Recalculate data using very small number for noise RMSD
X_RMSD=0; % does not need to be changed Y_RMSD=1e-6; test1.simulate_noisy_data([ setup.Rtotal setup.LRratio setup.Log10_K_A setup.k_2_A ... setup.w0_1 setup.w0_2 setup.FWHH_1 setup.FWHH_2 ... setup.ScaleFactor ], X_RMSD, Y_RMSD); figure1=test1.plot_simulation('Nearly ideal');
Prepare starting parameters and their limits
Each parameter has to have formally assigned minimum and maximum values it cannot exceed in fitting run. The Monte-Carlo range sets the interval the random starting parameters are taken from for determination of confidence interval via Monte-Carlo protocol.
Parameters, which are going to be fixed in fitting may have all these limits set to 0. Let's only fit for binding affinity constant as if all other parameters are known by design of our virtual experiment.
% EQUILIBRIUM CONSTANT (entered as logarithm of base 10) log10_KA=log10(1e1); % have it off-set from true value
Absolute limits:
log10_KA_min=log10(1e1); log10_KA_max=log10(1e10);
These ranges are disregarded by 'simplex' algorithm but obeyed by all others.
Monte-Carlo search range:
log10_KA_Monte_Carlo_min=log10(5e1); log10_KA_Monte_Carlo_max=log10(1e7);
NOTE: These ranges are only used in Monte-Carlo runs if session.vary_starting_parameters='yes'. Its default value (as it is now) is 'no' % consider turning to 'no'. Thus simulated noisy data sets are generated anew but fitting starts from the 'best-fit' parameter set. This set is actually used to synthesize noisy data. True Monte-Carlo approach calls for varying these starting parameters too before every fitting run but (1) in many times I did not see difference in results while (2) 'no' mode may be noticeably faster.
Collect parameter in a vector corresponding to U model.
To view the order of parameters in the model---simply display the active model of the dataset:
test1.show_active_model()
ans = Active model: Model 2: "U-model" Model description "1D NMR line shape for the U model, no temperature/field dependence, analytical" Model handle: line_shape_equations_1D.U_model_1D Current solver: analytical Model parameters: 1: Rtotal 2: LRratio 3: log10(K_A) 4: k_2_A 5: w0_1 6: w0_2 7: FWHH_1 8: FWHH_2 9: ScaleFactor
Create a vector of parameter values
starting_parameters=[ setup.Rtotal setup.LRratio log10_KA setup.k_2_A ...
setup.w0_1 setup.w0_2 setup.FWHH_1 setup.FWHH_2 setup.ScaleFactor ];
Create vectors of absolute limits. Parameters, which will not be optimized do not need meaningful limits (use 0).
min_limits=[ 0 0 log10_KA_min 0 0 0 0 0 0 ]; max_limits=[ 0 0 log10_KA_max 0 0 0 0 0 0 ];
Create vectors of Monte-Carlo ranges. They will only be used if you decide to fit data with determination of confidence intervals.
min_MC = [ 0 0 log10_KA_Monte_Carlo_min 0 0 0 0 0 0 ]; max_MC = [ 0 0 log10_KA_Monte_Carlo_max 0 0 0 0 0 0 ];
Prepare fitting session object
Fitting session is performed using TotalFit object. The datasets themselves have fitting functionality but their internal procedures are far less powerful than the ones offered by TotalFit.
Choose a name (to be used for saving data)
session_name='KA_fit_to_ideal_data';
Create fitting session
session=TotalFit(session_name);
Load dataset with ideal (no noise) simulated data.
session.copy_dataset_into_array(test1); session.show_datasets()
ans = Fitting session: KA_fit_to_ideal_data Current datasets: ----------------------------- Dataset 1 --------------- Priority = 1 Name: Test_1 Type: 1D NMR Line shapes Active model: Model 2: "U-model" Model description "1D NMR line shape for the U model, no temperature/field dependence, analytical" Model handle: line_shape_equations_1D.U_model_1D Current solver: analytical Model parameters: 1: Rtotal 2: LRratio 3: log10(K_A) 4: k_2_A 5: w0_1 6: w0_2 7: FWHH_1 8: FWHH_2 9: ScaleFactor
NOTE: The dataset gets a serial number (1 here), which it is addressed by in the session object.
Prepare fitting environment
session.initialize_relation_matrix(sprintf('%s.all_datasets.txt', session_name)) session.generate_fitting_environment(sprintf('%s.fitting_environment.txt', session_name))
Dataset listing is written to "KA_fit_to_ideal_data.all_datasets.txt". Fitting environment saved in KA_fit_to_ideal_data.fitting_environment.txt. ans = New parameter relations (indexed as in all_parameters vector): [1] 1:Rtotal [2] 1:LRratio [3] 1:log10(K_A) [4] 1:k_2_A [5] 1:w0_1 [6] 1:w0_2 [7] 1:FWHH_1 [8] 1:FWHH_2 [9] 1:ScaleFactor Summary of datasets with lookup vectors (position of each parameter in all_parameter vector) -------- Dataset 1/Test_1, model "U-model" Parameters : Rtotal LRratio log10(K_A) k_2_A w0_1 w0_2 FWHH_1 FWHH_2 ScaleFactor Lookup vector : 1 2 3 4 5 6 7 8 9
Assign starting parameters and limits
dataset_serial_number=1; % session.assign_parameter_values(dataset_serial_number, starting_parameters,... min_MC, max_MC, min_limits, max_limits);
Inspect initial guess
figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Initial',session.name));
Pretty terrible.
Save initial state of fitting session on a disk for future inspection and reuse
session.export_results(sprintf('%s.initial_guess', session_name), {figure_handle})
Output current session state, statistics and graphs into "KA_fit_to_ideal_data.initial_guess/". Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_ideal_data.initial_guess/adjustable_parameters_only.txt". Statistics on per-dataset basis is written into "KA_fit_to_ideal_data.initial_guess/all_datasets.txt". Writing figures... Figure 1: small PNG image, large PNG image, EPS image, MATLAB figure file - written Done
Set which parameters of the model will be fixed and variable
parameter_modes=[ 0 0 1 0 0 0 0 0 0]; session.set_parameters_modes(dataset_serial_number, parameter_modes);
Fit 'nearly ideal' dataset without determination of confidence intervals
Select fitting algorithm. Choices: 'Newton-active-set', 'Newton-interior-point', 'simplex', 'GlobalSearch', 'DirectSearch'. The first one is fast gradient search. The fastest of all and tends to be trapped in local minima or stuck on shallow slopes. The 'simplex' is slower and far more reliable. Yet may get stuck in a local minimum. The two last ones are the slowest algorithms coming from Global Optimization Toolbox. They make effort to ensure that the minimum of sum of squares of residual is a global one but, at least, 50x slower. Each fitting algorithm has options controlling its operation (options_fminsearch, options_fmincon_active_set, options_fmincon_interior_point, GlobalSearch_options, MultiStart_options, DirectSearch_options). They may be set as a property of TotalFit but we do not need them right now.
fit_only_algorithm='GlobalSearch';
Report on start-up
fprintf(1,'\n\n==============\nFitting Algorithm: %s\nVariable parameters:', fit_only_algorithm);
============== Fitting Algorithm: GlobalSearch Variable parameters:
Fit without error determination
tic session.fit_only(fit_only_algorithm); toc
GlobalSearch stopped because it analyzed all the trial points. All 62 local solver runs converged with a positive local solver exit flag. Elapsed time is 7.236392 seconds.
Report on finishing
session.show_fit_statistics() session.transfer_all_parameters_to_individual_datasets();
ans = Fitting results for "KA_fit_to_ideal_data" ========================== Fitting method: GlobalSearch Number of active datasets (non-zero priority): 1 Reporting only on variable parameters (dataset-#:parameter-name): (# in the fit-vector) [min value, max value] for 95% confidence interval -------------------------- (1) name: " 1:log10(K_A)" (1) value= 4.806708e+00 [ 0.000000e+00 0.000000e+00 ] Total sum of squares (SS) = 8.86423e-07 Statistics per dataset: ------------ (1) "Test_1": Priority=1, R^2=0.987, 1e+02% of total SS, Model= "U-model"
Inspect final result
figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Initial',session.name));
Export final result
session.export_results(sprintf('%s.fit_only.%s', session_name, fit_only_algorithm), {figure_handle})
Output current session state, statistics and graphs into "KA_fit_to_ideal_data.fit_only.GlobalSearch/". Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_ideal_data.fit_only.GlobalSearch/adjustable_parameters_only.txt". Statistics on per-dataset basis is written into "KA_fit_to_ideal_data.fit_only.GlobalSearch/all_datasets.txt". Writing figures... Figure 1: small PNG image, large PNG image, EPS image, MATLAB figure file - written Done
FIT A 'NOISY' DATASET
Create new fitting session: repeat all above steps for the second dataset.
session_name='KA_fit_to_noisy_data'; session=TotalFit(session_name); session.copy_dataset_into_array(test2); session.initialize_relation_matrix(sprintf('%s.all_datasets.txt', session_name)) session.generate_fitting_environment(sprintf('%s.fitting_environment.txt', session_name)) dataset_serial_number=1; session.assign_parameter_values(dataset_serial_number, starting_parameters,... min_MC, max_MC, min_limits, max_limits);
Dataset listing is written to "KA_fit_to_noisy_data.all_datasets.txt". Fitting environment saved in KA_fit_to_noisy_data.fitting_environment.txt. ans = New parameter relations (indexed as in all_parameters vector): [1] 1:Rtotal [2] 1:LRratio [3] 1:log10(K_A) [4] 1:k_2_A [5] 1:w0_1 [6] 1:w0_2 [7] 1:FWHH_1 [8] 1:FWHH_2 [9] 1:ScaleFactor Summary of datasets with lookup vectors (position of each parameter in all_parameter vector) -------- Dataset 1/Test_2, model "U-model" Parameters : Rtotal LRratio log10(K_A) k_2_A w0_1 w0_2 FWHH_1 FWHH_2 ScaleFactor Lookup vector : 1 2 3 4 5 6 7 8 9
Inspect initial guess
figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Initial',session.name));
Pretty terrible.
Save initial state of fitting session on a disk for future inspection and reuse
session.export_results(sprintf('%s.initial_guess', session_name), {figure_handle})
Output current session state, statistics and graphs into "KA_fit_to_noisy_data.initial_guess/". Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_noisy_data.initial_guess/adjustable_parameters_only.txt". Statistics on per-dataset basis is written into "KA_fit_to_noisy_data.initial_guess/all_datasets.txt". Writing figures... Figure 1: small PNG image, large PNG image, EPS image, MATLAB figure file - written Done
Set parameter modes and fit
parameter_modes=[ 0 0 1 0 0 0 0 0 0]; session.set_parameters_modes(dataset_serial_number, parameter_modes); % fit_only_algorithm='simplex'; fit_only_algorithm='GlobalSearch'; % Global Optimization Toolbox algorithm fprintf(1,'\n\n==============\nFitting Algorithm: %s\nVariable parameters:', fit_only_algorithm); tic session.fit_only(fit_only_algorithm); toc session.show_fit_statistics() session.transfer_all_parameters_to_individual_datasets(); figure_handle=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Final',session.name)); session.export_results(sprintf('%s.fit_only.%s', session_name, fit_only_algorithm), {figure_handle})
============== Fitting Algorithm: GlobalSearch Variable parameters: GlobalSearch stopped because it analyzed all the trial points. All 48 local solver runs converged with a positive local solver exit flag. Elapsed time is 4.580126 seconds. ans = Fitting results for "KA_fit_to_noisy_data" ========================== Fitting method: GlobalSearch Number of active datasets (non-zero priority): 1 Reporting only on variable parameters (dataset-#:parameter-name): (# in the fit-vector) [min value, max value] for 95% confidence interval -------------------------- (1) name: " 1:log10(K_A)" (1) value= 4.918282e+00 [ 0.000000e+00 0.000000e+00 ] Total sum of squares (SS) = 1.31037e-06 Statistics per dataset: ------------ (1) "Test_2": Priority=1, R^2=0.967, 1e+02% of total SS, Model= "U-model" Output current session state, statistics and graphs into "KA_fit_to_noisy_data.fit_only.GlobalSearch/". Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_noisy_data.fit_only.GlobalSearch/adjustable_parameters_only.txt". Statistics on per-dataset basis is written into "KA_fit_to_noisy_data.fit_only.GlobalSearch/all_datasets.txt". Writing figures... Figure 1: small PNG image, large PNG image, EPS image, MATLAB figure file - written Done
Fit noisy dataset with determination of confidence intervals
IMPORTANT NOTE: Confidence intervals are ONLY meaningful if the model fits data well that is a curve goes trough the points! Confidence intervals and parameter correlation plots then show how variation of the curves passing WITHIN errorbars may be accounted for by variation in parameters. If the curve DOES NOT pass through the error bars---Monte-Carlo run only wastes your time because the randomizer varies the synthetic data AROUND THE IDEAL SIMULATED CURVE. However, if this curve IS NOT GOING THROUGH the experimental data to begin with---entire Monte-Carlo run is meaningless because its random variation does reproduce deviation of the data from the model curve.
Before starting this fitting: start parallel processing framework by issuing 'matlabpool open N' command, where N is the number of your CPUs. The IDAP algorithm will then farm Monte-Carlo runs on multiple CPUs. If you do not have multiple CPU---do not do anything and it will still work (slower).
fit_algorithm='simplex'; % fit_algorithm='GlobalSearch'; % would be too slow session.fit(fit_algorithm, 'single_cluster'); session.show_fit_statistics() session.transfer_all_parameters_to_individual_datasets(); figure_handle1=session.dataset_plot_fit(dataset_serial_number, sprintf('%s / Final',session.name));
Starting fitting session using simplex algorithm on a single cluster. Working folder: "single_cluster_fit" Variable parameter: ' 1:log10(K_A)' Total dataset is saved into "single_cluster_fit". Checking on "single_cluster_fit": 10 trials. Checking on "single_cluster_fit": 20 trials. Checking on "single_cluster_fit": 30 trials. Checking on "single_cluster_fit": 40 trials. Checking on "single_cluster_fit": 50 trials. Using ALL 50 fitting runs ignoring the exit flags indicating that 0 runs failed to converge). Elapsed time is 4.127840 seconds. ans = Fitting results for "KA_fit_to_noisy_data" ========================== Fitting method: simplex Number of active datasets (non-zero priority): 1 Reporting only on variable parameters (dataset-#:parameter-name): (# in the fit-vector) [min value, max value] for 95% confidence interval -------------------------- (1) name: " 1:log10(K_A)" (1) value= 4.939737e+00 [ 4.658623e+00 5.786297e+00 ] Total sum of squares (SS) = 1.30989e-06 Statistics per dataset: ------------ (1) "Test_2": Priority=1, R^2=0.967, 1e+02% of total SS, Model= "U-model"
Remember: the confidence interval are direct percentiles of all results: you can see all fitting results in session properties: all_raw_starting_parameters and all_raw_final_parameters as well as in all_filtered_starting_parameters and all_filtered_final_parameters, which contain the same information but have only successful runs.
Show distribution of fitting parameters
If fitting parameters ended up at or beyond parameter limits they are thrown out of these distributions
variable_parameter_index=1; % we only have one here
histogram_bins=20;
figure_handle2=session.plot_one_histogram(variable_parameter_index, histogram_bins);
save data
session.export_results(sprintf('%s.Fit.%s', session_name, fit_algorithm), {figure_handle1 figure_handle2})
Output current session state, statistics and graphs into "KA_fit_to_noisy_data.Fit.simplex/". Adjustable parameters and cumulative dataset statistics is written into "KA_fit_to_noisy_data.Fit.simplex/adjustable_parameters_only.txt". Statistics on per-dataset basis is written into "KA_fit_to_noisy_data.Fit.simplex/all_datasets.txt". Writing figures... Figure 1: small PNG image, large PNG image, EPS image, MATLAB figure file - written Figure 2: small PNG image, large PNG image, EPS image, MATLAB figure file - written Done
CONCLUSIONS AND COMMENTS
- Avoid using Newton-based optimization algorithms --- they are so prone to be stuck, especially in Monte_Carlo fitting. Use 'simplex' or one of Global optimization algorithms instead.
- Simplex is a reliable choice when fitting for confidence intervals determination. It ignores parameter limits, however, which sometimes leads to problems with convergence. Global Optmization Algorithms are great but too slow for this fitting mode.