Concepts of data fitting
See also
Back to top
General ideas
- Data analysis is performed in two steps. First, you simply fit to see if the model you use can account for the data at all (setup_parameters.fit_mode='fit_only'). You try different models and compile a selection of possible models that produce good fits (after fitting is done --- rename folders with results). To peform best possible fitting you should use any of the most sophisticated global optimization algorithms (parameter setup_parameters.fit_only_algorithm).
- You then can test relative likelihood of different models using Akaike's Information Criterion (see code/Control_scripts_archive/hypothesis_testing.m).
- Once you selected the most likely model you would like to determine confidence intervals for the optimized parameters --- see Monte-Carlo analysis of optimization parameters
Back to top
Monte-Carlo analysis of optimization parameters
Introduction
Every functional dependence (equation) has specific properties such that some parameters are strongly defined by shape of experimental data while other parameters are defined only weakly. Additionally, specific distribution of experimental points in the dataset has great impact on definition of specific parameters obtain from fitting of this particular dataset. Noise in experimental data will have variable effect on these parameters. Perturbation of experimental data and independent variables followed by fitting in multiple trials is called Monte-Carlo analysis and provides a measure of confidence in the best-fit values. Multiple independent fitting runs allow building distributions for each fitting parameter as well as revealing their interdependency---correlation (such that deviation in one parameter value may be compensated by change in otherones without significant effect on sum of squares).
Diagnostics of convergence in fitting runs
A property of the TotalFit session vary_starting_parameters controls whether the standard randomizer or other randomizing modules perform variation of the starting parameters of for Monte-Carlo runs. Ideally, it has to be 'yes' at all times but this setting entails very slow convergence. In addition, the Newton-based algorigtms ('Newton-active-set' and 'Newton-interior-point') sometimes finish without improving on the initial guess. If this happens when vary_starting_parameters='yes' it is very difficult to detect this problem because you obtain a very broad random distribution of 'best-fit' results. The only way to detect this lack of convergence is to compare properties all_raw_starting_parameters and all_raw_final_parameters to see if fitting process lead to changes in values of fitting parameters. The properties all_filtered_starting_parameters and all_filtered_final_parameters contain the same information but have only successful runs.
One way to make this lack of convergence more easily detectable is setting vary_starting_parameters to 'no' so that the simulated data will be perturbed with random noise but fitting will always starts with the best-fit result to the experimental data. In this case, when fitting algorithm is not doing anything---one sees confidence intervals matching the best-fit parameters (zero intervals). Solution in this situation is to change minimizer or alter its options. I chose to use no setting as a default (switcheable by a user).
Sums of squares from all fitting runs are also collected in all_raw_SS and all_filtered_SS arrays as well as exitflags from the fitting routine (all_raw_exitflags and all_filtered_exitflags). Look at these arrays when troubleshooting your Monte-Carlo runs.
You may observe graphic result of every Monte-Carlo fitting run by setting a property of the TotalFit session: monte_carlo_monitoring. The options are: OFF/BRIEF/SEE-ALL-FIGURES
- SEE-ALL-FIGURES - When running single-core run you will see figures for individual monte-carlo fitting results. This is only for troubleshooting to see whether perturbations to the data and starting parameters result in reasonable fits. It will print the figure and stop. You type in 'return' to continue to the next trial and the figure, etc... Type in 'dbquit' to quit the mode. NOTE: this mode will only work on datasets that are included into series! (individual datasets will not be plotted). This mode should not be used in real runs because it will lead to a memory overflow due to large number of figures!
- BRIEF - to show only a trial number and SS
- OFF - silent run (default setting).
See TotalFit.m for more properties controlling the Monte-Carlo runs.
Setting up the Monte-Carlo fitting
Monte-Carlo analysis is typically started by setting setup_parameters.fit_mode='fit_with_error_analysis'. NOTE: Uncertainties of the variable parameters are ONLY MEANINGFUL IF your MODEL FITS the data WELL. Generally, this mode requires a lot of time, so see Parallelization section below for improving performance.
There are important aspects of Monte-Carlo analysis, which are described below.
- For each Monte-Carlo run the data is perturbed according to their uncertainties to create a synthetic noisy dataset. Importantly, the independent variables are perturbed according to their uncertainties. The randomizer property of the TotalFit session contains a handle to a function that performs these perturbations (according to specifics of every experiment).
- Starting parameters for every trial are taken randomly from the range defined by [parameter]_Monte_Carlo_min/max values.
- The most effective way is to perform these runs in parallel using either MATLAB parallelization on a multi-core machine or a cluster, or---using fitting deamons running on multiple computers. The master session monitors convergence by examining files output by deamons. At a point when confidence intervals are stabilized it sends to a stop signal to deamons.
- If you do not have enough CPU power or convergence from random points is unsatisfactory---you can use setup_parameters.vary_starting_parameters='no'. In this situation a new synthetic noisy dataset is created for each run but starting parameters remain the same---the best-fit parameters from fitting of the original experimental data (instead of a randomly chosen point).
- If you are still short of CPU power you also can switch to a simpler solver and set starting parameters to the best-fit parameters found by more sophisticated solver in previous run. Fitting algorithms are described above in Fitting algorithms.
Display of the Monte-Carlo results
The Monte-Carlo algorithm percentiles the optimization results from fitting runs and shows you confidence intervals. However, the 'best-fit' results is only one of the fitting results in a virtual experiment with fitting synthetic noisy data. Therefore, observing histograms of parameters is more informative: I prefer to take confidence intervals as calculated but extract the most probable value directly from distribution histograms. More on plotting histograms and parameter correlations see docs/TotalFit/methods_index:FITTING WITH DETERMINATION OF CONFIDENCE INTERVALS.
Back to top
Fitting algorithms
Algorithms for optimization of variable parameters of the models to minimize deviation from experimental data are based on standard MATLAB optimization functions. They come in two flavors: local and global solvers. Generally, the difference is that local solvers find a single minimium and do not attempt to verify whether it is global or not. Global solvers perfrom multiple local solver runs in attempt to find best local miminum. For description of these algorithms---see Optimization and Global Optimization toolboxes, correspondingly. All solvers return parameters for the miminum and save their detailed output in optimization_output property of the TotalFit session. Settings to invoke these algorithms and their general features are listed below:
'Newton-active-set' and 'Newton-interior-point'
- local solvers based on Newton algorithm with bounds fmincon.
- options may be set through options_fmincon_active_set and options_fmincon_interior_point properties of the TotalFit session.
- Very fast algorithms but prone to get stalled if experimental data contain a lot of noise. This problem is easily detected by observing optimization parameters changed very slightly if at all from initial values during fitting (due to a very shallow gradient of sum of squares).
- See fmincon description in MATLAB Optimization Toolbox.
'simplex'
- a local solver based on Nelder-Mead (simplex) algorithm fminsearch. Ignores parameter bounds.
- Options may be set through options_fminsearch property of the TotalFit session.
- Slow algorithm but much more robust than Newton-based ones.
- See fminsearch description in MATLAB Optimization Toolbox.
'GlobalSearch'
- a global solver that uses fmincon, performs multiple runs of a local solver and attempts to verify whether there is a better minimum than the found one. Runs are started from points chosen as most likely to lead to global minimum. Makes some sort of smart decision on when to stop. Looks for global minimium, may miss some local ones.
Does not use parallel processors.
- Local solver options may be set through GlobalSearch_options property of the TotalFit session.
Additional options
may be set through the GlobalSearch_object, a property of the TotalFit session.
- Returns parameters of the lowest minimum. Additional minima are given in all_solutions property of the TotalFit session and detailed output in optimization_output.
- See GlobalSearch description in MATLAB Global Optimization Toolbox.
'MultiStart'
- a global solver, performs multiple local runs of fmincon. Number of runs is specified by MultiStart_starting_points property of the TotalFit session. Runs are started from random points chosen within parameter bounds.
- Uses parallel processors by default. Issue matlabpool open N-cores-you-have before the run.
- Local solver options may be set through MultiStart_options property of the TotalFit session. Additional options
may be set through the MultiStart_object, a property of the TotalFit session.
- Returns parameters of the lowest minimum. Additional minima are given in all_solutions property of the TotalFit session and detailed output in optimization_output.
- See MultiStart description in MATLAB Global Optimization Toolbox.
'DirectSearch'
- A global solver, a very sophisticated version of simplex algorithm. Uses patternsearch global optimization function, which is not based on gradients so will be stable on very rough or very shallow surfaces. Generally---slow algorithm but uses parallel processors by default. Issue matlabpool open N-cores-you-have before the run.
- Local solver options may be set through DirectSearch_options property of the TotalFit session.
The 'SearchMethod' field of DirectSearch_options controls how the algorithm places new points on a surface of sum of squares.
- Returns one minimum. The all_solutions property of the TotalFit session is set to empty array. Detailed output is put in optimization_output.
- See DirectSearch description in MATLAB Global Optimization Toolbox.
Back to top
Parallelization
Introduction
There are two separate concepts in parallelizing fitting runs. First, multiple local cores or a cluster may be used through Parallel Computing Toolbox of MATLAB . Second, multiple machines, which are not a part of a cluster, may be employed for a distributed run. For clarity, in what follows I use 'custer' to refer to a machine with multiple cpus/cores or a cluster, while I use 'core' for a single cpu/core of a multicore workstation.
For a detailed description of fitting methods implemented in TotalFit see FITTING WITH DETERMINATION OF CONFIDENCE INTERVALS
Using parallelization tools of MATLAB
- To use parallel computing on a multi-core machine you need to have Parallel Computing Toolbox in MATLAB. To use a cluster you need to have MATLAB Distributed Computing Server installed on a cluster.
- Remember: using parallel computing is NOT ALWAYS faster than using single processor due to communication between individual cores (workers), which slows down processing.
- To activate parallel computing on a multi-core machine or a cluster issue matlabpool open N-cores-you-have before the run. The matlabpool close formally stops the workers on parallel cores.
- To utilize CPU's of a cluster - see MATLAB documentation on how to establish communication with the MATLAB Distributed Computing Server and creating corresponding configuration for a parallel job.
- Parallel processors are utilized by default in MultiStart and DirectSearch algorithms (that may be changed through their options).
- For Monte-Carlo runs use 'single-cluster' mode.
- By default, Monter-Carlo runs will use up to 8 (as of 2011) cores on a local machine or more on a cluster. Set number_of_trials property of the TotalFit object to a value larger than a number of your CPU/cores . Ths value will define how many jobs are farmed in one block to your CPUs/cores in parallel (before data is saved on a disk and next block of jobs
is farmed out).
Using TotalFit deamons
If you do not have a cluster but have a network access to mutliple workstations --- TotalFit implements MATLAB-independent distributed-computing functionality through ssh and disk sharing.
- For performing Monte-Carlo runs on multiple cores you need to
- Deamons communicate with main fitting session via text files placed in working folders of deamons. Therefore, one can manipulate deamons manually: creating stop.signal file forces deamon to stop calculations and enter waiting mode. Entering exit.signal (when the deamon is in waiting mode) forces deamon to quit.
Back to top