Back to main topic

General Description

PURPOSE

Dataset is a superclass, a parent of all classes for different types of data. I design these classes so that every dataset object contains data and a list of of specific mathematical models associated with this type of data. This way superclass methods don't need to know a correct model for each data type and a structure of X datasets.

TWO MODES OF OPERATION: SIMULATION AND FITTING
DEVELOPING SUBCLASSES
SIMULATION MODE
FITTING MODE
MODELS
MODEL PARAMETERS
FITTING ALGORITHMS
FITTING WITH ERROR ESTIMATES
CONFIDENCE INTERVALS FROM MONTE-CARLO ANALYSIS

Back to Contents

TWO MODES OF OPERATION: SIMULATION AND FITTING

The class objects may be created with one of the two modes: 'Simulation' or 'Fitting' meant, correspondingly, to perform simulation analysis of the models or fitting of real experimental data.

Specific methods serving these two modes perform checking whether they are called in the mode the are allowed to operate. You cannot change the mode of the existing object. This is done for making sure the internal state of the object is consistent. If you need to overcome this restriction you simply create a new object of the same class and feed it with the simulated (perturbed) data from X and Y properties. However, many methods are working in both modes, such as fitting.

Back to Contents

DEVELOPING SUBCLASSES

You can only derive one level of subclass that will have Simulation mode and Fitting mode functionalities. If you want further sub-subclasses - they need to specialize and NOT use variable parameters. This is due to the fact that MATLAB, when passing variable number of parameters, always packages them in another cell array every time and call from subclass->superclass constructor is different from sub-subclass->subclass->superclass preventing use of multiple parameters in a sub-subclass!

Back to Contents

SIMULATION MODE

Simulation mode is used for visualization of the model behavior and analysis of model's features.

In Simulation mode you can (1) select a model using set_active_model() method, (2) set values of independent variable(s): set_X(X), (3) simulate data for the parameter set: simulate_data(parameters) (4) plot them: plot_simulation (5) generate noise-perturbed Y-data and plot them (simulate_noisy_data() and plot_simulation()). If you need to produce more than one perturbed dataset from original ideal data you may simply use restore_original_XY(), which will recover original X and Y values without recomputing the model.

Special note on data perturbation with gaussian noise: simulate_noisy_data() computes ideal data and perturbs them with random gaussian noise with standard deviation X_std and Y_std. All these perturbations are done using X and Y data saved in 'original' arrays so the function may be run multiple times and will use original source datasets for simulations. To keep X or Y from being perturbed, - simply enter 0 in place of X_std or Y_std.

The superclass provides basic methods to do all that with simple models. Plotting and handling of data for more complex models is performed by specific methods of derivative classes, replacing methods of the parent superclass.

Back to Contents

FITTING MODE

The Fitting mode is for fitting of real experimental or simulated data with different models and statistical assessment of the model likelihood to correctly describe the data.

FITTING MODE: DATA INPUT The data name, X, Y data and standard errors of Y must be set upon object creation. Standard errors of X are optional, set separately by set_XSE() method. If they are present, Monte-Carlo error analysis will take them into account. Note that the input data are COLUMN-format. The name may be reset later as well (set_name()).

NOTE ON UNCERTAINTIES IN THE X AND Y DATA: Rememeber that I assume these uncertainties on experimental data are STANDARD DEVIATIONS, not CONFIDENCE INTERVALS!!! If you have confidence intervals instead you should include a custom generate_perturbed_XY() function in your subclass which reduces them by a factor 1.96 to approximate standard deviations of random noise. See example in R2_Dispersion.m.

Back to Contents

MODELS

A list of models applicable to the data is defined in the definition of the derivative class. The list may be viewed by list_models() method. Any model from the list may be set 'active' by using set_active_model() method supplied with the name of the model from the list.

An acitive model returns calculated Y values through model() method and a weighed sum of squares of residuals from experimental data through SS() method.

If the model uses numerical procedures to compute the signal---use model_numeric_solver and model_numeric_options properties to control operation of the numeric routine. These parameters have to be reset to specific values in specific subclasses of the Dataset.

Back to Contents

MODEL PARAMETERS

Numeric values of the model parameters is a vector, which user supplies. One sets parameters using set_parameters() method. Use plot_fit() to view quality of the initial guess.

Monte-Carlo ranges and parameters limits are assigned by using set_ranges_and_limits(). Supplying an empty array, [], in place of any of the parameters of set_ranges_and_limits() results in retaining previous values of this parameter.

Default Monte-Carlo ranges and parameters limits are [-Inf, Inf] so you need to explicitly set them to enable meaningful Monte-Carlo trials.

The best-fit parameters originating from external fitting algorithm (TotalFit.fit()) are imported through assign_best_fit_parameters(). The 'algorithm' gives the name of the algorithm used to determine these values.

Back to Contents

FITTING ALGORITHMS

There are three fitting routines for the data using Levenberg-Marquardt, simplex and Newton algorithms.

If one does NOT need error estimates then fitting may be done much quicker using simple_fit() method with 'Levenberg-Marquardt', 'Newton-active-set', 'Newton-interior-point' or 'simplex' modes. The first three obey parameter limits, the simplex does not. Your reasons to choose one over another: convergence to acceptable results.

Coarse estimate of speed: fit('Levenberg-Marquardt', initial_parameters) = 1 fit('Newton-active-set', initial_parameters) = 2 fit('Newton-interior-point', initial_parameters) = 4 fit('simplex', initial_parameters) = 6

Back to Contents

FITTING WITH ERROR ESTIMATES

To fit and estimate confidence intervals for the model parameters one can use fit() method.

Levenberg-Marquardt is the fastest least squares method. Use fit('Levenberg-Marquardt/Jacobian', initial_parameters). It provides for rigorous estimate of parameter uncertanties but ignores individual errors of the data points (both X and Y) or treats all Y values as similarly reliable. It is possible to set upper and lower parameter bounds with this algorithm (upper_bounds_lsqcurvefit and lower_bounds_lsqcurvefit). The other options for fitting may be set through options_lsqcurvefit structure (see options for 'lsqcurvefit' in the MATLAB Help system).

The same algorithm but using Monte-Carlo error estimate is run via fit('Levenberg-Marquardt', initial_parameters). It is much slower but provides parameter correlation plots that are helpful as they visualize the distribution the confidence intervals are taken from. This method in combination with Jacobian-based method may serve for testing whether the Monte-Carlo procedure provides for adequate estimate of parameter uncertanties for specific data and the models.

Simplex method is very robust, slow and uses weights given by errors of individual points in both X and Y dimensions. This is the slowest method by design. Its run options may be set through options_fminsearch structure (see options for 'fminsearch()' in MATLAB Help). To run fitting using simplex method use fit('simplex', initial_parameters)

Newton method is based on using gradients. This is not the most suitable one for solving sum-of-squares problems but may be quicker than simplex and uses individual X and Y errors as weights for residuals. To run it use fit('Newton', initial_parameters).

'Levenberg-Marquardt/Jacobian' is the method of choice for fitting in this class because it is the fastest one providing error estimate if your data DON NOT have significant errors in X values and ALL Y-values are EQUALLY PRECISE. Other three handle the data with unequal Y errors and errors in X. They also provide correlation plots for parameters of the model. The last two may also be used with piece-wise target functions encountered with multi-model fitting (the Levenberg-Marquardt functions are based on MATLAB's lsqcurvefit() that cannot handle piecewise data, it needs contiguous X dataset, as well as its error estimate requires a full column rank jacobian, not possible with the piecewise data).

Back to Contents

CONFIDENCE INTERVALS FROM MONTE-CARLO ANALYSIS

The fit('Levenberg-Marquardt/Jacobian', initial_parameters) method provides for statistically rigorous estimates of confidence intevals if your data DON NOT have significant errors in X values and ALL Y-values are EQUALLY PRECISE. The p value is set through options_lsqcurvefit) based on MATLAB's lsqcurvefit() and nlparci() functions. However, this approach also cannot handle piecewise data that will be encountered in global analysis of multiple datasets in IDAP project (in TotalFit class). Therefore, another method for estimation of confidence intervals is implemented here based on Monte-Carlo simulations and distribution analysis. Side-by-side runs of the built-in and Monte-Carlo methods allow for validation of the simulation procedure.

NOTE: The Monte-Carlo analysis in Dataset and in TotalFit are independent, separate procedures. It is generally advisable to do fitting and data analysis through TotalFit because those procedures are more sophisticated. However, for convenience, Dataset includes its own fitting and Monte-Carlo capability.

To estimate confidence intervals the fit() method (1) determines best-fit values of parameters for the experimental dataset, (2) calculates ideal values of Y, the dependent variable, and (3) perturbs Y using YSE, standard errors of Y, as a standard deviation for generating normal distirbution of perturbed Y values. If standard errors of X are present the algorithm also perturbs X. Then the starting parameters for fitting are randomly generated using ranges defined by MonteCarlo_range_min and MonteCarlo_range_max values. Then the synthetic data is fit to obtain best-fit parameters for the perturbed dataset. These steps are repeated multiple times.

The confidence intervals of the variable parameters in the model are detemined by taking a percentile range of the distribution of results of fitting trials (conf_high and conf_low). The default confidence interval is estimated for 95% of fitting results falling within the estimated bounds. This is not a very rigorous estimate of confidence intervals. Therefore, the methods fit('Levenberg-Marquardt/Jacobian', initial_parameters) and fit('Levenberg-Marquardt', initial_parameters) need to be run on every new type of data one after another to make sure that confidence intervals that come out from two different techniques are similar. The 'Levenberg-Marquardt' is not applicable if the error in X data is significant and/or Y errors are dramatically different between the points.

Number of trials needed for the accurate estimate of parameter errors is determined automatically during Monte-Carlo run. The algorithm watches the confidence intervals to reach a plateau when adding more and more fitting trials to a distribution analysis. Variable fraction_conf sets a threshold to stop Monte-Carlo run. This variable is a fraction of the full confidence interval and its change after adding next 10 trials have to be smaller than fraction_conf to terminate Monte-Carlo analysis. Reasonable value is 0.05. However, to keep Monte-Carlo from going indefinitely in case of multi-minima (periodic) functions, the variable N_max_trials limit the total length of the run. Reasonable value is 200.

NOTE: Fitting using TotalFit object DOES NOT call fitting methods outlined above. It only calls SS() method and runs its own fitting and collects its own statistics.

Back to Contents