Chi2fit.Statistics.empirical_cdf
empirical_cdf
, go back to Chi2fit.Statistics module for more information.
empirical_cdf(data, bin \\ {1.0, 0.5}, algorithm \\ :wilson, correction \\ 0)
View SourceSpecs
empirical_cdf( [{float(), number()}], {number(), number()}, algorithm(), integer() ) :: {cdf(), bins :: [float()], numbins :: pos_integer(), sum :: float()}
Generates an empirical Cumulative Distribution Function from sample data.
Three parameters determine the resulting empirical distribution:
algorithm for assigning errors,
the size of the bins,
a correction for limiting the bounds on the 'y' values
When e.g. task effort/duration is modeled, some tasks measured have 0 time. In practice what is actually is meant, is that the task effort is between 0 and 1 hour. This is where binning of the data happens. Specify a size of the bins to control how this is done. A bin size of 1 means that 0 effort will be mapped to 1/2 effort (at the middle of the bin). This also prevents problems when the fited distribution cannot cope with an effort os zero.
Supports two ways of assigning errors: Wald score or Wilson score. See [1]. Valie values for the algorithm
argument are :wald
or :wilson
.
In the handbook of MCMC [1] a cumulative distribution is constructed. For the largest 'x' value
in the sample, the 'y' value is exactly one (1). In combination with the Wald score this
gives zero errors on the value '1'. If the resulting distribution is used to fit a curve
this may give an infinite contribution to the maximum likelihood function.
Use the correction number to have a 'y' value of slightly less than 1 to prevent this from
happening.
Especially the combination of 0 correction, algorithm :wald
, and 'linear' model for
handling asymmetric errors gives problems.
The algorithm parameter determines how the errors onthe 'y' value are determined. Currently
supported values include :wald
and :wilson
.
References
[1] "Handbook of Monte Carlo Methods" by Kroese, Taimre, and Botev, section 8.4
[2] See https://en.wikipedia.org/wiki/Cumulative_frequency_analysis
[3] https://arxiv.org/pdf/1112.2593v3.pdf
[4] See https://en.wikipedia.org/wiki/Student%27s_t-distribution:
90% confidence ==> t = 1.645 for many data points (> 120)
70% confidence ==> t = 1.000