Signature | Description | Parameters |
---|---|---|
#include <DataFrame/DataFrameStatsVisitors.h> template<typename T, typename I = unsigned long, std::size_t A = 0> struct LowessVisitor; // ------------------------------------- template<typename T, typename I = unsigned long, std::size_t A = 0> using lowess_v = LowessVisitor<T, I, A>; |
This is a “single action visitor”, meaning it is passed the whole data vector in one call and you must use the single_act_visit() interface. This functor performs LOcally WEighted Scatterplot Smoothing A LOWESS function outputs smoothed estimates of dependent var (y) at the given independent var (x) values. This lowess function implements the algorithm given in the reference below using local linear estimates. Suppose the input data has N points. The algorithm works by estimating the smooth yi by taking the frac * N closest points to (xi, yi) based on their x values and estimating yi using a weighted linear regression. The weight for (xj, yj) is tricube function applied to |xi - xj|. If n_loop > 1, then further weighted local linear regressions are performed, where the weights are the same as above times the lowess_bisquare function of the residuals. Each iteration takes approximately the same amount of time as the original fit, so these iterations are expensive. They are most useful when the noise has extremely heavy tails, such as Cauchy noise. Noise with less heavy-tails, such as t-distributions with df > 2, are less problematic. The weights downgrade the influence of points with large residuals. In the extreme case, points whose residuals are larger than 6 times the median absolute residual are given weight 0. Delta can be used to save computations. For each xi, regressions are skipped for points closer than delta. The next regression is fit for the farthest point within delta of xi and all points in between are estimated by linearly interpolating between the two regression fits. Judicious choice of delta can cut computation time considerably for large data (N > 5000). A good choice is delta = 0.01 * range(independ_var). Some experimentation is likely required to find a good choice of frac and iter for a particular dataset. References ---------- Cleveland, W.S. (1979) "Robust Locally Weighted Regression and Smoothing Scatterplots". Journal of the American Statistical Association 74 (368): 829-836. get_result() returns the vector of fitted y values. There is also get_residual_weights() that returns the residual weights. The first column passed must be the dependent (y) or endog column. The second column passed must be the independent (x) or exog column. explicit LowessVisitor (std::size_t loop_n = 3, T frac = 2.0 / 3.0, T delta = 0, bool sorted = false);loop_n: The noumber of iterations. frac: Between 0 and 1. The fraction of the data used when estimating each y-value. delta: Distance within which to use linear-interpolation instead of weighted regression. sorted: Whether the x and y columns are already sorted in order of ascending x values. |
T: Column data type. I: Index type. A: Memory alignment boundary for vectors. Default is system default alignment |
static void test_LowessVisitor() { std::cout << "\nTesting LowessVisitor{ } ..." << std::endl; std::vector<unsigned long> idx = { 123450, 123451, 123452, 123453, 123454, 123455, 123456, 123457, 123458, 123459, 123460, 123461, 123462, 123466, 123467, 123468, 123469, 123470, 123471, 123472, 123473, }; std::vector<double> x_vec = { 0.5578196, 2.0217271, 2.5773252, 3.4140288, 4.3014084, 4.7448394, 5.1073781, 6.5411662, 6.7216176, 7.2600583, 8.1335874, 9.1224379, 1.9296663, 2.3797674, 3.2728619, 4.2767453, 5.3731026, 5.6476637, 8.5605355, 8.5866354, 8.7572812, }; std::vector<double> y_vec = { 18.63654, 103.49646, 150.35391, 190.51031, 208.70115, 213.71135, 228.49353, 233.55387, 234.55054, 223.89225, 227.68339, 223.91982, 168.01999, 164.95750, 152.61107, 160.78742, 168.55567, 152.42658, 221.70702, 222.69040, 243.18828, }; MyDataFrame df; df.load_data(std::move(idx), std::make_pair("indep_var", x_vec), std::make_pair("dep_var", y_vec)); LowessVisitor<double> l_v; df.single_act_visit<double, double>("dep_var", "indep_var", l_v); auto actual_yfit = std::vector<double> { 68.1432, 119.432, 122.75, 135.633, 142.724, 165.905, 169.447, 185.617, 186.017, 191.865, 198.03, 202.234, 206.178, 215.053, 216.586, 220.408, 226.671, 229.052, 229.185, 230.023, 231.657, }; for (size_t idx = 0; idx < actual_yfit.size(); ++idx) assert(fabs(l_v.get_result()[idx] - actual_yfit[idx]) < 0.001); auto actual_weights = std::vector<double> { 0.641773, 0.653544, 0.940738, 0.865302, 0.990575, 0.971522, 0.92929, 0.902444, 0.918228, 0.924041, 0.855054, 0.824388, 0.586045, 0.945216, 0.94831, 0.998031, 0.999834, 0.991263, 0.993165, 0.972067, 0.990308, }; for (size_t idx = 0; idx < actual_weights.size(); ++idx) assert(fabs(l_v.get_residual_weights()[idx] - actual_weights[idx]) < 0.00001); }