DataFrame is a templatized and heterogeneous C++ container designed for data analysis for statistical, machine-learning, or financial applications.
Artificial-Intelligence Visitors |
---|
struct AffinityPropVisitor{ } |
struct BiasVisitor{ } |
struct CubicSplineFitVisitor{ } |
struct DecomposeVisitor{ } |
struct EntropyVisitor{ } |
struct ExponentialFitVisitor{ } |
struct FastFourierTransVisitor{ } |
struct ImpurityVisitor{ } |
struct KMeansVisitor{ } |
struct LinearFitVisitor{ } |
struct LogFitVisitor{ } |
struct LossFunctionVisitor{ } |
struct LowessVisitor{ } |
struct NormalizeVisitor{ } |
struct PolicyLearningLossVisitor{ } |
struct PolyFitVisitor{ } |
struct ProbabilityDistVisitor{ } |
struct RectifyVisitor{ } |
struct SigmoidVisitor{ } |
struct SLRegressionVisitor{ } |
struct StandardizeVisitor{ } |
Types |
---|
enum class box_cox_type{ } |
enum class bucket_type{ } |
enum class concat_policy{ } |
enum class decompose_type{ } |
enum class drop_policy{ } |
enum class fill_policy{ } |
enum class exponential_decay_spec{ } |
enum class hampel_type{ } |
enum class impurity_type{ } |
enum class Index2D{ } |
enum class io_format{ } |
enum class join_policy{ } |
enum class linreg_moving_mean_type{ } |
enum class loss_function_type{ } |
enum class mad_type{ } |
enum class mean_type{ } |
enum class nan_policy{ } |
enum class pattern_spec{ } |
enum prob_dist_type{ } |
enum class quantile_policy{ } |
enum class random_policy{ } |
enum class rank_policy{ } |
enum class rectify_type{ } |
enum class remove_dup_spec{ } |
enum class return_policy{ } |
enum class roll_policy{ } |
enum class shift_policy{ } |
enum class sigmoid_type{ } |
enum class sort_spec{ } |
enum class sort_state{ } |
enum class time_frequency{ } |
operator df_divides( ) |
operator df_minus( ) |
operator df_multiplies( ) |
operator df_plus( ) |
struct BadRange{ } |
struct ColNotFound{ } |
struct DataFrameError{ } |
struct InconsistentData{ } |
struct MemUsage{ } |
struct NotFeasible{ } |
struct NotImplemented{ } |
Stand-alone Numeric Generators |
---|
gen_bernoulli_dist{ } |
gen_binomial_dist( ) |
gen_cauchy_dist( ) |
gen_chi_squared_dist( ) |
gen_dft_sample_freq( ) |
gen_even_space_nums( ) |
gen_exponential_dist( ) |
gen_extreme_value_dist( ) |
gen_fisher_f_dist( ) |
gen_gamma_dist( ) |
gen_geometric_dist( ) |
gen_log_space_nums( ) |
gen_lognormal_dist( ) |
gen_negative_binomial_dist( ) |
gen_normal_dist( ) |
gen_poisson_dist( ) |
gen_student_t_dist( ) |
gen_sym_triangle( ) |
gen_triangular_nums( ) |
gen_uniform_int_dist( ) |
gen_uniform_real_dist( ) |
gen_weibull_dist( ) |
Views have useful and practical use-cases. A view is a slice of a DataFrame that is a reference to the original DataFrame. It appears exactly the same as a DataFrame, but if you modify any data in the view, the corresponding data point(s) in the original DataFrame will also be modified and vice versa. There are certain things you cannot do in views. For example, you cannot add or delete columns, extend the index column, ...
In general there are two kinds of views
Visitors are the main mechanism to implement analytical (i.e. statistical, financial, machine-learning) algorithms. You can easily follow the visitor's interface to add your custom algorithm by which you will extend the DataFrame package. Visitors also play several roles that in other packages maybe handled by separate interfaces. Visitors play the role of apply, transformer, and algorithms. For example, a visitor can transform column(s) or it may take the column(s) as read-only and implement an algorithm.
There are two visitor interfaces:
DataFrame gives you the ability to allocate memory on custom alignment boundaries.
You can use this feature to take advantage of SIMD instructions in modern CPU's. Since DataFrame algorithms are all done on vectors of data — columns, this can come handy in conjunction with compiler optimizations. Also, you can use alignment to prevent false cache-line sharing between multiple columns.
There are convenient typedef's that define DataFrames that allocate memory, for example, on 64, 128, 256, ... bytes boundaries. See DataFrame Library Types.
When you get access to columns in a DataFrame, you will get a reference to a StlVecType. StlVecType is just a std::vector with custom allocator for the requested alignment.
Random generators, and a few other numeric generators, were added as a series of convenient stand-alone functions to generate random numbers (it covers all C++ standard distributions). You can seamlessly use these routines to generate random DataFrame columns.
See this document and file RandGen.h and dataframe_tester.cc.
For the definition and defaults of RandGenParams, see this document and file DataFrameTypes.h
The DataFrame library is almost a header-only library. Currently the only library source file is DateTime.cc.
Starting from the root directory:
include directory contains almost all of the code. It includes .h and .tcc files. The latter are C++ template code files (they are mostly located in the Internals subdirectory). The main header file is DataFrame.h. It contains the DataFrame class and its public interface. There are comprehensive comments for each public interface call in that file. The rest of the files will show you how the sausage is made. Include directory also contains subdirectories that contain mostly internal DataFrame implementation. One exception is the Utils subdirectory
src directory contains Linux-only make files and Utils subdirectory.
test directory contains all the test source files, mocked data files, and test output files. The main test source files are dataframe_tester.cc and dataframe_tester_2.cc. It contains test cases for all functionalities of DataFrame. It is not in a very organized structure. I plan to make the test cases more organized.
Using plain make and make-files:
Go to the root of the repository, where license file is, and execute build_all.sh. This will build the library and test executables for Linux/Unix flavors only
Using CMake:
You would be able to build this in Linux, Windows, MacOS, and more — see README
Using Package Managers:
You can also use Conan or VCPKG — see README