C++ DataFrame

This is a templatized and heterogeneous C++ container with data-analysis functionality and interface.


DataFrame class is defined as:
template<typename I, typename H>
class DataFrame;
I specifies the index column type
H specifies a heterogenous vector type to contain DataFrame columns. H can only be:
Some of the methods in DataFrame return another DataFrame or one of the above views depending on what you asked for. DataFrame and view instances should be indistinguishable from the user's point of view.
There are a few convenient typedef’s that are handy:
template<typename I>
using StdDataFrame = DataFrame<I, HeteroVector>;

template<typename I>
using DataFrameView = DataFrame<I, HeteroView>;

template<typename I>
using DataFramePtrView = DataFrame<I, HeteroPtrView>;

DataFrame library interface is separated into two main categories:
  1. Accessing, adding, slicing & dicing, joining, & groupby'ing ... (The first column in the table below)
  2. Analytical algorithms being statistical, machine-learning, financial analysis … (The second and third columns in the table below)
I employ regular parameterized methods (i.e. member functions) to implement item (1). For item (2), I chose the visitor pattern.
Please see the table below for a comprehensive list of methods, visitors, and types along with documentation and sample code for each feature


Table of Features (with Code Samples)

DataFrame Member Methods
append_column( 2 )
append_index( 2 )
bucketize( )
bucketize_async( )
combine( 3 )
concat( )
concat_view( )
consolidate( 4 )
create_column( )
drop_missing( )
empty( )
fill_missing( 2 )
from_string( )
from_string_async( )
gen_datetime_index( )
gen_sequence_index( )
get_col_unique_values( )
get_column( 4 )
get_columns_info( )
get_data()
get_data_by_idx( 2 )
get_data_by_loc( 2 )
get_data_by_rand( )
get_data_by_sel( 5 )
get_index( 2 )
get_memory_usage( )
get_reindexed( )
get_reindexed_view( )
get_row( 2 )
get_view()
get_view_by_idx( 2 )
get_view_by_loc( 2 )
get_view_by_rand( )
get_view_by_sel( 5 )
groupby1( )
groupby1_async( )
groupby2( )
groupby2_async( )
groupby3( )
groupby3_async( )
has_column( 2 )
is_equal( )
join_by_column( )
join_by_index( )
load_align_column( )
load_column( 3 )
load_data( )
load_index( 2 )
make_consistent( )
modify_by_idx( )
multi_visit( )
pattern_match( )
read( )
read_async( )
remove_column( 2 )
remove_data_by_idx( )
remove_data_by_loc( )
remove_data_by_sel( 3 )
remove_duplicates( 6 )
remove_lock( )
rename_column( )
replace( 2 )
replace_async( 2 )
replace_index( )
retype_column( )
rotate( )
self_concat( )
self_rotate( )
self_shift( )
shape( )
set_lock( )
shapeless( )
shift( 2 )
shrink_to_fit( )
shuffle( )
single_act_visit( 5 )
single_act_visit_async( 5 )
sort( 5 )
sort_async( 5 )
to_string( )
to_string_async( )
transpose( )
value_counts( 2 )
visit( 5 )
visit_async( 5 )
write( )
write_async( )
DataFrame Built-in Visitors
struct AbsVisitor{ }
struct AffinityPropVisitor{ }
struct AutoCorrVisitor{ }
struct BetaVisitor{ }
struct BiasVisitor{ }
struct BoxCoxVisitor{ }
struct CategoryVisitor{ }
struct ClipVisitor{ }
struct CorrVisitor{ }
struct CovVisitor{ }
struct CumMaxVisitor{ }
struct CumMinVisitor{ }
struct CumProdVisitor{ }
struct CumSumVisitor{ }
struct DecomposeVisitor{ }
struct DotProdVisitor{ }
struct ExpandingRollAdopter{ }
struct ExponentiallyWeightedMeanVisitor{ }
struct ExponentialRollAdopter{ }
struct ExpoSmootherVisitor{ }
struct FactorizeVisitor{ }
struct FastFourierTransVisitor{ }
struct GeometricMeanVisitor{ }
struct HampelFilterVisitor{ }
struct HarmonicMeanVisitor{ }
struct HWExpoSmootherVisitor{ }
struct KMeansVisitor{ }
struct KthValueVisitor{ }
struct LogFitVisitor{ }
struct LowessVisitor{ }
struct MADVisitor{ }
struct MaxSubArrayVisitor{ }
struct MaxVisitor{ }
struct MeanVisitor{ }
struct MedianVisitor{ }
struct MinSubArrayVisitor{ }
struct MinVisitor{ }
struct ModeVisitor{ }
struct NLargestVisitor{ }
struct NMaxSubArrayVisitor{ }
struct NMinSubArrayVisitor{ }
struct NormalizeVisitor{ }
struct NSmallestVisitor{ }
struct PolyFitVisitor{ }
struct ProdVisitor{ }
struct QuadraticMeanVisitor{ }
struct QuantileVisitor{ }
struct RankVisitor{ }
struct SampleZScoreVisitor{ }
struct SEMVisitor{ }
struct SigmoidVisitor{ }
struct SimpleRollAdopter{ }
struct SLRegressionVisitor{ }
struct StandardizeVisitor{ }
struct StatsVisitor{ }
struct StdVisitor{ }
struct StepRollAdopter{ }
struct SumVisitor{ }
struct TrackingErrorVisitor{ }
struct TTestVisitor{ }
struct WeightedMeanVisitor{ }
struct ZScoreVisitor{ }
DataFrame Built-in Financial Visitors
struct AccumDistVisitor{ }
struct ArnaudLegouxMAVisitor{ }
struct AvgDirMovIdxVisitor{ }
struct BollingerBand{ }
struct CCIVisitor{ }
struct CenterOfGravityVisitor{ }
struct ChaikinMoneyFlowVisitor{ }
struct CoppockCurveVisitor{ }
struct DecayVisitor{ }
struct DoubleCrossOver{ }
struct DrawdownVisitor{ }
struct EBSineWaveVisitor{ }
struct EhlerSuperSmootherVisitor{ }
struct EntropyVisitor{ }
struct FisherTransVisitor{ }
struct GarmanKlassVolVisitor{ }
struct HeikinAshiCndlVisitor{ }
struct HodgesTompkinsVolVisitor{ }
struct HoltWinterChannelVisitor{ }
struct HullRollingMeanVisitor{ }
struct HurstExponentVisitor{ }
struct KamaVisitor{ }
struct MACDVisitor{ }
struct MassIndexVisitor{ }
struct OnBalanceVolumeVisitor{ }
struct ParabolicSARVisitor{ }
struct ParkinsonVolVisitor{ }
struct PercentPriceOSCIVisitor{ }
struct PivotPointSRVisitor{ }
struct PSLVisitor{ }
struct RateOfChangeVisitor{ }
struct ReturnVisitor{ }
struct RollingMidValueVisitor{ }
struct RSIVisitor{ }
struct RSXVisitor{ }
struct SharpeRatioVisitor{ }
struct SlopeVisitor{ }
struct TrueRangeVisitor{ }
struct TTMTrendVisitor{ }
struct UlcerIndexVisitor{ }
struct UltimateOSCIVisitor{ }
struct VarIdxDynAvgVisitor{ }
struct VertHorizFilterVisitor{ }
struct VWAPVisitor{ }
struct VWBASVisitor{ }
struct WilliamPrcRVisitor{ }
struct YangZhangVolVisitor{ }
DataFrame Types
enum class box_cox_type{ }
enum class bucket_type{ }
enum class concat_policy{ }
enum class decompose_type{ }
enum class drop_policy{ }
enum class fill_policy{ }
enum class exponential_decay_spec{ }
enum class hampel_type{ }
enum class Index2D{ }
enum class io_format{ }
enum class join_policy{ }
enum class mad_type{ }
enum class mean_type{ }
enum class nan_policy{ }
enum class pattern_spec{ }
enum class quantile_policy{ }
enum class random_policy{ }
enum class rank_policy{ }
enum class remove_dup_spec{ }
enum class return_policy{ }
enum class shift_policy{ }
enum class sigmoid_type{ }
enum class sort_spec{ }
enum class sort_state{ }
enum class time_frequency{ }
operator df_divides( )
operator df_minus( )
operator df_multiplies( )
operator df_plus( )
struct BadRange{ }
struct ColNotFound{ }
struct DataFrameError{ }
struct InconsistentData{ }
struct MemUsage{ }
struct NotFeasible{ }
struct NotImplemented{ }
Stand-alone Numeric Generators
gen_bernoulli_dist{ }
gen_binomial_dist( )
gen_cauchy_dist( )
gen_chi_squared_dist( )
gen_even_space_nums( )
gen_exponential_dist( )
gen_extreme_value_dist( )
gen_fisher_f_dist( )
gen_gamma_dist( )
gen_geometric_dist( )
gen_log_space_nums( )
gen_lognormal_dist( )
gen_negative_binomial_dist( )
gen_normal_dist( )
gen_poisson_dist( )
gen_student_t_dist( )
gen_sym_triangle( )
gen_triangular_nums( )
gen_uniform_int_dist( )
gen_uniform_real_dist( )
gen_weibull_dist( )

Multithreading

  1. DataFrame uses static containers to achieve type heterogeneity. By default, these static containers are unprotected. This is done by design. So by default, there is no locking overhead. If you use DataFrame in a multithreaded program you must provide a SpinLock defined in ThreadGranularity.h file. DataFrame will use your SpinLock to protect the containers.
    Please see above, set_lock(), remove_lock(), and dataframe_tester.cc#3767 for code example.
  2. In addition, instances of DataFrame are not multithreaded safe either. In other words, a single instance of DataFrame must not be used in multiple threads without protection, unless it is used as read-only.
  3. In the meantime, DataFrame utilizes multithreading in two different ways internally:
    1. Async Interface: There are asynchronous versions of some methods. For example, you have sort()/sort_async(), visit()/visit_async(), ... more. The latter versions return a std::future that could execute in parallel.
    2. DataFrame uses multiple threads, internally and unbeknown to the user, in some of its algorithms when appropriate. User can control (or turn off) the multithreading by calling set_thread_level() which sets the max number of threads to be used. The default is 0. The optimal number of threads is a function of users hardware/software environment and usually obtained by trail and error. set_thread_level() and threading level in general is a static property and once set, it applies to all instances.

Views

Views have useful and practical use-cases. A view is a slice of a DataFrame that is a reference to the original DataFrame. It appears exactly the same as a DataFrame, but if you modify any data in the view, the corresponding data point(s) in the original DataFrame will also be modified and vice versa. There are certain things you cannot do in views. For example, you cannot add or delete columns, extend the index column, ...
For more understanding, look at this document further and/or the test files.


Visitors

Visitors are the main mechanism to implement analytical (i.e. statistical, financial, machine-learning) algorithms. You can easily follow the visitor's interface to add your custom algorithm by which you will extend the DataFrame package. Visitors also play several roles that in other packages maybe handled by separate interfaces. Visitors play the role of apply, transformer, and algorithms. For example, a visitors can transform column(s) or it may take the column(s) as read-only and implement an algorithm.
There are two visitor interfaces:

  1. Regular visit. This visitor is called by calling the visit() method on a DataFrame instance. In this case DataFrame passes the given index and column(s) data points one-by-one to the visitor functor. This is convenient for algorithms that can operate on one data point at a time. Examples are correlation or variance visitors.
  2. Single-action visit. This visitor is called by calling the single_act_visit() method on a DataFrame instance . In this case begin and end iterators for the given index and column(s) are passed to the visitor functor. So the fuctor has access to all index and column(s) data at once. This is necessary for algorithms that need the whole data together. Examples are return or median visitors.
There are some common interfaces in most of the visitors. For example the following interfaces are common between almost all visitors:
get_result(): It returns the result of the visitor/algorithm.
pre(): It is called by DataFrame each time before starting to pass the data to the visitor. pre() is the place to initialize the process
post(): It is called by DataFrame each time it is done with passing data to the visitor.

See this document, DataFrameStatsVisitors.h, DataFrameMLVisitors.h, DataFrameFinancialVisitors.h, DataFrameTransformVisitors.h, and test/dataframe_tester[_2].cc for more examples and documentation.


Numeric Generators

Random generators, and a few other numeric generators, were added as a series of convenient stand-alone functions to generate random numbers (it covers all C++ standard distributions). You can seamlessly use these routines to generate random DataFrame columns.
See this document and file RandGen.h and dataframe_tester.cc. For the definition and defaults of RandGenParams, see this document and file DataFrameTypes.h


Code Structure

The DataFrame library is almost a header-only library with a few boilerplate source file exceptions, HeteroVector.cc and HeteroView.cc and a few others. Also, there is DateTime.cc.

Starting from the root directory:

include directory contains most of the code. It includes .h and .tcc files. The latter are C++ template code files (they are mostly located in the Internals subdirectory). The main header file is DataFrame.h. It contains the DataFrame class and its public interface. There are comprehensive comments for each public interface call in that file. The rest of the files will show you how the sausage is made. Include directory also contains subdirectories that contain mostly internal DataFrame implementation. One exception, the DateTime.h is located in the Utils subdirectory

src directory contains Linux-only make files and a few subdirectories that contain various source codes.

test directory contains all the test source files, mocked data files, and test output files. The main test source files are dataframe_tester.cc and dataframe_tester_2.cc. It contains test cases for all functionalities of DataFrame. It is not in a very organized structure. I plan to make the test cases more organized.


Build Instructions

Using plain make and make-files:
Go to the root of the repository, where license file is, and execute build_all.sh. This will build the library and test executables for Linux/Unix flavors only
Using cmake:
Please see README file. Thanks to @justinjk007, you should be able to build this in Linux, Windows, Mac, and more


Motivation

Although Pandas has a spot-on interface and it is full of useful functionalities, it lacks performance and scalability. For example, it is hard to decipher high-frequency intraday data such as Options data or S&P500 constituents tick-by-tick data using Pandas. Another issue I have encountered often is the research is done using Python, because it has such tools as Pandas, but the execution in production is in C++ for its efficiency, reliability and scalability. Therefore, there is this translation, or sometimes a bridge, between research and executions. Also, in this day and age, C++ needs a heterogeneous data container. Mainly because of these factors, I implemented the C++ DataFrame.
I welcome all contributions from people with expertise, interest, and time to do it. I will add more functionalities from time to time, but currently my spare time is limited.