C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame C++ DataFrame

DataFrame is a templatized and heterogeneous C++ container designed for data analysis for statistical, machine-learning, or financial applications.




DataFrame class is defined as:
    template<typename I, typename H>
    class DataFrame;


I specifies the index column type
H specifies a heterogenous vector type to contain DataFrame columns — don't get hang up on this too much, instead use the convenient typedef's in DataFrame Library Types.
H can only be:
Template parameter A referrers to byte boundary alignment to be used in memory allocations. The default is system default boundaries for each type. See DataFrame Library Types for convenient typedef's, especially under Library-wide Types section. Also, see Memory Alignment section below

Some of the methods in DataFrame return another DataFrame or one of the above views depending on what you asked for. DataFrame and view instances should be indistinguishable from the user's point of view.
See Views section below. Also, see DataFrame Library Types for convenient typedef's



DataFrame library interface is separated into two main categories:
  1. Accessing, adding, slicing & dicing, joining, & groupby'ing ... (The first column in the table below)
  2. Analytical algorithms being statistical, machine-learning, financial analysis ... (The second and third columns in the table below)
I employ regular parameterized methods (i.e. member functions) to implement item (1). For item (2), I chose the visitor pattern.
Please see the table below for a comprehensive list of methods, visitors, and types along with documentation and sample code for each feature


Table of Functionalities — with Code Samples

DataFrame
Member Functions
append_column( 2 )
append_index( 2 )
append_row( )
bucketize( )
bucketize_async( )
col_name_to_idx( )
col_idx_to_name( )
combine( 3 )
concat( )
concat_view( )
consolidate( 4 )
create_column( )
describe( )
drop_missing( )
empty( )
fill_missing( 2 )
from_indicators
from_string( )
from_string_async( )
gen_datetime_index( )
gen_sequence_index( )
get_col_unique_values( )
get_column( 4 )
get_columns_info( )
get_data()
get_data_by_idx( 2 )
get_data_by_loc( 2 )
get_data_by_rand( )
get_data_by_sel( 5 )
get_index( 2 )
get_memory_usage( )
get_reindexed( )
get_reindexed_view( )
get_row( 2 )
get_view()
get_view_by_idx( 2 )
get_view_by_loc( 2 )
get_view_by_rand( )
get_view_by_sel( 5 )
groupby1( )
groupby1_async( )
groupby2( )
groupby2_async( )
groupby3( )
groupby3_async( )
has_column( 2 )
is_equal( )
join_by_column( )
join_by_index( )
load_align_column( )
load_column( 3 )
load_data( )
load_index( 2 )
load_indicators
load_result_as_column( )
make_consistent( )
modify_by_idx( )
multi_visit( )
pattern_match( )
read( )
read_async( )
remove_column( 2 )
remove_data_by_idx( )
remove_data_by_loc( )
remove_data_by_sel( 3 )
remove_duplicates( 6 )
remove_lock( )
rename_column( )
replace( 2 )
replace_async( 2 )
replace_index( )
retype_column( )
rotate( )
self_concat( )
self_rotate( )
self_shift( )
shape( )
set_lock( )
shapeless( )
shift( 2 )
shrink_to_fit( )
shuffle( )
single_act_visit( 5 )
single_act_visit_async( 5 )
sort( 5 )
sort_async( 5 )
to_string( )
to_string_async( )
transpose( )
value_counts( 2 )
visit( 5 )
visit_async( 5 )
write( )
write_async( )
DataFrame
Built-in Visitors
struct AbsVisitor{ }
struct AffinityPropVisitor{ }
struct AutoCorrVisitor{ }
struct BetaVisitor{ }
struct BiasVisitor{ }
struct BoxCoxVisitor{ }
struct CategoryVisitor{ }
struct ClipVisitor{ }
struct CorrVisitor{ }
struct CovVisitor{ }
struct CubicSplineFitVisitor{ }
struct CumMaxVisitor{ }
struct CumMinVisitor{ }
struct CumProdVisitor{ }
struct CumSumVisitor{ }
struct DecomposeVisitor{ }
struct DotProdVisitor{ }
struct EntropyVisitor{ }
struct ExpandingRollAdopter{ }
struct ExponentialFitVisitor{ }
struct ExponentiallyWeightedCorrVisitor{ }
struct ExponentiallyWeightedCovVisitor{ }
struct ExponentiallyWeightedMeanVisitor{ }
struct ExponentiallyWeightedVarVisitor{ }
struct ExpoSmootherVisitor{ }
struct FactorizeVisitor{ }
struct FastFourierTransVisitor{ }
struct FixedAutoCorrVisitor{ }
struct GeometricMeanVisitor{ }
struct HampelFilterVisitor{ }
struct HarmonicMeanVisitor{ }
struct HWExpoSmootherVisitor{ }
struct ImpurityVisitor{ }
struct KMeansVisitor{ }
struct KthValueVisitor{ }
struct LinearFitVisitor{ }
struct LinregMovingMeanVisitor{ }
struct LogFitVisitor{ }
struct LowessVisitor{ }
struct MADVisitor{ }
struct MaxSubArrayVisitor{ }
struct MaxVisitor{ }
struct MeanVisitor{ }
struct MedianVisitor{ }
struct MinSubArrayVisitor{ }
struct MinVisitor{ }
struct ModeVisitor{ }
struct NLargestVisitor{ }
struct NMaxSubArrayVisitor{ }
struct NMinSubArrayVisitor{ }
struct NonZeroRangeVisitor{ }
struct NormalizeVisitor{ }
struct NSmallestVisitor{ }
struct PolyFitVisitor{ }
struct ProdVisitor{ }
struct QuadraticMeanVisitor{ }
struct QuantileVisitor{ }
struct RankVisitor{ }
struct SampleZScoreVisitor{ }
struct SEMVisitor{ }
struct SigmoidVisitor{ }
struct SimpleRollAdopter{ }
struct SLRegressionVisitor{ }
struct StableMeanVisitor{ }
struct StandardizeVisitor{ }
struct StatsVisitor{ }
struct StdVisitor{ }
struct StepRollAdopter{ }
struct SumVisitor{ }
struct TrackingErrorVisitor{ }
struct TTestVisitor{ }
struct WeightedMeanVisitor{ }
struct ZeroLagMovingMeanVisitor{ }
struct ZScoreVisitor{ }
DataFrame
Built-in Financial Visitors
struct AccumDistVisitor{ }
struct ArnaudLegouxMAVisitor{ }
struct AvgDirMovIdxVisitor{ }
struct BalanceOfPowerVisitor{ }
struct BollingerBand{ }
struct CCIVisitor{ }
struct CenterOfGravityVisitor{ }
struct ChaikinMoneyFlowVisitor{ }
struct ChandeKrollStopVisitor{ }
struct CoppockCurveVisitor{ }
struct DecayVisitor{ }
struct DoubleCrossOver{ }
struct DrawdownVisitor{ }
struct EBSineWaveVisitor{ }
struct EhlerSuperSmootherVisitor{ }
struct FisherTransVisitor{ }
struct GarmanKlassVolVisitor{ }
struct HeikinAshiCndlVisitor{ }
struct HodgesTompkinsVolVisitor{ }
struct HoltWinterChannelVisitor{ }
struct HullRollingMeanVisitor{ }
struct HurstExponentVisitor{ }
struct KamaVisitor{ }
struct KeltnerChannelsVisitor{ }
struct MACDVisitor{ }
struct MassIndexVisitor{ }
struct OnBalanceVolumeVisitor{ }
struct ParabolicSARVisitor{ }
struct ParkinsonVolVisitor{ }
struct PercentPriceOSCIVisitor{ }
struct PivotPointSRVisitor{ }
struct PrettyGoodOsciVisitor{ }
struct PSLVisitor{ }
struct RateOfChangeVisitor{ }
struct ReturnVisitor{ }
struct RollingMidValueVisitor{ }
struct RSIVisitor{ }
struct RSXVisitor{ }
struct RVIVisitor{ }
struct SharpeRatioVisitor{ }
struct SlopeVisitor{ }
struct T3MovingMeanVisitor{ }
struct TreynorRatioVisitor{ }
struct TrixVisitor{ }
struct TrueRangeVisitor{ }
struct TTMTrendVisitor{ }
struct UlcerIndexVisitor{ }
struct UltimateOSCIVisitor{ }
struct VarIdxDynAvgVisitor{ }
struct VertHorizFilterVisitor{ }
struct VortexVisitor{ }
struct VWAPVisitor{ }
struct VWBASVisitor{ }
struct WilliamPrcRVisitor{ }
struct YangZhangVolVisitor{ }
DataFrame
Types
enum class box_cox_type{ }
enum class bucket_type{ }
enum class concat_policy{ }
enum class decompose_type{ }
enum class drop_policy{ }
enum class fill_policy{ }
enum class exponential_decay_spec{ }
enum class hampel_type{ }
struct impurity_type{ }
enum class Index2D{ }
enum class io_format{ }
enum class join_policy{ }
enum class linreg_moving_mean_type{ }
enum class mad_type{ }
enum class mean_type{ }
enum class nan_policy{ }
enum class pattern_spec{ }
enum class quantile_policy{ }
enum class random_policy{ }
enum class rank_policy{ }
enum class remove_dup_spec{ }
enum class return_policy{ }
enum class roll_policy{ }
enum class shift_policy{ }
enum class sigmoid_type{ }
enum class sort_spec{ }
enum class sort_state{ }
enum class time_frequency{ }
operator df_divides( )
operator df_minus( )
operator df_multiplies( )
operator df_plus( )
struct BadRange{ }
struct ColNotFound{ }
struct DataFrameError{ }
struct InconsistentData{ }
struct MemUsage{ }
struct NotFeasible{ }
struct NotImplemented{ }
Stand-alone
Numeric Generators
gen_bernoulli_dist{ }
gen_binomial_dist( )
gen_cauchy_dist( )
gen_chi_squared_dist( )
gen_dft_sample_freq( )
gen_even_space_nums( )
gen_exponential_dist( )
gen_extreme_value_dist( )
gen_fisher_f_dist( )
gen_gamma_dist( )
gen_geometric_dist( )
gen_log_space_nums( )
gen_lognormal_dist( )
gen_negative_binomial_dist( )
gen_normal_dist( )
gen_poisson_dist( )
gen_student_t_dist( )
gen_sym_triangle( )
gen_triangular_nums( )
gen_uniform_int_dist( )
gen_uniform_real_dist( )
gen_weibull_dist( )



Multithreading

  1. DataFrame uses static containers to achieve type heterogeneity. By default, these static containers are unprotected. This is done by design. So by default, there is no locking overhead. If you use DataFrame in a multithreaded program you must provide a SpinLock defined in ThreadGranularity.h file. DataFrame will use your SpinLock to protect the containers.
    Please see above, set_lock(), remove_lock(), and dataframe_tester.cc#3767 for code example.
  2. In addition, instances of DataFrame are not multithreaded safe either. In other words, a single instance of DataFrame must not be used in multiple threads without protection, unless it is used as read-only.
  3. In the meantime, DataFrame utilizes multithreading in two different ways internally:
    1. Async Interface: There are asynchronous versions of some methods. For example, you have sort()/sort_async(), visit()/visit_async(), ... more. The latter versions return a std::future that could execute in parallel.
    2. DataFrame uses multiple threads, internally and unbeknown to the user, in some of its algorithms when appropriate. User can control (or turn off) the multithreading by calling set_thread_level() which sets the max number of threads to be used. The default is 0. The optimal number of threads is a function of users hardware/software environment and usually obtained by trail and error. set_thread_level() and threading level in general is a static property and once set, it applies to all instances.



Views

Views have useful and practical use-cases. A view is a slice of a DataFrame that is a reference to the original DataFrame. It appears exactly the same as a DataFrame, but if you modify any data in the view, the corresponding data point(s) in the original DataFrame will also be modified and vice versa. There are certain things you cannot do in views. For example, you cannot add or delete columns, extend the index column, ...

In general there are two kinds of views

  1. Regular Views: You can change data in the view or in the original DataFrame and see the change on both sides
  2. Const Views: You can not change data in the view. But you can change the data in the original DataFrame or through another view and it will be refelcted in the const view
Why would you use views For more understanding, look at this document further and/or the test files.




Visitors

Visitors are the main mechanism to implement analytical (i.e. statistical, financial, machine-learning) algorithms. You can easily follow the visitor's interface to add your custom algorithm by which you will extend the DataFrame package. Visitors also play several roles that in other packages maybe handled by separate interfaces. Visitors play the role of apply, transformer, and algorithms. For example, a visitors can transform column(s) or it may take the column(s) as read-only and implement an algorithm.
There are two visitor interfaces:

  1. Regular visit. This visitor is called by calling the visit() method on a DataFrame instance. In this case DataFrame passes the given index and column(s) data points one-by-one to the visitor functor. This is convenient for algorithms that can operate on one data point at a time. Examples are correlation or variance visitors.
  2. Single-action visit. This visitor is called by calling the single_act_visit() method on a DataFrame instance . In this case begin and end iterators for the given index and column(s) are passed to the visitor functor. So the fuctor has access to all index and column(s) data at once. This is necessary for algorithms that need the whole data together. Examples are return or median visitors.
There are some common interfaces in most of the visitors. For example the following interfaces are common between almost all visitors:
get_result(): It returns the result of the visitor/algorithm.
pre(): It is called by DataFrame each time before starting to pass the data to the visitor. pre() is the place to initialize the process
post(): It is called by DataFrame each time it is done with passing data to the visitor.

See this document, DataFrameStatsVisitors.h, DataFrameMLVisitors.h, DataFrameFinancialVisitors.h, DataFrameTransformVisitors.h, and test/dataframe_tester[_2].cc for more examples and documentation.

I have been asked many times, why I chose the visitor pattern for algorithms as opposed to having member functions.
Because I wanted algorithms to be independent objects. To be more precise as to why:



Memory Alignment

DataFrame gives you the ability to allocate memory on custom alignment boundaries.
You can use this feature to take advantage of SIMD instructions in modern CPU's. Since DataFrame algorithms are all done on vectors of data — columns, this can come handy in conjunction with compiler optimizations.
There are convenient typedef's that define DataFrames that allocate memory, for example, on 64, 128, 256, ... bytes boundaries. See DataFrame Library Types.
When you get access to columns in a DataFrame, you will get a reference to a StlVecType. StlVecType is just a stl::vector with custom allocator for the requested alignment.




Numeric Generators

Random generators, and a few other numeric generators, were added as a series of convenient stand-alone functions to generate random numbers (it covers all C++ standard distributions). You can seamlessly use these routines to generate random DataFrame columns.
See this document and file RandGen.h and dataframe_tester.cc. For the definition and defaults of RandGenParams, see this document and file DataFrameTypes.h




Code Structure

The DataFrame library is almost a header-only library with a few boilerplate source file exceptions, HeteroVector.cc and HeteroView.cc and a few others. Also, there is DateTime.cc.

Starting from the root directory:

include directory contains most of the code. It includes .h and .tcc files. The latter are C++ template code files (they are mostly located in the Internals subdirectory). The main header file is DataFrame.h. It contains the DataFrame class and its public interface. There are comprehensive comments for each public interface call in that file. The rest of the files will show you how the sausage is made. Include directory also contains subdirectories that contain mostly internal DataFrame implementation. One exception, the DateTime.h is located in the Utils subdirectory

src directory contains Linux-only make files and a few subdirectories that contain various source codes.

test directory contains all the test source files, mocked data files, and test output files. The main test source files are dataframe_tester.cc and dataframe_tester_2.cc. It contains test cases for all functionalities of DataFrame. It is not in a very organized structure. I plan to make the test cases more organized.




Build Instructions

Using plain make and make-files:
Go to the root of the repository, where license file is, and execute build_all.sh. This will build the library and test executables for Linux/Unix flavors only
Using cmake:
Please see README file. Thanks to @justinjk007, you should be able to build this in Linux, Windows, Mac, and more




Motivation

Although Pandas has a spot-on interface and it is full of useful functionalities, it lacks performance and scalability. For example, it is hard to decipher high-frequency intraday data such as Options data or S&P500 constituents tick-by-tick data using Pandas. Another issue I have encountered often is the research is done using Python, because it has such tools as Pandas, but the execution in production is in C++ for its efficiency, reliability and scalability. Therefore, there is this translation, or sometimes a bridge, between research and executions. Also, in this day and age, C++ needs a heterogeneous data container. Mainly because of these factors, I implemented the C++ DataFrame.
I welcome all contributions from people with expertise, interest, and time to do it. I will add more functionalities from time to time, but currently my spare time is limited.