Signature Description Parameters

bool
read(const char *file_name,
     io_format iof = io_format::csv,
     bool columns_only = false,
     size_t starting_row = 0,
     size_t num_rows = std::numeric_limits<size_t>::max());
        
It inputs the contents of a text file/stream into itself (i.e. DataFrame). Currently 3 formats (i.e. csv, csv2, json) are supported. See io_format documentation page
CSV file format must be:
  INDEX:<Number of data points>:<Comma delimited list of values>
  <Column1 name>:<Number of data points>:<Column1 type>:<Comma delimited list of values>
  <Column2 name>:<Number of data points>:<Column2 type>:<Comma delimited list of values>
      .
      .
      .
        
All empty lines or lines starting with # will be skipped. For examples see files in test directory

CSV2 file format must be (this is similar to Pandas csv format):
  INDEX:<Number of data points>:<Index type>:,<Column1 name>:<Number of data points>:<Column1 type>,<Column2 name>:<Number of data points>:<Column2 type>, . . .
  Comma delimited rows of values
      .
      .
      .
        
All empty lines or lines starting with # will be skipped. For examples see IBM and FORD files in test directory

JSON file format looks like this:
  {
    "INDEX":{"N":3,"T":"ulong","D":[123450,123451,123452]},
    "col_3":{"N":3,"T":"double","D":[15.2,16.34,17.764]},
    "col_4":{"N":3,"T":"int","D":[22,23,24]},
    "col_str":{"N":3,"T":"string","D":["11","22","33"]},
    "col_2":{"N":3,"T":"double","D":[8,9.001,10]},
    "col_1":{"N":3,"T":"double","D":[1,2,3.456]}
  }
        
Please note DataFrame json does not follow json spec 100%. In json, there is no particular order in dictionary fields. But in DataFrame json:
  1. Column “INDEX” must be the first column, if it exists
  2. Fields in column dictionaries must be in N (name), T (type), D (data) order


In all formats the following data types are supported:
          float
          double
          longdouble  -- long double
          int
          uint        -- unsigned int
          long
          longlong    -- long long int
          ulong       -- unsigned long
          ulonglong   -- unsigned long long int
          string
          bool
          DateTime    -- DateTime data in format of <Epoch seconds>.<nanoseconds> (1516179600.874123908)
        
In case of io_format::csv2 the following additional types are also supported:
          DateTimeAME -- DateTime string printed in American style (MM/DD/YYYY HH:MM:SS.mmm)
          DateTimeEUR -- DateTime string printed in European style (YYYY/MM/DD HH:MM:SS.mmm)
          DateTimeISO -- DateTime string printed in ISO style (YYYY-MM-DD HH:MM:SS.mmm)
        
NOTE:: This version of read() can be substantially faster, especially for larger files, than if you open the file yourself and use the read() version below.
file_name: Complete path to the file
iof: Specifies the I/O format. The default is CSV
columns_only: If true, the index column is not read.
              You may want to do that to read multiple files into the same DataFrame.
              If columns_only is false the index column must exist in the stream.
              If columns_only is true the index column may or may not exist
starting_row: Zero-based number of the row to start reading. starting_row and
              num_rows can be used to read large files in chunks.
num_rows: Number of rows to read starting at starting_row
        

template<typename S>
bool
read(S &in_s,
     io_format iof = io_format::csv,
     bool columns_only = false);
        
Same as read() above, but takes a reference to a stream

std::future<bool>
read_async(const char *file_name,
           io_format iof = io_format::csv,
           bool columns_only = false);
        
Same as read() above, but executed asynchronously

template<typename S>
std::future<bool>
read_async(S &in_s,
           io_format iof = io_format::csv,
           bool columns_only = false);
        
Same as read_async() above, but takes a reference to a stream

bool
from_string(const char *data_frame); 
        
This is a convenient function (simple implementation) to restore a DataFrame from a string that was previously generated by calling to_string(). It utilizes the read() member function of DataFrame. These functions could be used to transmit a DataFrame from one place to another or store a DataFrame in databases, caches, …

I have been asked why I implemented from_string instead of/before doing “from binary format”
Implementing a binary format as a form of serialization is a legit ask and I will add that option when I find time to implement it. But implementing a binary format is more involved. And binary format is not always more efficient than string format. Two issues stand out
  1. Consider Options market data. Options' prices and sizes are usually smaller numbers. For example, consider the number 0.5. In string format that is 3 bytes ".5|". In binary format it is always 8 bytes. So, if you have a dataset with millions/billions of this kind of numbers, it makes a significant difference
  2. In binary format you must deal with big-endian vs. little-endian. It is a pain in the neck and affects efficiency
data_frame: A null terminated string that was generated by calling to_string(). It must contain a complete DataFrame

std::future<bool>
from_string_async(const char *data_frame); 
        
Same as from_string() above, but executed asynchronously
static void test_read()  {

    std::cout << "\nTesting read() ..." << std::endl;

    MyDataFrame df_read;

    try  {
        std::future<bool>   fut2 = df_read.read_async("sample_data.csv");

        fut2.get();
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }
    df_read.write<std::ostream, int, unsigned long, double, std::string, bool>(std::cout);

    StdDataFrame<std::string>   df_read_str;

    try  {
        df_read_str.read("sample_data_string_index.csv");
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }
    df_read_str.write<std::ostream, int, unsigned long, double, std::string, bool>(std::cout);

    StdDataFrame<DateTime>  df_read_dt;

    try  {
        df_read_dt.read("sample_data_dt_index.csv");
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }
    df_read_dt.write<std::ostream, int, unsigned long, double, std::string, bool>(std::cout);
}

// -----------------------------------------------------------------------------

static void test_io_format_csv2()  {

    std::cout << "\nTesting io_format_csv2( ) ..." << std::endl;

    std::vector<unsigned long>  ulgvec2 =
        { 123450, 123451, 123452, 123450, 123455, 123450, 123449, 123450, 123451, 123450, 123452, 123450, 123455, 123450,
          123454, 123450, 123450, 123457, 123458, 123459, 123450, 123441, 123442, 123432, 123450, 123450, 123435, 123450 };
    std::vector<unsigned long>  xulgvec2 = ulgvec2;
    std::vector<int>            intvec2 =
        { 1, 2, 3, 4, 5, 3, 7, 3, 9, 10, 3, 2, 3, 14, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 36, 2, 45, 2 };
    std::vector<double>         xdblvec2 =
        { 1.2345, 2.2345, 3.2345, 4.2345, 5.2345, 3.0, 0.9999, 10.0, 4.25, 0.009, 8.0, 2.2222, 3.3333,
          11.0, 5.25, 1.009, 2.111, 9.0, 3.2222, 4.3333, 12.0, 6.25, 2.009, 3.111, 10.0, 4.2222, 5.3333 };
    std::vector<double>         dblvec22 =
        { 0.998, 0.3456, 0.056, 0.15678, 0.00345, 0.923, 0.06743, 0.1, 0.0056, 0.07865, 0.0111, 0.1002, -0.8888,
          0.14, 0.0456, 0.078654, -0.8999, 0.8002, -0.9888, 0.2, 0.1056, 0.87865, -0.6999, 0.4111, 0.1902, -0.4888 };
    std::vector<std::string>    strvec2 =
        { "4% of something", "Description 4/5", "This is bad", "3.4% of GDP", "Market drops", "Market pulls back",
          "$15 increase", "Running fast", "C++14 development", "Some explanation", "More strings", "Bonds vs. Equities",
          "Almost done", "XXXX04", "XXXX2", "XXXX3", "XXXX4", "XXXX4", "XXXX5", "XXXX6",
          "XXXX7", "XXXX10", "XXXX11", "XXXX02", "XXXX03" };
    std::vector<bool>           boolvec = { true, true, true, false, false, true };

    MyDataFrame df;

    df.load_data(std::move(ulgvec2), std::make_pair("ul_col", xulgvec2));
    df.load_column("xint_col", std::move(intvec2), nan_policy::dont_pad_with_nans);
    df.load_column("str_col", std::move(strvec2), nan_policy::dont_pad_with_nans);
    df.load_column("dbl_col", std::move(xdblvec2), nan_policy::dont_pad_with_nans);
    df.load_column("dbl_col_2", std::move(dblvec22), nan_policy::dont_pad_with_nans);
    df.load_column("bool_col", std::move(boolvec), nan_policy::dont_pad_with_nans);

    df.write<std::ostream, int, unsigned long, double, bool, std::string>(std::cout, false, io_format::csv2);

    MyDataFrame df_read;

    try  {
        df_read.read("csv2_format_data.csv", io_format::csv2);
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }
    df_read.write<std::ostream, int, unsigned long, double, bool, std::string>(std::cout, false, io_format::csv2);
}

// -----------------------------------------------------------------------------

static void test_DT_IBM_data()  {

    std::cout << "\nTesting DT_IBM_data(  ) ..." << std::endl;

    typedef StdDataFrame<DateTime>  DT_DataFrame;

    DT_DataFrame    df;

    df.read("DT_IBM.csv", io_format::csv2);

    assert(df.get_column<double>("IBM_Open")[0] == 98.4375);
    assert(df.get_column<double>("IBM_Close")[18] == 97.875);
    assert(df.get_index()[18] == DateTime(20001128));
    assert(fabs(df.get_column<double>("IBM_High")[5030] - 111.8) < 0.001);
    assert(df.get_column<long>("IBM_Volume")[5022] == 21501100L);
    assert(df.get_index()[5020] == DateTime(20201016));
}
// -----------------------------------------------------------------------------

static void test_to_from_string()  {

    std::cout << "\nTesting to_from_string() ..." << std::endl;

    std::vector<unsigned long>  idx =
        { 123450, 123451, 123452, 123453, 123454, 123455, 123456, 123457, 123458, 123459, 123460, 123461, 123462, 123466 };
    std::vector<double> d1 = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 };
    std::vector<double> d2 = { 8, 9, 10, 11, 12, 13, 14, 20, 22, 23, 30, 31, 32, 1.89 };
    std::vector<double> d3 = { 15, 16, 17, 18, 19, 20, 21, 0.34, 1.56, 0.34, 2.3, 0.1, 0.89, 0.45 };
    std::vector<int>    i1 = { 22, 23, 24, 25, 99, 100, 101, 3, 2 };
    std::vector<std::string>    strvec =
        { "zz", "bb", "cc", "ww", "ee", "ff", "gg", "hh", "ii", "jj", "kk", "ll", "mm", "nn" };
    MyDataFrame         df;

    df.load_data(std::move(idx),
                 std::make_pair("col_1", d1),
                 std::make_pair("col_2", d2),
                 std::make_pair("col_3", d3),
                 std::make_pair("col_4", i1),
                 std::make_pair("str_col", strvec));

    std::future<std::string>    f = df.to_string_async<double, int, std::string>();
    const std::string           str_dump = f.get();

    // std::cout << str_dump << std::endl;

    MyDataFrame df2;

    df2.from_string(str_dump.c_str());
    // std::cout << '\n' << std::endl;
    // df2.write<std::ostream, double, int, std::string>(std::cout);
    assert((df.is_equal<double, int, std::string>(df2)));
}

// -----------------------------------------------------------------------------

static void test_reading_in_chunks()  {

    std::cout << "\nTesting reading_in_chunks(  ) ..." << std::endl;

    try  {
        StrDataFrame    df1;

        df1.read("data/SHORT_IBM.csv", io_format::csv2, false, 0, 10);
        assert(df1.get_index().size() == 10);
        assert(df1.get_column<double>("IBM_Close").size() == 10);
        assert(df1.get_index()[0] == "2014-01-02");
        assert(df1.get_index()[9] == "2014-01-15");
        assert(fabs(df1.get_column<double>("IBM_Close")[0] - 185.53) < 0.0001);
        assert(fabs(df1.get_column<double>("IBM_Close")[9] - 187.74) < 0.0001);

        StrDataFrame    df2;

        df2.read("data/SHORT_IBM.csv", io_format::csv2, false, 800, 10);
        assert(df2.get_index().size() == 10);
        assert(df2.get_column<double>("IBM_Close").size() == 10);
        assert(df2.get_index()[0] == "2017-03-08");
        assert(df2.get_index()[9] == "2017-03-21");
        assert(fabs(df2.get_column<double>("IBM_Close")[0] - 179.45) < 0.0001);
        assert(fabs(df2.get_column<double>("IBM_Close")[9] - 173.88) < 0.0001);

        StrDataFrame    df3;

        df3.read("data/SHORT_IBM.csv", io_format::csv2, false, 1716, 10);
        assert(df3.get_index().size() == 5);
        assert(df3.get_column<double>("IBM_Close").size() == 5);
        assert(df3.get_index()[0] == "2020-10-26");
        assert(df3.get_index()[4] == "2020-10-30");
        assert(fabs(df3.get_column<double>("IBM_Close")[0] - 112.22) < 0.0001);
        assert(fabs(df3.get_column<double>("IBM_Close")[4] - 111.66) < 0.0001);
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }
}
C++ DataFrame