Developer Guide for Intel® Data Analytics Acceleration Library 2019 Update 4
Data sources define interfaces for access and management of data in raw format and out-of-memory data. A data source is closely coupled with the data dictionary that describes the structure of the data associated with the data source. To create the associated data dictionary, you can do one of the following:
The getDictionary() method returns the dictionary associated with the data source.
Data sources stream and transform raw out-of-memory data into numeric in-memory data accessible through numeric table interfaces. A data source is associated with the corresponding numeric table. To allocate the associated numeric table, you can do one of the following:
The getNumericTable() method returns the numeric table associated with the data source.
To retrieve the number of columns (features) in a raw data set, use the getNumberOfColumns() method. To retrieve the number of rows (observations) available in a raw data set, use the getNumberOfAvailableRows() method. The getStatus() method returns the current status of the data source:
readyForLoad - the data is available for the load operation.
waitingForData - the data source is waiting for new data to arrive later; designated for data sources that deal with asynchronous data streaming, that is, the data arriving in blocks at different points in time.
endOfData- all the data is already loaded.
Because the entire out-of-memory data set may fail to fit into memory, as well as for performance reasons, Intel® DAAL implements data loading in blocks. Use the loadDataBlock() method to load the next block of data into the numeric table. This method enables you to load a data block into an internally allocated numeric table or into the provided numeric table. In both cases, you can specify the number of rows or not. The method also recalculates basic statistics associated with this numeric table.
Intel DAAL maintains the list of possible values associated with categorical features to convert them into a numeric form. In this list, a new index is assigned to each new value found in the raw data set. You can get the list of possible values from the possibleValues collection associated with the corresponding feature in the data source. In the case you have several data sets with same data structure and you want to use continuous indexing, do the following:
Retrieve the data dictionary from the last data source using the getDictionary() method.
Assign this dictionary to the next data source using the setDictionary() method.
Repeat these steps for each next data source.
Intel DAAL implements classes for some popular types of data sources. Each of these classes takes a feature manager class as the class template parameter. The feature manager parses, filters, and normalizes the data and converts it into a numeric format. The following are the data sources and the corresponding feature manager classes:
CSVFeatureManager provides additional capabilities for features modification. Use addModifier() to enable specific modification when loading data to a numeric table:
Feature managers provide additional capabilities for the modification of the input data during its loading. Use the Feature modifier entity to define desired modification. Feature modifiers enables you to implement a wide range of feature extraction or transformation techniques, for instance, feature binarization, one-hot-encoding, or polynomial features generation. To enable specific modification, use the addModifier() method that accepts two parameters:
// Crate DataSource object (for example FileDataSource) FileDataSource<CSVFeatureManager> ds("file.csv", options); // Specify features subset and modifier auto featureIds = features::list("f1", "f2"); auto featureModifier = modifiers::csv::continuous(); // Add modifier to feature manager ds.getFeatureManager().addModifier(featureIds, modifier); // Cause data loading ds.loadDataBlock();A feature subset may be defined with the functions list(…) , range(…), all(), or allReverse() located in the namespace data_management::features. For example, you can use numerical or string identifiers to refer to the particular feature in the data set. A string identifier may correspond to a feature name (for instance, name in CSV header or in SQL table column name) and numerical one to the index of a feature. The following code block shows several ways to define a feature subset. f1 , f2, and f4 are the names of the respective columns in CSV file or SQL table, and the numbers 0, 2 - 4 are the indices of columns starting from the left one.
features::list("f1", "f2") // String identifiers features::list(0, 3); // Numerical identifiers features::list("f1", 2); // Mixed identifiers features::range(0, 4); // Range of features, the same as list(0,…,4) features::range("f1", "f4"); // Range with string identifiers features::all(); // Refer to all features in the data set features::allReverse() // Like features::all() but in reverse order // With STL vector std::vector<features::IdFactory> fv; fv.push_back("f2"); fv.push_back(3); features::list(fv); // With C++ 11 initializer list features::list({ "f2", 3, "f1" });We will use the term input features to refer to the columns of raw out-of-memory data and the term output features for the columns of numeric in-memory data. A feature modifier transforms specified input features subset to the output features. The number of output features is determined by the modifier. A feature modifier is expected to read the values corresponding to specified input features from the i-th row and write modified values to the i-th row of the output numeric table. In general case, feature modifier is able to process arbitrary number of input features to arbitrary number of output features. Let's assume that we added m modifiers along with the features subsets F 1,...F m and the j-th modifier has the C j output columns, where
class MyFeatureModifier : public modifiers::csv::FeatureModifierBase { public: virtual void initialize(modifiers::csv::Config &config); virtual void apply(modifiers::csv::Context &context); virtual void finalize(modifiers::csv::Config &config); };Use the addModifier(…) method to add the user-defined modifier to the feature manager:
ds.getFeatureManager().addModifier( features::list(0, 3), modifiers::custom<MyFeatureModifier>() );Feature modifier's lifetime consists of three stages: