lightgbm.Dataset

class lightgbm.Dataset(data, label=None, reference=None, weight=None, group=None, init_score=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]

Bases: object

Dataset in LightGBM.

__init__(data, label=None, reference=None, weight=None, group=None, init_score=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]

Initialize Dataset.

Parameters:
  • data (string, numpy array, pandas DataFrame, H2O DataTable's Frame, scipy.sparse or list of numpy arrays) – Data source of Dataset. If string, it represents the path to txt file.
  • label (list, numpy 1-D array, pandas Series / one-column DataFrame or None, optional (default=None)) – Label of the data.
  • reference (Dataset or None, optional (default=None)) – If this is Dataset for validation, training data should be used as reference.
  • weight (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Weight for each instance.
  • group (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Group/query size for Dataset.
  • init_score (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Init score for Dataset.
  • silent (bool, optional (default=False)) – Whether to print messages during construction.
  • feature_name (list of strings or 'auto', optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
  • categorical_feature (list of strings or int, or 'auto', optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.
  • params (dict or None, optional (default=None)) – Other parameters for Dataset.
  • free_raw_data (bool, optional (default=True)) – If True, raw data is freed after constructing inner Dataset.

Methods

__init__(data[, label, reference, weight, …]) Initialize Dataset.
add_features_from(other) Add features from other Dataset to the current Dataset.
construct() Lazy init.
create_valid(data[, label, weight, group, …]) Create validation data align with current Dataset.
get_data() Get the raw data of the Dataset.
get_feature_penalty() Get the feature penalty of the Dataset.
get_field(field_name) Get property from the Dataset.
get_group() Get the group of the Dataset.
get_init_score() Get the initial score of the Dataset.
get_label() Get the label of the Dataset.
get_monotone_constraints() Get the monotone constraints of the Dataset.
get_ref_chain([ref_limit]) Get a chain of Dataset objects.
get_weight() Get the weight of the Dataset.
num_data() Get the number of rows in the Dataset.
num_feature() Get the number of columns (features) in the Dataset.
save_binary(filename) Save Dataset to a binary file.
set_categorical_feature(categorical_feature) Set categorical features.
set_feature_name(feature_name) Set feature name.
set_field(field_name, data) Set property into the Dataset.
set_group(group) Set group size of Dataset (used for ranking).
set_init_score(init_score) Set init score of Booster to start from.
set_label(label) Set label of Dataset.
set_reference(reference) Set reference Dataset.
set_weight(weight) Set weight of each instance.
subset(used_indices[, params]) Get subset of current Dataset.
add_features_from(other)[source]

Add features from other Dataset to the current Dataset.

Both Datasets must be constructed before calling this method.

Parameters:other (Dataset) – The Dataset to take features from.
Returns:self – Dataset with the new features added.
Return type:Dataset
construct()[source]

Lazy init.

Returns:self – Constructed Dataset object.
Return type:Dataset
create_valid(data, label=None, weight=None, group=None, init_score=None, silent=False, params=None)[source]

Create validation data align with current Dataset.

Parameters:
  • data (string, numpy array, pandas DataFrame, H2O DataTable's Frame, scipy.sparse or list of numpy arrays) – Data source of Dataset. If string, it represents the path to txt file.
  • label (list, numpy 1-D array, pandas Series / one-column DataFrame or None, optional (default=None)) – Label of the data.
  • weight (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Weight for each instance.
  • group (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Group/query size for Dataset.
  • init_score (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Init score for Dataset.
  • silent (bool, optional (default=False)) – Whether to print messages during construction.
  • params (dict or None, optional (default=None)) – Other parameters for validation Dataset.
Returns:

valid – Validation Dataset with reference to self.

Return type:

Dataset

get_data()[source]

Get the raw data of the Dataset.

Returns:data – Raw data used in the Dataset construction.
Return type:string, numpy array, pandas DataFrame, H2O DataTable’s Frame, scipy.sparse, list of numpy arrays or None
get_feature_penalty()[source]

Get the feature penalty of the Dataset.

Returns:feature_penalty – Feature penalty for each feature in the Dataset.
Return type:numpy array or None
get_field(field_name)[source]

Get property from the Dataset.

Parameters:field_name (string) – The field name of the information.
Returns:info – A numpy array with information from the Dataset.
Return type:numpy array
get_group()[source]

Get the group of the Dataset.

Returns:group – Group size of each group.
Return type:numpy array or None
get_init_score()[source]

Get the initial score of the Dataset.

Returns:init_score – Init score of Booster.
Return type:numpy array or None
get_label()[source]

Get the label of the Dataset.

Returns:label – The label information from the Dataset.
Return type:numpy array or None
get_monotone_constraints()[source]

Get the monotone constraints of the Dataset.

Returns:monotone_constraints – Monotone constraints: -1, 0 or 1, for each feature in the Dataset.
Return type:numpy array or None
get_ref_chain(ref_limit=100)[source]

Get a chain of Dataset objects.

Starts with r, then goes to r.reference (if exists), then to r.reference.reference, etc. until we hit ref_limit or a reference loop.

Parameters:ref_limit (int, optional (default=100)) – The limit number of references.
Returns:ref_chain – Chain of references of the Datasets.
Return type:set of Dataset
get_weight()[source]

Get the weight of the Dataset.

Returns:weight – Weight for each data point from the Dataset.
Return type:numpy array or None
num_data()[source]

Get the number of rows in the Dataset.

Returns:number_of_rows – The number of rows in the Dataset.
Return type:int
num_feature()[source]

Get the number of columns (features) in the Dataset.

Returns:number_of_columns – The number of columns (features) in the Dataset.
Return type:int
save_binary(filename)[source]

Save Dataset to a binary file.

Parameters:filename (string) – Name of the output file.
Returns:self – Returns self.
Return type:Dataset
set_categorical_feature(categorical_feature)[source]

Set categorical features.

Parameters:categorical_feature (list of int or strings) – Names or indices of categorical features.
Returns:self – Dataset with set categorical features.
Return type:Dataset
set_feature_name(feature_name)[source]

Set feature name.

Parameters:feature_name (list of strings) – Feature names.
Returns:self – Dataset with set feature name.
Return type:Dataset
set_field(field_name, data)[source]

Set property into the Dataset.

Parameters:
  • field_name (string) – The field name of the information.
  • data (list, numpy 1-D array, pandas Series or None) – The array of data to be set.
Returns:

self – Dataset with set property.

Return type:

Dataset

set_group(group)[source]

Set group size of Dataset (used for ranking).

Parameters:group (list, numpy 1-D array, pandas Series or None) – Group size of each group.
Returns:self – Dataset with set group.
Return type:Dataset
set_init_score(init_score)[source]

Set init score of Booster to start from.

Parameters:init_score (list, numpy 1-D array, pandas Series or None) – Init score for Booster.
Returns:self – Dataset with set init score.
Return type:Dataset
set_label(label)[source]

Set label of Dataset.

Parameters:label (list, numpy 1-D array, pandas Series / one-column DataFrame or None) – The label information to be set into Dataset.
Returns:self – Dataset with set label.
Return type:Dataset
set_reference(reference)[source]

Set reference Dataset.

Parameters:reference (Dataset) – Reference that is used as a template to construct the current Dataset.
Returns:self – Dataset with set reference.
Return type:Dataset
set_weight(weight)[source]

Set weight of each instance.

Parameters:weight (list, numpy 1-D array, pandas Series or None) – Weight to be set for each data point.
Returns:self – Dataset with set weight.
Return type:Dataset
subset(used_indices, params=None)[source]

Get subset of current Dataset.

Parameters:
  • used_indices (list of int) – Indices used to create the subset.
  • params (dict or None, optional (default=None)) – These parameters will be passed to Dataset constructor.
Returns:

subset – Subset of the current Dataset.

Return type:

Dataset