lightgbm.Dataset

class lightgbm.Dataset(data, label=None, reference=None, weight=None, group=None, init_score=None, silent='warn', feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]

Bases: object

Dataset in LightGBM.

__init__(data, label=None, reference=None, weight=None, group=None, init_score=None, silent='warn', feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)[source]

Initialize Dataset.

Parameters
  • data (str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable's Frame, scipy.sparse, Sequence, list of Sequence or list of numpy array) – Data source of Dataset. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM) or a LightGBM Dataset binary file.

  • label (list, numpy 1-D array, pandas Series / one-column DataFrame or None, optional (default=None)) – Label of the data.

  • reference (Dataset or None, optional (default=None)) – If this is Dataset for validation, training data should be used as reference.

  • weight (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Weight for each instance.

  • group (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

  • init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None, optional (default=None)) – Init score for Dataset.

  • silent (bool, optional (default=False)) – Whether to print messages during construction.

  • feature_name (list of str, or 'auto', optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.

  • categorical_feature (list of str or int, or 'auto', optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.

  • params (dict or None, optional (default=None)) – Other parameters for Dataset.

  • free_raw_data (bool, optional (default=True)) – If True, raw data is freed after constructing inner Dataset.

Methods

__init__(data[, label, reference, weight, ...])

Initialize Dataset.

add_features_from(other)

Add features from other Dataset to the current Dataset.

construct()

Lazy init.

create_valid(data[, label, weight, group, ...])

Create validation data align with current Dataset.

get_data()

Get the raw data of the Dataset.

get_feature_name()

Get the names of columns (features) in the Dataset.

get_field(field_name)

Get property from the Dataset.

get_group()

Get the group of the Dataset.

get_init_score()

Get the initial score of the Dataset.

get_label()

Get the label of the Dataset.

get_params()

Get the used parameters in the Dataset.

get_ref_chain([ref_limit])

Get a chain of Dataset objects.

get_weight()

Get the weight of the Dataset.

num_data()

Get the number of rows in the Dataset.

num_feature()

Get the number of columns (features) in the Dataset.

save_binary(filename)

Save Dataset to a binary file.

set_categorical_feature(categorical_feature)

Set categorical features.

set_feature_name(feature_name)

Set feature name.

set_field(field_name, data)

Set property into the Dataset.

set_group(group)

Set group size of Dataset (used for ranking).

set_init_score(init_score)

Set init score of Booster to start from.

set_label(label)

Set label of Dataset.

set_reference(reference)

Set reference Dataset.

set_weight(weight)

Set weight of each instance.

subset(used_indices[, params])

Get subset of current Dataset.

add_features_from(other)[source]

Add features from other Dataset to the current Dataset.

Both Datasets must be constructed before calling this method.

Parameters

other (Dataset) – The Dataset to take features from.

Returns

self – Dataset with the new features added.

Return type

Dataset

construct()[source]

Lazy init.

Returns

self – Constructed Dataset object.

Return type

Dataset

create_valid(data, label=None, weight=None, group=None, init_score=None, silent='warn', params=None)[source]

Create validation data align with current Dataset.

Parameters
  • data (str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable's Frame, scipy.sparse, Sequence, list of Sequence or list of numpy array) – Data source of Dataset. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM) or a LightGBM Dataset binary file.

  • label (list, numpy 1-D array, pandas Series / one-column DataFrame or None, optional (default=None)) – Label of the data.

  • weight (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Weight for each instance.

  • group (list, numpy 1-D array, pandas Series or None, optional (default=None)) – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

  • init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None, optional (default=None)) – Init score for Dataset.

  • silent (bool, optional (default=False)) – Whether to print messages during construction.

  • params (dict or None, optional (default=None)) – Other parameters for validation Dataset.

Returns

valid – Validation Dataset with reference to self.

Return type

Dataset

get_data()[source]

Get the raw data of the Dataset.

Returns

data – Raw data used in the Dataset construction.

Return type

str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable’s Frame, scipy.sparse, Sequence, list of Sequence or list of numpy array or None

get_feature_name()[source]

Get the names of columns (features) in the Dataset.

Returns

feature_names – The names of columns (features) in the Dataset.

Return type

list

get_field(field_name)[source]

Get property from the Dataset.

Parameters

field_name (str) – The field name of the information.

Returns

info – A numpy array with information from the Dataset.

Return type

numpy array or None

get_group()[source]

Get the group of the Dataset.

Returns

group – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

Return type

numpy array or None

get_init_score()[source]

Get the initial score of the Dataset.

Returns

init_score – Init score of Booster.

Return type

numpy array or None

get_label()[source]

Get the label of the Dataset.

Returns

label – The label information from the Dataset.

Return type

numpy array or None

get_params()[source]

Get the used parameters in the Dataset.

Returns

params – The used parameters in this Dataset object.

Return type

dict or None

get_ref_chain(ref_limit=100)[source]

Get a chain of Dataset objects.

Starts with r, then goes to r.reference (if exists), then to r.reference.reference, etc. until we hit ref_limit or a reference loop.

Parameters

ref_limit (int, optional (default=100)) – The limit number of references.

Returns

ref_chain – Chain of references of the Datasets.

Return type

set of Dataset

get_weight()[source]

Get the weight of the Dataset.

Returns

weight – Weight for each data point from the Dataset.

Return type

numpy array or None

num_data()[source]

Get the number of rows in the Dataset.

Returns

number_of_rows – The number of rows in the Dataset.

Return type

int

num_feature()[source]

Get the number of columns (features) in the Dataset.

Returns

number_of_columns – The number of columns (features) in the Dataset.

Return type

int

save_binary(filename)[source]

Save Dataset to a binary file.

Note

Please note that init_score is not saved in binary file. If you need it, please set it again after loading Dataset.

Parameters

filename (str or pathlib.Path) – Name of the output file.

Returns

self – Returns self.

Return type

Dataset

set_categorical_feature(categorical_feature)[source]

Set categorical features.

Parameters

categorical_feature (list of int or str) – Names or indices of categorical features.

Returns

self – Dataset with set categorical features.

Return type

Dataset

set_feature_name(feature_name)[source]

Set feature name.

Parameters

feature_name (list of str) – Feature names.

Returns

self – Dataset with set feature name.

Return type

Dataset

set_field(field_name, data)[source]

Set property into the Dataset.

Parameters
  • field_name (str) – The field name of the information.

  • data (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None) – The data to be set.

Returns

self – Dataset with set property.

Return type

Dataset

set_group(group)[source]

Set group size of Dataset (used for ranking).

Parameters

group (list, numpy 1-D array, pandas Series or None) – Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

Returns

self – Dataset with set group.

Return type

Dataset

set_init_score(init_score)[source]

Set init score of Booster to start from.

Parameters

init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None) – Init score for Booster.

Returns

self – Dataset with set init score.

Return type

Dataset

set_label(label)[source]

Set label of Dataset.

Parameters

label (list, numpy 1-D array, pandas Series / one-column DataFrame or None) – The label information to be set into Dataset.

Returns

self – Dataset with set label.

Return type

Dataset

set_reference(reference)[source]

Set reference Dataset.

Parameters

reference (Dataset) – Reference that is used as a template to construct the current Dataset.

Returns

self – Dataset with set reference.

Return type

Dataset

set_weight(weight)[source]

Set weight of each instance.

Parameters

weight (list, numpy 1-D array, pandas Series or None) – Weight to be set for each data point.

Returns

self – Dataset with set weight.

Return type

Dataset

subset(used_indices, params=None)[source]

Get subset of current Dataset.

Parameters
  • used_indices (list of int) – Indices used to create the subset.

  • params (dict or None, optional (default=None)) – These parameters will be passed to Dataset constructor.

Returns

subset – Subset of the current Dataset.

Return type

Dataset