lightgbm.Booster

class lightgbm.Booster(params=None, train_set=None, model_file=None, model_str=None)[source]

Bases: object

Booster in LightGBM.

__init__(params=None, train_set=None, model_file=None, model_str=None)[source]

Initialize the Booster.

Parameters:

params (dict or None, optional (default=None)) – Parameters for Booster.
train_set (Dataset or None, optional (default=None)) – Training dataset.
model_file (str, pathlib.Path or None, optional (default=None)) – Path to the model file.
model_str (str or None, optional (default=None)) – Model will be loaded from this string.

Methods

`__init__`([params, train_set, model_file, ...])	Initialize the Booster.
`add_valid`(data, name)	Add validation data.
`current_iteration`()	Get the index of the current iteration.
`dump_model`([num_iteration, start_iteration, ...])	Dump Booster to JSON format.
`eval`(data, name[, feval])	Evaluate for data.
`eval_train`([feval])	Evaluate for training data.
`eval_valid`([feval])	Evaluate for validation data.
`feature_importance`([importance_type, iteration])	Get feature importances.
`feature_name`()	Get names of features.
`free_dataset`()	Free Booster's Datasets.
`free_network`()	Free Booster's network.
`get_leaf_output`(tree_id, leaf_id)	Get the output of a leaf.
`get_split_value_histogram`(feature[, bins, ...])	Get split value histogram for the specified feature.
`lower_bound`()	Get lower bound value of a model.
`model_from_string`(model_str)	Load Booster from a string.
`model_to_string`([num_iteration, ...])	Save Booster to string.
`num_feature`()	Get number of features.
`num_model_per_iteration`()	Get number of models per iteration.
`num_trees`()	Get number of weak sub-models.
`predict`(data[, start_iteration, ...])	Make a prediction.
`refit`(data, label[, decay_rate, reference, ...])	Refit the existing Booster by new data.
`reset_parameter`(params)	Reset parameters of Booster.
`rollback_one_iter`()	Rollback one iteration.
`save_model`(filename[, num_iteration, ...])	Save Booster to file.
`set_leaf_output`(tree_id, leaf_id, value)	Set the output of a leaf.
`set_network`(machines[, local_listen_port, ...])	Set the network configuration.
`set_train_data_name`(name)	Set the name to the training Dataset.
`shuffle_models`([start_iteration, end_iteration])	Shuffle models.
`trees_to_dataframe`()	Parse the fitted model and return in an easy-to-read pandas DataFrame.
`update`([train_set, fobj])	Update Booster for one iteration.
`upper_bound`()	Get upper bound value of a model.

add_valid(data, name)[source]

Add validation data.

Parameters:

data (Dataset) – Validation data.
name (str) – Name of validation data.

Returns:

self – Booster with set validation data.

Return type:

Booster

current_iteration()[source]

Get the index of the current iteration.

Returns:: cur_iter – The index of the current iteration.
Return type:: int

dump_model(num_iteration=None, start_iteration=0, importance_type='split', object_hook=None)[source]

Dump Booster to JSON format.

Parameters:

num_iteration (int or None, optional (default=None)) – Index of the iteration that should be dumped. If None, if the best iteration exists, it is dumped; otherwise, all iterations are dumped. If <= 0, all iterations are dumped.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be dumped.
importance_type (str, optional (default="split")) – What type of feature importance should be dumped. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
object_hook (callable or None, optional (default=None)) – If not None, object_hook is a function called while parsing the json string returned by the C API. It may be used to alter the json, to store specific values while building the json structure. It avoids walking through the structure again. It saves a significant amount of time if the number of trees is huge. Signature is def object_hook(node: dict) -> dict. None is equivalent to lambda node: node. See documentation of json.loads() for further details.

Returns:

json_repr – JSON format of Booster.

Return type:

dict

eval(data, name, feval=None)[source]

Evaluate for data.

Parameters:

data (Dataset) – Data for the evaluating.
name (str) – Name of the data.
feval (callable, list of callable, or None, optional (default=None)) –
Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (metric_name, metric_value, maximize) or list of such tuples.

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes]. If custom objective function is used, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

eval_dataDataset
A Dataset to evaluate.

metric_namestr
Unique identifier for the metric (e.g. “custom_adjusted_mse”).

metric_valuefloat
Value of the evaluation metric.

maximizebool
Are higher values better? e.g. True for AUC and False for binary error.

Returns:

result – List with (dataset_name, metric_name, metric_value, maximize) tuples.

Return type:

list

eval_train(feval=None)[source]

Evaluate for training data.

Parameters:

feval (callable, list of callable, or None, optional (default=None)) –

Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (metric_name, metric_value, maximize) or list of such tuples.

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes]. If custom objective function is used, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

eval_dataDataset
The training dataset.

metric_namestr
Unique identifier for the metric (e.g. “custom_adjusted_mse”).

metric_valuefloat
Value of the evaluation metric.

maximizebool
Are higher values better? e.g. True for AUC and False for binary error.

Returns:

result – List with (train_dataset_name, metric_name, metric_value, maximize) tuples.

Return type:

list

eval_valid(feval=None)[source]

Evaluate for validation data.

Parameters:

feval (callable, list of callable, or None, optional (default=None)) –

Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (metric_name, metric_value, maximize) or list of such tuples.

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes]. If custom objective function is used, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

eval_dataDataset
The validation dataset.

metric_namestr
Unique identifier for the metric (e.g. “custom_adjusted_mse”).

metric_valuefloat
Value of the evaluation metric.

maximizebool
Are higher values better? e.g. True for AUC and False for binary error.

Returns:

result – List with (validation_dataset_name, metric_name, metric_value, maximize) tuples.

Return type:

list

feature_importance(importance_type='split', iteration=None)[source]

Get feature importances.

Parameters:

importance_type (str, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
iteration (int or None, optional (default=None)) – Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used. If <= 0, all trees are used (no limits).

Returns:

result – Array with feature importances.

Return type:

numpy array

feature_name()[source]

Get names of features.

Returns:: result – List with names of features.
Return type:: list of str

free_dataset()[source]

Free Booster’s Datasets.

Returns:: self – Booster without Datasets.
Return type:: Booster

free_network()[source]

Free Booster’s network.

Returns:: self – Booster with freed network.
Return type:: Booster

get_leaf_output(tree_id, leaf_id)[source]

Get the output of a leaf.

Parameters:

tree_id (int) – The index of the tree.
leaf_id (int) – The index of the leaf in the tree.

Returns:

result – The output of the leaf.

Return type:

float

get_split_value_histogram(feature, bins=None, xgboost_style=False)[source]

Get split value histogram for the specified feature.

Parameters:

feature (int or str) –
The feature name or index the histogram is calculated for. If int, interpreted as index. If str, interpreted as name.

Warning

Categorical features are not supported.
bins (int, str or None, optional (default=None)) – The maximum number of bins. If None, or int and > number of unique split values and xgboost_style=True, the number of bins equals number of unique split values. If str, it should be one from the list of the supported values by numpy.histogram() function.
xgboost_style (bool, optional (default=False)) – Whether the returned result should be in the same form as it is in XGBoost. If False, the returned value is tuple of 2 numpy arrays as it is in numpy.histogram() function. If True, the returned value is matrix, in which the first column is the right edges of non-empty bins and the second one is the histogram values.

Returns:

result_tuple (tuple of 2 numpy arrays) – If xgboost_style=False, the values of the histogram of used splitting values for the specified feature and the bin edges.
result_array_like (numpy array or pandas DataFrame (if pandas is installed)) – If xgboost_style=True, the histogram of used splitting values for the specified feature.

lower_bound()[source]

Get lower bound value of a model.

Returns:: lower_bound – Lower bound value of the model.
Return type:: float

model_from_string(model_str)[source]

Load Booster from a string.

Parameters:: model_str (str) – Model will be loaded from this string.
Returns:: self – Loaded Booster object.
Return type:: Booster

model_to_string(num_iteration=None, start_iteration=0, importance_type='split')[source]

Save Booster to string.

Parameters:

num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns:

str_repr – String representation of Booster.

Return type:

str

num_feature()[source]

Get number of features.

Returns:: num_feature – The number of features.
Return type:: int

num_model_per_iteration()[source]

Get number of models per iteration.

Returns:: model_per_iter – The number of models per iteration.
Return type:: int

num_trees()[source]

Get number of weak sub-models.

Returns:: num_trees – The number of weak sub-models.
Return type:: int

predict(data, start_iteration=0, num_iteration=None, raw_score=False, pred_leaf=False, pred_contrib=False, data_has_header=False, validate_features=False, **kwargs)[source]

Make a prediction.

Parameters:

data (str, pathlib.Path, numpy array, pandas DataFrame, scipy.sparse or pyarrow Table) – Data source for prediction. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).
start_iteration (int, optional (default=0)) – Start index of the iteration to predict. If <= 0, starts from the first iteration.
num_iteration (int or None, optional (default=None)) – Total number of iterations used in the prediction. If None, if the best iteration exists and start_iteration <= 0, the best iteration is used; otherwise, all iterations from start_iteration are used (no limits). If <= 0, all iterations from start_iteration are used (no limits).
raw_score (bool, optional (default=False)) – Whether to predict raw scores.
pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
pred_contrib (bool, optional (default=False)) –
Whether to predict feature contributions.

Note

If you want to get more explanations for your model’s predictions using SHAP values, like SHAP interaction values, you can install the shap package (https://github.com/slundberg/shap). Note that unlike the shap package, with pred_contrib we return a matrix with an extra column, where the last column is the expected value.
data_has_header (bool, optional (default=False)) – Whether the data has header. Used only if data is str.
validate_features (bool, optional (default=False)) – If True, ensure that the features used to predict match the ones used to train. Used only if data is pandas DataFrame.
**kwargs – Other parameters for the prediction.

Returns:

result – Prediction result. Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when pred_contrib=True).

Return type:

numpy array, scipy.sparse or list of scipy.sparse

refit(data, label, decay_rate=0.9, reference=None, weight=None, group=None, init_score=None, feature_name='auto', categorical_feature='auto', dataset_params=None, free_raw_data=True, validate_features=False, **kwargs)[source]

Refit the existing Booster by new data.

Parameters:

data (str, pathlib.Path, numpy array, pandas DataFrame, scipy.sparse, Sequence, list of Sequence, list of numpy array or pyarrow Table) – Data source for refit. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).
label (list, numpy 1-D array, pandas Series / one-column DataFrame, pyarrow Array or pyarrow ChunkedArray) – Label for refit.
decay_rate (float, optional (default=0.9)) – Decay rate of refit, will use leaf_output = decay_rate * old_leaf_output + (1.0 - decay_rate) * new_leaf_output to refit trees.
reference (Dataset or None, optional (default=None)) –
Reference for data.

Added in version 4.0.0.
weight (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) –
Weight for each data instance. Weights should be non-negative.

Added in version 4.0.0.
group (list, numpy 1-D array, pandas Series, pyarrow Array, pyarrow ChunkedArray or None, optional (default=None)) –
Group/query size for data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

Added in version 4.0.0.
init_score (list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None, optional (default=None)) –
Init score for data.

Added in version 4.0.0.
feature_name (list of str, or 'auto', optional (default="auto")) –
Feature names for data. If ‘auto’ and data is pandas DataFrame, data columns names are used.

Added in version 4.0.0.
categorical_feature (list of str or int, or 'auto', optional (default="auto")) –
Categorical features for data. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature. Floating point numbers in categorical features will be rounded towards 0.

Added in version 4.0.0.
dataset_params (dict or None, optional (default=None)) –
Other parameters for Dataset data.

Added in version 4.0.0.
free_raw_data (bool, optional (default=True)) –
If True, raw data is freed after constructing inner Dataset for data.

Added in version 4.0.0.
validate_features (bool, optional (default=False)) –
If True, ensure that the features used to refit the model match the original ones. Used only if data is pandas DataFrame.

Added in version 4.0.0.
**kwargs – Other parameters for refit. These parameters will be passed to predict method.

Returns:

result – Refitted Booster.

Return type:

Booster

reset_parameter(params)[source]

Reset parameters of Booster.

Parameters:: params (dict) – New parameters for Booster.
Returns:: self – Booster with new parameters.
Return type:: Booster

rollback_one_iter()[source]

Rollback one iteration.

Returns:: self – Booster with rolled back one iteration.
Return type:: Booster

save_model(filename, num_iteration=None, start_iteration=0, importance_type='split')[source]

Save Booster to file.

Parameters:

filename (str or pathlib.Path) – Filename to save Booster.
num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns:

self – Returns self.

Return type:

Booster

set_leaf_output(tree_id, leaf_id, value)[source]

Set the output of a leaf.

Added in version 4.0.0.

Parameters:

tree_id (int) – The index of the tree.
leaf_id (int) – The index of the leaf in the tree.
value (float) – Value to set as the output of the leaf.

Returns:

self – Booster with the leaf output set.

Return type:

Booster

set_network(machines, local_listen_port=12400, listen_time_out=120, num_machines=1)[source]

Set the network configuration.

Parameters:

machines (list, set or str) – Names of machines.
local_listen_port (int, optional (default=12400)) – TCP listen port for local machines.
listen_time_out (int, optional (default=120)) – Socket time-out in minutes.
num_machines (int, optional (default=1)) – The number of machines for distributed learning application.

Returns:

self – Booster with set network.

Return type:

Booster

set_train_data_name(name)[source]

Set the name to the training Dataset.

Parameters:: name (str) – Name for the training Dataset.
Returns:: self – Booster with set training Dataset name.
Return type:: Booster

shuffle_models(start_iteration=0, end_iteration=-1)[source]

Shuffle models.

Parameters:

start_iteration (int, optional (default=0)) – The first iteration that will be shuffled.
end_iteration (int, optional (default=-1)) – The last iteration that will be shuffled. If <= 0, means the last available iteration.

Returns:

self – Booster with shuffled models.

Return type:

Booster

trees_to_dataframe()[source]

Parse the fitted model and return in an easy-to-read pandas DataFrame.

The returned DataFrame has the following columns.

tree_index : int64, which tree a node belongs to. 0-based, so a value of 6, for example, means “this node is in the 7th tree”.

node_depth : int64, how far a node is from the root of the tree. The root node has a value of 1, its direct children are 2, etc.

node_index : str, unique identifier for a node.

left_child : str, node_index of the child node to the left of a split. None for leaf nodes.

right_child : str, node_index of the child node to the right of a split. None for leaf nodes.

parent_index : str, node_index of this node’s parent. None for the root node.

split_feature : str, name of the feature used for splitting. None for leaf nodes.

split_gain : float64, gain from adding this split to the tree. NaN for leaf nodes.

threshold : float64, value of the feature used to decide which side of the split a record will go down. NaN for leaf nodes.

decision_type : str, logical operator describing how to compare a value to threshold. For example, split_feature = "Column_10", threshold = 15, decision_type = "<=" means that records where Column_10 <= 15 follow the left side of the split, otherwise follows the right side of the split. None for leaf nodes.

missing_direction : str, split direction that missing values should go to. None for leaf nodes.

missing_type : str, describes what types of values are treated as missing.

value : float64, predicted value for this leaf node, multiplied by the learning rate.

weight : float64 or int64, sum of Hessian (second-order derivative of objective), summed over observations that fall in this node.

count : int64, number of records in the training data that fall into this node.

Returns:: result – Returns a pandas DataFrame of the parsed model.
Return type:: pandas DataFrame

update(train_set=None, fobj=None)[source]

Update Booster for one iteration.

Parameters:

train_set (Dataset or None, optional (default=None)) – Training data. If None, last training data is used.
fobj (callable or None, optional (default=None)) –
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.

train_dataDataset
The training dataset.

gradnumpy 1-D array or numpy 2-D array (for multi-class task)
The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.

hessnumpy 1-D array or numpy 2-D array (for multi-class task)
The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes], and grad and hess should be returned in the same format.

Returns:

produced_empty_tree – True if the tree(s) produced by this iteration did not have any splits. This usually means that training is “finished” (calling update() again will not change the model’s predictions). However, that is not always the case. For example, if you have added any randomness (like column sampling by setting feature_fraction_bynode < 1.0), it is possible that another call to update() would produce a non-empty tree.

Return type:

bool

upper_bound()[source]

Get upper bound value of a model.

Returns:: upper_bound – Upper bound value of the model.
Return type:: float