lightgbm.Booster
- class lightgbm.Booster(params=None, train_set=None, model_file=None, model_str=None, silent='warn')[source]
Bases:
object
Booster in LightGBM.
- __init__(params=None, train_set=None, model_file=None, model_str=None, silent='warn')[source]
Initialize the Booster.
- Parameters
params (dict or None, optional (default=None)) – Parameters for Booster.
train_set (Dataset or None, optional (default=None)) – Training dataset.
model_file (str, pathlib.Path or None, optional (default=None)) – Path to the model file.
model_str (str or None, optional (default=None)) – Model will be loaded from this string.
silent (bool, optional (default=False)) – Whether to print messages during construction.
Methods
__init__
([params, train_set, model_file, ...])Initialize the Booster.
add_valid
(data, name)Add validation data.
attr
(key)Get attribute string from the Booster.
Get the index of the current iteration.
dump_model
([num_iteration, start_iteration, ...])Dump Booster to JSON format.
eval
(data, name[, feval])Evaluate for data.
eval_train
([feval])Evaluate for training data.
eval_valid
([feval])Evaluate for validation data.
feature_importance
([importance_type, iteration])Get feature importances.
Get names of features.
Free Booster's Datasets.
Free Booster's network.
get_leaf_output
(tree_id, leaf_id)Get the output of a leaf.
get_split_value_histogram
(feature[, bins, ...])Get split value histogram for the specified feature.
Get lower bound value of a model.
model_from_string
(model_str[, verbose])Load Booster from a string.
model_to_string
([num_iteration, ...])Save Booster to string.
Get number of features.
Get number of models per iteration.
Get number of weak sub-models.
predict
(data[, start_iteration, ...])Make a prediction.
refit
(data, label[, decay_rate])Refit the existing Booster by new data.
reset_parameter
(params)Reset parameters of Booster.
Rollback one iteration.
save_model
(filename[, num_iteration, ...])Save Booster to file.
set_attr
(**kwargs)Set attributes to the Booster.
set_network
(machines[, local_listen_port, ...])Set the network configuration.
set_train_data_name
(name)Set the name to the training Dataset.
shuffle_models
([start_iteration, end_iteration])Shuffle models.
Parse the fitted model and return in an easy-to-read pandas DataFrame.
update
([train_set, fobj])Update Booster for one iteration.
Get upper bound value of a model.
- attr(key)[source]
Get attribute string from the Booster.
- Parameters
key (str) – The name of the attribute.
- Returns
value – The attribute value. Returns None if attribute does not exist.
- Return type
str or None
- current_iteration()[source]
Get the index of the current iteration.
- Returns
cur_iter – The index of the current iteration.
- Return type
int
- dump_model(num_iteration=None, start_iteration=0, importance_type='split', object_hook=None)[source]
Dump Booster to JSON format.
- Parameters
num_iteration (int or None, optional (default=None)) – Index of the iteration that should be dumped. If None, if the best iteration exists, it is dumped; otherwise, all iterations are dumped. If <= 0, all iterations are dumped.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be dumped.
importance_type (str, optional (default="split")) – What type of feature importance should be dumped. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
object_hook (callable or None, optional (default=None)) – If not None,
object_hook
is a function called while parsing the json string returned by the C API. It may be used to alter the json, to store specific values while building the json structure. It avoids walking through the structure again. It saves a significant amount of time if the number of trees is huge. Signature isdef object_hook(node: dict) -> dict
. None is equivalent tolambda node: node
. See documentation ofjson.loads()
for further details.
- Returns
json_repr – JSON format of Booster.
- Return type
dict
- eval(data, name, feval=None)[source]
Evaluate for data.
- Parameters
data (Dataset) – Data for the evaluating.
name (str) – Name of the data.
feval (callable or None, optional (default=None)) –
Customized evaluation function. Should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.
- predslist or numpy 1-D array
The predicted values. If
fobj
is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.- eval_dataDataset
The evaluation dataset.
- eval_namestr
The name of evaluation function (without whitespace).
- eval_resultfloat
The eval result.
- is_higher_betterbool
Is eval result higher better, e.g. AUC is
is_higher_better
.
For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is preds[j * num_data + i].
- Returns
result – List with evaluation results.
- Return type
list
- eval_train(feval=None)[source]
Evaluate for training data.
- Parameters
feval (callable or None, optional (default=None)) –
Customized evaluation function. Should accept two parameters: preds, train_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.
- predslist or numpy 1-D array
The predicted values. If
fobj
is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.- train_dataDataset
The training dataset.
- eval_namestr
The name of evaluation function (without whitespace).
- eval_resultfloat
The eval result.
- is_higher_betterbool
Is eval result higher better, e.g. AUC is
is_higher_better
.
For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is preds[j * num_data + i].
- Returns
result – List with evaluation results.
- Return type
list
- eval_valid(feval=None)[source]
Evaluate for validation data.
- Parameters
feval (callable or None, optional (default=None)) –
Customized evaluation function. Should accept two parameters: preds, valid_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.
- predslist or numpy 1-D array
The predicted values. If
fobj
is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.- valid_dataDataset
The validation dataset.
- eval_namestr
The name of evaluation function (without whitespace).
- eval_resultfloat
The eval result.
- is_higher_betterbool
Is eval result higher better, e.g. AUC is
is_higher_better
.
For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is preds[j * num_data + i].
- Returns
result – List with evaluation results.
- Return type
list
- feature_importance(importance_type='split', iteration=None)[source]
Get feature importances.
- Parameters
importance_type (str, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
iteration (int or None, optional (default=None)) – Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used. If <= 0, all trees are used (no limits).
- Returns
result – Array with feature importances.
- Return type
numpy array
- feature_name()[source]
Get names of features.
- Returns
result – List with names of features.
- Return type
list
- free_dataset()[source]
Free Booster’s Datasets.
- Returns
self – Booster without Datasets.
- Return type
- free_network()[source]
Free Booster’s network.
- Returns
self – Booster with freed network.
- Return type
- get_leaf_output(tree_id, leaf_id)[source]
Get the output of a leaf.
- Parameters
tree_id (int) – The index of the tree.
leaf_id (int) – The index of the leaf in the tree.
- Returns
result – The output of the leaf.
- Return type
float
- get_split_value_histogram(feature, bins=None, xgboost_style=False)[source]
Get split value histogram for the specified feature.
- Parameters
feature (int or str) –
The feature name or index the histogram is calculated for. If int, interpreted as index. If str, interpreted as name.
Warning
Categorical features are not supported.
bins (int, str or None, optional (default=None)) – The maximum number of bins. If None, or int and > number of unique split values and
xgboost_style=True
, the number of bins equals number of unique split values. If str, it should be one from the list of the supported values bynumpy.histogram()
function.xgboost_style (bool, optional (default=False)) – Whether the returned result should be in the same form as it is in XGBoost. If False, the returned value is tuple of 2 numpy arrays as it is in
numpy.histogram()
function. If True, the returned value is matrix, in which the first column is the right edges of non-empty bins and the second one is the histogram values.
- Returns
result_tuple (tuple of 2 numpy arrays) – If
xgboost_style=False
, the values of the histogram of used splitting values for the specified feature and the bin edges.result_array_like (numpy array or pandas DataFrame (if pandas is installed)) – If
xgboost_style=True
, the histogram of used splitting values for the specified feature.
- lower_bound()[source]
Get lower bound value of a model.
- Returns
lower_bound – Lower bound value of the model.
- Return type
double
- model_from_string(model_str, verbose='warn')[source]
Load Booster from a string.
- Parameters
model_str (str) – Model will be loaded from this string.
verbose (bool, optional (default=True)) – Whether to print messages while loading model.
- Returns
self – Loaded Booster object.
- Return type
- model_to_string(num_iteration=None, start_iteration=0, importance_type='split')[source]
Save Booster to string.
- Parameters
num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
- Returns
str_repr – String representation of Booster.
- Return type
str
- num_feature()[source]
Get number of features.
- Returns
num_feature – The number of features.
- Return type
int
- num_model_per_iteration()[source]
Get number of models per iteration.
- Returns
model_per_iter – The number of models per iteration.
- Return type
int
- num_trees()[source]
Get number of weak sub-models.
- Returns
num_trees – The number of weak sub-models.
- Return type
int
- predict(data, start_iteration=0, num_iteration=None, raw_score=False, pred_leaf=False, pred_contrib=False, data_has_header=False, is_reshape=True, **kwargs)[source]
Make a prediction.
- Parameters
data (str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable's Frame or scipy.sparse) – Data source for prediction. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).
start_iteration (int, optional (default=0)) – Start index of the iteration to predict. If <= 0, starts from the first iteration.
num_iteration (int or None, optional (default=None)) – Total number of iterations used in the prediction. If None, if the best iteration exists and start_iteration <= 0, the best iteration is used; otherwise, all iterations from
start_iteration
are used (no limits). If <= 0, all iterations fromstart_iteration
are used (no limits).raw_score (bool, optional (default=False)) – Whether to predict raw scores.
pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
pred_contrib (bool, optional (default=False)) –
Whether to predict feature contributions.
Note
If you want to get more explanations for your model’s predictions using SHAP values, like SHAP interaction values, you can install the shap package (https://github.com/slundberg/shap). Note that unlike the shap package, with
pred_contrib
we return a matrix with an extra column, where the last column is the expected value.data_has_header (bool, optional (default=False)) – Whether the data has header. Used only if data is str.
is_reshape (bool, optional (default=True)) – If True, result is reshaped to [nrow, ncol].
**kwargs – Other parameters for the prediction.
- Returns
result – Prediction result. Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when
pred_contrib=True
).- Return type
numpy array, scipy.sparse or list of scipy.sparse
- refit(data, label, decay_rate=0.9, **kwargs)[source]
Refit the existing Booster by new data.
- Parameters
data (str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable's Frame or scipy.sparse) – Data source for refit. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).
label (list, numpy 1-D array or pandas Series / one-column DataFrame) – Label for refit.
decay_rate (float, optional (default=0.9)) – Decay rate of refit, will use
leaf_output = decay_rate * old_leaf_output + (1.0 - decay_rate) * new_leaf_output
to refit trees.**kwargs – Other parameters for refit. These parameters will be passed to
predict
method.
- Returns
result – Refitted Booster.
- Return type
- reset_parameter(params)[source]
Reset parameters of Booster.
- Parameters
params (dict) – New parameters for Booster.
- Returns
self – Booster with new parameters.
- Return type
- rollback_one_iter()[source]
Rollback one iteration.
- Returns
self – Booster with rolled back one iteration.
- Return type
- save_model(filename, num_iteration=None, start_iteration=0, importance_type='split')[source]
Save Booster to file.
- Parameters
filename (str or pathlib.Path) – Filename to save Booster.
num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.
- Returns
self – Returns self.
- Return type
- set_attr(**kwargs)[source]
Set attributes to the Booster.
- Parameters
**kwargs – The attributes to set. Setting a value to None deletes an attribute.
- Returns
self – Booster with set attributes.
- Return type
- set_network(machines, local_listen_port=12400, listen_time_out=120, num_machines=1)[source]
Set the network configuration.
- Parameters
machines (list, set or str) – Names of machines.
local_listen_port (int, optional (default=12400)) – TCP listen port for local machines.
listen_time_out (int, optional (default=120)) – Socket time-out in minutes.
num_machines (int, optional (default=1)) – The number of machines for distributed learning application.
- Returns
self – Booster with set network.
- Return type
- set_train_data_name(name)[source]
Set the name to the training Dataset.
- Parameters
name (str) – Name for the training Dataset.
- Returns
self – Booster with set training Dataset name.
- Return type
- shuffle_models(start_iteration=0, end_iteration=- 1)[source]
Shuffle models.
- Parameters
start_iteration (int, optional (default=0)) – The first iteration that will be shuffled.
end_iteration (int, optional (default=-1)) – The last iteration that will be shuffled. If <= 0, means the last available iteration.
- Returns
self – Booster with shuffled models.
- Return type
- trees_to_dataframe()[source]
Parse the fitted model and return in an easy-to-read pandas DataFrame.
The returned DataFrame has the following columns.
tree_index
: int64, which tree a node belongs to. 0-based, so a value of6
, for example, means “this node is in the 7th tree”.node_depth
: int64, how far a node is from the root of the tree. The root node has a value of1
, its direct children are2
, etc.node_index
: str, unique identifier for a node.left_child
: str,node_index
of the child node to the left of a split.None
for leaf nodes.right_child
: str,node_index
of the child node to the right of a split.None
for leaf nodes.parent_index
: str,node_index
of this node’s parent.None
for the root node.split_feature
: str, name of the feature used for splitting.None
for leaf nodes.split_gain
: float64, gain from adding this split to the tree.NaN
for leaf nodes.threshold
: float64, value of the feature used to decide which side of the split a record will go down.NaN
for leaf nodes.decision_type
: str, logical operator describing how to compare a value tothreshold
. For example,split_feature = "Column_10", threshold = 15, decision_type = "<="
means that records whereColumn_10 <= 15
follow the left side of the split, otherwise follows the right side of the split.None
for leaf nodes.missing_direction
: str, split direction that missing values should go to.None
for leaf nodes.missing_type
: str, describes what types of values are treated as missing.value
: float64, predicted value for this leaf node, multiplied by the learning rate.weight
: float64 or int64, sum of hessian (second-order derivative of objective), summed over observations that fall in this node.count
: int64, number of records in the training data that fall into this node.
- Returns
result – Returns a pandas DataFrame of the parsed model.
- Return type
pandas DataFrame
- update(train_set=None, fobj=None)[source]
Update Booster for one iteration.
- Parameters
train_set (Dataset or None, optional (default=None)) – Training data. If None, last training data is used.
fobj (callable or None, optional (default=None)) –
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- predslist or numpy 1-D array
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.
- train_dataDataset
The training dataset.
- gradlist or numpy 1-D array
The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.
- hesslist or numpy 1-D array
The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.
For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is score[j * num_data + i] and you should group grad and hess in this way as well.
- Returns
is_finished – Whether the update was successfully finished.
- Return type
bool