Main CV logic for LightGBM — lgb.cv • lightgbm

Cross validation logic used by LightGBM

lgb.cv(
  params = list(),
  data,
  nrounds = 100L,
  nfold = 3L,
  label = NULL,
  weight = NULL,
  obj = NULL,
  eval = NULL,
  verbose = 1L,
  record = TRUE,
  eval_freq = 1L,
  showsd = TRUE,
  stratified = TRUE,
  folds = NULL,
  init_model = NULL,
  colnames = NULL,
  categorical_feature = NULL,
  early_stopping_rounds = NULL,
  callbacks = list(),
  reset_data = FALSE,
  serializable = TRUE,
  eval_train_metric = FALSE
)

Arguments

params	a list of parameters. See the "Parameters" section of the documentation for a list of parameters and valid values.
data	a `lgb.Dataset` object, used for training. Some functions, such as `lgb.cv`, may allow you to pass other types of data like `matrix` and then separately supply `label` as a keyword argument.
nrounds	number of training rounds
nfold	the original dataset is randomly partitioned into `nfold` equal size subsamples.
label	Vector of labels, used if `data` is not an `lgb.Dataset`
weight	vector of response values. If not NULL, will set to dataset
obj	objective function, can be character or custom objective function. Examples include `regression`, `regression_l1`, `huber`, `binary`, `lambdarank`, `multiclass`, `multiclass`
eval	evaluation function(s). This can be a character vector, function, or list with a mixture of strings and functions. a. character vector: If you provide a character vector to this argument, it should contain strings with valid evaluation metrics. See The "metric" section of the documentation for a list of valid metrics. b. function: You can provide a custom evaluation function. This should accept the keyword arguments `preds` and `dtrain` and should return a named list with three elements: `name`: A string with the name of the metric, used for printing and storing results. `value`: A single number indicating the value of the metric for the given predictions and true values `higher_better`: A boolean indicating whether higher values indicate a better fit. For example, this would be `FALSE` for metrics like MAE or RMSE. c. list: If a list is given, it should only contain character vectors and functions. These should follow the requirements from the descriptions above.
verbose	verbosity for output, if <= 0 and `valids` has been provided, also will disable the printing of evaluation during training
record	Boolean, TRUE will record iteration message to `booster$record_evals`
eval_freq	evaluation output frequency, only effective when verbose > 0 and `valids` has been provided
showsd	`boolean`, whether to show standard deviation of cross validation. This parameter defaults to `TRUE`. Setting it to `FALSE` can lead to a slight speedup by avoiding unnecessary computation.
stratified	a `boolean` indicating whether sampling of folds should be stratified by the values of outcome labels.
folds	`list` provides a possibility to use a list of pre-defined CV folds (each element must be a vector of test fold's indices). When folds are supplied, the `nfold` and `stratified` parameters are ignored.
init_model	path of model file or `lgb.Booster` object, will continue training from this model
colnames	feature names, if not null, will use this to overwrite the names in dataset
categorical_feature	categorical features. This can either be a character vector of feature names or an integer vector with the indices of the features (e.g. `c(1L, 10L)` to say "the first and tenth columns").
early_stopping_rounds	int. Activates early stopping. When this parameter is non-null, training will stop if the evaluation of any metric on any validation set fails to improve for `early_stopping_rounds` consecutive boosting rounds. If training stops early, the returned model will have attribute `best_iter` set to the iteration number of the best iteration.
callbacks	List of callback functions that are applied at each iteration.
reset_data	Boolean, setting it to TRUE (not the default value) will transform the booster model into a predictor model which frees up memory and the original datasets
serializable	whether to make the resulting objects serializable through functions such as `save` or `saveRDS` (see section "Model serialization").
eval_train_metric	`boolean`, whether to add the cross validation results on the training data. This parameter defaults to `FALSE`. Setting it to `TRUE` will increase run time.

Value

a trained model lgb.CVBooster.

Early Stopping

"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.

If multiple arguments are given to eval, their order will be preserved. If you enable early stopping by setting early_stopping_rounds in params, by default all metrics will be considered for early stopping.

If you want to only consider the first metric for early stopping, pass first_metric_only = TRUE in params. Note that if you also specify metric in params, that metric will be considered the "first" one. If you omit metric, a default metric will be used based on your choice for the parameter obj (keyword argument) or objective (passed into params).

Examples

# \donttest{
setLGBMthreads(2L)
data.table::setDTthreads(1L)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(
  objective = "regression"
  , metric = "l2"
  , min_data = 1L
  , learning_rate = 1.0
  , num_threads = 2L
)
model <- lgb.cv(
  params = params
  , data = dtrain
  , nrounds = 5L
  , nfold = 3L
)
#> [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000608 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 232
#> [LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 116
#> [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000593 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 232
#> [LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 116
#> [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000593 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 232
#> [LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 116
#> [LightGBM] [Info] Start training from score 0.474436
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Info] Start training from score 0.490557
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Info] Start training from score 0.481345
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [1]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [2]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [3]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [4]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [5]:  valid's l2:0.000307078+0.000434274 
# }