Logic to train with LightGBM

lgb.train(
  params = list(),
  data,
  nrounds = 100L,
  valids = list(),
  obj = NULL,
  eval = NULL,
  verbose = 1L,
  record = TRUE,
  eval_freq = 1L,
  init_model = NULL,
  colnames = NULL,
  categorical_feature = NULL,
  early_stopping_rounds = NULL,
  callbacks = list(),
  reset_data = FALSE,
  serializable = TRUE
)

Arguments

params

a list of parameters. See the "Parameters" section of the documentation for a list of parameters and valid values.

data

a lgb.Dataset object, used for training. Some functions, such as lgb.cv, may allow you to pass other types of data like matrix and then separately supply label as a keyword argument.

nrounds

number of training rounds

valids

a list of lgb.Dataset objects, used for validation

obj

objective function, can be character or custom objective function. Examples include regression, regression_l1, huber, binary, lambdarank, multiclass, multiclass

eval

evaluation function(s). This can be a character vector, function, or list with a mixture of strings and functions.

  • a. character vector: If you provide a character vector to this argument, it should contain strings with valid evaluation metrics. See The "metric" section of the documentation for a list of valid metrics.

  • b. function: You can provide a custom evaluation function. This should accept the keyword arguments preds and dtrain and should return a named list with three elements:

    • name: A string with the name of the metric, used for printing and storing results.

    • value: A single number indicating the value of the metric for the given predictions and true values

    • higher_better: A boolean indicating whether higher values indicate a better fit. For example, this would be FALSE for metrics like MAE or RMSE.

  • c. list: If a list is given, it should only contain character vectors and functions. These should follow the requirements from the descriptions above.

verbose

verbosity for output, if <= 0 and valids has been provided, also will disable the printing of evaluation during training

record

Boolean, TRUE will record iteration message to booster$record_evals

eval_freq

evaluation output frequency, only effective when verbose > 0 and valids has been provided

init_model

path of model file of lgb.Booster object, will continue training from this model

colnames

feature names, if not null, will use this to overwrite the names in dataset

categorical_feature

categorical features. This can either be a character vector of feature names or an integer vector with the indices of the features (e.g. c(1L, 10L) to say "the first and tenth columns").

early_stopping_rounds

int. Activates early stopping. When this parameter is non-null, training will stop if the evaluation of any metric on any validation set fails to improve for early_stopping_rounds consecutive boosting rounds. If training stops early, the returned model will have attribute best_iter set to the iteration number of the best iteration.

callbacks

List of callback functions that are applied at each iteration.

reset_data

Boolean, setting it to TRUE (not the default value) will transform the booster model into a predictor model which frees up memory and the original datasets

serializable

whether to make the resulting objects serializable through functions such as save or saveRDS (see section "Model serialization").

Value

a trained booster model lgb.Booster.

Early Stopping

"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.

If multiple arguments are given to eval, their order will be preserved. If you enable early stopping by setting early_stopping_rounds in params, by default all metrics will be considered for early stopping.

If you want to only consider the first metric for early stopping, pass first_metric_only = TRUE in params. Note that if you also specify metric in params, that metric will be considered the "first" one. If you omit metric, a default metric will be used based on your choice for the parameter obj (keyword argument) or objective (passed into params).

Examples

# \donttest{
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
params <- list(
  objective = "regression"
  , metric = "l2"
  , min_data = 1L
  , learning_rate = 1.0
)
valids <- list(test = dtest)
model <- lgb.train(
  params = params
  , data = dtrain
  , nrounds = 5L
  , valids = valids
  , early_stopping_rounds = 3L
)
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000945 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 232
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 116
#> [LightGBM] [Info] Start training from score 0.482113
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [1] "[1]:  test's l2:6.44165e-17"
#> [1] "Will train until there is no improvement in 3 rounds."
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [1] "[2]:  test's l2:1.97215e-31"
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [1] "[3]:  test's l2:0"
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
#> [1] "[4]:  test's l2:0"
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
#> [1] "[5]:  test's l2:0"
#> [1] "Did not meet early stopping, best iteration is: [3]:  test's l2:0"
# }