Feature filtering using a `mlr3filters::Filter`

object, see the
mlr3filters package.

If a `Filter`

can only operate on a subset of columns based on column type, then only these features are considered and filtered.
`nfeat`

and `frac`

will count for the features of the type that the `Filter`

can operate on;
this means e.g. that setting `nfeat`

to 0 will only remove features of the type that the `Filter`

can work with.

`R6Class`

object inheriting from `PipeOpTaskPreprocSimple`

/`PipeOpTaskPreproc`

/`PipeOp`

.

PipeOpFilter$new(filter, id = filter$id, param_vals = list())

`filter`

::`Filter`

`Filter`

used for feature filtering. This argument is always cloned; to access the`Filter`

inside`PipeOpFilter`

by-reference, use`$filter`

.`id`

::`character(1)`

Identifier of the resulting object, defaulting to the`id`

of the`Filter`

being used.`param_vals`

:: named`list`

List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default`list()`

.

Input and output channels are inherited from `PipeOpTaskPreproc`

.

The output is the input `Task`

with features removed that were filtered out.

The `$state`

is a named `list`

with the `$state`

elements inherited from `PipeOpTaskPreproc`

, as well as:

`scores`

:: named`numeric`

Scores calculated for all features of the training`Task`

which are being used as cutoff for feature filtering. If`frac`

or`nfeat`

is given, the underlying`Filter`

may choose to not calculate scores for all features that are given. This only includes features on which the`Filter`

can operate; e.g. if the`Filter`

can only operate on numeric features, then scores for factorial features will not be given.`features`

::`character`

Names of features that are being kept. Features of types that the`Filter`

can not operate on are always being kept.

The parameters are the parameters inherited from the `PipeOpTaskPreproc`

, as well as the parameters of the `Filter`

used by this object. Besides, parameters introduced are:

`filter.nfeat`

::`numeric(1)`

Number of features to select. Mutually exclusive with`frac`

,`cutoff`

, and`permuted`

.`filter.frac`

::`numeric(1)`

Fraction of features to keep. Mutually exclusive with`nfeat`

,`cutoff`

, and`permuted`

.`filter.cutoff`

::`numeric(1)`

Minimum value of filter heuristic for which to keep features. Mutually exclusive with`nfeat`

,`frac`

, and`permuted`

.`filter.permuted`

::`integer(1)`

If this parameter is set, a random permutation of each feature is added to the task before applying the filter. All features selected before the`permuted`

-th permuted features is selected are kept. This is similar to the approach in Wu (2007) and Thomas (2017). Mutually exclusive with`nfeat`

,`frac`

, and`cutoff`

.

Note that at least one of `filter.nfeat`

, `filter.frac`

, `filter.cutoff`

, and `filter.permuted`

must be given.

This does *not* use the `$.select_cols`

feature of `PipeOpTaskPreproc`

to select only features compatible with the `Filter`

;
instead the whole `Task`

is used by `private$.get_state()`

and subset internally.

Fields inherited from `PipeOpTaskPreproc`

, as well as:

`filter`

::`Filter`

`Filter`

that is being used for feature filtering. Do*not*use this slot to get to the feature filtering scores after training; instead, use`$state$scores`

. Read-only.

Methods inherited from `PipeOpTaskPreprocSimple`

/`PipeOpTaskPreproc`

/`PipeOp`

.

Wu Y, Boos DD, Stefanski LA (2007).
“Controlling Variable Selection by the Addition of Pseudovariables.”
*Journal of the American Statistical Association*, **102**(477), 235--243.
doi: 10.1198/016214506000000843
.

Thomas J, Hepp T, Mayr A, Bischl B (2017).
“Probing for Sparse and Fast Variable Selection with Model-Based Boosting.”
*Computational and Mathematical Methods in Medicine*, **2017**, 1--8.
doi: 10.1155/2017/1421409
.

https://mlr3book.mlr-org.com/list-pipeops.html

Other PipeOps:
`PipeOpEnsemble`

,
`PipeOpImpute`

,
`PipeOpTargetTrafo`

,
`PipeOpTaskPreprocSimple`

,
`PipeOpTaskPreproc`

,
`PipeOp`

,
`mlr_pipeops_boxcox`

,
`mlr_pipeops_branch`

,
`mlr_pipeops_chunk`

,
`mlr_pipeops_classbalancing`

,
`mlr_pipeops_classifavg`

,
`mlr_pipeops_classweights`

,
`mlr_pipeops_colapply`

,
`mlr_pipeops_collapsefactors`

,
`mlr_pipeops_colroles`

,
`mlr_pipeops_copy`

,
`mlr_pipeops_datefeatures`

,
`mlr_pipeops_encodeimpact`

,
`mlr_pipeops_encodelmer`

,
`mlr_pipeops_encode`

,
`mlr_pipeops_featureunion`

,
`mlr_pipeops_fixfactors`

,
`mlr_pipeops_histbin`

,
`mlr_pipeops_ica`

,
`mlr_pipeops_imputeconstant`

,
`mlr_pipeops_imputehist`

,
`mlr_pipeops_imputelearner`

,
`mlr_pipeops_imputemean`

,
`mlr_pipeops_imputemedian`

,
`mlr_pipeops_imputemode`

,
`mlr_pipeops_imputeoor`

,
`mlr_pipeops_imputesample`

,
`mlr_pipeops_kernelpca`

,
`mlr_pipeops_learner`

,
`mlr_pipeops_missind`

,
`mlr_pipeops_modelmatrix`

,
`mlr_pipeops_multiplicityexply`

,
`mlr_pipeops_multiplicityimply`

,
`mlr_pipeops_mutate`

,
`mlr_pipeops_nmf`

,
`mlr_pipeops_nop`

,
`mlr_pipeops_ovrsplit`

,
`mlr_pipeops_ovrunite`

,
`mlr_pipeops_pca`

,
`mlr_pipeops_proxy`

,
`mlr_pipeops_quantilebin`

,
`mlr_pipeops_randomprojection`

,
`mlr_pipeops_randomresponse`

,
`mlr_pipeops_regravg`

,
`mlr_pipeops_removeconstants`

,
`mlr_pipeops_renamecolumns`

,
`mlr_pipeops_replicate`

,
`mlr_pipeops_scalemaxabs`

,
`mlr_pipeops_scalerange`

,
`mlr_pipeops_scale`

,
`mlr_pipeops_select`

,
`mlr_pipeops_smote`

,
`mlr_pipeops_spatialsign`

,
`mlr_pipeops_subsample`

,
`mlr_pipeops_targetinvert`

,
`mlr_pipeops_targetmutate`

,
`mlr_pipeops_targettrafoscalerange`

,
`mlr_pipeops_textvectorizer`

,
`mlr_pipeops_threshold`

,
`mlr_pipeops_tunethreshold`

,
`mlr_pipeops_unbranch`

,
`mlr_pipeops_updatetarget`

,
`mlr_pipeops_vtreat`

,
`mlr_pipeops_yeojohnson`

,
`mlr_pipeops`

library("mlr3") library("mlr3filters") # setup PipeOpFilter to keep the 5 most important # features of the spam task w.r.t. their AUC task = tsk("spam") filter = flt("auc") po = po("filter", filter = filter) po$param_set #> <ParamSetCollection:auc> #> id class lower upper nlevels default value #> 1: filter.nfeat ParamInt 0 Inf Inf <NoDefault[3]> #> 2: filter.frac ParamDbl 0 1 Inf <NoDefault[3]> #> 3: filter.cutoff ParamDbl -Inf Inf Inf <NoDefault[3]> #> 4: filter.permuted ParamInt 1 Inf Inf <NoDefault[3]> #> 5: affect_columns ParamUty NA NA Inf <Selector[1]> po$param_set$values$filter.nfeat = 5 # filter the task filtered_task = po$train(list(task))[[1]] # filtered task + extracted AUC scores filtered_task$feature_names #> [1] "capitalAve" "capitalLong" "charDollar" "charExclamation" #> [5] "your" head(po$state$scores, 10) #> charExclamation capitalLong capitalAve your charDollar #> 0.3290461 0.3041626 0.2882004 0.2801659 0.2721394 #> capitalTotal free our you remove #> 0.2622801 0.2327285 0.2109325 0.2104681 0.2031303 # feature selection embedded in a 3-fold cross validation # keep 30% of features based on their AUC score task = tsk("spam") gr = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>% po("learner", lrn("classif.rpart")) learner = GraphLearner$new(gr) rr = resample(task, learner, rsmp("holdout"), store_models = TRUE) rr$learners[[1]]$model$auc$scores #> charExclamation capitalLong capitalAve your #> 3.290018e-01 3.084719e-01 2.924356e-01 2.850997e-01 #> charDollar capitalTotal free you #> 2.760477e-01 2.690304e-01 2.328002e-01 2.133331e-01 #> our remove money all #> 2.127344e-01 2.049659e-01 1.848303e-01 1.800999e-01 #> hp num000 business over #> 1.768315e-01 1.592152e-01 1.529875e-01 1.490547e-01 #> mail internet hpl george #> 1.395390e-01 1.362281e-01 1.362075e-01 1.341867e-01 #> email receive address order #> 1.316039e-01 1.303801e-01 1.246968e-01 1.142778e-01 #> make num1999 charHash credit #> 1.090133e-01 1.049933e-01 1.024926e-01 9.926152e-02 #> will people labs addresses #> 9.423281e-02 9.040350e-02 7.689188e-02 7.541491e-02 #> num650 num85 edu lab #> 6.979414e-02 6.939648e-02 6.787860e-02 6.004967e-02 #> technology telnet meeting data #> 5.498094e-02 5.137943e-02 4.946566e-02 4.597672e-02 #> pm report project num857 #> 3.984151e-02 3.941819e-02 3.742082e-02 3.490039e-02 #> charSquarebracket num415 original conference #> 3.485239e-02 3.285303e-02 2.864972e-02 2.808021e-02 #> cs re font charSemicolon #> 2.658932e-02 2.658113e-02 2.309021e-02 2.247249e-02 #> charRoundbracket direct num3d table #> 1.810618e-02 1.206585e-02 9.208792e-03 2.783626e-03 #> parts #> 5.883081e-05