Feature Selection Algorithms - A Brief Guide

P. Somol, P. Pudil

(to be further extended)

It should be stressed that the following is only a very brief summary of our experience regarding the applicability of a limited number of sub-optimal feature selection methods. The whole family of existing methods (and methodologies) is constantly growing; for more information, please, follow the references below.

Suppose a traditional feature selection task: we have training data of C classes of approximately S samples per class, each sample represented by F numerical features and we have some reasonable feature subset evaluation criterion. Suppose we do not face too many problems like: mixed feature types, (numerical, nominal), extreme difference in class sizes, incomplete data etc. etc. OK, real world problems usually involve many such problems, but in this explanation we only want to compare search strategies as such.

Be immediately warned against the naive feature ordering according to individual criterion values. Very few real data have so simple character that this would work. You need methods that reveal at least some inter-feature relations.

The next simplest choice is sequential forward selection (SFS) or sequential backward selection (SBS), which has some properties that make it the preferred choice by many practitioners: 1) it is extremely simple to implement, 2) it is very fast because it consists of only F core steps (although each step involves max. F criterion tests), 3) it can gives solutions for all subset sizes in one run. The SFS is usually faster than SBS. SBS may better reflect feature relations in larger subsets.

The problem with SFS and SBS is its single-track search. It tends to "go straight to the closest local optimum" of the search space. Technically said, it does not address the feature nesting problem - once a feature is added/removed, it remains so forever, even if the context changes and the feature could become important/useless inside some newly assembled subset. With most real datasets both SFS and SBS get considerably beaten by more sophisticated methods.

Significantly better results can be expected from the floating search methods (SFFS,SBFS). They take use of backtracking and are capable of "correcting wrong inclusion/removal decisions". Floating search has become widely popular because of its good performance/speed ratio, although some critics claim it to be too slow. Our experience shows that one can expect roughly 10x longer computational time when compared to SFS/SBS. This obviously cannot be taken as too accurate guess, because the overall speed depends primarily on the dimensionality of the problem. With dozens of features it usually is not important whether the search takes 10 seconds (SFS) or 2 minutes (SFFS). With thousands of features all sequential methods may become unusable, as even the simplest sequential forward selection may take days due to the overwhelming amount of possible feature combinations to be tested. The original floating search description may be found in:

    Floating search methods in feature selection
    P. Pudil, J. Novovicová and J. Kittler
    Pattern Recognition Letters, 15 (11) (1994) pp. 1119-1125

This description is formally well-written, however you may get an impression that the method is more complicated than it really is. This holds also for our C code (see below) - it is very flexible, incorporating different additions and an intelligent scheme for subset configuration coding, so the search principle may look somewhat hidden in it. But be sure that the floating search itself is very very simple, once you get into it.

Adaptive floating search (ASFFS, ASFBS) is an extension of the classical floating search. It should be noted that the adaptive floating search can seem rather complicated and does not necessarily bring much (according to our long-term experience by using the adaptive floating instead classical floating search you would get results occassionally better by 1-5%, but at a cost of several times longer computational time). In contrary, the classical floating search has proved to be a great tool, especially when compared to the older methods like Plus-L-Minus-R or the basic sequential forward and backward selection.

Apart from floating search, we can recommend something even more flexible if you intend to achieve best possible results and are ready to experiment a bit - the Oscillating Search (OS), introduced in the following paper:

    Oscillating Search Algorithms For Feature Selection (paper can be downloaded here),
    Somol P., Pudil P.
    Proc. 15th IAPR International Conference on Pattern Recognition, Barcelona, 406-409, 2000.

The algorithm is simpler than the adaptive (even classical) floating search and is often capable of finding significantly better results (if it is principally possible, depending on data used). The OS is more or less a variant of gradient methods known from different contexts. It may be used in different ways - as a very fast method, or a very thorough method. It may be time-restricted for use in real time systems, it may operate as a deterministic procedure or as a randomised procedure. The point is that the method always starts the search from some given dimensionality (subset size) what makes it usable even in some cases when the others fail due to computational problems. If you have 10000 features and need to find the best 2000, then this procedure may become the only one we can imagine to give at least some reasonable results in acceptable time. Alternatively it is possible to try genetic algorithms in such cases, however experiments suggest that the sequential methods described above tend to give better solutions.

According to our current state of experience, we would recommed:
1) First try the classical SFFS floating search as a good compromise between speed and effectivity. Moreover, it finds solutions for all subset sizes.
2) Try to play a little with simple variants of the OS algorithm, particularly with the randomised OS. With randomised OS try as many different random starting points as your time allows. Try also to use the OS to "tune" the results from 1). If your criterion to be maximised is not monotonic - like, e.g., in "Wrapper" case where you directly maximise classifier recognition rate - focus on those subset sizes that seemed most promising in 1); with monotonic criteria it is generally difficult to tell which subset size is recommendable.

As for the truly optimal search, there is unfortunately no generally applicable and fast tool available. The full search (testing all possible combinations) is obviously applicable for small dimensionalities only due to the exponentiality of the problem. For cases when the criterion function fulfils one additional condition (the "monotonicity" condition, assuming we want to maximize the criterion, shortly said is: if you add a feature to a subset, the criterion value must not decrease) you can try one of the Branch & Bound algorithm versions. As for the Branch & Bound speed it is a sort of lottery at the moment - it depends on many things and may be anything between O(n) to twice as much time than needed for the full search. A detailed discussion of this topic together with a description of modern versions of the algorithm can be found in TPAMI 7/2004 (full reference below).

Source codes:

1) There is a freely available C source code of the floating search methods (SFFS, SBFS, ASFFS, ASBFS) at the Elsevier server. It is an "electronic annex" of our paper:

    Adaptive floating search methods in feature selection
    P. Somol, P. Pudil, J. Novovicova and P. Paclik
    Pattern Recognition Letters, 20 (11-13) (1999) pp. 1157-1163

The annex contains two text files, one for the classical floating search, second for the "adaptive" version. The text file is actually a C source with a sufficient textual description in the header. The sources should be accessible through PRL pages at www.elsevier.com as "PRL Annexes" or "Electronic Annexes". However, as many poeple ask us for help because of difficulties with the PRL pages we make the sources available here:

     Floating Search source code download, and
     Adaptive Floating Search source code download.

2) Oscillating search source code is available in a similar form here. The procedure technically looks quite complicated, but the only reason is the extent of implemented optional features, not the core algorithm as such. However, it should not be a problem to adjust it for use with your criterion procedures. For first experiments we recommend to stay with the simplest (sequential) form of OS, low Delta values (2-3) and some simple initialization (SFS or random).

Oscillating Search source code download.

The codes come from experimental versions of Feature Selection Toolbox, a tool being developed in our department. F.S.Toolbox as a whole is not currently available for general use.

References (ours, selected):

Somol P., Pudil P., Kittler J. (UK): Fast Branch & Bound Algorithms For Optimal Feature Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(7):900–912, July 2004.
Somol P., Pudil P.: Feature Selection Toolbox. Pattern Recognition 35(12):2749--2759, 2002.
Somol P., Pudil P., Grim J.: Branch & Bound Algorithm with Partial Prediction For Use with Recursive and Non-Recursive Criterion Forms. Lecture Notes in Computer Science LNCS 2013, Springer, pp. 230-239, 2001.
Somol P., Pudil P.: Oscillating search algorithms for feature selection. In: Proceedings of the 15th International Conference on Pattern Recognition. (Sanfeliu A., Villanueva J. J., Vanrell M. eds.). IEEE Computer Society, Los Alamitos, pp. 406-409, 2000.
Somol P., Pudil P., Novovicova J., Paclik P.: Adaptive floating search methods in feature selection. Pattern Recognition Letters, 20 (1999), 11/13, 1157-1163.
Novovicova J., Pudil P., Kittler J. (UK): Divergence Based Feature Selection for Multimodal Class Densities. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(2):218-223, 1996.
Pudil P., Novovicova J., Kittler J. (UK): Floating Search Methods in Feature Selection. Pattern Recognition Letters, 15(11):1119--1125, 1994.

References (general):

S.Theodoridis and K.Koutroumbas. Pattern Recognition: 2nd edition. Academic Press, Inc., 2003.
A.Webb. Statistical Pattern Recognition: 2nd edition. John Wiley & Sons, 2002.
R.O.Duda, P.E.Hart and D.G.Stork. Pattern Classification: 2nd edition. Wiley-Interscience, 2000.
K.Fukunaga. Introduction to Statistical Pattern Recognition: 2nd edition. Academic Press, Inc., 1990.
P.A. Devijver and J.Kittler. Pattern Recognition: A Statistical Approach. Prentice-Hall, 1982.

Back to the demonstration list.