Over the years, the use of Machine Learning in real life has increased manifold. Be it the diagnosis of cancerous cells or the prediction of stock market prices, machine learning models are being utilized extensively. With the sudden spurt of machine learning, our focus has turned to methods that improve and optimize its performance.
Feature selection is one such method.
In feature selection, we find the optimal feature subset that contributes most to our predicted variable. The computational ability of machine learning models depends a lot on the feature set. Retaining the significant features vastly improves the learning time, and also improves accuracy.
Why do we need feature selection?
- Improve generalization of models by reducing overfitting of data.
- Remove unnecessary/redundant data.
- Curtail the Curse Of Dimensionality
- Optimize training time
The dataset used is Breast Cancer Wisconsin(Diagnostic) which consists of 30 real-valued features.
Genetic algorithm is a search and optimization algorithm based on the principle of natural evolution. The algorithm tries to ‘mimic’ the concept of human evolution by modifying a set of individuals called a population, followed by a random selection of parents from this population to carry out reproduction in the form of mutation and crossover. This process continues till the stopping criterion is met. In the end, it gives the best individual/solution.
This algorithm is based on the fact that ‘good’ parents produce ‘good’ offspring, which causes the algorithm to converge to an optimal value. Since it is a derivative-free method and has a purely random approach, we can expect it to converge to a global optimum over time.
Steps involved in Genetic Algorithm
INITIALIZATION OF POPULATION
Population is the group of individuals that form a generation. It is a subset of possible solutions that undergoes reproduction. A chromosome represents an individual in a population. In computational terms, the chromosome is represented by a binary string. For feature selection, the chromosome's length is taken as the number of features in the dataset. 0/1 indicates the presence/absence of the ith feature in the solution.
A large population size would cause the algorithm to slow down, while a small population will not show diversity. So it is important to choose population size wisely. Here I have used a population size of 40 (~1.5x(number of features in the dataset)).
Here, we are initializing our population to find the top N features.
CALCULATION OF FITNESS VALUES
Fitness function evaluates how ‘fit’ an individual is. The term ‘fit’ indicates how close the individual’s solution is to the optimal solution. An individual with a high fitness value is considered better and poses an increased chance of being selected for reproduction.
Usually, the fitness function is the same as the optimization function. For, eg., in a maximization problem, the fitness function will be the function that is to be maximized. Targetting an improved model performance, I used the cross-validation f1 score from MLPClassifier() trained on the individual’s solution as the fitness value.
F1 score is used as the accuracy metric because:
- The dataset is quite imbalanced, and using accuracy_score could indicate a slight bias towards the majority class.
- A high f1 score indicates a better classification, i.e. a better model performance.
Here, get_fitness() finds the population's fitness values by decoding each chromosome into a feature subset, and calculate_fitness() finds the corresponding f1 score.
SELECTION OF PARENTS FOR REPRODUCTION
Parent selection, which is one of the most crucial steps, is choosing individuals(parents) from the population for reproduction, to produce the next generation.
Fitness Proportionate Parent Selection is the widely accepted criteria for parent selection. It ensures that all individuals get a chance to be selected as a parent with a probability proportionate to their fitness value. In this way, the underlying idea behind genetic algorithm would be justified.
There are various methods for parent selection like Tournament Selection, Roulette Wheel Selection, Stochastic Universal Sampling, Rank Selection, Random Selection, etc. I have used Roulette Wheel Selection here.
In Roulette Wheel Selection, a fixed point is chosen on the pie chart prepared using the fitness values. On every rotation, whichever individual comes in front of the point is selected for reproduction. This means that an individual with a greater area on the pie chart (i.e. a greater fitness value ) has a high probability of being selected.
- Find the sum of all fitness values in a population(S)
- Find normalized fitness values (=fitness value/S)
- Find cumulative fitness values ( See Figure 3)
- Generate a random number p between [0,1]
- The individual for which p is just smaller than the cumulative sum is selected. Eg. In Figure 3, consider the Normalised Fitness Values column. If p<0.232, Individual 1 is selected; if 0.232<p<0.426, Individual 2 is selected, and so on.
Shown below is the implementation of Roulette Wheel Selection.
REPRODUCTION: CROSSOVER AND MUTATION
Reproduction involves forming a new generation by the mating of parents. Mating is implemented through the process of crossover. Mutation is used to add slight randomness to the individual to introduce diversity in the population.
Various crossover operations like One Point Crossover, Two Point Crossover, Uniform Crossover are used. Here I have used Two Point Crossover technique which involves swapping genetic material between two points randomly chosen on the parents.
Usually, a probability is assigned to this process, indicating the chance of crossover for a given pair. Crossover is a high probability event and is assigned an optimum probability between 0.65–0.80. Here, I have used a probability of 0.78.
Mutation is used to introduce a slight variation in the chromosome by tweaking one of its genes. Its probability is kept very low to preserve the integrity of the population.
In general, mutation is done by randomly swapping any bit of a random individual in the population. Following the conventional mutation process, it was observed that after many generations, the number of features extracted deviated a lot from N. To reduce deviation, I swapped a ‘0’ bit with ‘1’ bit. In this way, the deviation was reduced by a good extent.
After reproduction, a new generation is formed, and then the stopping criterion is checked. If the condition is satisfied, the algorithm terminates; otherwise, the process is repeated with the mutated population as the original population.
A few of the stopping criteria are listed below.
- Fixing the number of generations: This is not a very good method as we may miss out on the best generation because of an upper limit on the number of generations.
- Lack of progress in the fitness of the best individual of the population: Lack of progress does not necessarily imply convergence as evolution proceeds with punctuated equilibria that can later lead to improvement.
- Keeping an upper limit on the variance of fitness values in a population: When individuals are so similar to each other that we may not get a better individual in future generations, the algorithm would indicate convergence.
Here I have used the third stopping criteria.
Conclusion and final remarks
On running the algorithm multiple times for the same N, it was observed that it gave different N features every time. The future scope majorly lies in deducing Genetic Algorithm's certainty in deciding the optimal feature subset that would require multiple iterations for each value of N and varying the variance for the stopping criteria. The initial features can be considered suggestive. With adequate knowledge about features, Genetic Algorithm could prove to be a great start in obtaining optimal feature subsets.
You can find the whole source code here. Feel free to go through it and suggest any improvements or optimization.