Entropy, Information Gain, Gini Index, Reducing Impurity?

Beytullah Soylev
3 min readSep 30, 2023

--

Entropy, Information Gain, Gini Index, and Reducing Impurity are all important concepts in data science, particularly for decision tree algorithms.

Entropy & Gini
Entropy & Gini

Entropy is a measure of uncertainty or randomness in a dataset. It is calculated by summing the weighted probabilities of each possible outcome. A dataset with high entropy is more uncertain, while a dataset with low entropy is more certain.

Information Gain is a measure of how much uncertainty is reduced by splitting a dataset on a particular feature. It is calculated by subtracting the entropy of the parent node from the weighted average entropy of the child nodes. A feature with a high information gain is a good candidate for splitting the dataset, as it will reduce the overall uncertainty of the dataset.

Gini Index is another measure of impurity in a dataset. It is calculated by summing the squared probabilities of each possible outcome. A dataset with a high Gini Index is more impure, while a dataset with a low Gini Index is more pure.

Reducing Impurity is the process of making a dataset more pure. This can be done by splitting the dataset on features that are likely to reduce the overall impurity of the dataset.

Decision tree algorithms use these concepts to build predictive models. At each node in the decision tree, the algorithm selects the feature with the highest information gain or Gini Index to split the dataset. This process is repeated until each leaf node in the tree contains a pure dataset.

Here is an example of how entropy, information gain, and Gini index can be used to reduce impurity in a decision tree:

Suppose we have a dataset of customers who have purchased a product or not. The dataset contains the following features:

  • Age
  • Gender
  • Income

We want to build a decision tree to predict whether or not a customer will purchase the product.

We start by calculating the entropy of the entire dataset. This gives us a measure of the overall uncertainty in the dataset.

Next, we calculate the information gain for each feature. This gives us a measure of how much uncertainty is reduced by splitting the dataset on that feature.

We then select the feature with the highest information gain and split the dataset on that feature.

We repeat this process until each leaf node in the tree contains a pure dataset.

For example, we might split the dataset on the age feature. We might find that customers who are younger than 30 are more likely to purchase the product. We would then create two child nodes, one for customers who are younger than 30 and one for customers who are older than 30.

We would then repeat the process for each child node, splitting them on the gender feature and then the income feature.

By the end of the process, we would have a decision tree that can accurately predict whether or not a customer will purchase the product based on their age, gender, and income.

Reducing impurity in decision trees is important because it can lead to more accurate and reliable models. By splitting the dataset on features that are likely to reduce the overall impurity of the dataset, we can create models that are better at predicting the outcome of interest.

Source

“Don’t be afraid to fail. It’s not the end of the world, and in many ways, it’s the first step toward learning something and getting better at it.” — Jon Hamm

--

--

No responses yet