The measure/value of information in a dataset is information gain or entropy (Shannon entropy). This concept is widely used in building decision trees (used in machine learning).
What is this entropy?
As the name suggests, its the disturbance in the dataset. Lets say we have a dataset which has few attributes and one of these attributes is "clicksource". It can have one of either "mobile" or "desktop" values (in reality it can have more than 2 but for simplicity sake we will consider it can have one of these 2 values).
As the name suggests, its the disturbance in the dataset. Lets say we have a dataset which has few attributes and one of these attributes is "clicksource". It can have one of either "mobile" or "desktop" values (in reality it can have more than 2 but for simplicity sake we will consider it can have one of these 2 values).
Now lets assume we have 150 records. We will take below examples to explain about the entropy and then will see the mathematical formula for this.
Example 1
There are 75 records which have "mobile" value for clicksource attribute.
There are 75 records which have "mobile" value for clicksource attribute.
Rest 75 records have "desktop" values for same attribute.
Question -> Is this uniform data w.r.t. clicksource? Can this further be classified/split based upon clicksource?
Example 2
There are 100 records which have "mobile" value for clicksource attribute.
Rest 50 records have "desktop" values for same attribute.
Question -> Does this dataset has better uniformity (for the lack of better term) than in example 1? If yes then why?
Example 3
There are 140 records which have "mobile" value for clicksource attribute.
Rest 10 records have "desktop" values for same attribute.
Question -> Does this dataset has better uniformity than in example 1 and 2? If yes then why?
When you think about these, you realize that uniformity is more in dataset in example 3 than in previous examples. Reason, most (app. 93%) records are from same class (or category) of "mobile" for "clicksource" attribute. If there is more uniformity than it has less distrubance, in other words it has less entropy.
Again, In other words, we can say that if we can NOT further classify the dataset then we have zero entropy. But if we can then entropy is greater than zero. If we have entropy as 1 then the dataset is random. Are these fair statements to make?
Mathematical representation
===========================
Now lets see how can we represent entropy mathematically , the fun part.
===========================
Now lets see how can we represent entropy mathematically , the fun part.
As per ID3 algorithm (and try to correlate this with examples above)
Collection = S (150 in above examples)
p(C) = Proportion of S belonging to class C (in above examples class is "mobile" or "desktop")
Entropy of S = Sum of ( - p(C) log2 p(C) )
Why do we have log2 (log of base 2), we will discuss later.
Example 1 entropy
===================
p(C) for "mobile" = 75/150 = 0.5
p(C) for "desktop" = 75/150 = 0.5
(hint = log of base 2 of 0.5 is equal to -1)
Entropy = -0.5 * ( log2 0.5) - 0.5 (log2 0.5)
= -0.5 * -1 - 0.5 * -1
= 0.5 + 0.5 = 1
= -0.5 * -1 - 0.5 * -1
= 0.5 + 0.5 = 1
Which is right as the dataset is random having equally all (in this case 2 , mobile and desktop) values of attributes.
Example 2 entropy
===================
p(C) for "mobile" = 100/150 = 0.67
p(C) for "desktop" = 50/150 = 0.33
(hint = log of base 2 of 0.67 is equal to -0.58 , log of base 2 of 0.33 is -1.6)
===================
p(C) for "mobile" = 100/150 = 0.67
p(C) for "desktop" = 50/150 = 0.33
(hint = log of base 2 of 0.67 is equal to -0.58 , log of base 2 of 0.33 is -1.6)
Entropy = -0.67 * ( log2 0.67) - 0.33 (log2 0.33)
= -0.67 * -0.58 - 0.33 * -1.6
= 0.39 + 0.53 = ~0.9
= -0.67 * -0.58 - 0.33 * -1.6
= 0.39 + 0.53 = ~0.9
Has less entropy than first but still ...
Example 3 entropy
===================
p(C) for "mobile" = 140/150 = 0.93
p(C) for "desktop" = 10/150 = 0.07
(hint = log of base 2 of 0.93 is equal to -0.105 , log of base 2 of 0.07 is -3.8)
===================
p(C) for "mobile" = 140/150 = 0.93
p(C) for "desktop" = 10/150 = 0.07
(hint = log of base 2 of 0.93 is equal to -0.105 , log of base 2 of 0.07 is -3.8)
Entropy = -0.93 * ( log2 0.93) - 0.07 (log2 0.07)
= -0.93 * -0.105 - 0.07 * -3.8
= 0.098 + 0.26 = ~0.37
= -0.93 * -0.105 - 0.07 * -3.8
= 0.098 + 0.26 = ~0.37
Much less entropy.
Lets say all records are having "mobile" value for clicksource attribute. The entropy will be? Make a guess.
Here it is, Log of base 2 for 1 (i.e. 150/150) is 0 . Hence entropy is zero. Fully uniformed records i.e. belonging to just 1 class.
Now why base 2
Last line explained it, log of base 2 for 1 is zero, so log of base 2 is right factor for this measurement :) .