Tuesday, December 16, 2014

Entropy, used in machine learning, for layman

The measure/value of information in a dataset is information gain or entropy (Shannon entropy). This concept is widely used in building decision trees (used in machine learning).
What is this entropy?
As the name suggests, its the disturbance in the dataset. Lets say we have a dataset which has few attributes and one of these attributes is "clicksource". It can have one of either "mobile" or "desktop" values (in reality it can have more than 2 but for simplicity sake we will consider it can have one of these 2 values).
Now lets assume we have 150 records. We will take below examples to explain about the entropy and then will see the mathematical formula for this.
Example 1 
There are 75 records which have "mobile" value for clicksource attribute.
Rest 75 records have "desktop" values for same attribute.
Question -> Is this uniform data w.r.t. clicksource? Can this further be classified/split based upon clicksource?
Example 2
There are 100 records which have "mobile" value for clicksource attribute.
Rest 50 records have "desktop" values for same attribute.
Question -> Does this dataset has better uniformity (for the lack of better term) than in example 1? If yes then why?
Example 3
There are 140 records which have "mobile" value for clicksource attribute.
Rest 10 records have "desktop" values for same attribute.
Question -> Does this dataset has better uniformity than in example 1 and 2? If yes then why?

When you think about these, you realize that uniformity is more in dataset in example 3 than in previous examples. Reason, most (app. 93%) records are from same class (or category) of "mobile" for "clicksource" attribute. If there is more uniformity than it has less distrubance, in other words it has less entropy.
Again, In other words, we can say that if we can NOT further classify the dataset then we have zero entropy. But if we can then entropy is greater than zero. If we have entropy as 1 then the dataset is random. Are these fair statements to make?
Mathematical representation
===========================
Now lets see how can we represent entropy mathematically , the fun part.
As per ID3 algorithm (and try to correlate this with examples above)

Collection = S (150 in above examples)
p(C) = Proportion of S belonging to class C (in above examples class is "mobile" or "desktop")

Entropy of S = Sum of ( - p(C) log2 p(C) )
Why do we have log2 (log of base 2), we will discuss later.

Example 1 entropy
===================
p(C) for "mobile" = 75/150 = 0.5
p(C) for "desktop" = 75/150 = 0.5

(hint = log of base 2 of 0.5 is equal to -1)
Entropy = -0.5 * ( log2 0.5) - 0.5 (log2 0.5)
= -0.5 * -1 - 0.5 * -1
= 0.5 + 0.5 = 1
Which is right as the dataset is random having equally all (in this case 2 , mobile and desktop) values of attributes.
Example 2 entropy
===================
p(C) for "mobile" = 100/150 = 0.67
p(C) for "desktop" = 50/150 = 0.33

(hint = log of base 2 of 0.67 is equal to -0.58 , log of base 2 of 0.33 is -1.6)
Entropy = -0.67 * ( log2 0.67) - 0.33 (log2 0.33)
= -0.67 * -0.58 - 0.33 * -1.6
= 0.39 + 0.53 = ~0.9
Has less entropy than first but still ...
Example 3 entropy
===================
p(C) for "mobile" = 140/150 = 0.93
p(C) for "desktop" = 10/150 = 0.07

(hint = log of base 2 of 0.93 is equal to -0.105 , log of base 2 of 0.07 is -3.8)
Entropy = -0.93 * ( log2 0.93) - 0.07 (log2 0.07)
= -0.93 * -0.105 - 0.07 * -3.8
= 0.098 + 0.26 = ~0.37
Much less entropy.

Lets say all records are having "mobile" value for clicksource attribute. The entropy will be? Make a guess.
Here it is, Log of base 2 for 1 (i.e. 150/150) is 0 . Hence entropy is zero. Fully uniformed records i.e. belonging to just 1 class.

Now why base 2
Last line explained it, log of base 2 for 1 is zero, so log of base 2 is right factor for this measurement :) .

Wednesday, September 24, 2014

Cloudera VM setup on VMPlayer, to develop hadoop applications


Setup for Cloudera VM
======================

a) download VM player from VMWare site
1. https://my.vmware.com/web/vmware/downloads
2. select "Download" opposite to "VMware Player". It is almost at the end of the page.
3. Now select the VMPlayer for your operating system.
b) cloudera standalone VM for VMPlayer
1. download it from -> http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-5-1-x1.html
1.1 Select version CDH 5.1x, if not already selected.
1.2 Download for VMWare.
2. open it using winzip or 7zip or any other unzip utility you might have.
2.1 remember the location where it is unzipped
3. Open VMWare player (Downloaded in a) above)
4. From VM player window, click on "Open virtual machine" from right hand pane.
5. Select the location in 2.1 above.
6. Now VM Player has opened the VM
c) VMPlayer playing the VM
d) change VM memory or storage setting by right clicking on the VM name on VMPlayer and by selecting "settings"
d.1) select "hard disk" from "hardware" tab and expand to allocate more space.
d.1.1) if you have space then give more than 100GB to this, if you are trying to process huge data.
d.2) select "memory" from "hardware" and can change the  memory settings
d.2.1) Give 2GB or more.
d) For this VM, user id and password both are cloudera
e) run some hdfs commands
- hadoop fs -ls  /
- user directory in hdfs is  /user/cloudera
- fs -ls /user/cloudera
- . (dot) in hdfs command for a directory will translate to this directory (/user/cloudera)
- upload a file to hdfs
- on linux console type this -> touch test
- issue this command (without quotes) - "hadoop fs -put test . "
- above and  "hadoop fs -put test /user/cloudera/" mean the same thing


f) hadoop installation is at -> /etc/hadoop/conf
g) type "hadoop version" to verify the version of hadoop
h) Download netbeans (make sure you have allocated equal or more than 2GB memory to this VM)
- Open firefix browser from within the VM
- if you are not able to connect to internet then a) make sure your net connection is woking and
b) right clict at top rigt corner of the VM and change "network adapter" settings from "hardware" tab
Change for the right network settings. Mostly "Bridged" network connection is needed.
- Go to neatbeans.org
- click on downloads
- on the next page, click on "Download" under Java SE
- save the file
- once download finishes, go to this directory /home/cloudera/Downloads
- cd /home/cloudera/Downloads
- to execute the script, run the below command
- chmod +x netbeans-8.0.1-javase-linux.sh
- start installing netbeans, run the below command
- ./netbeans-8.0.1-javase-linux.sh
- accept all the defaults and keep clicking "next" and at the end "install"

Now you are ready to develop big data processing programs.

Thursday, April 17, 2014

virtual machine in VMPlayer not connecting to internet

 use bridged connection then click on network adapter, uncheck the network adapters which are not needed. Select only needed network adapters..