{"id":106,"date":"2012-11-06T02:18:38","date_gmt":"2012-11-06T02:18:38","guid":{"rendered":"http:\/\/www.clayford.net\/statistics\/?p=106"},"modified":"2023-08-20T11:14:49","modified_gmt":"2023-08-20T15:14:49","slug":"machine-learning-for-hackers-chapter-2","status":"publish","type":"post","link":"https:\/\/www.clayford.net\/statistics\/machine-learning-for-hackers-chapter-2\/","title":{"rendered":"Machine Learning for Hackers, Chapter 2"},"content":{"rendered":"<p>Chapter 2 of <em>Machine Learning for Hackers<\/em> is called Data Exploration. It explains means, medians, quartiles, variance, histograms, scatterplots, things like that. It&#8217;s a quick and effective introduction to some basic statistics. If you know stats pretty well you can probably skip it, but I found it pretty useful for its intro to the ggplot2 package.<\/p>\n<p>The data they explore is collection of heights and weights for males and females. You can download it\u00a0<a href=\"https:\/\/github.com\/johnmyleswhite\/ML_for_Hackers\/blob\/master\/02-Exploration\/data\/01_heights_weights_genders.csv\">here<\/a>. The authors explore the data by creating histograms and scatterplots.<\/p>\n<p>Here&#8217;s how they create a histogram of the heights:<\/p>\n<pre>ggplot(ch2, aes(x = Height)) + geom_histogram(binwidth=1)<\/pre>\n<p><a href=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-107\" title=\"mlh_ch2_1\" src=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_1-300x300.png\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_1-300x300.png 300w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_1-150x150.png 150w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_1.png 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nChanging the binwidth parameter changes the size of the bins in inches.\u00a0They also create kernel density estimates (KDE):<\/p>\n<pre>ggplot(ch2, aes(x = Height)) + geom_density()<\/pre>\n<p><a href=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-108\" title=\"mlh_ch2_2\" src=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_2-300x300.png\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_2-300x300.png 300w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_2-150x150.png 150w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_2.png 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nNotice that line of code is the same as the histogram code but with a different function called after the &#8220;+&#8221;.\u00a0Now that plots all the height data. They then show you how create two different KDEs for each gender:<\/p>\n<pre>ggplot(ch2, aes(x = Height, fill=Gender)) + geom_density() + facet_grid(Gender ~ .)<\/pre>\n<p><a href=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-109\" title=\"mlh_ch2_3\" src=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_3-300x300.png\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_3-300x300.png 300w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_3-150x150.png 150w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_3.png 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nHistograms are good for one variable. If you want to explore two numeric variables, you bust out a scatterplot. Continuing with ggplot, they do the following:<\/p>\n<pre>ggplot(ch2, aes(x = Height, y = Weight)) + geom_point()<\/pre>\n<p><a href=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-110\" title=\"mlh_ch2_4\" src=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_4-300x300.png\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_4-300x300.png 300w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_4-150x150.png 150w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_4.png 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nThere&#8217;s your scatterplot. Next they show how to easily run a smooth prediction line through the plot with a confidence band:<\/p>\n<pre>ggplot(ch2, aes(x = Height, y = Weight)) + geom_point() + geom_smooth()<\/pre>\n<p><a href=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-111\" title=\"mlh_ch2_5\" src=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_5-300x300.png\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_5-300x300.png 300w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_5-150x150.png 150w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_5.png 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nThen they show you again how to distinguish between gender:<\/p>\n<pre>ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point()<\/pre>\n<p><a href=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_6.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-112\" title=\"mlh_ch2_6\" src=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_6-300x300.png\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_6-300x300.png 300w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_6-150x150.png 150w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_6.png 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nUsing this last scatterplot with genders called out,\u00a0they decide to give a sneak preview of machine learning using a simple classification model. First they\u00a0code the genders as Male=1 and Female=0, like so:<\/p>\n<pre>ch2 <- transform(ch2, Male = ifelse(Gender == 'Male', 1, 0)<\/pre>\n<p>Then they use glm to create a logit model that attempts to predict gender based on height and weight:<\/p>\n<pre>logit.model <- glm(Male ~ Height + Weight, data=ch2, family=binomial)<\/pre>\n<p>Finally they redraw the scatterplot, but this time use the logit model parameters to draw a \"separating hyperplane\" through the scatterplot:<\/p>\n<pre>ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point() +\r\n stat_abline(intercept = - coef(logit.model)[1] \/ coef(logit.model)[3],\r\n slope = - coef(logit.model)[2] \/ coef(logit.model)[3],\r\n geom = 'abline', color = 'black')<\/pre>\n<p><a href=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_9.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-113\" title=\"mlh_ch2_9\" src=\"http:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_9-300x300.png\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_9-300x300.png 300w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_9-150x150.png 150w, https:\/\/www.clayford.net\/statistics\/wp-content\/uploads\/2012\/11\/mlh_ch2_9.png 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><br \/>\nThe formula for the line is \\( y = - \\frac{\\alpha}{\\beta_{2}} - \\frac{\\beta_{1}}{\\beta_{2}}\\). They don't tell you that in the book, but I thought I would state that for the record. Also the code in the book for this portion has typos and draws something different. What I provide above replicates Figure 2-31 in the book.<\/p>\n<p>Again I felt like this was a solid intro to the ggplot package. And of course it's never bad to review the basics. \"You have to be sound in the fundamentals\" as sports analysts are fond of saying.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Chapter 2 of Machine Learning for Hackers is called Data Exploration. It explains means, medians, quartiles, variance, histograms, scatterplots, things&#8230; <a class=\"read-more\" href=\"https:\/\/www.clayford.net\/statistics\/machine-learning-for-hackers-chapter-2\/\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15,13],"tags":[],"class_list":["post-106","post","type-post","status-publish","format-standard","hentry","category-machine-learning-for-hackers","category-using-r"],"_links":{"self":[{"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/posts\/106","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/comments?post=106"}],"version-history":[{"count":4,"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/posts\/106\/revisions"}],"predecessor-version":[{"id":893,"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/posts\/106\/revisions\/893"}],"wp:attachment":[{"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/media?parent=106"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/categories?post=106"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.clayford.net\/statistics\/wp-json\/wp\/v2\/tags?post=106"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}