﻿ Machine Learning - Data Science from Scratch: First Principles with Python (2015) ﻿

## Data Science from Scratch: First Principles with Python (2015)

### Chapter 11.Machine Learning

I am always ready to learn although I do not always like being taught.

Winston Churchill

# What Is Machine Learning?

§  Predicting whether an email message is spam or not

§  Predicting whether a credit card transaction is fraudulent

§  Predicting which advertisement a shopper is most likely to click on

§  Predicting which football team is going to win the Super Bowl

# Overfitting and Underfitting

``def` `split_data(data,` `prob):``
`    `"""split data into fractions [prob, 1 - prob]"""``
`    `results` `=` `[],` `[]``
`    `for` `row` `in` `data:``
`        `results[0` `if` `random.random()` `<` `prob` `else` `1].append(row)``
`    `return` `results``
``def` `train_test_split(x,` `y,` `test_pct):``
`    `data` `=` `zip(x,` `y)`                              `# pair corresponding values``
`    `train,` `test` `=` `split_data(data,` `1` `-` `test_pct)`  `# split the data set of pairs``
`    `x_train,` `y_train` `=` `zip(*train)`                `# magical un-zip trick``
`    `x_test,` `y_test` `=` `zip(*test)``
`    `return` `x_train,` `x_test,` `y_train,` `y_test``
``model` `=` `SomeKindOfModel()``
``x_train,` `x_test,` `y_train,` `y_test` `=` `train_test_split(xs,` `ys,` `0.33)``
``model.train(x_train,` `y_train)``
``performance` `=` `model.test(x_test,` `y_test)``

# Correctness

§  True positive: “This message is spam, and we correctly predicted spam.”

§  False positive (Type 1 Error): “This message is not spam, but we predicted spam.”

§  False negative (Type 2 Error): “This message is spam, but we predicted not spam.”

§  True negative: “This message is not spam, and we correctly predicted not spam.”

 Spam not Spam predict “Spam” True Positive False Positive predict “Not Spam” False Negative True Negative
 leukemia no leukemia total “Luke” 70 4,930 5,000 not “Luke” 13,930 981,070 995,000 total 14,000 986,000 1,000,000
``def` `accuracy(tp,` `fp,` `fn,` `tn):``
`    `correct` `=` `tp` `+` `tn``
`    `total` `=` `tp` `+` `fp` `+` `fn` `+` `tn``
`    `return` `correct` `/` `total``
` `
``print` `accuracy(70,` `4930,` `13930,` `981070)`     `# 0.98114``
``def` `precision(tp,` `fp,` `fn,` `tn):``
`    `return` `tp` `/` `(``tp` `+` `fp)``
` `
``print` `precision(70,` `4930,` `13930,` `981070)`    `# 0.014``
``def` `recall(tp,` `fp,` `fn,` `tn):``
`    `return` `tp` `/` `(``tp` `+` `fn)``
` `
``print` `recall(70,` `4930,` `13930,` `981070)`       `# 0.005``
``def` `f1_score(tp,` `fp,` `fn,` `tn):``
`    `p` `=` `precision(tp,` `fp,` `fn,` `tn)``
`    `r` `=` `recall(tp,` `fp,` `fn,` `tn)``
` `
`    `return` `2` `*` `p` `*` `r` `/` `(``p` `+` `r)``

# Feature Extraction and Selection

§  Does the email contain the word “Viagra”?

§  How many times does the letter d appear?

§  What was the domain of the sender?

# For Further Exploration

§  Keep reading! The next several chapters are about different families of machine-learning models.

§  The Coursera Machine Learning course is the original MOOC and is a good place to get a deeper understanding of the basics of machine learning. The Caltech Machine Learning MOOC is also good.

§  The Elements of Statistical Learning is a somewhat canonical textbook that can be downloaded online for free. But be warned: it’s very mathy.

﻿

﻿