How to Get Started with Machine Learning: Installing Python, SciKit-Learn Environment, The Fundamentals of Machine Learning and using an Estimator to Recognize Numbers

I'm going to learn Machine Learning language Python, specifically Scikit-Learn and I'm going to document the learning process while I learn it - live, kind of. I'm starting from a position of having never looked at a Python script before. I am literate and semi-functional with the likes of C++, Java, HTML/CSS, VBA, SQL, and Maple. I'll be coming at this basically from a beginners' level, I'll also note down how long I spend learning this stuff, to give a realistic gauge on what it takes.

Python SciKit-Learn Machine Learning

I'll drop the citation right here for now, since this entire post is about learning Scikit:

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Clearly, the first thing I need to do is install whatever software I assume will let me run Python and then the Scikit components which are Python (3.3/+), NumPy (1.8.2/+), and SciPy (0.13.3/+) according to the scikit website.

- Some time passes reading up on how to do this 

Retro-Related: Getting into Coding and How it Feels When Learning a New Programming Language.

Installing Software to use Python and SciKit-Learn

There's basically three choices for installing:
  1. Canopy - Scientific Python Distribution & Integrated Analysis Environment
  2. Anaconda - world’s most popular Python data science platform.
  3. WinPython - Python for Windows, that is all.
All 3 support windows but I'm going with Anaconda because it appears they have a wider support system and adoption. I downloaded the 64-Bit Graphical Installer Python 3.6. and it ran an auto exec setup that says "Welcome to Anaconda3 5.0.1 (64-bit) Setup". Click 'NEXT' a couple of times, select the install directory and watch as it unpacks 2.4GB of software.

Python Anaconda


While waiting for it to install, I checked out the Anaconda Cheat Sheet and Documentation.

Now you'll find Anaconda Navigator and Prompt on the Start Menu, launching into the navigator shows a GUI whereas if you want to code use prompt. I'll start by trying out the GUI, you launch Navigator, click Environments > Create + > Name/Version/R > Create. There's a drop down menu defaulted to "installed" within the new environment just created, change it to "All", then search for 'scikit'. Select 'scitkit-learn' and Apply. The popup shows you all the sub-packages that will be installed as dependencies, Apply. 

After finishing installing the packages for scikit-learn, press the play button beside the environment name and you can launch it via the terminal or with the python shell, now we are ready.

Time invested so far: 2 hours while simultaneously running ETH Miner and watching the latest Mr. Robot episode.

Some Very Simple Fundamentals of Machine Learning

Let's get into the basics within this tutorial, I like this line:

"a learning problem considers a set of n samples of data and then tries to predict properties of unknown data."

Supervised learning would be either, classification of sample data into some number of specific categories like text recognition or face-id, or a regression problem where a desired outcome is a function-of many variables. Then there's unsupervised learning, where you might examine a data set for clustering or determine distributions within the data.

Usually, you would take a data set, split it in half. One half is known as the training set, on which you learn things about the data, the other half is known as the testing set, where you test the things you learned.

Basic Machine Learning Sample Problem, Example with Training an Estimator Classifier to Recognize Numbers.

Back to the environment, select "Open with python" and it launches the python shell, from here I followed the scikit-learn tutorial and typed:


>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=0.001, C=100.)
>>> clf.fit(digits.data[:-1], digits.target[:-1])  
OUTPUT:SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

So this dataset.digits is a dictionary with images of digits of numbers, you can get them to display by typing in: "digits.images[7]" and it outputs a seven. Once you've loaded the datasets iris and digits, you import the SVM from sklearn which has an estimator class. Then you run the clf estimator and fit the input data, to the target data, the digits data set excluding the last digit, hence the minus one. 

Now that the data is trained, you can run the clf.predict function 


>>> clf.predict(digits.data[-1:])
output: array([8])


Well that's it, another 90 minutes has passed, so in 3.5 hours I have:
  1. Learned what and how to install Anaconda Python.
  2. Learned how to install packages and create the SciKit-Learn environment,
  3. Learned about how Machine Learning algorithms function with data sets.
  4. Run my first Machine Learning algorithm to read a number form an array and classify what the number looks like. It was an 8.
The Next Steps: Explore some additional basics of Machine Learning Classification.

Comments