Archive
K-means clustering and comparison of K-means clustering algorithms in Python, Java and R
What is K-means Clustering:
The k-means algorithm is one of the simplest clustering techniques commonly used in data analytics.
Original Definition :
The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again.
A very Simple Example:
If you have a collection of dog and you would want to cluster them based on their gender then K =2 would be perfect.
A very complex Example:
If you have the same collection of dog and you would want to cluster them based on their size you could put K=5 (Extra Large, large, medium, small, extra small) however if you would use their family, you may be using K=50, may be cause you just only know 50 types of dog families and this may not be suitable scenario.
K-means clustering algorithms in Python:
#!/usr/bin/python
import sys
from math import fabs
from org.apache.pig.scripting import Pig
filename = “student.txt”
k = 4
tolerance = 0.01
MAX_SCORE = 4
MIN_SCORE = 0
MAX_ITERATION = 100
# initial centroid, equally divide the space
initial_centroids = “”
last_centroids = [None] * k
for i in range(k):
last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE)
initial_centroids = initial_centroids + str(last_centroids[i])
if i!=k-1:
initial_centroids = initial_centroids + “:”
P = Pig.compile(“”"register udf.jar
DEFINE find_centroid FindCentroid(‘$centroids’);
raw = load ‘student.txt’ as (name:chararray, age:int, gpa:double);
centroided = foreach raw generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = foreach grouped generate group, AVG(centroided.gpa);
store result into ‘output’;
“”")
converged = False
iter_num = 0
while iter_num<MAX_ITERATION:
Q = P.bind({‘centroids’:initial_centroids})
results = Q.runSingle()
if results.isSuccessful() == “FAILED”:
raise “Pig job failed”
iter = results.result(“result”).iterator()
centroids = [None] * k
distance_move = 0
# get new centroid of this iteration, caculate the moving distance with last iteration
for i in range(k):
tuple = iter.next()
centroids[i] = float(str(tuple.get(1)))
distance_move = distance_move + fabs(last_centroids[i]-centroids[i])
distance_move = distance_move / k;
Pig.fs(“rmr output”)
print(“iteration ” + str(iter_num))
print(“average distance moved: ” + str(distance_move))
if distance_move<tolerance:
sys.stdout.write(“k-means converged at centroids: [")
sys.stdout.write(",".join(str(v) for v in centroids))
sys.stdout.write("]\n”)
converged = True
break
last_centroids = centroids[:]
initial_centroids = “”
for i in range(k):
initial_centroids = initial_centroids + str(last_centroids[i])
if i!=k-1:
initial_centroids = initial_centroids + “:”
iter_num += 1
if not converged:
print(“not converge after ” + str(iter_num) + ” iterations”)
sys.stdout.write(“last centroids: [")
sys.stdout.write(",".join(str(v) for v in last_centroids))
sys.stdout.write("]\n”)
K-means clustering algorithms in Java:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class FindCentroid extends EvalFunc<Double> {
double[] centroids;
public FindCentroid(String initialCentroid) {
String[] centroidStrings = initialCentroid.split(“:”);
centroids = new double[centroidStrings.length];
for (int i=0;i<centroidStrings.length;i++)
centroids[i] = Double.parseDouble(centroidStrings[i]);
}
@Override
public Double exec(Tuple input) throws IOException {
double min_distance = Double.MAX_VALUE;
double closest_centroid = 0;
for (double centroid : centroids) {
double distance = Math.abs(centroid – (Double)input.get(0));
if (distance < min_distance) {
min_distance = distance;
closest_centroid = centroid;
}
}
return closest_centroid;
}
}
K-means clustering algorithms in R:
kmeans =
function(points, ncenters, iterations = 10, distfun = NULL) {
if(is.null(distfun))
distfun =
function(a,b) norm(as.matrix(a-b), type = ‘F’)
newCenters =
kmeans.iter(
points,
distfun,
ncenters = ncenters)
for(i in 1:iterations) {
newCenters = kmeans.iter(points, distfun, centers = newCenters)}
newCenters}
kmeans.iter =
function(points, distfun, ncenters = dim(centers)[1], centers = NULL) {
from.dfs(mapreduce(input = points,
map =
if (is.null(centers)) {
function(k,v) keyval(sample(1:ncenters,1),v)}
else {
function(k,v) {
distances = apply(centers, 1, function(c) distfun(c,v))
keyval(centers[which.min(distances),], v)}},
reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))),
to.data.frame = T)}
Details on Scikit-Learn Python based Machine Learning Library
SCiKit-Learn Python based Machine Learning Library which is open sources through BSD.
URL:
Dependency:
- Python >= 2.6
- numpy > = 1.3
- scipy >= 0.7
Includes supervised learning algorithms:
- Generalized Linear Model with with scipy.sparse bindings for wide features datasets
- Support Vector Machine (SVM) based on libsvm
- Stochastic Gradient Descent
- bayesian methods
- Gaussian Processes
- Nearest Neighbors
- Partial Least Squares
- Naive Bayes
- Decision Trees
- Ensemble methods
- Multiclass and multilabel algorithms
- Feature selection
- L1 and L1+L2 regularized regression methods aka Lasso and Elastic Net models implemented with algorithms such as LARS and coordinate descent
- Linear and Quadratic Discriminant Analysis
Includes unsupervised clustering algorithms:
- Gaussian mixture models
- kmeans++
- meanshift
- affinity propagation
- Manifold learning
- spectral clustering
- Decomposing signals in components (matrix factorization problems)
- Covariance estimation
- Novelty and Outlier Detection
- Hidden Markov Models (HMMs)
Include other tools:
- feature extractors for text content (token and char ngrams + hashing vectorizer)
- univariate feature selections
- a simple pipe line tool
- numerous implementations of cross validation strategies
- performance metrics evaluation and ploting (ROC curve, AUC, confusion matrix, …)
- a grid search utility to perform hyper-parameters tuning using parallel cross validation
- integration with joblib for caching partial results when working in interactive environment (e.g. using ipython)
Samples:
- Each algorithm implementation comes with sample programs demonstrating it’s usage either on toy data or real life datasets.
Source code:
- Get the source code from Git-hub
My Experiment with Python based Open source data visualization and analysis Tool – Orange (Part 1)
Install Orange 2.0b for Windows completely which will install pyMl, PythonWin, QT and several other Python libraries along with Orange.
With Orange you can access regular tab or comma delimited data or you can use C4.5 data which is separated into two separate file as *.data and *.names. Mainly Orange supports C4.5, Assistant, Retis, and tab-delimited (native Orange) data formats.
Loading Tab Delimited Data:
Launch PythonWin and have a Tab Delimited content to load it
>>> import orange
>>> print orange.version
2.0b (19:12:26, Feb 14 2012)
>>> data = orange.ExampleTable(“C:\Python27\lenses.tab”)
>>> print data.domain.attributes
<Orange.feature.Discrete ‘age’, Orange.feature.Discrete ‘prescription’, Orange.feature.Discrete ‘astigmatic’, Orange.feature.Discrete ‘tear_rate’>
>>>
>>> for i in data.domain.attributes:
… print i.name,
… print
…
age
prescription
astigmatic
tear_rate
>>> for i in range(3):
… print data[i]
…
['young', 'myope', 'no', 'reduced', 'none']
['young', 'myope', 'no', 'normal', 'soft']
['young', 'myope', 'yes', 'reduced', 'none']
>>> for i in range(5):
… print data[i]
…
['young', 'myope', 'no', 'reduced', 'none']
['young', 'myope', 'no', 'normal', 'soft']
['young', 'myope', 'yes', 'reduced', 'none']
['young', 'myope', 'yes', 'normal', 'hard']
['young', 'hypermetrope', 'no', 'reduced', 'none']
Loading C4.5 Delimited Data:
Launch PythonWin and have a C4.5 content to load it
>>> import os
>>> os.chdir(“c:/Python27/ascdata”)
>>> os.listdir(os.curdir)
['car.data', 'car.names', 'mydata.txt']
>>> car_data = orange.ExampleTable(“car”)
>>> print car_data.domain.attributes
<Orange.feature.Discrete ‘buying’, Orange.feature.Discrete ‘maint’, Orange.feature.Discrete ‘doors’, Orange.feature.Discrete ‘persons’, Orange.feature.Discrete ‘lugboot’, Orange.feature.Discrete ‘safety’>
>>> for i in range(10):
… print car_data[i]
…
['v-high', 'v-high', '2', '2', 'small', 'low', 'unacc']
['v-high', 'v-high', '2', '2', 'small', 'med', 'unacc']
['v-high', 'v-high', '2', '2', 'small', 'high', 'unacc']
['v-high', 'v-high', '2', '2', 'med', 'low', 'unacc']
['v-high', 'v-high', '2', '2', 'med', 'med', 'unacc']
['v-high', 'v-high', '2', '2', 'med', 'high', 'unacc']
['v-high', 'v-high', '2', '2', 'big', 'low', 'unacc']
['v-high', 'v-high', '2', '2', 'big', 'med', 'unacc']
['v-high', 'v-high', '2', '2', 'big', 'high', 'unacc']
['v-high', 'v-high', '2', '4', 'small', 'low', 'unacc']
Machine Learning Libraries in Python
Here is a collection of Machine Learning Libraries in Python:
PyBrain (http://pybrain.org/)
PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.
mlPy (http://mlpy.sourceforge.net/)
mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is Open Source, distributed under the GNU General Public License version 3.
PyML: Machine Learning in Python (http://pyml.sourceforge.net/)
PyML is an interactive object oriented framework for machine learning written in Python. PyML focuses on SVMs and other kernel methods. It is supported on Linux and Mac OS X.
Features
- Classifiers: support vector machines, nearest neighbor, ridge regression
- Multi-class methods (one-against-rest and one-against-one)
- Feature selection (filter methods, RFE)
- Model selection
- Preprocessing and normalization
- Syntax for combining classifiers
- Classifier testing (cross-validation, error rates, ROC curves)
- Various kernels for biological sequences (several variants of the spectrum kernel, and the weighted-degree kernel).
Shogun – A Large Scale Machine Learning Toolbox (http://www.shogun-toolbox.org/)
The machine learning toolbox’s focus is on large scale kernel methods and especially on Support Vector Machines (SVM) [1]. It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS [21], Liblinear [20], LibSVM [2], SVMLight, [3] SVMLin [4] and GPDT [5]. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved [6], Fischer [7], TOP [8], Spectrum [9], Weighted Degree Kernel (with shifts) [10] [11] [12]. For the latter the efficient LINADD [12] optimizations are implemented. For linear SVMs the COFFIN framework [22][23] allows for on-demand computing feature spaces on-the-fly, even allowing to mix sparse, dense and other data types. Furthermore, SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning [13] [14] [18] [19]. Currently SVM one-class, 2-class and multiclass classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.
SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released as Machine Learning Open Source Software.
MDP – Modular toolkit for Data Processing (http://mdp-toolkit.sourceforge.net/)
Modular toolkit for Data Processing (MDP) is a Python data processing framework.
From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.
The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.
Orange http://orange.biolab.si/
Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.
MILK: MACHINE LEARNING TOOLKIT (http://packages.python.org/milk/)
Milk is a machine learning toolkit in Python. Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.
For unsupervised learning, milk supports k-means clustering and affinity propagation. Milk is flexible about its inputs. It optimised for numpy arrays, but can often handle anything (for example, for SVMs, you can use any dataype and any kernel and it does the right thing).
There is a strong emphasis on speed and low memory usage. Therefore, most of the performance sensitive code is in C++. This is behind Python-based interfaces for convenience.
scikit-learn: machine learning in Python (http://scikit-learn.sourceforge.net/stable/)
scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.
