Archive

Archive for the ‘Machine Learning’ Category

List of machine leanring algorithms supported in Mahout 0.7


Here is a list of Mahout built-in algorithms available in 0.7:

  •   arff.vector: : Generate Vectors from an ARFF file or directory
  •   baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  •   canopy: : Canopy clustering
  •   cat: : Print a file or resource as the logistic regression models would see it
  •   cleansvd: : Cleanup and verification of SVD output
  •   clusterdump: : Dump cluster output to text
  •   clusterpp: : Groups Clustering Output In Clusters
  •   cmdump: : Dump confusion matrix in HTML or text formats
  •   cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
  •   cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
  •   dirichlet: : Dirichlet Clustering
  •   eigencuts: : Eigencuts spectral clustering
  •   evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
  •   fkmeans: : Fuzzy K-means clustering
  •   fpg: : Frequent Pattern Growth
  •   hmmpredict: : Generate random sequence of observations by given HMM
  •   itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
  •   kmeans: : K-means clustering
  •   lucene.vector: : Generate Vectors from a Lucene index
  •   matrixdump: : Dump matrix in CSV format
  •   matrixmult: : Take the product of two matrices
  •   meanshift: : Mean Shift clustering
  •   minhash: : Run Minhash clustering
  •   parallelALS: : ALS-WR factorization of a rating matrix
  •   recommendfactorized: : Compute recommendations using the factorization of a rating matrix
  •   recommenditembased: : Compute recommendations using item-based collaborative filtering
  •   regexconverter: : Convert text files on a per line basis based on regular expressions
  •   rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
  •   rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
  •   runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
  •   runlogistic: : Run a logistic regression model against CSV data
  •   seq2encoded: : Encoded Sparse Vector generation from Text sequence files
  •   seq2sparse: : Sparse Vector generation from Text sequence files
  •   seqdirectory: : Generate sequence files (of Text) from a directory
  •   seqdumper: : Generic Sequence File dumper
  •   seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
  •   seqwiki: : Wikipedia xml dump to sequence file
  •   spectralkmeans: : Spectral k-means clustering
  •   split: : Split Input data into test and train sets
  •   splitDataset: : split a rating dataset into training and probe parts
  •   ssvd: : Stochastic SVD
  •   svd: : Lanczos Singular Value Decomposition
  •   testnb: : Test the Vector-based Bayes classifier
  •   trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
  •   trainlogistic: : Train a logistic regression using stochastic gradient descent
  •   trainnb: : Train the Vector-based Bayes classifier
  •   transpose: : Take the transpose of a matrix
  •   validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  •   vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  •   vectordump: : Dump vectors from a sequence file to text
  •   viterbi: : Viterbi decoding of hidden states from given output states sequence

To further use any one of the above just try appending it with mahout and you will see more details on how to use it:

$mahout kmeans

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths>              comma separated archives to be unarchived
on the compute machines.
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-files <paths>                 comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-libjars <paths>               comma separated jar files to include in the classpath.
-tokenCacheFile <tokensFile>   name of the file with the tokens

Missing required option –clusters

Usage:
[--input <input> --output <output> --distanceMeasure <distanceMeasure>
--clusters <clusters> --numClusters <k> --convergenceDelta <convergenceDelta>
--maxIter <maxIter> --overwrite --clustering --method <method>
--outlierThreshold <outlierThreshold> --help --tempDir <tempDir> --startPhase
<startPhase> --endPhase <endPhase>]
–clusters (-c) clusters    The input centroids, as Vectors.  Must be a
SequenceFile of Writable, Cluster/Canopy.  If k is
also specified, then a random set of vectors will
be selected and written out to this path first

K-means clustering and comparison of K-means clustering algorithms in Python, Java and R

June 2, 2012 1 comment

What is K-means Clustering:

The k-means algorithm is one of the simplest clustering techniques commonly used in data analytics.

Original Definition :

The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again.

A very Simple Example:

If you have a collection of dog and you would want to cluster them based on their gender then K =2 would be perfect.

A very complex Example:

If you have the same collection of dog and you would want to cluster them based on their size you could put K=5 (Extra Large, large, medium, small, extra small) however if you would use their family, you may be using K=50, may be cause you just only know 50 types of dog families and this may not be suitable scenario.

 K-means clustering algorithms in Python:

#!/usr/bin/python
import sys
from math import fabs
from org.apache.pig.scripting import Pig

filename = “student.txt”
k = 4
tolerance = 0.01

MAX_SCORE = 4
MIN_SCORE = 0
MAX_ITERATION = 100

# initial centroid, equally divide the space
initial_centroids = “”
last_centroids = [None] * k
for i in range(k):
last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE)
initial_centroids = initial_centroids + str(last_centroids[i])
if i!=k-1:
initial_centroids = initial_centroids + “:”

P = Pig.compile(“”"register udf.jar
DEFINE find_centroid FindCentroid(‘$centroids’);
raw = load ‘student.txt’ as (name:chararray, age:int, gpa:double);
centroided = foreach raw generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = foreach grouped generate group, AVG(centroided.gpa);
store result into ‘output’;
“”")

converged = False
iter_num = 0
while iter_num<MAX_ITERATION:
Q = P.bind({‘centroids’:initial_centroids})
results = Q.runSingle()
if results.isSuccessful() == “FAILED”:
raise “Pig job failed”
iter = results.result(“result”).iterator()
centroids = [None] * k
distance_move = 0
# get new centroid of this iteration, caculate the moving distance with last iteration
for i in range(k):
tuple = iter.next()
centroids[i] = float(str(tuple.get(1)))
distance_move = distance_move + fabs(last_centroids[i]-centroids[i])
distance_move = distance_move / k;
Pig.fs(“rmr output”)
print(“iteration ” + str(iter_num))
print(“average distance moved: ” + str(distance_move))
if distance_move<tolerance:
sys.stdout.write(“k-means converged at centroids: [")
sys.stdout.write(",".join(str(v) for v in centroids))
sys.stdout.write("]\n”)
converged = True
break
last_centroids = centroids[:]
initial_centroids = “”
for i in range(k):
initial_centroids = initial_centroids + str(last_centroids[i])
if i!=k-1:
initial_centroids = initial_centroids + “:”
iter_num += 1

if not converged:
print(“not converge after ” + str(iter_num) + ” iterations”)
sys.stdout.write(“last centroids: [")
sys.stdout.write(",".join(str(v) for v in last_centroids))
sys.stdout.write("]\n”)

 

K-means clustering algorithms in Java:

 

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class FindCentroid extends EvalFunc<Double> {
double[] centroids;
public FindCentroid(String initialCentroid) {
String[] centroidStrings = initialCentroid.split(“:”);
centroids = new double[centroidStrings.length];
for (int i=0;i<centroidStrings.length;i++)
centroids[i] = Double.parseDouble(centroidStrings[i]);
}
@Override
public Double exec(Tuple input) throws IOException {
double min_distance = Double.MAX_VALUE;
double closest_centroid = 0;
for (double centroid : centroids) {
double distance = Math.abs(centroid – (Double)input.get(0));
if (distance < min_distance) {
min_distance = distance;
closest_centroid = centroid;
}
}
return closest_centroid;
}

}

 

K-means clustering algorithms in R:

kmeans =
function(points, ncenters, iterations = 10, distfun = NULL) {
if(is.null(distfun))
distfun =
function(a,b) norm(as.matrix(a-b), type = ‘F’)
newCenters =
kmeans.iter(
points,
distfun,
ncenters = ncenters)
for(i in 1:iterations) {
newCenters = kmeans.iter(points, distfun, centers = newCenters)}
newCenters}

kmeans.iter =
function(points, distfun, ncenters = dim(centers)[1], centers = NULL) {
from.dfs(mapreduce(input = points,
map =
if (is.null(centers)) {
function(k,v) keyval(sample(1:ncenters,1),v)}
else {
function(k,v) {
distances = apply(centers, 1, function(c) distfun(c,v))
keyval(centers[which.min(distances),], v)}},
reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))),
to.data.frame = T)}

 

 

Pattern Analysis Computation Methods and Algorithms for Machine Learning


In machine learning, pattern recognition is the assignment of a label to a given input value. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes (for example, determine whether a given email is “spam” or “non-spam”). However, pattern recognition is a more general problem that encompasses other types of output as well. Other examples are regression, which assigns a real-valued output to each input; sequence labeling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); and parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence.
Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to do “fuzzy” matching of inputs. This is opposed to pattern matching algorithms, which look for exact matches in the input with pre-existing patterns. A common example of a pattern-matching algorithm is regular expression matching, which looks for patterns of a given sort in textual data and is included in the search capabilities of many text editors and word processors. In contrast to pattern recognition, pattern matching is generally not considered a type of machine learning, although pattern-matching algorithms (especially with fairly general, carefully tailored patterns) can sometimes succeed in providing similar-quality output to the sort provided by pattern-recognition algorithms.

 

Pattern Analysis Computation Methods:

  • Ridge regression
  • Regularized Fisher discriminant
  • Regularized kernel Fisher discriminant
  • Maximizing variance
  • Maximizing covariance
  • Canonical correlation analysis
  • Kernel CCA
  • Regularized CCA
  • Kernel regularized CCA
  • Smallest enclosing hyper sphere
  • Soft minimal hyper sphere
  • nu-soft minimal hyper sphere
  • Hard margin SVM
  • 1-norm soft margin SVM
  • 2-norm soft margin SVM
  • Ridge regression optimization
  • Quadratic e-insensitive
  • Linear e-insensitive SVR
  • nu-SVR 
  • Soft ranking 
  • Cluster quality 
  • Cluster optimization strategy 
  • Multiclass clustering
  • Relaxed multiclass clustering 
  • Visualization quality

Pattern Analysis Algorithms:

  • Normalization 
  • Centering data 
  • Simple novelty detection 
  • Parzen based classifier 
  • Cholesky decomposition or dual Gram�Schmidt 
  • Standardizing data 
  • Kernel Fisher discriminant 
  • Primal PCA 
  • Kernel PCA 
  • Whitening 
  • Primal CCA 
  • Kernel CCA 
  • Principal components regression 
  • PLS feature extraction 
  • Primal PLS 
  • Kernel PLS 
  • Smallest hyper sphere enclosing data 
  • Soft hyper sphere minimization 
  • nu-soft minimal hyper sphere 
  • Hard margin SVM 
  • Alternative hard margin SVM 
  • 1-norm soft margin SVM 
  • nu-SVM
  • 2-norm soft margin SVM
  • Kernel ridge regression
  • 2-norm SVR
  • 1-norm SVR
  • nu-support vector regression
  • Kernel perceptron
  • Kernel adatron 
  • On-line SVR
  • nu-ranking
  • On-line ranking
  • Kernel k-means
  • MDS for kernel-embedded data
  • Data visualization

Source: http://www.kernel-methods.net/algos.html

Keywords: Data Mining, Machine Learning, Algorithms

 

Details on Scikit-Learn Python based Machine Learning Library


SCiKit-Learn Python based Machine Learning Library which is open sources through BSD.

URL:

Dependency:

  • Python >= 2.6
  • numpy > = 1.3
  • scipy >= 0.7

Includes supervised learning algorithms:

  • Generalized Linear Model with with scipy.sparse bindings for wide features datasets
  • Support Vector Machine (SVM) based on libsvm
  • Stochastic Gradient Descent
  • bayesian methods
  • Gaussian Processes
  • Nearest Neighbors
  • Partial Least Squares
  • Naive Bayes
  • Decision Trees
  • Ensemble methods
  • Multiclass and multilabel algorithms
  • Feature selection
  • L1 and L1+L2 regularized regression methods aka Lasso and Elastic Net models implemented with algorithms such as LARS and coordinate descent
  • Linear and Quadratic Discriminant Analysis

Includes unsupervised clustering algorithms:

  • Gaussian mixture models
  • kmeans++
  • meanshift
  • affinity propagation
  • Manifold learning
  • spectral clustering
  • Decomposing signals in components (matrix factorization problems)
  • Covariance estimation
  • Novelty and Outlier Detection
  • Hidden Markov Models (HMMs)

Include other tools:

  • feature extractors for text content (token and char ngrams + hashing vectorizer)
  • univariate feature selections
  • a simple pipe line tool
  • numerous implementations of cross validation strategies
  • performance metrics evaluation and ploting (ROC curve, AUC, confusion matrix, …)
  • a grid search utility to perform hyper-parameters tuning using parallel cross validation
  • integration with joblib for caching partial results when working in interactive environment (e.g. using ipython)

Samples:

  • Each algorithm implementation comes with sample programs demonstrating it’s usage either on toy data or real life datasets.

Source code:

  • Get the source code from Git-hub
Categories: Machine Learning, Python

Top most algorithms used in Data Mining


I am trying to compile a comprehensive  list of  Data Mining Algorithm and while trying to do so I found a top 10 list can be created by several ways.

Based on a Scientific research paper here is top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006 and  these top 10 algorithms are among the most influential data mining algorithms in the research community

  1. C4.5
  2. k-Means
  3. SVM
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART

Public Voting:

  1. Decision Trees/Rules
  2. Regression
  3. Clustering
  4. Statistics (descriptive)
  5. Visualization
  6. Time series/Sequence analysis
  7. Support Vector (SVM)
  8. Association rules
  9. Ensemble methods
  10. Text Mining
  11. Neural Nets
  12. Boosting
  13. Bayesian
  14. Bagging
  15. Factor Analysis
  16. Anomaly/Deviation detection
  17. Social Network Analysis
  18. Survival Analysis
  19. Genetic algorithms
  20. Uplift modeling

Based on voting done by “Mahout user mailing list” here is the list:

  1. Matrix factorization (SVD)
  2. k-means
  3. Naive Bayes
  4. Dirichlet Process Clustering
  5. Matrix Factorization
  6. Frequent Pattern Matching
  7. LDA
  8. Expectation Maximization
  9. SVM
  10. Decision Trees
  11. Logistics Regression
  12. Random Forest

Resources: 

Machine Learning Libraries in Python

April 25, 2012 3 comments

Here is a collection of Machine Learning Libraries in Python:

PyBrain (http://pybrain.org/)

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

mlPy (http://mlpy.sourceforge.net/)

mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is Open Source, distributed under the GNU General Public License version 3.

PyML: Machine Learning in Python (http://pyml.sourceforge.net/)
PyML is an interactive object oriented framework for machine learning written in Python. PyML focuses on SVMs and other kernel methods. It is supported on Linux and Mac OS X.
Features

  • Classifiers: support vector machines, nearest neighbor, ridge regression
  • Multi-class methods (one-against-rest and one-against-one)
  • Feature selection (filter methods, RFE)
  • Model selection
  • Preprocessing and normalization
  • Syntax for combining classifiers
  • Classifier testing (cross-validation, error rates, ROC curves)
  • Various kernels for biological sequences (several variants of the spectrum kernel, and the weighted-degree kernel).

Shogun – A Large Scale Machine Learning Toolbox (http://www.shogun-toolbox.org/)
The machine learning toolbox’s focus is on large scale kernel methods and especially on Support Vector Machines (SVM) [1]. It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS [21], Liblinear [20], LibSVM [2], SVMLight, [3] SVMLin [4] and GPDT [5]. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved [6], Fischer [7], TOP [8], Spectrum [9], Weighted Degree Kernel (with shifts) [10] [11] [12]. For the latter the efficient LINADD [12] optimizations are implemented. For linear SVMs the COFFIN framework [22][23] allows for on-demand computing feature spaces on-the-fly, even allowing to mix sparse, dense and other data types. Furthermore, SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning [13] [14] [18] [19]. Currently SVM one-class, 2-class and multiclass classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.
SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released as Machine Learning Open Source Software.

MDP – Modular toolkit for Data Processing (http://mdp-toolkit.sourceforge.net/)
Modular toolkit for Data Processing (MDP) is a Python data processing framework.
From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.
The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.

Orange http://orange.biolab.si/

Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.
MILK: MACHINE LEARNING TOOLKIT (http://packages.python.org/milk/)
Milk is a machine learning toolkit in Python. Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.
For unsupervised learning, milk supports k-means clustering and affinity propagation. Milk is flexible about its inputs. It optimised for numpy arrays, but can often handle anything (for example, for SVMs, you can use any dataype and any kernel and it does the right thing).
There is a strong emphasis on speed and low memory usage. Therefore, most of the performance sensitive code is in C++. This is behind Python-based interfaces for convenience.

 

scikit-learn: machine learning in Python (http://scikit-learn.sourceforge.net/stable/)

scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

 

Categories: Machine Learning, Python
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: