数据挖掘十大经典算法(附python源码)
http://blog.csdn.net/rss.html?type=column&column=top10-algorithm-dm
1.Apriori算法
如果一个事务中有X,则该事务中则很有可能有Y,写成关联规则 {X}→{Y} 将这种找出项目之间联系的方法叫做关联分析。关联分析中最有名的问题是购物蓝问题,在超市购物时,有一个奇特的现象——顾客在买完尿布之后通常会买啤酒,即{尿布}→{啤酒}。原来,妻子嘱咐丈夫回家的时候记得给孩子买尿布,丈夫买完尿布后通常会买自己喜欢的啤酒。
考虑到规则的合理性,引入了两个度量:支持度(support)、置信度(confidence),定义如下 支持度保证项集(X, Y)在数据集出现的频繁程度,置信度确定Y在包含X中出现的频繁程度。
对于包含有d个项的数据集,可能的规则数为 如果用brute-force的方法,计算代价太大了。为此,R. Agrawal与R. Srikant提出了Apriori算法。同大部分的关联分析算法一样,Apriori算法分为两步:
A priori在拉丁语中是“from before”(先验)的意思。Apriori算法是用到了一个简单到不能再简单的先验:一个频繁项集的子集也是频繁的。
生成频繁项集、关联规则用到了剪枝,具体参看[2]。
class associationRule: def __init__(self,dataSet): self.sentences=map(set,dataSet) self.minSupport=0.5 self.minConf=0.98 self.numSents=float(len(self.sentences)) self.supportData={} self.L=[] self.ruleList=[] def createC1(self): """create candidate itemsets of size 1 C1""" C1=[] for sentence in self.sentences: for word in sentence: if not [word] in C1: C1.append([word]) C1.sort() return map(frozenset,C1) def scan(self,Ck): """generate frequent itemsets Lk from candidate itemsets Ck""" wscnt={} retList=[] #calculate support for every itemset in Ck for words in Ck: for sentence in self.sentences: if words.issubset(sentence): if not wscnt.has_key(words): wscnt[words]=1 else: wscnt[words]+=1 for key in wscnt: support=wscnt[key]/self.numSents if support>=self.minSupport: retList.append(key) self.supportData[key]=support self.L.append(retList) def aprioriGen(self,Lk,k): """the candidate generation: merge a pair of frequent (k − 1)-itemsets only if their first k − 2 items are identical """ retList=[] lenLk=len(Lk) for i in range(lenLk): for j in range(i+1,lenLk): L1=list(Lk[i])[:k-2]; L2=list(Lk[j])[:k-2] L1.sort(); L2.sort() if L1==L2: retList.append(Lk[i]|Lk[j]) return retList def apriori(self): """generate a list of frequent itemsets""" C1=self.createC1() self.scan(C1) k=2 while(k<=3): Ck=self.aprioriGen(self.L[k-2],k) self.scan(Ck) k+=1 def generateRules(self): """generate a list of rules""" for i in range(1,len(self.L)): #get only sets with two or more items for freqSet in self.L[i]: H1=[frozenset([word]) for word in freqSet] if(i>1): self.rulesFromConseq(freqSet,H1) else: self.calcConf(freqSet,H1) #set with two items def calcConf(self,freqSet,H): """calculate confidence, eliminate some rules by confidence-based pruning""" prunedH=[] for conseq in H: conf=self.supportData[freqSet]/self.supportData[freqSet-conseq] if conf>=self.minConf: print "%s --> %s, conf=%.3f"%(map(str,freqSet-conseq), map(str,conseq), conf) self.ruleList.append((freqSet-conseq,conseq,conf)) prunedH.append(conseq) return prunedH def rulesFromConseq(self,freqSet,H): """generate more association rules from freqSet+H""" m=len(H[0]) if len(freqSet)>m+1: #try further merging Hmp1=self.aprioriGen(H,m+1) #create new candidate Hm+1 hmp1=self.calcConf(freqSet,Hmp1) if len(Hmp1)>1: self.rulesFromConseq(freqSet,Hmp1) 读取mushroom.dat数据集 def read_file(raw_file): """read file""" return [sorted(list(set(e.split()))) for e in open(raw_file).read().strip().split('\n')] def main(): sentences=read_file('test.txt') assrules=associationRule(sentences) assrules.apriori() assrules.generateRules() if __name__=="__main__": main()
['76'] --> ['34'], conf=1.000
源代码在有些数据集上跑得很慢,还需要做一些优化。这里有一些用作关联分析测试的数据集。
2. Referrence
[1] Peter Harrington, machine learning in action. [2] Tan, et al., Introduction to data minging.
作者:lifehack 发表于2014/5/7 15:20:03 原文链接
阅读:2685 评论:3 查看评论
|
【数据挖掘】分类之Naïve Bayes |
1.算法简介
朴素贝叶斯(Naive Bayes)是无监督学习的一种常用算法,易于实现,没有迭代,并有坚实的数学理论(即贝叶斯定理)作为支撑。
本文以拼写检查作为例子,讲解Naive Bayes分类器是如何实现的。对于用户输入的一个单词(words),拼写检查试图推断出最有可能的那个正确单词(correct)。当然,输入的单词有可能本身就是正确的。比如,输入的单词thew,用户有可能是想输入the,也有可能是想输入thaw。为了解决这个问题,Naive Bayes分类器采用了后验概率P(c|w)来解决这个问题。P(c|w)表示在发生了w的情况下推断出c的概率。为了找出最有可能c,应找出有最大值的P(c|w),即求解问题 argmaxc P(c|w)
根据贝叶斯定理, P(c|w)=P(w|c) P(c) / P(w)
对于任意的c,P(w)均相等,问题等价与 argmaxc P(w|c) P(c)
P(w|c)表示的是用户输入w而是想输入c的概率。为了得到P(c),我们可以从文本库中统计c出现的频率。但是,P(w|c)似乎计算起来不那么容易。Norvig [1]中给出了一个简化计算办法: (1)如果w拼写正确并且出现在文本库中,返回w; (2)如果(1)没发生,计算与w的编辑距离为1的所有候选c,选出文本库中出现频率最高者; (3)如果(1)(2)均没发生,计算与w的编辑距离为2的所有候选c,选出文本库中出现频率最高者; (4)如果(1)(2)(3)均没发生,返回w。 一个单词通过删除、交换、更改、插入四个操作中一种,变换成另一个单词,这两个单词之间的编辑距离为1。
import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model=collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): splits=[(word[:i], word[i:]) for i in range(len(word) + 1)] deletes=[a + b[1:] for a, b in splits if b] transposes=[a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces=[a + c + b[1:] for a, b in splits for c in alphabet if b] inserts=[a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get) 2.Referrence
[1] Peter Norvig, How to Write a Spelling Corrector. [2] 阮一峰, 贝叶斯推断及其互联网应用(三):拼写检查.
作者:lifehack 发表于2014/4/28 16:51:54 原文链接
阅读:1576 评论:0 查看评论
|
【数据挖掘】分类之decision tree |
1. ID3 算法
ID3 算法是一种典型的决策树(decision tree)算法,C4.5, CART都是在其基础上发展而来。决策树的叶子节点表示类标号,非叶子节点作为属性测试条件。从树的根节点开始,将测试条件用于检验记录,根据测试结果选择恰当的分支;直至到达叶子节点,叶子节点的类标号即为该记录的类别。
ID3采用信息增益(information gain)作为分裂属性的度量,最佳分裂等价于求解最大的信息增益。 信息增益=parent节点熵 - 带权的子女节点的熵
ID3算法流程如下: 1.如果节点的所有类标号相同,停止分裂; 2.如果没有feature可供分裂,根据多数表决确定该节点的类标号,并停止分裂; 3.选择最佳分裂的feature,根据选择feature的值逐一进行分裂;递归地构造决策树。
源代码(从[1]中拿过来):
from math import log import operator import matplotlib.pyplot as plt def calcEntropy(dataSet): """calculate the shannon entropy""" numEntries=len(dataSet) labelCounts={} for entry in dataSet: entry_label=entry[-1] if entry_label not in labelCounts: labelCounts[entry_label]=0 labelCounts[entry_label]+=1 entropy=0.0 for key in labelCounts: prob=float(labelCounts[key])/numEntries entropy-=prob*log(prob,2) return entropy def createDataSet(): dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] labels = ['no surfacing','flippers'] return dataSet, labels def splitDataSet(dataSet,axis,pivot): """split dataset on feature""" retDataSet=[] for entry in dataSet: if entry[axis]==pivot: reduced_entry=entry[:axis] reduced_entry.extend(entry[axis+1:]) retDataSet.append(reduced_entry) return retDataSet def bestFeatureToSplit(dataSet): """chooose the best feature to split """ numFeatures=len(dataSet[0])-1 baseEntropy=calcEntropy(dataSet) bestInfoGain=0.0; bestFeature=-1 for axis in range(numFeatures): #create unique list of class labels featureList=[entry[axis] for entry in dataSet] uniqueFeaList=set(featureList) newEntropy=0.0 for value in uniqueFeaList: subDataSet=splitDataSet(dataSet,axis,value) prob=float(len(subDataSet))/len(dataSet) newEntropy+=prob*calcEntropy(subDataSet) infoGain=baseEntropy-newEntropy #find the best infomation gain if infoGain>bestInfoGain: bestInfoGain=infoGain bestFeature=axis return bestFeature def majorityVote(classList): """take a majority vote""" classCount={} for vote in classList: if vote not in classCount.keys(): classCount[vote]=0 classCount+=1 sortedClassCount=sorted(classCount.iteritems(), key=operator.itemgetter(1),reverse=True) return sortedClassCount[0][0] def createTree(dataSet,labels): classList=[entry[-1] for entry in dataSet] #stop when all classes are equal if classList.count(classList[0])==len(classList): return classList[0] #when no more features, return majority vote if len(dataSet[0])==1: return majorityVote(classList) bestFeature=bestFeatureToSplit(dataSet) bestFeatLabel=labels[bestFeature] myTree={bestFeatLabel:{}} del(labels[bestFeature]) subLabels=labels[:] featureList=[entry[bestFeature] for entry in dataSet] uniqueFeaList=set(featureList) #split dataset according to the values of the best feature for value in uniqueFeaList: subDataSet=splitDataSet(dataSet,bestFeature,value) myTree[bestFeatLabel][value]=createTree(subDataSet,subLabels) return myTree 分类结果可视化
2. Referrence
[1] Peter Harrington, machine learning in action. 作者:lifehack 发表于2014/4/24 8:52:37 原文链接
阅读:1505 评论:0 查看评论
|
【数据挖掘】聚类之k-means |
1.算法简述
分类是指分类器(classifier)根据已标注类别的训练集,通过训练可以对未知类别的样本进行分类。分类被称为监督学习(supervised learning)。如果训练集的样本没有标注类别,那么就需要用到聚类。聚类是把相似的样本聚成一类,这种相似性通常以距离来度量。聚类被称为无监督学习(unspervised learning)。
k-means是聚类算法中常用的一种,其中k的含义是指有k个cluster。由聚类的定义可知,一个样本应距离其所属cluster的质心是最近的(相较于其他k-1个cluster)。实际上,k-means的本质是最小化目标函数: x为样本点,c为cluster。为了表示cluster,最简单有效的是取所有样本点平均,即质心(cluster centroid);这便是取名means的来由。
k-means算法流程如下: 选取初始k个质心(通常随机选取) 循环重复直至收敛 { 对每个样本,计算出与k个质心距离最近的那个,将其归为距离最新质心所对应的cluster 重新计算质心,当质心不再变化即为收敛 }
代码参考[1,2],结果可视化请参考[2]
import numpy as np import scipy.spatial.distance as ssd import matplotlib.pyplot as plt def read_file(fn): raw_file=open(fn) dataSet=[] for raw_row in raw_file.readlines(): row=raw_row.strip().split('\t') dataSet.append((float(row[0]),float(row[1]))) return np.array(dataSet) def firstCentroids(k,dataSet): """create the first centroids""" num_columns=dataSet.shape[1] centroids=np.zeros((k,num_columns)) for j in range(num_columns): minJ=min(dataSet[:,j]) rangeJ=max(dataSet[:,j])-minJ for i in range(k): centroids[i,j]=minJ+rangeJ*np.random.uniform(0,1) return np.array(centroids) def kmeans(k,dataSet): num_rows,num_columns=dataSet.shape centroids=firstCentroids(k,dataSet) #store the cluster that the samples belong to clusterAssment=np.zeros((num_rows,2)) clusterChanged=True while clusterChanged: clusterChanged=False #find the closet centroid for i in range(num_rows): minDis=np.inf;minIndex=-1 for j in range(k): distance=ssd.euclidean(dataSet[i,:],centroids[j,:]) if distance<minDis: minDis=distance;minIndex=j if(clusterAssment[i,0]!=minIndex): clusterChanged=True clusterAssment[i,:]=minIndex,minDis**2 #update the centroid location for cent in range(k): ptsInCent=dataSet[np.nonzero(clusterAssment[:,0]==cent)[0]] centroids[cent,:]=np.mean(ptsInCent,axis=0) return centroids,clusterAssment
缺点:
2. Referrence
[1] Peter Harrington, machine learning in action. [2] zouxy09, 机器学习算法与Python实践之(五)k均值聚类(k-means). [3] the top ten algorithm in data mining, CRC Press.
作者:lifehack 发表于2014/4/18 21:05:03 原文链接
阅读:1874 评论:0 查看评论
|
【数据挖掘】分类之kNN |
1.算法简介
kNN的思想很简单:计算待分类的数据点与训练集所有样本点,取距离最近的k个样本;统计这k个样本的类别数量;根据多数表决方案,取数量最多的那一类作为待测样本的类别。距离度量可采用Euclidean distance,Manhattan distance和cosine。
用Iris数据集作为测试,代码参考[1]
import numpy as np import scipy.spatial.distance as ssd def read_data(fn): """ read dataset and separate into characteristics data and label data """ # read dataset file with open(fn) as f: raw_data = np.loadtxt(f, delimiter= ',', dtype="float", skiprows=1, usecols=None) #initialize charac=[]; label=[] #obtain input characrisitics and label for row in raw_data: charac.append(row[:-1]) label.append(int (row[-1])) return np.array(charac),np.array(label) def knn(k,dtrain,dtest,dtr_label): """k-nearest neighbors algorithm""" pred_label=[] #for each instance in test dataset, calculate #distance in respect to train dataset for di in dtest: distances=[] for ij,dj in enumerate(dtrain): distances.append((ssd.euclidean(di,dj),ij)) #sort the distances to get k-neighbors k_nn=sorted(distances)[:k] #classify accroding to the maxmium label dlabel=[] for dis,idtr in k_nn: dlabel.append(dtr_label[idtr]) pred_label.append(np.argmax(np.bincount(dlabel))) return pred_label def evaluate(result): """evaluate the predicited label""" eval_result=np.zeros(2,int) for x in result: #pred_label==dte_label if x==0: eval_result[0]+=1 #pred_label!=dte_label else: eval_result[1]+=1 return eval_result dtrain,dtr_label=read_data('iris-train.csv') dtest,dte_label=read_data('iris-test.csv') K=[1,3,7,11] print "knn classification result for iris data set:\n" print "k | number of correct/wrong classified test records" for k in K: pred_label=knn(k,dtrain,dtest,dtr_label) eval_result=evaluate(pred_label-dte_label) #print the evaluted result into screen print k," | ", eval_result[0], "/", eval_result[1] print
2. Referrence[1] M. Saad Nurul Ishlah, Python: Simple K Nearest Neighbours Classifier.
|