数据挖掘十大经典算法(附python源码)
http://blog.csdn.net/rss.html?type=column&column=top10-algorithm-dm
1.Apriori算法
如果一个事务中有X,则该事务中则很有可能有Y,写成关联规则 {X}→{Y} 将这种找出项目之间联系的方法叫做关联分析。关联分析中最有名的问题是购物蓝问题,在超市购物时,有一个奇特的现象——顾客在买完尿布之后通常会买啤酒,即{尿布}→{啤酒}。原来,妻子嘱咐丈夫回家的时候记得给孩子买尿布,丈夫买完尿布后通常会买自己喜欢的啤酒。
考虑到规则的合理性,引入了两个度量:支持度(support)、置信度(confidence),定义如下 支持度保证项集(X, Y)在数据集出现的频繁程度,置信度确定Y在包含X中出现的频繁程度。
对于包含有d个项的数据集,可能的规则数为 如果用brute-force的方法,计算代价太大了。为此,R. Agrawal与R. Srikant提出了Apriori算法。同大部分的关联分析算法一样,Apriori算法分为两步:
A priori在拉丁语中是“from before”(先验)的意思。Apriori算法是用到了一个简单到不能再简单的先验:一个频繁项集的子集也是频繁的。
生成频繁项集、关联规则用到了剪枝,具体参看[2]。
class associationRule:
def __init__(self,dataSet):
self.sentences=map(set,dataSet)
self.minSupport=0.5
self.minConf=0.98
self.numSents=float(len(self.sentences))
self.supportData={}
self.L=[]
self.ruleList=[]
def createC1(self):
"""create candidate itemsets of size 1 C1"""
C1=[]
for sentence in self.sentences:
for word in sentence:
if not [word] in C1:
C1.append([word])
C1.sort()
return map(frozenset,C1)
def scan(self,Ck):
"""generate frequent itemsets Lk from candidate itemsets Ck"""
wscnt={}
retList=[]
#calculate support for every itemset in Ck
for words in Ck:
for sentence in self.sentences:
if words.issubset(sentence):
if not wscnt.has_key(words): wscnt[words]=1
else: wscnt[words]+=1
for key in wscnt:
support=wscnt[key]/self.numSents
if support>=self.minSupport:
retList.append(key)
self.supportData[key]=support
self.L.append(retList)
def aprioriGen(self,Lk,k):
"""the candidate generation: merge a pair of frequent (k − 1)-itemsets
only if their first k − 2 items are identical
"""
retList=[]
lenLk=len(Lk)
for i in range(lenLk):
for j in range(i+1,lenLk):
L1=list(Lk[i])[:k-2]; L2=list(Lk[j])[:k-2]
L1.sort(); L2.sort()
if L1==L2:
retList.append(Lk[i]|Lk[j])
return retList
def apriori(self):
"""generate a list of frequent itemsets"""
C1=self.createC1()
self.scan(C1)
k=2
while(k<=3):
Ck=self.aprioriGen(self.L[k-2],k)
self.scan(Ck)
k+=1
def generateRules(self):
"""generate a list of rules"""
for i in range(1,len(self.L)): #get only sets with two or more items
for freqSet in self.L[i]:
H1=[frozenset([word]) for word in freqSet]
if(i>1): self.rulesFromConseq(freqSet,H1)
else: self.calcConf(freqSet,H1) #set with two items
def calcConf(self,freqSet,H):
"""calculate confidence, eliminate some rules by confidence-based pruning"""
prunedH=[]
for conseq in H:
conf=self.supportData[freqSet]/self.supportData[freqSet-conseq]
if conf>=self.minConf:
print "%s --> %s, conf=%.3f"%(map(str,freqSet-conseq), map(str,conseq), conf)
self.ruleList.append((freqSet-conseq,conseq,conf))
prunedH.append(conseq)
return prunedH
def rulesFromConseq(self,freqSet,H):
"""generate more association rules from freqSet+H"""
m=len(H[0])
if len(freqSet)>m+1: #try further merging
Hmp1=self.aprioriGen(H,m+1) #create new candidate Hm+1
hmp1=self.calcConf(freqSet,Hmp1)
if len(Hmp1)>1:
self.rulesFromConseq(freqSet,Hmp1)
读取mushroom.dat数据集 def read_file(raw_file):
"""read file"""
return [sorted(list(set(e.split()))) for e in
open(raw_file).read().strip().split('\n')]
def main():
sentences=read_file('test.txt')
assrules=associationRule(sentences)
assrules.apriori()
assrules.generateRules()
if __name__=="__main__":
main()
['76'] --> ['34'], conf=1.000
源代码在有些数据集上跑得很慢,还需要做一些优化。这里有一些用作关联分析测试的数据集。
2. Referrence
[1] Peter Harrington, machine learning in action. [2] Tan, et al., Introduction to data minging.
作者:lifehack 发表于2014/5/7 15:20:03 原文链接
阅读:2685 评论:3 查看评论
|
| 【数据挖掘】分类之Naïve Bayes |
1.算法简介
朴素贝叶斯(Naive Bayes)是无监督学习的一种常用算法,易于实现,没有迭代,并有坚实的数学理论(即贝叶斯定理)作为支撑。
本文以拼写检查作为例子,讲解Naive Bayes分类器是如何实现的。对于用户输入的一个单词(words),拼写检查试图推断出最有可能的那个正确单词(correct)。当然,输入的单词有可能本身就是正确的。比如,输入的单词thew,用户有可能是想输入the,也有可能是想输入thaw。为了解决这个问题,Naive Bayes分类器采用了后验概率P(c|w)来解决这个问题。P(c|w)表示在发生了w的情况下推断出c的概率。为了找出最有可能c,应找出有最大值的P(c|w),即求解问题 argmaxc P(c|w)
根据贝叶斯定理, P(c|w)=P(w|c) P(c) / P(w)
对于任意的c,P(w)均相等,问题等价与 argmaxc P(w|c) P(c)
P(w|c)表示的是用户输入w而是想输入c的概率。为了得到P(c),我们可以从文本库中统计c出现的频率。但是,P(w|c)似乎计算起来不那么容易。Norvig [1]中给出了一个简化计算办法: (1)如果w拼写正确并且出现在文本库中,返回w; (2)如果(1)没发生,计算与w的编辑距离为1的所有候选c,选出文本库中出现频率最高者; (3)如果(1)(2)均没发生,计算与w的编辑距离为2的所有候选c,选出文本库中出现频率最高者; (4)如果(1)(2)(3)均没发生,返回w。 一个单词通过删除、交换、更改、插入四个操作中一种,变换成另一个单词,这两个单词之间的编辑距离为1。
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model=collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits=[(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes=[a + b[1:] for a, b in splits if b]
transposes=[a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces=[a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts=[a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
2.Referrence
[1] Peter Norvig, How to Write a Spelling Corrector. [2] 阮一峰, 贝叶斯推断及其互联网应用(三):拼写检查.
作者:lifehack 发表于2014/4/28 16:51:54 原文链接
阅读:1576 评论:0 查看评论
|
| 【数据挖掘】分类之decision tree |
1. ID3 算法
ID3 算法是一种典型的决策树(decision tree)算法,C4.5, CART都是在其基础上发展而来。决策树的叶子节点表示类标号,非叶子节点作为属性测试条件。从树的根节点开始,将测试条件用于检验记录,根据测试结果选择恰当的分支;直至到达叶子节点,叶子节点的类标号即为该记录的类别。
ID3采用信息增益(information gain)作为分裂属性的度量,最佳分裂等价于求解最大的信息增益。 信息增益=parent节点熵 - 带权的子女节点的熵
ID3算法流程如下: 1.如果节点的所有类标号相同,停止分裂; 2.如果没有feature可供分裂,根据多数表决确定该节点的类标号,并停止分裂; 3.选择最佳分裂的feature,根据选择feature的值逐一进行分裂;递归地构造决策树。
源代码(从[1]中拿过来):
from math import log
import operator
import matplotlib.pyplot as plt
def calcEntropy(dataSet):
"""calculate the shannon entropy"""
numEntries=len(dataSet)
labelCounts={}
for entry in dataSet:
entry_label=entry[-1]
if entry_label not in labelCounts:
labelCounts[entry_label]=0
labelCounts[entry_label]+=1
entropy=0.0
for key in labelCounts:
prob=float(labelCounts[key])/numEntries
entropy-=prob*log(prob,2)
return entropy
def createDataSet():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing','flippers']
return dataSet, labels
def splitDataSet(dataSet,axis,pivot):
"""split dataset on feature"""
retDataSet=[]
for entry in dataSet:
if entry[axis]==pivot:
reduced_entry=entry[:axis]
reduced_entry.extend(entry[axis+1:])
retDataSet.append(reduced_entry)
return retDataSet
def bestFeatureToSplit(dataSet):
"""chooose the best feature to split """
numFeatures=len(dataSet[0])-1
baseEntropy=calcEntropy(dataSet)
bestInfoGain=0.0; bestFeature=-1
for axis in range(numFeatures):
#create unique list of class labels
featureList=[entry[axis] for entry in dataSet]
uniqueFeaList=set(featureList)
newEntropy=0.0
for value in uniqueFeaList:
subDataSet=splitDataSet(dataSet,axis,value)
prob=float(len(subDataSet))/len(dataSet)
newEntropy+=prob*calcEntropy(subDataSet)
infoGain=baseEntropy-newEntropy
#find the best infomation gain
if infoGain>bestInfoGain:
bestInfoGain=infoGain
bestFeature=axis
return bestFeature
def majorityVote(classList):
"""take a majority vote"""
classCount={}
for vote in classList:
if vote not in classCount.keys():
classCount[vote]=0
classCount+=1
sortedClassCount=sorted(classCount.iteritems(),
key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList=[entry[-1] for entry in dataSet]
#stop when all classes are equal
if classList.count(classList[0])==len(classList):
return classList[0]
#when no more features, return majority vote
if len(dataSet[0])==1:
return majorityVote(classList)
bestFeature=bestFeatureToSplit(dataSet)
bestFeatLabel=labels[bestFeature]
myTree={bestFeatLabel:{}}
del(labels[bestFeature])
subLabels=labels[:]
featureList=[entry[bestFeature] for entry in dataSet]
uniqueFeaList=set(featureList)
#split dataset according to the values of the best feature
for value in uniqueFeaList:
subDataSet=splitDataSet(dataSet,bestFeature,value)
myTree[bestFeatLabel][value]=createTree(subDataSet,subLabels)
return myTree
分类结果可视化
2. Referrence
[1] Peter Harrington, machine learning in action. 作者:lifehack 发表于2014/4/24 8:52:37 原文链接
阅读:1505 评论:0 查看评论
|
| 【数据挖掘】聚类之k-means |
1.算法简述
分类是指分类器(classifier)根据已标注类别的训练集,通过训练可以对未知类别的样本进行分类。分类被称为监督学习(supervised learning)。如果训练集的样本没有标注类别,那么就需要用到聚类。聚类是把相似的样本聚成一类,这种相似性通常以距离来度量。聚类被称为无监督学习(unspervised learning)。
k-means是聚类算法中常用的一种,其中k的含义是指有k个cluster。由聚类的定义可知,一个样本应距离其所属cluster的质心是最近的(相较于其他k-1个cluster)。实际上,k-means的本质是最小化目标函数: x为样本点,c为cluster。为了表示cluster,最简单有效的是取所有样本点平均,即质心(cluster centroid);这便是取名means的来由。
k-means算法流程如下: 选取初始k个质心(通常随机选取) 循环重复直至收敛 { 对每个样本,计算出与k个质心距离最近的那个,将其归为距离最新质心所对应的cluster 重新计算质心,当质心不再变化即为收敛 }
代码参考[1,2],结果可视化请参考[2]
import numpy as np
import scipy.spatial.distance as ssd
import matplotlib.pyplot as plt
def read_file(fn):
raw_file=open(fn)
dataSet=[]
for raw_row in raw_file.readlines():
row=raw_row.strip().split('\t')
dataSet.append((float(row[0]),float(row[1])))
return np.array(dataSet)
def firstCentroids(k,dataSet):
"""create the first centroids"""
num_columns=dataSet.shape[1]
centroids=np.zeros((k,num_columns))
for j in range(num_columns):
minJ=min(dataSet[:,j])
rangeJ=max(dataSet[:,j])-minJ
for i in range(k):
centroids[i,j]=minJ+rangeJ*np.random.uniform(0,1)
return np.array(centroids)
def kmeans(k,dataSet):
num_rows,num_columns=dataSet.shape
centroids=firstCentroids(k,dataSet)
#store the cluster that the samples belong to
clusterAssment=np.zeros((num_rows,2))
clusterChanged=True
while clusterChanged:
clusterChanged=False
#find the closet centroid
for i in range(num_rows):
minDis=np.inf;minIndex=-1
for j in range(k):
distance=ssd.euclidean(dataSet[i,:],centroids[j,:])
if distance<minDis:
minDis=distance;minIndex=j
if(clusterAssment[i,0]!=minIndex): clusterChanged=True
clusterAssment[i,:]=minIndex,minDis**2
#update the centroid location
for cent in range(k):
ptsInCent=dataSet[np.nonzero(clusterAssment[:,0]==cent)[0]]
centroids[cent,:]=np.mean(ptsInCent,axis=0)
return centroids,clusterAssment
缺点:
2. Referrence
[1] Peter Harrington, machine learning in action. [2] zouxy09, 机器学习算法与Python实践之(五)k均值聚类(k-means). [3] the top ten algorithm in data mining, CRC Press.
作者:lifehack 发表于2014/4/18 21:05:03 原文链接
阅读:1874 评论:0 查看评论
|
| 【数据挖掘】分类之kNN |
1.算法简介
kNN的思想很简单:计算待分类的数据点与训练集所有样本点,取距离最近的k个样本;统计这k个样本的类别数量;根据多数表决方案,取数量最多的那一类作为待测样本的类别。距离度量可采用Euclidean distance,Manhattan distance和cosine。
用Iris数据集作为测试,代码参考[1]
import numpy as np
import scipy.spatial.distance as ssd
def read_data(fn):
""" read dataset and separate into characteristics data
and label data
"""
# read dataset file
with open(fn) as f:
raw_data = np.loadtxt(f, delimiter= ',', dtype="float",
skiprows=1, usecols=None)
#initialize
charac=[]; label=[]
#obtain input characrisitics and label
for row in raw_data:
charac.append(row[:-1])
label.append(int (row[-1]))
return np.array(charac),np.array(label)
def knn(k,dtrain,dtest,dtr_label):
"""k-nearest neighbors algorithm"""
pred_label=[]
#for each instance in test dataset, calculate
#distance in respect to train dataset
for di in dtest:
distances=[]
for ij,dj in enumerate(dtrain):
distances.append((ssd.euclidean(di,dj),ij))
#sort the distances to get k-neighbors
k_nn=sorted(distances)[:k]
#classify accroding to the maxmium label
dlabel=[]
for dis,idtr in k_nn:
dlabel.append(dtr_label[idtr])
pred_label.append(np.argmax(np.bincount(dlabel)))
return pred_label
def evaluate(result):
"""evaluate the predicited label"""
eval_result=np.zeros(2,int)
for x in result:
#pred_label==dte_label
if x==0:
eval_result[0]+=1
#pred_label!=dte_label
else:
eval_result[1]+=1
return eval_result
dtrain,dtr_label=read_data('iris-train.csv')
dtest,dte_label=read_data('iris-test.csv')
K=[1,3,7,11]
print "knn classification result for iris data set:\n"
print "k | number of correct/wrong classified test records"
for k in K:
pred_label=knn(k,dtrain,dtest,dtr_label)
eval_result=evaluate(pred_label-dte_label)
#print the evaluted result into screen
print k," | ", eval_result[0], "/", eval_result[1]
print
2. Referrence[1] M. Saad Nurul Ishlah, Python: Simple K Nearest Neighbours Classifier.
|