hadoop学习--K-Means算法实现

最新推荐文章于 2025-08-12 13:51:51 发布

原创最新推荐文章于 2025-08-12 13:51:51 发布 · 2.1k 阅读

5 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#Hadoop #KMeans

Hadoop 专栏收录该内容

18 篇文章

订阅专栏

本文档介绍了如何使用Hadoop MapReduce实现K-Means聚类算法。步骤包括随机选择初始中心点，计算点到中心的距离并归类，更新中心点，直至中心点稳定。代码中，CenterInitial函数初始化中心，map阶段进行距离计算和归类，reduce阶段计算新中心，Newcenter函数比较新旧中心判断聚类是否结束。运行参数如`2 input.txt output`表示2个聚类中心。源代码可在GitHub找到，参考链接提供了更多细节。

本例子介绍使用hadoop做聚类分析。通过mapreduce实现KMeans算法。

1、KMeans算法介绍：

k-means 算法接受参数 k ；然后将事先输入的n个数据对象划分为 k个聚类以便使得所获得的聚类满足：同一聚类中的对象相似度较高；而不同聚类中的对象相似度较小。聚类相似度是利用各聚类中对象的均值所获得一个“中心对象”（引力中心）来进行计算的。

K-means 算法是最为经典的基于划分的聚类方法，是十大经典数据挖掘算法之一。K-means 算法的基本思想是：以空间中k个点为中心进行聚类，对最靠近他们的对象归类。通过迭代的方法，逐次更新各聚类中心的值，直至得到最好的聚类结果。

假设要把样本集分为c个类别，算法描述如下：

（1）适当选择c个类的初始中心；

（2）在第k次迭代中，对任意一个样本，求其到c各中心的距离，将该样本归到距离最短的中心所在的类；

（3）利用均值等方法更新该类的中心值；

（4）对于所有的c个聚类中心，如果利用（2）（3）的迭代法更新后，值保持不变，则迭代结束，否则继续迭代。

该算法的最大优势在于简洁和快速。算法的关键在于初始中心的选择和距离公式。

这里的测试数据使用坐标数据，找到各个坐标对应的聚类中心，数据格式如下：

(1,1)
(2,2)
(99,99)
(100,100)
(101,101)

2、实现过程

step1：随机选取K个点，作为初始中心；

step2：对于任意一个坐标，求其到各个中心点的距离，并将其归类到最短的中心所在类中；

step3：累加类中的各个点坐标求均值，作为新的中心点；

step4：如果中心点保持不变，则结束。否则继续step2。

3、代码：

import java.io.IOException;
import java.util.StringTokenizer;
import java.io.ByteArrayInputStream;  
import java.io.ByteArrayOutputStream;
import java.net.URI;

import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;

public class KMeans {
    public static String[] centerlist;
    public static int k = 0;//K 个数
    public static class MapClass 
    	extends Mapper<LongWritable, Text, Text, Text> {
        
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text(); 
        public void map(LongWritable key, Text value,
                        Context context ) throws IOException,
                        InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());  
	    while(itr.hasMoreTokens())  
<span style="white-space:pre">	</span>    {  
		String outValue = new String(itr.nextToken());
  
		String[] list = outValue.replace("(", "").replace(")", "").split(",");  
		String[] c = centerlist[0].replace("(", "").replace(")", "").split(",");  
		float min = 0;  
		int pos = 0;  
		for(int i=0;i<list.length;i++)  
		{  
		    min += (float) Math.pow((Float.parseFloat(list[i]) - Float.parseFloat(c[i])),2);  
		}  
		for(int i=0;i<centerlist.length;i++)  
		{  
		     String[] centerStrings = centerlist[i].replace("(", "").replace(")", "").split(",");  
		     float distance = 0;  
		     for(int j=0;j<list.length;j++)  
		         distance += (float) Math.pow((Float.parseFloat(list[j]) - Float.parseFloat(centerStrings[j])),2);  
		     if(min>distance)  
		     {  
		                min=distance;  
		                pos=i;  
		            }  
		        }  
		        context.write(new Text(centerlist[pos]), new Text(outValue));  
		    }  
        }
    }
    
    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        
        public void reduce(Text key, Iterable<Text> values,
                           Context context) throws IOException,InterruptedException {
                           
            String outVal = "";  
        	int count=0;  
		    String center="";  
		    int length = key.toString().replace("(", "").replace(")", "").replace(":", "").split(",").length;  
		    float[] ave = new float[Float.SIZE*length];  
		    for(int i=0;i<length;i++)  
		        ave[i]=0;   
		    for(Text val:values)  
		    {  
		        outVal += val.toString()+" ";  
		        String[] tmp = val.toString().replace("(", "").replace(")", "").split(",");  
		        for(int i=0;i<tmp.length;i++)  
		            ave[i] += Float.parseFloat(tmp[i]);  
		        count ++;  
		    }  
		    for(int i=0;i<length;i++)  
		    {  
		        ave[i]=ave[i]/count;  
		        if(i==0)  
		            center += "("+ave[i]+",";  
		        else {  
		            if(i==length-1)  
		                center += ave[i]+")";  
		            else {  
		                center += ave[i]+",";  
		            }  
		        }  
		    }  
		    System.out.println(center);  
		    context.write(key, new Text(outVal+center));
        }
    }
    public static void CenterInitial(String strInPath,String[] strcen) throws IOException{
    	String[] list;  
        String inpath = strInPath; 
        Configuration conf = new Configuration(); //读取hadoop文件系统的配置  
        conf.set("hadoop.job.ugi", "hadoop,hadoop");   
        FileSystem fs = FileSystem.get(URI.create(inpath),conf); //FileSystem是用户操作HDFS的核心类，它获得URI对应的HDFS文件系统   
        FSDataInputStream in = null;   
        ByteArrayOutputStream out = new ByteArrayOutputStream();  
        try{   
           
            in = fs.open( new Path(inpath) );   
            IOUtils.copyBytes(in,out,50,false);  //用Hadoop的IOUtils工具方法来让这个文件的指定字节复制到标准输出流上   
            list = out.toString().split("\n");  
        } finally {   
        	IOUtils.closeStream(in);
        	out.close();  
        }
        for(int i = 0;i < strcen.length;i++)
        {
        	strcen[i] = list[i];
        }
    	
    }
    public static float NewCenter(String strOutPath) throws IOException{
    	String[] list;
    	float should = Integer.MIN_VALUE ;
    	Configuration conf = new Configuration(); //读取hadoop文件系统的配置  
        conf.set("hadoop.job.ugi", "hadoop,hadoop");   
        FileSystem fs = FileSystem.get(URI.create(strOutPath + "/part-r-00000"),conf); //FileSystem是用户操作HDFS的核心类，它获得URI对应的HDFS文件系统   
        FSDataInputStream in = null;   
        ByteArrayOutputStream out = new ByteArrayOutputStream();  
        try{   
           
            in = fs.open( new Path(strOutPath + "/part-r-00000") );   
            IOUtils.copyBytes(in,out,50,false);  //用Hadoop的IOUtils工具方法来让这个文件的指定字节复制到标准输出流上   
            list = out.toString().split("\n");  
        } finally {   
        	IOUtils.closeStream(in);
        	out.close();  
        }
        
        for(int i = 0;i < k;i++){
        	String[] l = list[i].replace("\t", " ").split(" ");
        	String[] oldcenter = l[0].replace("(", "").replace(")", "").split(",");	//原先的中心点
        	String[] finalcenter = l[l.length-1].replace("(", "").replace(")", "").split(",");//新的中心点
        	centerlist[i] = l[l.length-1];		//保存最新的中心点
        	float tmp = 0;
        	for(int j = 0; j< oldcenter.length;j++){
        		tmp+=Math.pow(Float.parseFloat(oldcenter[j]) - Float.parseFloat(finalcenter[j]),2 );
        	}
        	if(should <= tmp)
        		should = tmp;
        }
        System.out.println("New Center...");
    	return should;
    	
    }
    public static void main(String[] args) throws Exception { 

        Configuration conf = new Configuration();
        conf.set("fs.default.name", "hdfs://localhost:9000");
        int times = 0;
        String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
        if(otherArgs.length != 3){
        	System.err.println("Usage: KMeans K <in> <out>");
        	System.exit(2);
        }
        k = Integer.parseInt(args[0]);
        String[] strcen = new String[k];
        centerlist = strcen;
        CenterInitial(args[1], centerlist);		//初始化中心
        double s = 0;
        double shold = 0.0001;
        do{
	        Job job = new Job(conf, "KMeans");
	        job.setJarByClass(KMeans.class);
	        job.setMapperClass(MapClass.class);
	        job.setReducerClass(Reduce.class);
	        
	        job.setOutputKeyClass(Text.class);
	        job.setOutputValueClass(Text.class);
	        
        	FileSystem fs = FileSystem.get(conf);		//每次循环，都删除输出目录  
	        fs.delete(new Path(args[2]),true);
	        FileInputFormat.setInputPaths(job, new Path(otherArgs[1]));
	        FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
	        if(job.waitForCompletion(true))  
	        {  
	            s = NewCenter(args[2]); 
	            times++;
	        }
        }while(s> shold);
        System.out.println("Iterator: " + times);	//迭代次数，即重复计算中心点的次数
        //System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

4、代码分析

CenterInitial函数：初始化中心点，这里以输入文件的前k个点作为初始中心；例如k=2，则(1,1) (2,2)就是初始中心；

map：计算每个点到各个中心的距离，并将其归类；

第一计算后，map的输出：

(1,1)<span style="white-space:pre">	</span>(1,1)
(2,2)<span style="white-space:pre">	</span>(2,2) (99,99) (100,100) (101,101)

reduce：计算每个聚类中新的中心点，并输出到part-r-00000；

第一次reduce之后：

(1,1)<span style="white-space: pre;">	</span>(1,1) (1,1)
(2,2)<span style="white-space: pre;">	</span>(2,2) (99,99) (100,100) (101,101) (75.5,75.5)

Newcenter函数：根据reduce计算出的新的中心点与原来的中心作比较，这里的比较也就是2点之间距离的计算，如果距离很小，基本可以判定聚类中心不再变化
该函数把reduce的结果中新的中心点（第一次计算时，把(1,1) (75.5,75.5) ）存到centerlist中。作为下次map归类的根据。

这里的centerlist是KMeans的String变量数组，相当于全局变量。。。如果中心点比较多，数据量比较大的话，也可以把这个结果写到hdfs文件中，每次都从文件中读取。

这里的main函数是一个循环，每次map reduce 结束之前，都要删除上一次输出的结果。

5、运行参数

2 input.txt output

这里的 2 表示 2个聚类中心。

运行结果：

(1.5,1.5)<span style="white-space:pre">	</span>(1,1) (2,2) (1.5,1.5)
(100.0,100.0)<span style="white-space:pre">	</span>(99,99) (100,100) (101,101) (100.0,100.0)

源代码：https://github.com/y521263/Hadoop_in_Action

参考资料：

http://blog.csdn.net/shizhixin/article/details/8968977

http://cloudcomputing.ruc.edu.cn/