机器学习-无监督

最新推荐文章于 2025-07-08 07:33:46 发布

翻译最新推荐文章于 2025-07-08 07:33:46 发布 · 118 阅读

本内容遵循CC 4.0 BY-SA版权协议

原文链接：https://www.tutorialspoint.com/machine_learning/machine_learning_unsupervised.htm

机器学习-无监督 (Machine Learning - Unsupervised)

So far what you have seen is making the machine learn to find out the solution to our target. In regression, we train the machine to predict a future value. In classification, we train the machine to classify an unknown object in one of the categories defined by us. In short, we have been training machines so that it can predict Y for our data X. Given a huge data set and not estimating the categories, it would be difficult for us to train the machine using supervised learning. What if the machine can look up and analyze the big data running into several Gigabytes and Terabytes and tell us that this data contains so many distinct categories?

到目前为止，您所看到的是使机器学习找出解决我们目标的方法。在回归中，我们训练机器以预测未来价值。在分类中，我们训练机器将未知对象分类为我们定义的类别之一。简而言之，我们一直在训练机器，以便它可以为我们的数据X预测Y。给定庞大的数据集且未估计类别，对于我们而言，使用监督学习来训练机器将非常困难。如果机器可以查找并分析运行到数GB和TB的大数据，并告诉我们该数据包含许多不同的类别，该怎么办？

As an example, consider the voter’s data. By considering some inputs from each voter (these are called features in AI terminology), let the machine predict that there are so many voters who would vote for X political party and so many would vote for Y, and so on. Thus, in general, we are asking the machine given a huge set of data points X, “What can you tell me about X?”. Or it may be a question like “What are the five best groups we can make out of X?”. Or it could be even like “What three features occur together most frequently in X?”.

例如，考虑选民的数据。通过考虑每个选民的一些输入(在AI术语中称为特征)，让机器预测有太多选民将为X政党投票，而有很多选民为Y政党投票，依此类推。因此，总的来说，我们要求机器给定大量的数据点X，“关于X，您能告诉我什么？”。或可能有一个问题，例如“我们可以从X中选出五个最好的小组？”。甚至可能就像“在X中哪三个功能最常同时出现？”一样。

This is exactly the Unsupervised Learning is all about.

这正是无监督学习的全部内容。

无监督学习算法 (Algorithms for Unsupervised Learning)

Let us now discuss one of the widely used algorithms for classification in unsupervised machine learning.

现在让我们讨论一种在无监督机器学习中广泛使用的分类算法。

k均值聚类 (k-means clustering)

The 2000 and 2004 Presidential elections in the United States were close — very close. The largest percentage of the popular vote that any candidate received was 50.7% and the lowest was 47.9%. If a percentage of the voters were to have switched sides, the outcome of the election would have been different. There are small groups of voters who, when properly appealed to, will switch sides. These groups may not be huge, but with such close races, they may be big enough to change the outcome of the election. How do you find these groups of people? How do you appeal to them with a limited budget? The answer is clustering.

美国的2000年和2004年总统大选已经接近，非常接近。在所有候选人中，获得最高票数的民众投票是50.7％，而最低的是47.9％。如果一定比例的选民要换面，选举的结果将是不同的。有几小组选民，如果受到适当的呼吁，将改变立场。这些团体可能并不庞大，但由于种族如此亲密，它们可能足以改变选举结果。您如何找到这些人？在预算有限的情况下，您如何吸引他们？答案是集群。

Let us understand how it is done.

让我们了解它是如何完成的。

First, you collect information on people either with or without their consent: any sort of information that might give some clue about what is important to them and what will influence how they vote.
首先，您收集有关经过或未经过他们同意的人的信息：可能提供一些线索的信息，这些线索对他们来说很重要，什么会影响他们的投票方式。
Then you put this information into some sort of clustering algorithm.
然后，您将此信息放入某种聚类算法中。
Next, for each cluster (it would be smart to choose the largest one first) you craft a message that will appeal to these voters.
接下来，对于每个集群(首先选择最大的集群是明智的)，您将制作出一条吸引这些选民的信息。
Finally, you deliver the campaign and measure to see if it’s working.
最后，您交付广告活动并进行衡量以查看其是否有效。

Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It is like automatic classification. You can cluster almost anything, and the more similar the items are in the cluster, the better the clusters are. In this chapter, we are going to study one type of clustering algorithm called k-means. It is called k-means because it finds ‘k’ unique clusters, and the center of each cluster is the mean of the values in that cluster.

聚类是一种无监督学习，可自动形成相似事物的聚类。就像自动分类一样。您几乎可以对任何事物进行聚类，并且聚类中的项目越相似，聚类就越好。在本章中，我们将研究一种称为k-means的聚类算法。之所以称为k-均值，是因为它找到“ k”个唯一的簇，并且每个簇的中心是该簇中值的平均值。

集群识别 (Cluster Identification)

Cluster identification tells an algorithm, “Here’s some data. Now group similar things together and tell me about those groups.” The key difference from classification is that in classification you know what you are looking for. While that is not the case in clustering.

集群识别告诉一种算法，“这里有一些数据。现在将类似的事物归为一类，并向我介绍这些分组。” 分类的主要区别在于分类中您知道要查找的内容。虽然在集群中不是这种情况。

Clustering is sometimes called unsupervised classification because it produces the same result as classification does but without having predefined classes.

聚类有时称为无监督分类，因为它产生与分类相同的结果，但是没有预定义的类。

Now, we are comfortable with both supervised and unsupervised learning. To understand the rest of the machine learning categories, we must first understand Artificial Neural Networks (ANN), which we will learn in the next chapter.

现在，我们对有监督和无监督学习都感到满意。要了解其余的机器学习类别，我们必须首先了解人工神经网络(ANN)，我们将在下一章中学习。