René's URL Explorer Experiment

Title: 深入理解KNN扩展到ANN · Issue #38 · aialgorithm/Blog · GitHub

Open Graph Title: 深入理解KNN扩展到ANN · Issue #38 · aialgorithm/Blog

X Title: 深入理解KNN扩展到ANN · Issue #38 · aialgorithm/Blog

Description: 一、KNN（K最近邻算法）原理一句话可以概括出KNN的算法原理：综合k个“邻居”的标签值作为新样本的预测值。更具体来讲KNN分类，给定一个训练数据集，对新的样本Xu，在训练数据集中找到与该样本距离最邻近的K（下图k=5）个样本，以这K个样本的最多数所属类别（标签）作为新实例Xu的类别。由上，可以总结出KNN算法有K值的选择、距离度量和决策方法等三个基本要素，如下分别解析： 1.1 距离度量 KNN算法用距离去度量两两样本间的临近程度，最终为新实例样本确认出最临近的...

Open Graph Description: 一、KNN（K最近邻算法）原理一句话可以概括出KNN的算法原理：综合k个“邻居”的标签值作为新样本的预测值。更具体来讲KNN分类，给定一个训练数据集，对新的样本Xu，在训练数据集中找到与该样本距离最邻近的K（下图k=5）个样本，以这K个样本的最多数所属类别（标签）作为新实例Xu的类别。由上，可以总结出KNN算法有K值的选择、距离度量和决策方法等三个基本要素，如下分别解析： 1.1 距离...

X Description: 一、KNN（K最近邻算法）原理一句话可以概括出KNN的算法原理：综合k个“邻居”的标签值作为新样本的预测值。更具体来讲KNN分类，给定一个训练数据集，对新的样本Xu，在训练数据集中找到与该样本距离最邻近的K（下图k=5）个样本，以这K个样本的最多数所属类别（标签）作为新实例Xu的类别。由上，可以总结出KNN算法有K值的选择、距离度量和决策方法等三个基本要素，如下分别解析： 1.1 距离...

Opengraph URL: https://github.com/aialgorithm/Blog/issues/38

X: @github

direct link

Domain: github.com

Hey, it has json ld scripts:

{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"深入理解KNN扩展到ANN","articleBody":"## 一、KNN（K最近邻算法）原理\r\n一句话可以概括出KNN的算法原理：**综合k个“邻居”的标签值作为新样本的预测值。**\r\n更具体来讲KNN分类，给定一个训练数据集，对新的样本Xu，在训练数据集中找到与该样本距离最邻近的K（下图k=5）个样本，以这K个样本的最多数所属类别（标签）作为新实例Xu的类别。\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-1bd0a58bbe6f3f90.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n由上，可以总结出KNN算法有K值的选择、距离度量和决策方法等三个基本要素，如下分别解析：\r\n\r\n### 1.1 距离度量\r\nKNN算法用距离去度量两两样本间的临近程度，最终为新实例样本确认出最临近的K个实例样本（这也是算法的关键步骤），常用的距离度量方法有曼哈顿距离、欧几里得距离：\r\n\r\n- 曼哈顿距离 公式：\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-86f9852ba8ca496a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n\r\n- 欧几里得距离 公式：\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-0d1161aa10669e59.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n曼哈顿、欧几里得距离的计算方法很简单，就是计算两样本(x,y)的各个特征i间的总距离。\r\n如下图（二维特征的情况）蓝线的距离即是曼哈顿距离（想象你在曼哈顿要从一个十字路口开车到另外一个十字路口实际驾驶距离就是这个“曼哈顿距离”，也称为城市街区距离），红线为欧几里得距离：\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-70fbf7d8b61024df.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n 曼哈顿距离 与  欧几里得距离 同属于闵氏距离的特例（p=1为曼哈顿距离；p=2为欧氏距离）\r\n![](https://upload-images.jianshu.io/upload_images/11682271-c7249abf6d758618.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n在多数情况下，KNN使用两者的差异不大。在一些情况的差异如下：\r\n- 对于高维特征，曼哈顿距离（即p更低）更能避免维度灾难的影响，效果更优。（具体可见https://bib.dbvis.de/uploadedFiles/155.pdf）\r\n- 欧几里得距离：（即p更高）更能关注大差异较大的特征的情况；\r\n\r\n除了曼哈顿距离、欧几里得距离，也可使用其他距离方法，衡量样本间的临近程度，具体可以看下这篇关于[【距离度量】](https://mp.weixin.qq.com/s/AKx9N01-xlLgL2_oFa1KUg)的介绍。\r\n\r\n**闵氏距离注意点：特征量纲差异问题**\r\n\r\n计算距离时，需要关注到特征量纲差异问题。假设各样本有年龄、工资两个特征变量，如计算欧氏距离的时候，(年龄1-年龄2)² 的值要远小于(工资1-工资2)² ，这意味着在不使用特征缩放的情况下，距离会被工资变量（大的数值）主导。因此，我们需要使用特征缩放来将全部的数值统一到一个量级上来解决此问题。通常的解决方法可以对数据进行“标准化”或“归一化”，对所有数值特征统一到标准的范围如0~1。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-6cde6ee02a4907fe.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n### 1.2 决策方法\r\n决策方法就计算确认到新实例样本最邻近的K个实例样本后，如何确定新实例样本的标签值。\r\n\r\n- 对于KNN分类：通常就是“多数表决，少数服从多数”，k个“邻居”的最多数所属类别为预测类别（可以基于距离的远近做加权，一般可以用距离的倒数作为权重，越近的邻居的类别更有可信度）。\r\n- 对于KNN回归：通常就是“取均值”，以k个“邻居”的标签值（目标值）的平均值作为预测值（同理也可以基于距离的远近做加权）。\r\n\r\n取K个”邻居“平均值或者多数决策的方法，其实也就是经验损失最小化。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-3b874318ed8202c9.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n### 1.3 K值的选择\r\nk值是KNN算法的一个超参数，K的含义即参考”邻居“标签值的个数。\r\n有个反直觉的现象，K取值较小时，模型复杂度（容量）高，训练误差会减小，泛化能力减弱；K取值较大时，模型复杂度低，训练误差会增大，泛化能力有一定的提高。\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-9d552e745b4ad062.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n\r\n原因是K取值小的时候（如k==1），仅用较小的领域中的训练样本进行预测，模型拟合能力比较强，决策就是只要紧跟着最近的训练样本（邻居）的结果。但是，当训练集包含”噪声样本“时，模型也很容易受这些噪声样本的影响（如图 过拟合情况，噪声样本在哪个位置，决策边界就会画到哪），这样会增大\"学习\"的方差，也就是**容易过拟合**。这时，多”听听其他邻居“训练样本的观点就能尽量减少这些噪声的影响。K值取值太大时，情况相反，容易欠拟合。\r\n\r\n对于K值的选择，通常可以网格搜索，采用交叉验证的方法选取合适的K值。\r\n\r\n\r\n\r\n\r\n\r\n\r\n## 二、KNN算法实现\r\nKNN有两种常用的实现方法：暴力搜索法，KD树法。\r\n### 2.1 暴力搜索法\r\n\r\nKNN实现最直接的方法就是暴力搜索（brute-force search），计算输入样本与每一个训练样本的距离，选择前k个最近邻的样本来多数表决。但是，当训练集或特征维度很大时，计算非常耗时，不太可行（对于D维的 N个样本而言，暴力查找方法的复杂度为 O(D*N) ） 。如下实现暴力搜索法的代码实现：\r\n```\r\nimport math\r\nimport numpy as np\r\nfrom matplotlib import pyplot\r\nfrom collections import Counter\r\n\r\n\r\ndef k_nearest_neighbors(data, predict, k=5):\r\n    distances = []\r\n    for group in data:\r\n        for features in data[group]:  #计算新样本-predict与训练样本的距离\r\n            euclidean_distance = np.sqrt(np.sum((np.array(features)-np.array(predict))**2))   # 计算欧拉距离\r\n            # euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))    # 计算欧拉距离优化效率\r\n            distances.append([euclidean_distance, group])\r\n    # print(sorted(distances))\r\n    sorted_distances = [i[1] for i in sorted(distances)]\r\n    top_nearest = sorted_distances[:k]\r\n    # print(top_nearest)  ['red','black','red']\r\n    group_res = Counter(top_nearest).most_common(1)[0][0]\r\n    confidence = Counter(top_nearest).most_common(1)[0][1] * 1.0 / k\r\n    # confidences是对本次分类的确定程度\r\n    return group_res, confidence\r\n```\r\n验证新的iris样本的分类效果（训练样本一共有3类：'blue’， 'green'， ‘yellow’），输出新样本（红色点）的分类结果为yellow，并绘图表示：\r\n![](https://upload-images.jianshu.io/upload_images/11682271-e297473a40feb308.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\n# 使用iris花的数据集(部分)，\r\ndataset = {\r\n    'blue': [[5.1, 3.5, 1.4, 0.2],\r\n        [4.9, 3. , 1.4, 0.2],\r\n        [4.7, 3.2, 1.3, 0.2],\r\n        [4.6, 3.1, 1.5, 0.2],\r\n        [5. , 3.6, 1.4, 0.2]], \r\n    'green': [[6.7, 3. , 5.2, 2.3],\r\n        [6.3, 2.5, 5. , 1.9],\r\n        [6.5, 3. , 5.2, 2. ],\r\n        [6.2, 3.4, 5.4, 2.3],\r\n        [5.9, 3. , 5.1, 1.8]],\r\n    'yellow':[[5.5, 2.4, 3.7, 1. ],\r\n        [5.8, 2.7, 3.9, 1.2],\r\n        [6. , 2.7, 5.1, 1.6],\r\n        [5.4, 3. , 4.5, 1.5],\r\n        [6. , 3.4, 4.5, 1.6]]\r\n\r\n} \r\nnew_features = [6. , 3. , 4.8, 1.8] \r\n# 计算预测样本在数据集中的最近邻\r\ngroup_res, confidence = k_nearest_neighbors(dataset, new_features, k=3)\r\nprint(group_res, confidence)   \r\n\r\nfor i in dataset:\r\n    for ii in dataset[i]:\r\n        pyplot.scatter(ii[0], ii[1], s=50, color=i)  #数据集样本画图（仅展示两个特征）\r\n        \r\npyplot.scatter(new_features[0], new_features[1], s=100, color='red')  # 新样本--红色，画图展示\r\npyplot.show()\r\n```\r\n\r\n\r\n### 2.2 KD树法\r\n我们知道暴力搜索的缺点是，算法学习时只能盲目计算新样本与其他训练样本的两两距离确认出K个近邻，而近邻样本只是其中的某一部分，如何高效识别先粗筛出这部分？再计算这部分候选样本的距离。\r\n\r\n一个解决办法是：利用KD树可以省去对大部分数据点的搜索，从而减少搜索的计算量，提高算法效率最优方法的时间复杂度为 O(n * log(n))。KD树实现KNN算法（主要为两步：1、构建KD树；2、利用KD树快速寻找K最近邻并决策。\r\n\r\n- 构建KD树\r\n\r\n\u003e所谓的KD树就是n个特征维度的二叉树，可以对n维空间的样本划分到对应的一个个小空间（如下图，KD树划分示意）。KD树建采用的是从m个样本的n维特征中，分别计算n个特征的取值的方差，用方差最大的第k维特征nk来作为根节点。对于这个特征，我们选择特征nk的取值的中位数nkv对应的样本作为划分点，对于所有第k维特征的取值小于nkv的样本，我们划入左子树，对于第k维特征的取值大于等于nkv的样本，我们划入右子树，对于左子树和右子树，我们采用和刚才同样的办法来找方差最大的特征来做更节点，递归的生成KD树。\r\n\r\n\r\n比如我们有二维样本6个，{(2,3)，(5,4)，(9,6)，(4,7)，(8,1)，(7,2)}，构建kd树的具体步骤为：\r\n\r\n　　　　1）找到划分的特征：6个数据点在x，y维度上的数据方差分别为6.97，5.37，所以在x轴上方差更大，用第1维特征建树。\r\n\r\n　　　　2）确定划分中位数点（7,2）：根据x维上的值将数据排序，6个数据的中值(所谓中值，即中间大小的值)为7，所以划分点的数据是（7,2）。这样，该节点的分割超平面就是通过（7,2）并垂直于：划分点维度的直线x=7；\r\n\r\n　　　　3）确定左子空间和右子空间： 分割超平面x=7将整个空间分为两部分：x\u003c=7的部分为左子空间，包含3个节点={(2,3),(5,4),(4,7)}；另一部分为右子空间，包含2个节点={(9,6)，(8,1)}。\r\n\r\n　　　　4）用同样的办法划分左子树的节点{(2,3),(5,4),(4,7)}和右子树的节点{(9,6)，(8,1)}。最终得到KD树。\r\n\r\n最后得到的KD树如下：\r\n![](https://upload-images.jianshu.io/upload_images/11682271-fca4d2cf82c3d524.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n\r\n\r\n- 利用KD树快速寻找K近邻并决策\r\n\r\n当我们生成KD树以后，就可以去预测测试集里面的目标点（待预测样本）。对于一个目标点，我们首先在KD树里面找到对应包含目标点的叶子节点。以目标点为圆心，以目标点到叶子节点样本实例的距离为半径，得到一个超球体，最近邻的点一定在这个超球体内部。然后返回叶子节点的父节点，检查另一个子节点包含的超矩形体是否和超球体相交，如果相交就到这个子节点寻找是否有更加近的近邻,有的话就更新最近邻。如果不相交那就简单了，我们直接返回父节点的父节点，在另一个子树继续搜索最近邻。当回溯到根节点时，算法结束，此时保存的最近邻节点就是最终的最近邻。\r\n\r\n　　　　从上面的描述可以看出，KD树划分后可以大大减少无效的最近邻搜索，很多样本点由于所在的超矩形体和超球体不相交，根本不需要计算距离。大大节省了计算时间。\r\n\r\n　　　　我们利用建立的KD树，具体来看对点(2,4.5)找最近邻的过程。\r\n\r\n　　　　先进行二叉查找，先从（7,2）查找到（5,4）节点，在进行查找时是由y = 4为分割超平面的，由于查找点为y值为4.5，因此进入右子空间查找到（4,7），形成搜索路径\u003c(7,2)，(5,4)，(4,7)\u003e，但 （4,7）与目标查找点的距离为3.202，而（5,4）与查找点之间的距离为3.041，所以（5,4）为查询点的最近点； 以（2，4.5）为圆心，以3.041为半径作圆，如下图所示。可见该圆和y = 4超平面交割，所以需要进入（5,4）左子空间进行查找，也就是将（2,3）节点加入搜索路径中得\u003c(7,2)，(2,3)\u003e；于是接着搜索至（2,3）叶子节点，（2,3）距离（2,4.5）比（5,4）要近，所以最近邻点更新为（2，3），最近距离更新为1.5；回溯查找至（5,4），直到最后回溯到根结点（7,2）的时候，以（2,4.5）为圆心1.5为半径作圆，并不和x = 7分割超平面交割，如下图所示。至此，搜索路径回溯完，返回最近邻点（2,3），最近距离1.5。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-c4cec92e806fff45.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n在KD树搜索最近邻的基础上，我们选择到了第一个最近邻样本，就把它置为已选。在第二轮中，我们忽略置为已选的样本，重新选择最近邻，这样跑k次，就得到了目标的K个最近邻，然后根据多数表决法，如果是KNN分类，预测为K个最近邻里面有最多类别数的类别。如果是KNN回归，用K个最近邻样本输出的平均值作为回归预测值。\r\n\r\n\u003eKD 树对于低维度最近邻搜索比较好，但当K增长到很大时，搜索的效率就变得很低(维数灾难)。为了解决KD 树在高维数据上的问题，Ball 树结构被提了出来。KD 树是沿着笛卡尔积（坐标轴）方向迭代分割数据，而 Ball 树是通过一系列的超球体分割数据而非超长方体。具体可见文末参考文献2。\r\n\r\n\r\n\r\n## 三、KNN算法的优缺点\r\n\r\n### 3.1 KNN的主要优点\r\n1、算法简单直观，易于应用于回归及多分类任务\r\n\r\n2、 对数据没有假设，准确度高，对异常点较不敏感\r\n\r\n3、由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此适用于类域的交叉或非线性可分的样本集。\r\n\r\n\r\n\r\n　　　\r\n###  3.2 KNN的主要缺点\r\n\r\n1、计算量大，尤其是样本量、特征数非常多的时候。另外KD树、球树之类的模型建立也需要大量的内存\r\n\r\n2、只与少量的k相邻样本有关，样本不平衡的时候，对稀有类别的预测准确率低\r\n\r\n3、 使用懒散学习方法，导致预测时速度比起逻辑回归之类的算法慢。当要预测时，就临时进行 计算处理。需要计算待分样本与训练样本库中每一个样本的相似度，才能求得与 其最近的K个样本进行决策。\r\n\r\n4、与决策树等方法相比，KNN无考虑到不同的特征重要性，各个归一化的特征的影响都是相同的。\r\n\r\n5、 相比决策树、逻辑回归模型，KNN模型可解释性弱一些\r\n\r\n6、差异性小，不太适合KNN集成进一步提高性能。\r\n\r\n\r\n\r\n##  四、KNN算法扩展方法\r\n\r\n### 4.1 最近质心算法\r\n\r\n这个算法比KNN还简单。它首先把样本按输出类别归类。对于第 L类的CL个样本。它会对这CL个样本的n维特征中每一维特征求平均值，最终该类别以n个平均值形成所谓的质心点。同理，每个类别会最终得到一个质心点。\r\n\r\n当我们做预测时，仅仅需要比较预测样本和这些质心的距离，最小的距离对于的质心类别即为预测的类别。这个算法通常用在文本分类处理上。\r\n\r\n### 4.2 ANN\r\n将最近邻算法扩展至大规模数据的方法是使用 ANN 算法（Approximate Nearest Neighbor），以彻底避开暴力距离计算。ANN 是一种在近邻计算搜索过程中允许少量误差的算法，在大规模数据情况下，可以在短时间内获得卓越的准确性。ANN 算法有以下几种：Spotify 的 ANNOY、Google 的 ScaNN、Facebook的Faiss 以及 HNSW 等 ，如下具体介绍HNSW。\r\n\r\n- 分层的可导航小世界（Hierarchical Navigable Small World, HNSW）\r\n\r\nHNSW 是一种基于多层图的 ANN 算法。在插入元素阶段，通过指数衰减概率分布随机选择每个元素的最大层，逐步构建 HNSW 图。这确保 layer=0 时有很多元素能够实现精细搜索，而 layer=2 时支持粗放搜索的元素数量少了 e^-2。最近邻搜索从最上层开始进行粗略搜索，然后逐步向下处理，直至最底层。使用贪心图路径算法遍历图，并找到所需邻居数量。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-e3c6b130eef56eea.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n可以通过hnswlib库简单使用ANN算法（hnswlib还常应用于大规模向量相似度计算），如下iris花示例代码：\r\n```\r\n# pip install hnswlib  # 安装hnswlib\r\nimport hnswlib\r\nimport numpy as np\r\n\r\n# 同上iris数据集，前五个为blue类,中间5个为green类，最后5个为yellow类\r\ndataset2 = np.array([\r\n        [5.1, 3.5, 1.4, 0.2],\r\n        [4.9, 3. , 1.4, 0.2],\r\n        [4.7, 3.2, 1.3, 0.2],\r\n        [4.6, 3.1, 1.5, 0.2],\r\n        [5. , 3.6, 1.4, 0.2], \r\n        [6.7, 3. , 5.2, 2.3],\r\n        [6.3, 2.5, 5. , 1.9],\r\n        [6.5, 3. , 5.2, 2. ],\r\n        [6.2, 3.4, 5.4, 2.3],\r\n        [5.9, 3. , 5.1, 1.8],\r\n        [5.5, 2.4, 3.7, 1. ],\r\n        [5.8, 2.7, 3.9, 1.2],\r\n        [6. , 2.7, 5.1, 1.6],\r\n        [5.4, 3. , 4.5, 1.5],\r\n        [6. , 3.4, 4.5, 1.6]\r\n])\r\n\r\n# 创建索引\r\ndef fit_hnsw_index(features, ef=100, M=16, save_index_file=False):\r\n    # Convenience function to create HNSW graph\r\n    # features : list of lists containing the embeddings\r\n    # ef, M: parameters to tune the HNSW algorithm\r\n    \r\n    num_elements = len(features)\r\n    labels_index = np.arange(num_elements)    \r\n    EMBEDDING_SIZE = len(features[0])    # Declaring index\r\n    # possible space options are l2, cosine or ip\r\n    p = hnswlib.Index(space='l2', dim=EMBEDDING_SIZE)    # Initing index - the maximum number of elements should be known\r\n    p.init_index(max_elements=num_elements, ef_construction=ef, M=M)    # Element insertion\r\n    int_labels = p.add_items(features, labels_index)    # Controlling the recall by setting ef\r\n    # ef should always be \u003e k\r\n    p.set_ef(ef) \r\n    \r\n    # If you want to save the graph to a file\r\n    if save_index_file:\r\n         p.save_index(save_index_file)\r\n    \r\n    return p\r\n\r\np = fit_hnsw_index(dataset2)  # 创建 HNSW 索引\r\n\r\n```\r\n\r\n创建索引后，通过索引快速查询到k个近似近邻（Approximate Nearest Neighbor），在示例数据集的结果与KNN算法的结果是一样的，近邻的样本索引是[9,12,14]，也就是大部分近邻(即第12,14个样本)为“yellow”，最后分类为“yellow”。![](https://upload-images.jianshu.io/upload_images/11682271-c99673897e803be0.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\n#  通过HNSW索引快速查询k个近邻\r\n\r\nann_neighbor_indices, ann_distances = p.knn_query(new_features, 3)\r\n\r\nprint('K个近邻：',ann_neighbor_indices)\r\n\r\nprint('距离值:',ann_distances)\r\n```\r\n\r\n\u003e 参考文献  1、 https://www.joinquant.com/view/community/detail/c2c41c79657cebf8cd871b44ce4f5d97   2、 https://www.cnblogs.com/pinard/p/6061661.html\r\n3、https://github.com/spotify/annoy 4、https://github.com/nmslib/hnswlib\r\n\r\n---\r\n文章首发公众号“算法进阶”，公众号阅读原文可访问文章相关代码\r\n","author":{"url":"https://github.com/aialgorithm","@type":"Person","name":"aialgorithm"},"datePublished":"2021-12-13T15:04:29.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/38/Blog/issues/38"}

route-pattern	/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controller	voltron_issues_fragments
route-action	issue_layout
fetch-nonce	v2:fbd659e4-b904-e8b0-5883-cef8a4666a7e
current-catalog-service-hash	81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id	AB74:11D37A:814450:B67826:696A1639
html-safe-nonce	fd35d06dac0bbcf35768c9cc1f9da41626facce2eef7be2165c0cbfb1a242460
visitor-payload	eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJBQjc0OjExRDM3QTo4MTQ0NTA6QjY3ODI2OjY5NkExNjM5IiwidmlzaXRvcl9pZCI6IjYyMzA5NzE0ODkxNjc0ODAzNzciLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ==
visitor-hmac	7ba2eea43a649e7a45f910116a1b57e210da2095e500964f3b179b03f20456e0
hovercard-subject-tag	issue:1078639166
github-keyboard-shortcuts	repository,issues,copilot
google-site-verification	Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-url	https://collector.github.com/github/collect
analytics-location	///voltron/issues_fragments/issue_layout
fb:app_id	1401488693436528
apple-itunes-app	app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/aialgorithm/Blog/38/issue_layout
twitter:image	https://opengraph.githubassets.com/13aa6337b4edd08f90b26df372e85faea86705070c212a8a25873833f60ccc04/aialgorithm/Blog/issues/38
twitter:card	summary_large_image
og:image	https://opengraph.githubassets.com/13aa6337b4edd08f90b26df372e85faea86705070c212a8a25873833f60ccc04/aialgorithm/Blog/issues/38
og:image:alt	一、KNN（K最近邻算法）原理一句话可以概括出KNN的算法原理：综合k个“邻居”的标签值作为新样本的预测值。更具体来讲KNN分类，给定一个训练数据集，对新的样本Xu，在训练数据集中找到与该样本距离最邻近的K（下图k=5）个样本，以这K个样本的最多数所属类别（标签）作为新实例Xu的类别。由上，可以总结出KNN算法有K值的选择、距离度量和决策方法等三个基本要素，如下分别解析： 1.1 距离...
og:image:width	1200
og:image:height	600
og:site_name	GitHub
og:type	object
og:author:username	aialgorithm
hostname	github.com
expected-hostname	github.com
None	34a52bd10bd674f68e5c1b6b74413b79bf2ca20c551055ace3f7cdd112803923
turbo-cache-control	no-preview
go-import	github.com/aialgorithm/Blog git https://github.com/aialgorithm/Blog.git
octolytics-dimension-user_id	33707637
octolytics-dimension-user_login	aialgorithm
octolytics-dimension-repository_id	147093233
octolytics-dimension-repository_nwo	aialgorithm/Blog
octolytics-dimension-repository_public	true
octolytics-dimension-repository_is_fork	false
octolytics-dimension-repository_network_root_id	147093233
octolytics-dimension-repository_network_root_nwo	aialgorithm/Blog
turbo-body-classes	logged-out env-production page-responsive
disable-turbo	false
browser-stats-url	https://api.github.com/_private/browser/stats
browser-errors-url	https://api.github.com/_private/browser/errors
release	e8bd37502700f365b18a4d39acf7cb7947e11b1a
ui-target	full
theme-color	#1e2327
color-scheme	light dark

Links:

Skip to content	https://github.com/aialgorithm/Blog/issues/38#start-of-content
	https://github.com/
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F38
GitHub CopilotWrite better code with AI	https://github.com/features/copilot
GitHub SparkBuild and deploy intelligent apps	https://github.com/features/spark
GitHub ModelsManage and compare prompts	https://github.com/features/models
MCP RegistryNewIntegrate external tools	https://github.com/mcp
ActionsAutomate any workflow	https://github.com/features/actions
CodespacesInstant dev environments	https://github.com/features/codespaces
IssuesPlan and track work	https://github.com/features/issues
Code ReviewManage code changes	https://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilities	https://github.com/security/advanced-security
Code securitySecure your code as you build	https://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they start	https://github.com/security/advanced-security/secret-protection
Why GitHub	https://github.com/why-github
Documentation	https://docs.github.com
Blog	https://github.blog
Changelog	https://github.blog/changelog
Marketplace	https://github.com/marketplace
View all features	https://github.com/features
Enterprises	https://github.com/enterprise
Small and medium teams	https://github.com/team
Startups	https://github.com/enterprise/startups
Nonprofits	https://github.com/solutions/industry/nonprofits
App Modernization	https://github.com/solutions/use-case/app-modernization
DevSecOps	https://github.com/solutions/use-case/devsecops
DevOps	https://github.com/solutions/use-case/devops
CI/CD	https://github.com/solutions/use-case/ci-cd
View all use cases	https://github.com/solutions/use-case
Healthcare	https://github.com/solutions/industry/healthcare
Financial services	https://github.com/solutions/industry/financial-services
Manufacturing	https://github.com/solutions/industry/manufacturing
Government	https://github.com/solutions/industry/government
View all industries	https://github.com/solutions/industry
View all solutions	https://github.com/solutions
AI	https://github.com/resources/articles?topic=ai
Software Development	https://github.com/resources/articles?topic=software-development
DevOps	https://github.com/resources/articles?topic=devops
Security	https://github.com/resources/articles?topic=security
View all topics	https://github.com/resources/articles
Customer stories	https://github.com/customer-stories
Events & webinars	https://github.com/resources/events
Ebooks & reports	https://github.com/resources/whitepapers
Business insights	https://github.com/solutions/executive-insights
GitHub Skills	https://skills.github.com
Documentation	https://docs.github.com
Customer support	https://support.github.com
Community forum	https://github.com/orgs/community/discussions
Trust center	https://github.com/trust-center
Partners	https://github.com/partners
GitHub SponsorsFund open source developers	https://github.com/sponsors
Security Lab	https://securitylab.github.com
Maintainer Community	https://maintainers.github.com
Accelerator	https://github.com/accelerator
Archive Program	https://archiveprogram.github.com
Topics	https://github.com/topics
Trending	https://github.com/trending
Collections	https://github.com/collections
Enterprise platformAI-powered developer platform	https://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security features	https://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI features	https://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 support	https://github.com/premium-support
Pricing	https://github.com/pricing
Search syntax tips	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentation	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F38
Sign up	https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=aialgorithm%2FBlog
Reload	https://github.com/aialgorithm/Blog/issues/38
Reload	https://github.com/aialgorithm/Blog/issues/38
Reload	https://github.com/aialgorithm/Blog/issues/38
aialgorithm	https://github.com/aialgorithm
Blog	https://github.com/aialgorithm/Blog
Notifications	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Fork 259	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Star 942	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Code	https://github.com/aialgorithm/Blog
Issues 66	https://github.com/aialgorithm/Blog/issues
Pull requests 0	https://github.com/aialgorithm/Blog/pulls
Actions	https://github.com/aialgorithm/Blog/actions
Projects 0	https://github.com/aialgorithm/Blog/projects
Security Uh oh! There was an error while loading. Please reload this page.	https://github.com/aialgorithm/Blog/security
Please reload this page	https://github.com/aialgorithm/Blog/issues/38
Insights	https://github.com/aialgorithm/Blog/pulse
Code	https://github.com/aialgorithm/Blog
Issues	https://github.com/aialgorithm/Blog/issues
Pull requests	https://github.com/aialgorithm/Blog/pulls
Actions	https://github.com/aialgorithm/Blog/actions
Projects	https://github.com/aialgorithm/Blog/projects
Security	https://github.com/aialgorithm/Blog/security
Insights	https://github.com/aialgorithm/Blog/pulse
New issue	https://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/38
New issue	https://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/38
深入理解KNN扩展到ANN	https://github.com/aialgorithm/Blog/issues/38#top
	https://github.com/aialgorithm
	https://github.com/aialgorithm
aialgorithm	https://github.com/aialgorithm
on Dec 13, 2021	https://github.com/aialgorithm/Blog/issues/38#issue-1078639166
	https://camo.githubusercontent.com/42888814d4ccab94b4a2bccc4259339301808ba85e61a252f62c7f64e4c6f640/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d316264306135386262653666336639302e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/d717c6cd2d12a87ba6e8fdcdcd567d63a649bfee5616df14176579d60ce35707/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d383666393835326261386361343936612e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/14f88da6f51da13b25901180d79ba9690028aa789464e4e976cc3421864dd3f2/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d306431313631616131303636396535392e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/e884c624a0fdb134ed6358223932c7f8bb5f84f0c2c956422b6ffc904e8de03d/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d373066626637643862363130323464662e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/c0bea82295aae392b73474035be0f66fbf531f7b0bb5a6a21ccb8056ff853d73/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d633732343961626636643735383631382e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://bib.dbvis.de/uploadedFiles/155.pdf）	https://bib.dbvis.de/uploadedFiles/155.pdf%EF%BC%89
【距离度量】	https://mp.weixin.qq.com/s/AKx9N01-xlLgL2_oFa1KUg
	https://camo.githubusercontent.com/d2e8b5d035582ec015eb79ad5b785e5b089638b096e61621d1f1585a871e2795/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d366364653665653032613439303766652e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/9f0476e95b79dd043ce2336758adbd74bceb64283252fb2e02eecabd3b78c8b5/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d336238373433313865643832303263392e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/d5bab9a38eca087ee6d319af5d7f1337dfb9686270ffb0c0ce5fe8bbf4d5a5ac/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d396435353265373435623461643036322e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/373cd52d3ce32cc45b3ebea766304d65c8dd9bb4356aee73eca7fab9c471b732/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d653239373437336134306665623330382e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/9b2e9a44ac4ca0a31bf52934b95defe578fea9d4262bcb041bb99db72b8e3936/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d666361346432636638326333643532342e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/0c010dd18cb606a819b0c894ec8625e8d053ae17c2bb93706df2525c3c5cd1f5/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d633463656339326538303666666634352e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/4403b79ed8eb7edeb8c14394c146547ae9081423f6fed182301a929367da5304/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d653363366231333065656635366565612e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/f1d6fd64d4def78f0acc76c612ab62c9fac7e4217c36b17870d3209bf513279a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d633939363733383937653830336265302e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://www.joinquant.com/view/community/detail/c2c41c79657cebf8cd871b44ce4f5d97	https://www.joinquant.com/view/community/detail/c2c41c79657cebf8cd871b44ce4f5d97
https://www.cnblogs.com/pinard/p/6061661.html	https://www.cnblogs.com/pinard/p/6061661.html
https://github.com/spotify/annoy	https://github.com/spotify/annoy
https://github.com/nmslib/hnswlib	https://github.com/nmslib/hnswlib
	https://github.com
Terms	https://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacy	https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Security	https://github.com/security
Status	https://www.githubstatus.com/
Community	https://github.community/
Docs	https://docs.github.com/
Contact	https://support.github.com?tags=dotcom-footer

Viewport: width=device-width

URLs of crawlers that visited me.