René's URL Explorer Experiment

Title: 一文概览NLP算法(Python) · Issue #52 · aialgorithm/Blog · GitHub

Open Graph Title: 一文概览NLP算法(Python) · Issue #52 · aialgorithm/Blog

X Title: 一文概览NLP算法(Python) · Issue #52 · aialgorithm/Blog

Description: 一、自然语言处理（NLP）简介自然语言处理就是用计算机来分析和生成自然语言（文本、语音），目的是让人类可以用自然语言形式跟计算机系统进行人机交互，从而更便捷、有效地进行信息管理。 NLP是人工智能领域历史较为悠久的领域，但由于语言的复杂性（语言表达多样性/歧义/模糊等等），如今的发展及收效相对缓慢。比尔·盖茨曾说过，"NLP是 AI 皇冠上的明珠。" 在光鲜绚丽的同时，却可望而不可及（...）。为了揭开NLP的神秘面纱，本文接下来会梳理下NLP流程、主要任务及算法，...

Open Graph Description: 一、自然语言处理（NLP）简介自然语言处理就是用计算机来分析和生成自然语言（文本、语音），目的是让人类可以用自然语言形式跟计算机系统进行人机交互，从而更便捷、有效地进行信息管理。 NLP是人工智能领域历史较为悠久的领域，但由于语言的复杂性（语言表达多样性/歧义/模糊等等），如今的发展及收效相对缓慢。比尔·盖茨曾说过，"NLP是 AI 皇冠上的明珠。" 在光鲜绚丽的同时，却可望而不可及（.....

X Description: 一、自然语言处理（NLP）简介自然语言处理就是用计算机来分析和生成自然语言（文本、语音），目的是让人类可以用自然语言形式跟计算机系统进行人机交互，从而更便捷、有效地进行信息管理。 NLP是人工智能领域历史较为悠久的领域，但由于语言的复杂性（语言表达多样性/歧义/模糊等等），如今的发展及收效相对缓慢。比尔·盖茨曾说过，"NLP是 AI 皇冠上的明珠。" 在光鲜绚丽的同时，...

Opengraph URL: https://github.com/aialgorithm/Blog/issues/52

X: @github

direct link

Domain: github.com

Hey, it has json ld scripts:

{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"一文概览NLP算法(Python)","articleBody":"\r\n## 一、自然语言处理（NLP）简介\r\n\u003e自然语言处理就是用计算机来分析和生成自然语言（文本、语音），目的是让人类可以用自然语言形式跟计算机系统进行人机交互，从而更便捷、有效地进行信息管理。\r\n\r\n\r\nNLP是人工智能领域历史较为悠久的领域，但由于语言的复杂性（语言表达多样性/歧义/模糊等等），如今的发展及收效相对缓慢。比尔·盖茨曾说过，\"NLP是 AI 皇冠上的明珠。\" 在光鲜绚丽的同时，却可望而不可及（...）。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-2808ff813c4db7f6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n为了揭开NLP的神秘面纱，本文接下来会梳理下NLP流程、主要任务及算法，并最终落到实际NLP项目（经典的文本分类任务的实战）。`顺便说一句，个人水平有限，不足之处还请留言指出~~`\r\n\r\n## 二、NLP主要任务及技术\r\n**NLP任务可以大致分为词法分析、句法分析、语义分析三个层面。具体的，本文按照单词-》句子-》文本做顺序展开，并介绍各个层面的任务及对应技术。本节上半部分的分词、命名实体识别、词向量等等可以视为NLP基础的任务。下半部分的句子关系、文本生成及分类任务可以看做NLP主要的应用任务。**\r\n![](https://upload-images.jianshu.io/upload_images/11682271-865816ee430c1e08.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n这里，贴一张自然语言处理的技术路线图，介绍了NLP任务及主流模型的分支：\r\n![](https://upload-images.jianshu.io/upload_images/11682271-06b637d032ccde3f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\u003e高清图可如下路径下载（原作者graykode）：https://github.com/aialgorithm/AiPy/tree/master/Ai%E7%9F%A5%E8%AF%86%E5%9B%BE%E5%86%8C/Ai_Roadmap\r\n\r\n\r\n### 2.1 数据清洗 + 分词（系列标注任务）\r\n- 数据语料清洗。我们拿到文本的数据语料(Corpus)后，通常首先要做的是，分析并清洗下文本，主要用正则匹配删除掉数字及标点符号（一般这些都是噪音，对于实际任务没有帮助），做下分词后，删掉一些无关的词（停用词），对于英文还需要统一下复数、语态、时态等不同形态的单词形式，也就是词干/词形还原。\r\n\r\n- 分词。即划分为词单元（token），是一个常见的序列标注任务。对于英文等拉丁语系的语句分词，天然可以通过空格做分词，![](https://upload-images.jianshu.io/upload_images/11682271-10407a47b0dca356.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)对于中文语句，由于中文词语是连续的，可以用结巴分词（基于trie tree+维特比等算法实现最大概率的词语切分）等工具实现。\r\n```\r\nimport jieba\r\njieba.lcut(\"我的地址是上海市松江区中山街道华光药房\")\r\n\r\n\u003e\u003e\u003e ['我', '的', '地址', '是', '上海市', '松江区', '中山', '街道', '华光', '药房']\r\n\r\n```\r\n\r\n\r\n- 英文分词后的词干/词形等还原(去除时态 语态及复数等信息，统一为一个“单词”形态)。这并不是必须的，还是根据实际任务是否需要保留时态、语态等信息，有WordNetLemmatizer、 SnowballStemmer等方法。\r\n\r\n- 分词及清洗文本后，还需要对照前后的效果差异，在做些微调。这里可以统计下个单词的频率、句长等指标，还可以通过像词云等工具做下可视化~\r\n```\r\nfrom wordcloud import WordCloud\r\nham_msg_cloud = WordCloud(width =520, height =260,max_font_size=50, background_color =\"black\", colormap='Blues').generate(原文本语料)\r\nplt.figure(figsize=(16,10))\r\nplt.imshow(ham_msg_cloud, interpolation='bilinear')\r\nplt.axis('off') # turn off axis\r\nplt.show()\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-0f69a151afb9dadf.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n### 2.2 词性标注（系列标注任务）\r\n\r\n词性标注是对句子中的成分做简单分析，区分出分名词、动词、形容词之类。对于句法分析、信息抽取的任务，经过词性标注后的文本会带来很大的便利性（其他方面的应用好像比较少）。\r\n\r\n常用的词性标注有基于规则、统计以及深度学习的方法，像HanLP、结巴分词等工具都有这个功能。\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-b21b8398d45c3326.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n\r\n### 2.3 命名实体识别（系列标注任务）\r\n命名实体识别（Named Entity Recognition，简称NER）是一个有监督的系列标注任务，又称作“专名识别”，是指识别文本中具有特定意义的实体，主要包括人名、地名、机构名、时间、专有名词等关键信息。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-1b9bfec39809004f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n通过NER识别出一些关键的人名、地名就可以很方便地提取出“某人去哪里，做什么事的信息”，很方便信息提取、问答系统等任务。NER主流的模型实现有BiLSTM-CRF、Bert-CRF，如下一个简单的中文ner项目：https://github.com/Determined22/zh-NER-TF\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n### 2.4 词向量（表示学习）\r\n对于自然语言文本，计算机无法理解词后面的含义。输入模型前，首先要做的就是词的数值化表示，常用的转化方式有2种：One-hot编码、词嵌入分布式方法。\r\n\r\n- One-hot编码：最简单的表示方法某过于onehot表示，每个单词是否出现就用一位数单独展示。进一步，句子的表示也就是**累加**每个单词的onehot，也就是常说的句子的词袋模型（bow）表示。\r\n```\r\n## 词袋表示\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nbow = CountVectorizer(\r\n                analyzer = 'word',\r\n                strip_accents = 'ascii',\r\n                tokenizer = [],\r\n                lowercase = True,\r\n                max_features = 100, \r\n                )\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-4a52e50e27fb3f9f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n- 词嵌入分布式表示：自然语言的单词数是成千上万的，One-hot编码会有高维、词语间无联系的缺陷。这时有一种更有效的方法就是——词嵌入分布式表示，通过神经网络学习构造一个低维、稠密，隐含词语间关系的向量表示。常见有Word2Vec、Fasttext、Bert等模型学习每个单词的向量表示，在表示学习后相似的词汇在向量空间中是比较接近的。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-ab08b6c01dd9ca92.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n```\r\n# Fasttext embed模型\r\nfrom gensim.models import FastText,word2vec\r\n\r\nmodel = FastText(text,  size=100,sg=1, window=3, min_count=1, iter=10, min_n=3, max_n=6,word_ngrams=1,workers=12)\r\nprint(model.wv['hello']) # 词向量\r\nmodel.save('./data/fasttext100dim')\r\n\r\n```\r\n特别地，正因为Bert等大规模自监督预训练方法，又为NLP带来了春天~\r\n![](https://upload-images.jianshu.io/upload_images/11682271-51774d3b3fed28f2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n- 对于学习后的词表示向量，还可以通过重要程度进行特征加权，合适的加权方法对于任务可以有不错的提升效果。常用的有卡方chi2、TF-IDF等加权方法。TF-IDF是一种基于统计的方法，其核心思想是假设字词的重要性与其在某篇文章中出现的比例成正比，与其在其他文章中出现的比例成反比。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-f53a15fe8bfb6d8a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n```\r\n# TF-IDF可以直接调用sklearn\r\nfrom sklearn.feature_extraction.text import TfidfTransformer\r\n```\r\n\r\n### 2.5 句法、语义依存分析\r\n\r\n\u003e句法、语义依存分析是传统自然语言的基础句子级的任务，语义依存分析是指在句子结构中分析实词和实词之间的语义关系，这种关系是一种事实上或逻辑上的关系，且只有当词语进入到句子时才会存在。语义依存分析的目的即回答句子的”Who did what to whom when and where”的问题。例如句子“张三昨天告诉李四一个秘密”，语义依存分析可以回答四个问题，即谁告诉了李四一个秘密，张三告诉谁一个秘密，张三什么时候告诉李四一个秘密，张三告诉李四什么。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-6581c59b1e7abc21.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n传统的自然语言处理多是参照了语言学家对于自然语言的归纳总结，通过句法、语义分析可以挖掘出词语间的联系（主谓宾、施事受事等关系），用于制定文本规则、信息抽取（如正则匹配叠加语义规则应用于知识抽取或者构造特征）。可以参考spacy库、哈工大NLP的示例：http://ltp.ai/demo.html\r\n\r\n随着深度学习技术RNN/LSTM等强大的时序模型（sequential modeling）和词嵌入方法的普及，能够在一定程度上刻画句子的隐含语法结构，学习到上下文信息，已经逐渐取代了词法、句法等传统自然语言处理流程。 \r\n\r\n\r\n\r\n### 2.6 相似度算法（句子关系的任务）\r\n自然语言处理任务中，我们经常需要判断两篇文档的相似程度（句子关系），比如检索系统输出最相关的文本，推荐系统推荐相似的文章。文本相似度匹配常用到的方法有：文本编辑距离、WMD、 BM2.5、词向量相似度 、Approximate Nearest Neighbor以及一些有监督的(神经网络)模型判断文本间相似度。\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-e667811041f2ae84.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n\r\n### 2.7 文本分类任务\r\n\r\n文本分类是经典的NLP任务，就是将文本系列对应预测到类别。\r\n- 一种是输入序列输出这整个序列的类别，如短信息、微博分类、意图识别等。\r\n- 另一种是输入序列输出序列上每个位置的类别，上文提及的系列标注可以看做为词粒度的一种分类任务，如实体命名识别。\r\n\r\n分类任务使用预训练+(神经网络)分类模型的端对端学习是主流，深度学习学习特征的表达然后进行分类，大大减少人工的特征。但以实际项目中的经验来看，对于一些困难任务（任务的噪声大），加入些人工的特征工程还是很有必要的。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-33c01000f13f64a9.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n### 2.8 文本生成任务\r\n\r\n文本生成也就是由类别生成序列 或者 由序列到序列的预测任务。按照不同的输入划分，文本自动生成可包括文本到文本的生成(text-to-text  generation)、意义到文本的生成(meaning-to-text  generation)、数据到文本的生成(data-to-text  generation)以及图像到文本的生成(image-to-text generation)等。具体应用如机器翻译、文本摘要理解、阅读理解、闲聊对话、写作、看图说话。 常用的模型如RNN、CNN、seq2seq、Transformer。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-8922e985dad726ab.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n同样的，基于大规模预训练模型的文本生成也是一大热门，可见《A Survey of Pretrained Language Models Based Text Generation》\r\n\r\n\r\n\r\n## 三、垃圾短信文本分类实战\r\n\r\n### 3.1 读取短信文本数据并展示\r\n\r\n本项目是通过有监督的短信文本，学习一个垃圾短信文本分类模型。数据样本总的有5572条，label有spam（垃圾短信）和ham两种，是一个典型类别不均衡的二分类问题。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-33738bf2e358bd20.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\n# 源码可见https://github.com/aialgorithm/Blog\r\nimport pandas as pd\r\nimport numpy as np\r\nimport  matplotlib.pyplot as plt\r\n\r\nspam_df = pd.read_csv('./data/spam.csv', header=0, encoding=\"ISO-8859-1\")\r\n\r\n# 数据展示\r\n_, ax = plt.subplots(1,2,figsize=(10,5))\r\nspam_df['label'].value_counts().plot(ax=ax[0], kind=\"bar\", rot=90, title='label');\r\nspam_df['label'].value_counts().plot(ax=ax[1], kind=\"pie\", rot=90, title='label', ylabel='');\r\nprint(\"Dataset size: \", spam_df.shape)\r\n\r\nspam_df.head(5)\r\n```\r\n### 3.2 数据清洗预处理\r\n数据清洗在于去除一些噪声信息，这里对短信文本做按空格分词，统一大小写，清洗非英文字符，去掉停用词并做了词干还原。考虑到短信文本里面的数字位数可能有一定的含义，这里将数字替换为‘x’的处理。最后，将标签统一为数值（0、1）是否垃圾短信。\r\n\r\n```\r\n# 导入相关的库\r\nimport nltk\r\nfrom nltk import word_tokenize\r\nfrom nltk.corpus import stopwords\r\nfrom nltk.data import load\r\nfrom nltk.stem import SnowballStemmer\r\nfrom string import punctuation\r\n\r\nimport re  # 正则匹配\r\nstop_words = set(stopwords.words('english'))\r\nnon_words = list(punctuation)\r\n\r\n\r\n# 词形、词干还原\r\n# from nltk.stem import WordNetLemmatizer\r\n# wnl = WordNetLemmatizer()\r\nstemmer = SnowballStemmer('english')\r\ndef stem_tokens(tokens, stemmer):\r\n    stems = []\r\n    for token in tokens:\r\n        stems.append(stemmer.stem(token))\r\n    return stems\r\n\r\n### 清除非英文词汇并替换数值x\r\ndef clean_non_english_xdig(txt,isstem=True, gettok=True):\r\n    txt = re.sub('[0-9]', 'x', txt) # 去数字替换为x\r\n    txt = txt.lower() # 统一小写\r\n    txt = re.sub('[^a-zA-Z]', ' ', txt) #去除非英文字符并替换为空格\r\n    word_tokens = word_tokenize(txt) # 分词\r\n    if not isstem: #是否做词干还原\r\n        filtered_word = [w for w in word_tokens if not w in stop_words]  # 删除停用词\r\n    else:\r\n        filtered_word = [stemmer.stem(w) for w in word_tokens if not w in stop_words]   # 删除停用词及词干还原\r\n    if gettok:   #返回为字符串或分词列表\r\n        return filtered_word\r\n    else:\r\n        return \" \".join(filtered_word)\r\n\r\nspam_df['token'] = spam_df.message.apply(lambda x:clean_non_english_xdig(x))\r\nspam_df.head(3)\r\n\r\n# 数据清洗\r\nspam_df['token'] = spam_df.message.apply(lambda x:clean_non_english_xdig(x))\r\n\r\n# 标签整数编码\r\nspam_df['label'] = (spam_df.label=='spam').astype(int)\r\n\r\nspam_df.head(3)\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-165f259dbbc0d678.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n### 3.3 fasttext词向量表示学习\r\n我们需要将单词文本转化为数值的词向量才能输入模型。词向量表示常用的词袋、fasttext、bert等方法，这里训练的是fasttext，模型的主要输入参数是，输入分词后的语料（通常训练语料越多越好，当现有语料有限时候，直接拿github上合适的大规模预训练模型来做词向量也是不错的选择），词向量的维度size（一个经验的词向量维度设定是，dim \u003e 8.33 logN, N为词汇表的大小，当维度dim足够大才能表达好这N规模的词汇表的含义。可参考《# [最小熵原理（六）：词向量的维度应该怎么选择？](https://kexue.fm/archives/7695) By 苏剑林》）。语料太大的时候可以使用workers开启多进程训练（其他参数及词表示学习原理后续会专题介绍，也可以自行了解）。\r\n```\r\n# 训练词向量 Fasttext embed模型\r\nfrom gensim.models import FastText,word2vec\r\n\r\nfmodel = FastText(spam_df.token,  size=100,sg=1, window=3, min_count=1, iter=10, min_n=3, max_n=6,word_ngrams=1,workers=12)\r\nprint(fmodel.wv['hello']) # 输出hello的词向量\r\n# fmodel.save('./data/fasttext100dim')\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-d7947d5c6dfd6df8.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n按照句子所有的词向量取平均，为每一句子生成句向量。\r\n\r\n```\r\nfmodel = FastText.load('./data/fasttext100dim')\r\n\r\n#对每个句子的所有词向量取均值，来生成一个句子的vector\r\ndef build_sentence_vector(sentence,w2v_model,size=100):\r\n    sen_vec=np.zeros((size,))\r\n    count=0\r\n    for word in sentence:\r\n        try:\r\n            sen_vec+=w2v_model[word]#.reshape((1,size))\r\n            count+=1\r\n        except KeyError:\r\n            continue\r\n    if count!=0:\r\n        sen_vec/=count\r\n    return sen_vec\r\n\r\n# 句向量\r\nsents_vec = []\r\nfor sent in spam_df['token']:\r\n    sents_vec.append(build_sentence_vector(sent,fmodel,size=100))\r\n        \r\nprint(len(sents_vec))\r\n```\r\n### 3.4 训练文本分类模型\r\n示例采用的fasttext embedding + lightgbm的二分类模型，类别不均衡使用lgb代价敏感学习解决（即class_weight='balanced'），超参数是手动简单配置的，可以自行搜索下较优超参数。\r\n```\r\n### 训练文本分类模型\r\nfrom sklearn.model_selection import train_test_split\r\nfrom lightgbm import LGBMClassifier\r\nfrom sklearn.linear_model import LogisticRegression\r\n\r\ntrain_x, test_x, train_y, test_y = train_test_split(sents_vec, spam_df.label,test_size=0.2,shuffle=True,random_state=42)\r\nresult = []\r\nclf = LGBMClassifier(class_weight='balanced',n_estimators=300, num_leaves=64, reg_alpha= 1,reg_lambda= 1,random_state=42)\r\n#clf = LogisticRegression(class_weight='balanced',random_state=42)\r\n\r\nclf.fit(train_x,train_y)\r\n\r\nimport pickle\r\n# 保存模型\r\npickle.dump(clf, open('./saved_models/spam_clf.pkl', 'wb'))\r\n\r\n# 加载模型\r\nmodel = pickle.load(open('./saved_models/spam_clf.pkl', 'rb'))\r\n```\r\n\r\n### 3.5 模型评估\r\n训练集测试集按0.2划分，分布验证训练集测试集的AUC、F1score等指标，均有不错的表现。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-df79bdfb07472962.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\nfrom sklearn.metrics import auc,roc_curve,f1_score,precision_score,recall_score\r\ndef model_metrics(model, x, y,tp='auc'):\r\n    \"\"\" 评估 \"\"\"\r\n    yhat = model.predict(x)\r\n    yprob = model.predict_proba(x)[:,1]\r\n    fpr,tpr,_ = roc_curve(y, yprob,pos_label=1)\r\n    metrics = {'AUC':auc(fpr, tpr),'KS':max(tpr-fpr),\r\n               'f1':f1_score(y,yhat),'P':precision_score(y,yhat),'R':recall_score(y,yhat)}\r\n    \r\n    roc_auc = auc(fpr, tpr)\r\n\r\n    plt.plot(fpr, tpr, 'k--', label='ROC (area = {0:.2f})'.format(roc_auc), lw=2)\r\n\r\n    plt.xlim([-0.05, 1.05])  # 设置x、y轴的上下限，以免和边缘重合，更好的观察图像的整体\r\n    plt.ylim([-0.05, 1.05])\r\n    plt.xlabel('False Positive Rate')\r\n    plt.ylabel('True Positive Rate')  # 可以使用中文，但需要导入一些库即字体\r\n    plt.title('ROC Curve')\r\n    plt.legend(loc=\"lower right\")\r\n\r\n\r\n    return metrics\r\n\r\nprint('train ',model_metrics(clf,  train_x, train_y,tp='ks'))\r\nprint('test ',model_metrics(clf, test_x,test_y,tp='ks'))\r\n```\r\n\r\n\r\n","author":{"url":"https://github.com/aialgorithm","@type":"Person","name":"aialgorithm"},"datePublished":"2022-05-21T11:57:45.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/52/Blog/issues/52"}

route-pattern	/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controller	voltron_issues_fragments
route-action	issue_layout
fetch-nonce	v2:da8b001d-345d-dc89-cc21-74cafb15bf98
current-catalog-service-hash	81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id	D2D8:DD0BF:4B6FEC:696C12:696A21E7
html-safe-nonce	f80161325dd4ad05ffbf9e7e548819d2f203b6fc4767746fb1cc4901a3c5088e
visitor-payload	eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJEMkQ4OkREMEJGOjRCNkZFQzo2OTZDMTI6Njk2QTIxRTciLCJ2aXNpdG9yX2lkIjoiNDU2MTk0MTAwODg5NDAwOTgzMSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmac	31effea1830faf733a598257a2e542253c0542c60ecb681785c2be1061a74c03
hovercard-subject-tag	issue:1243956233
github-keyboard-shortcuts	repository,issues,copilot
google-site-verification	Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-url	https://collector.github.com/github/collect
analytics-location	///voltron/issues_fragments/issue_layout
fb:app_id	1401488693436528
apple-itunes-app	app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/aialgorithm/Blog/52/issue_layout
twitter:image	https://opengraph.githubassets.com/2502da34874dcd80733c75b4f0870e48fddcaf147ffbf1014dddcfeec2a3fe4e/aialgorithm/Blog/issues/52
twitter:card	summary_large_image
og:image	https://opengraph.githubassets.com/2502da34874dcd80733c75b4f0870e48fddcaf147ffbf1014dddcfeec2a3fe4e/aialgorithm/Blog/issues/52
og:image:alt	一、自然语言处理（NLP）简介自然语言处理就是用计算机来分析和生成自然语言（文本、语音），目的是让人类可以用自然语言形式跟计算机系统进行人机交互，从而更便捷、有效地进行信息管理。 NLP是人工智能领域历史较为悠久的领域，但由于语言的复杂性（语言表达多样性/歧义/模糊等等），如今的发展及收效相对缓慢。比尔·盖茨曾说过，"NLP是 AI 皇冠上的明珠。" 在光鲜绚丽的同时，却可望而不可及（.....
og:image:width	1200
og:image:height	600
og:site_name	GitHub
og:type	object
og:author:username	aialgorithm
hostname	github.com
expected-hostname	github.com
None	014f3d193f36b7d393f88ca22d06fbacd370800b40a547c1ea67291e02dc8ea3
turbo-cache-control	no-preview
go-import	github.com/aialgorithm/Blog git https://github.com/aialgorithm/Blog.git
octolytics-dimension-user_id	33707637
octolytics-dimension-user_login	aialgorithm
octolytics-dimension-repository_id	147093233
octolytics-dimension-repository_nwo	aialgorithm/Blog
octolytics-dimension-repository_public	true
octolytics-dimension-repository_is_fork	false
octolytics-dimension-repository_network_root_id	147093233
octolytics-dimension-repository_network_root_nwo	aialgorithm/Blog
turbo-body-classes	logged-out env-production page-responsive
disable-turbo	false
browser-stats-url	https://api.github.com/_private/browser/stats
browser-errors-url	https://api.github.com/_private/browser/errors
release	d515f6f09fa57a93bf90355cb894eb84ca4f458f
ui-target	full
theme-color	#1e2327
color-scheme	light dark

Links:

Skip to content	https://github.com/aialgorithm/Blog/issues/52#start-of-content
	https://github.com/
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F52
GitHub CopilotWrite better code with AI	https://github.com/features/copilot
GitHub SparkBuild and deploy intelligent apps	https://github.com/features/spark
GitHub ModelsManage and compare prompts	https://github.com/features/models
MCP RegistryNewIntegrate external tools	https://github.com/mcp
ActionsAutomate any workflow	https://github.com/features/actions
CodespacesInstant dev environments	https://github.com/features/codespaces
IssuesPlan and track work	https://github.com/features/issues
Code ReviewManage code changes	https://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilities	https://github.com/security/advanced-security
Code securitySecure your code as you build	https://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they start	https://github.com/security/advanced-security/secret-protection
Why GitHub	https://github.com/why-github
Documentation	https://docs.github.com
Blog	https://github.blog
Changelog	https://github.blog/changelog
Marketplace	https://github.com/marketplace
View all features	https://github.com/features
Enterprises	https://github.com/enterprise
Small and medium teams	https://github.com/team
Startups	https://github.com/enterprise/startups
Nonprofits	https://github.com/solutions/industry/nonprofits
App Modernization	https://github.com/solutions/use-case/app-modernization
DevSecOps	https://github.com/solutions/use-case/devsecops
DevOps	https://github.com/solutions/use-case/devops
CI/CD	https://github.com/solutions/use-case/ci-cd
View all use cases	https://github.com/solutions/use-case
Healthcare	https://github.com/solutions/industry/healthcare
Financial services	https://github.com/solutions/industry/financial-services
Manufacturing	https://github.com/solutions/industry/manufacturing
Government	https://github.com/solutions/industry/government
View all industries	https://github.com/solutions/industry
View all solutions	https://github.com/solutions
AI	https://github.com/resources/articles?topic=ai
Software Development	https://github.com/resources/articles?topic=software-development
DevOps	https://github.com/resources/articles?topic=devops
Security	https://github.com/resources/articles?topic=security
View all topics	https://github.com/resources/articles
Customer stories	https://github.com/customer-stories
Events & webinars	https://github.com/resources/events
Ebooks & reports	https://github.com/resources/whitepapers
Business insights	https://github.com/solutions/executive-insights
GitHub Skills	https://skills.github.com
Documentation	https://docs.github.com
Customer support	https://support.github.com
Community forum	https://github.com/orgs/community/discussions
Trust center	https://github.com/trust-center
Partners	https://github.com/partners
GitHub SponsorsFund open source developers	https://github.com/sponsors
Security Lab	https://securitylab.github.com
Maintainer Community	https://maintainers.github.com
Accelerator	https://github.com/accelerator
Archive Program	https://archiveprogram.github.com
Topics	https://github.com/topics
Trending	https://github.com/trending
Collections	https://github.com/collections
Enterprise platformAI-powered developer platform	https://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security features	https://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI features	https://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 support	https://github.com/premium-support
Pricing	https://github.com/pricing
Search syntax tips	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentation	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F52
Sign up	https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=aialgorithm%2FBlog
Reload	https://github.com/aialgorithm/Blog/issues/52
Reload	https://github.com/aialgorithm/Blog/issues/52
Reload	https://github.com/aialgorithm/Blog/issues/52
aialgorithm	https://github.com/aialgorithm
Blog	https://github.com/aialgorithm/Blog
Notifications	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Fork 259	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Star 942	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Code	https://github.com/aialgorithm/Blog
Issues 66	https://github.com/aialgorithm/Blog/issues
Pull requests 0	https://github.com/aialgorithm/Blog/pulls
Actions	https://github.com/aialgorithm/Blog/actions
Projects 0	https://github.com/aialgorithm/Blog/projects
Security Uh oh! There was an error while loading. Please reload this page.	https://github.com/aialgorithm/Blog/security
Please reload this page	https://github.com/aialgorithm/Blog/issues/52
Insights	https://github.com/aialgorithm/Blog/pulse
Code	https://github.com/aialgorithm/Blog
Issues	https://github.com/aialgorithm/Blog/issues
Pull requests	https://github.com/aialgorithm/Blog/pulls
Actions	https://github.com/aialgorithm/Blog/actions
Projects	https://github.com/aialgorithm/Blog/projects
Security	https://github.com/aialgorithm/Blog/security
Insights	https://github.com/aialgorithm/Blog/pulse
New issue	https://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/52
New issue	https://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/52
一文概览NLP算法(Python)	https://github.com/aialgorithm/Blog/issues/52#top
	https://github.com/aialgorithm
	https://github.com/aialgorithm
aialgorithm	https://github.com/aialgorithm
on May 21, 2022	https://github.com/aialgorithm/Blog/issues/52#issue-1243956233
	https://camo.githubusercontent.com/e505e044f6438bf42d5ce3b6f799d5439dedc039e21a8a2efd4cd56f654b6f43/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d323830386666383133633464623766362e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/da10daaea24d2403d6661ff2945f3f72e97edf7f74494b04e97afa608133e275/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d383635383136656534333063316530382e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/fccc487f222bce7a4c702a5849b8e1a71b26c42327e935d15c3bad139defdcf9/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d303662363337643033326363646533662e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://github.com/aialgorithm/AiPy/tree/master/Ai%E7%9F%A5%E8%AF%86%E5%9B%BE%E5%86%8C/Ai_Roadmap	https://github.com/aialgorithm/AiPy/tree/master/Ai%E7%9F%A5%E8%AF%86%E5%9B%BE%E5%86%8C/Ai_Roadmap
	https://camo.githubusercontent.com/be8f976cfbdb539a5af2ba8ff94e86e9cb17830fe8cee66e9e5e081450ccacb1/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d313034303761343762306463613335362e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/62530f8dac40094c2e27edbbdaaa29e10abbe9831953de677bb386a32ba9f6d5/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d306636396131353161666239646164662e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/e64c729ad02759de18b6cc5451e501782e3817aa5a728cdd718acbb3ab6463c9/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d623231623833393864343563333332362e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/3ca36ad1ddcc1e77f3b3c770f2b57dca3866fd40c2a6181622330b08477fd823/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d316239626665633339383039303034662e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://github.com/Determined22/zh-NER-TF	https://github.com/Determined22/zh-NER-TF
	https://camo.githubusercontent.com/ab435ebbf948c19f4f27edea226b8e6798b0ee5e13b590e3be7ddb5157dffe3c/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d346135326535306532376662336639662e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/c25d76a03499b395f9a2b54c0e6fd849087902d346612e6633c57d129a6c2ef4/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d616230386236633031646439636139322e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/7591b7a790f4c467169cba208dac525b85f6614c055e288aaadfd371296d5831/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d353137373464336233666564323866322e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/47bf6d587f03747b7096de676404b4031620bc65cd9072ef1b3e332ac21729c4/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d663533613135666538626662366438612e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/b09a556c692a15427e14ca14ce613d0e44476a6078d3b1e596ec4cb68d107a49/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d363538316335396231653761626332312e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
http://ltp.ai/demo.html	http://ltp.ai/demo.html
	https://camo.githubusercontent.com/6b5a96efa01eef84fa6e6619fa5e3efc9b1e341f190d5fa16e14634e9e81b59d/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d653636373831313034316632616538342e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/b746c3a38bf85758226b9a5e1f8b2adbe7773589d87758cb7b35bb59a38963c9/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d333363303130303066313366363461392e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/5decf2706d56c3f2bf7f87e352124cf37ef801cedbf7f1ed3e3dfdb13f4a92fd/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d383932326539383564616437323661622e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/7564e5b3fd4786e35b83d1035bc45dc7a2ecd78982e80e660fc8ed1930c15505/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d333337333862663265333538626432302e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/87701bfe809c19c95d4cc5811f1d6c1d99b3eb75b7db225e7cfca2a1a0696648/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d313635663235396462626330643637382e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
最小熵原理（六）：词向量的维度应该怎么选择？	https://kexue.fm/archives/7695
	https://camo.githubusercontent.com/20dd6716c91b2736d9d4906529d2825b0e2edd854aa2d6aa423a7a8f06486951/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d643739343764356336646664366466382e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/e8debc18834c1da6fec3401eebf1180ca07130616fee2cbae5488b4fb0328f32/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d646637396264666230373437323936322e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://github.com
Terms	https://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacy	https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Security	https://github.com/security
Status	https://www.githubstatus.com/
Community	https://github.community/
Docs	https://docs.github.com/
Contact	https://support.github.com?tags=dotcom-footer

Viewport: width=device-width

URLs of crawlers that visited me.