Title: 树+神经网络算法强强联手(Python) · Issue #57 · aialgorithm/Blog · GitHub
Open Graph Title: 树+神经网络算法强强联手(Python) · Issue #57 · aialgorithm/Blog
X Title: 树+神经网络算法强强联手(Python) · Issue #57 · aialgorithm/Blog
Description: 结合论文《Revisiting Deep Learning Models for Tabular Data》的观点,集成树模型通常擅长于表格数据这种异构数据集,是实打实的表格数据王者。集成树模型中的LightGBM是增强版的GBDT,支持了分类变量,在工程层面大大提高了训练效率。关于树模型的介绍,可见之前文章:一文讲透树模型 DNN深度神经网络擅长于同构的高维数据,从高维稀疏的表示中学习到低维致密的分布式表示,所以在自然语言、图像识别等领域基本上是称霸武林(神经网络的介...
Open Graph Description: 结合论文《Revisiting Deep Learning Models for Tabular Data》的观点,集成树模型通常擅长于表格数据这种异构数据集,是实打实的表格数据王者。集成树模型中的LightGBM是增强版的GBDT,支持了分类变量,在工程层面大大提高了训练效率。关于树模型的介绍,可见之前文章:一文讲透树模型 DNN深度神经网络擅长于同构的高维数据,从高维稀疏的表示中学习到低...
X Description: 结合论文《Revisiting Deep Learning Models for Tabular Data》的观点,集成树模型通常擅长于表格数据这种异构数据集,是实打实的表格数据王者。集成树模型中的LightGBM是增强版的GBDT,支持了分类变量,在工程层面大大提高了训练效率。关于树模型的介绍,可见之前文章:一文讲透树模型 DNN深度神经网络擅长于同构的高维数据,从高维稀疏的表示中学习到低...
Opengraph URL: https://github.com/aialgorithm/Blog/issues/57
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"树+神经网络算法强强联手(Python)","articleBody":"结合论文《Revisiting Deep Learning Models for Tabular Data》的观点,集成树模型通常擅长于表格数据这种异构数据集,是实打实的表格数据王者。集成树模型中的LightGBM是增强版的GBDT,支持了分类变量,在工程层面大大提高了训练效率。关于树模型的介绍,可见之前文章:[一文讲透树模型](http://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026amp;mid=2247488558\u0026amp;idx=1\u0026amp;sn=476991d1c8e16db31f71a18c41c98acd\u0026amp;chksm=ebbd968edcca1f988730379ed1553a846c9ef27cc26d2ce7f2a6034c3199358096dcc6b817d6\u0026token=38808633\u0026lang=zh_CN#rd)\r\n\r\n\r\nDNN深度神经网络擅长于同构的高维数据,从高维稀疏的表示中学习到低维致密的分布式表示,所以在自然语言、图像识别等领域基本上是称霸武林(神经网络的介绍及实践可见系列文章:[一文搞定深度学习建模全流程](http://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026amp;mid=2247486048\u0026amp;idx=1\u0026amp;sn=bbbe904159a9f9a65940057992a2ce8b\u0026amp;chksm=ebbd88c0dcca01d6a5054651e0f28266e4ad5dd762bdd1f3b6f4eadc0590af50af2ef653b14b\u0026token=38808633\u0026lang=zh_CN#rd))。对于异构致密的表格数据,个人实践来看,DNN模型的非线性能力没树模型来得高效。\r\n\r\n\r\n所以一个很朴素的想法是,结合这树模型+神经网络模型的优势。比如通过NN学习文本的嵌入特征后,输入树模型继续学习(如word2vec+LGB做文本分类,可见文章:[NLP建模全流程](http://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026amp;mid=2247489096\u0026amp;idx=1\u0026amp;sn=700d19511c6fff982082148ff1d9496c\u0026amp;chksm=ebbd94e8dcca1dfe998bae685e7c37c9bd89717834060225193f092df3caf12d3480800ef04f\u0026token=38808633\u0026lang=zh_CN#rd))。 或者是,树模型学习表格数据后,输出样本的高维个叶子节点的特征表示,输入DNN模型。\r\n\r\n接下来,我们使用LightGBM+DNN模型强强联手,验证其在信贷违约的表格数据预测分类效果。\r\n\r\n\r\n\r\n\r\n### 数据处理及树模型训练\r\n\r\nlightgbm树模型,自带缺失、类别变量的处理,还有很强的非线性拟合能力,特征工程上面不用做很多处理,建模非常方便。\r\n\r\n\r\n```\r\n##完整代码及数据请见 算法进阶github:https://github.com/aialgorithm/Blog\r\n\r\n# 划分数据集:训练集和测试集\r\ntrain_x, test_x, train_y, test_y = train_test_split(train_bank[num_feas + cate_feas], train_bank.isDefault,test_size=0.3, random_state=0)\r\n\r\n# 训练模型\r\nlgb=lightgbm.LGBMClassifier(n_estimators=5, num_leaves=5,class_weight= 'balanced',metric = 'AUC')\r\nlgb.fit(train_x, train_y)\r\nprint('train ',model_metrics(lgb,train_x, train_y))\r\nprint('test ',model_metrics(lgb,test_x,test_y))\r\n```\r\n\r\n简单处理建模后test的AUC可以达到0.8656\r\n\r\n### 树+神经网络\r\n接下来我们将提取树模型的叶子节点的路径作为特征,并简单做下特征选择处理\r\n```\r\nimport numpy as np\r\n\r\ny_pred = lgb.predict(train_bank[num_feas + cate_feas],pred_leaf=True) \r\n\r\n# 提取叶子节点\r\ntrain_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb.get_params()['num_leaves']],dtype=np.int64)\r\nprint(train_matrix.shape) \r\n\r\n\r\nfor i in range(len(y_pred)):\r\n temp = np.arange(len(y_pred[0]))*lgb.get_params()['num_leaves'] + np.array(y_pred[i])\r\n train_matrix[i][temp] += 1\r\n\r\n# drop zero-features\r\ndf2 = pd.DataFrame(train_matrix)\r\ndroplist2 = []\r\nfor k in df2.columns:\r\n if not df2[k].any():\r\n droplist2.append(k)\r\nprint(len(droplist2))\r\ndf2= df2.drop(droplist2,axis=1).add_suffix('_lgb')\r\n\r\n# 拼接原特征和树节点特征\r\ndf_final2 = pd.concat([train_bank[num_feas],df2],axis=1)\r\ndf_final2.head()\r\n```\r\n\r\n\r\n将拼接好原特征及树节点路径特征输入神经网络模型,并使用网格搜索调优神经网络模型。\r\n```\r\n# 划分数据集:训练集和测试集\r\ntrain_x, test_x, train_y, test_y = train_test_split(df_final2, train_bank.isDefault,test_size=0.3, random_state=0)\r\n\r\n# 神经网络模型评估\r\ndef model_metrics2(nnmodel, x, y):\r\n yprob = nnmodel.predict(x.replace([np.inf, -np.inf], np.nan).fillna(0))[:,0]\r\n fpr,tpr,_ = roc_curve(y, yprob,pos_label=1)\r\n return auc(fpr, tpr),max(tpr-fpr)\r\n\r\n\r\nimport keras\r\nfrom keras import regularizers\r\nfrom keras.layers import Dense,Dropout,BatchNormalization,GaussianNoise\r\nfrom keras.models import Sequential, Model\r\nfrom keras.callbacks import EarlyStopping\r\nfrom sklearn.metrics import mean_squared_error\r\n\r\n\r\nnp.random.seed(1) # 固定随机种子,使每次运行结果固定\r\n\r\n\r\n\r\nbestval = 0\r\n# 创建神经模型并暴力搜索较优网络结构超参: 输入层; n层k个神经元的relu隐藏层; 输出层\r\nfor layer_nums in range(2): #隐藏层的层数\r\n for k in list(range(1,100,5)): # 网格神经元数\r\n for norm in [0.01,0.05,0.1,0.2,0.4,0.6,0.8]:#正则化惩罚系数\r\n print(\"************隐藏层vs神经元数vs norm**************\",layer_nums,k,norm)\r\n model = Sequential()\r\n model.add(BatchNormalization()) # 输入层 批标准化 input_dim=train_x.shape\r\n for _ in range(layer_nums):\r\n model.add(Dense(k, \r\n kernel_initializer='random_uniform', # 均匀初始化\r\n activation='relu', # relu激活函数\r\n kernel_regularizer=regularizers.l1_l2(l1=norm, l2=norm), # L1及L2 正则项\r\n use_bias=True)) # 隐藏层1\r\n model.add(Dropout(norm)) # dropout正则\r\n model.add(Dense(1,use_bias=True,activation='sigmoid')) # 输出层\r\n\r\n\r\n # 编译模型:优化目标为回归预测损失mse,优化算法为adam\r\n model.compile(optimizer='adam', loss=keras.losses.binary_crossentropy) \r\n\r\n # 训练模型\r\n history = model.fit(train_x.replace([np.inf, -np.inf], np.nan).fillna(0), \r\n train_y, \r\n epochs=1000, # 训练迭代次数\r\n batch_size=1000, # 每epoch采样的batch大小\r\n validation_data=(test_x.replace([np.inf, -np.inf], np.nan).fillna(0),test_y), # 从训练集再拆分验证集,作为早停的衡量指标\r\n callbacks=[EarlyStopping(monitor='val_loss', patience=10)], #早停法\r\n verbose=False) # 不输出过程 \r\n print(\"验证集最优结果:\",min(history.history['loss']),min(history.history['val_loss']))\r\n print('------------train------------\\n',model_metrics2(model, train_x,train_y))\r\n\r\n print('------------test------------\\n',model_metrics2(model, test_x,test_y))\r\n test_auc = model_metrics2(model, test_x,test_y)[0] \r\n if test_auc \u003e bestval:\r\n bestval = test_auc\r\n bestparas = ['bestval, layer_nums, k, norm',bestval, layer_nums, k, norm]\r\n\r\n\r\n# 模型评估:拟合效果\r\nplt.plot(history.history['loss'],c='blue') # 蓝色线训练集损失\r\nplt.plot(history.history['val_loss'],c='red') # 红色线验证集损失\r\nplt.show()\r\nmodel.summary() #模型概述信息\r\nprint(bestparas)\r\n \r\n```\r\n\r\n\r\n可见,在我们这个实验中,使用树模型+神经网络模型在test的auc得到一些不错的提升,树模型的AUC 0.8656,而树模型+神经网络的AUC 0.8776,提升了1.2%\r\n\r\n### 其他试验结果\r\n\r\n结合微软的试验,树+神经网络(DeepGBM),在不同的任务上也是可以带来一些的效果提升的。有兴趣可以阅读下文末参考文献。\r\n\r\n\r\n\r\n\r\nLGB+DNN(或者单层的LR)是一个很不错的想法,有提升模型的一些效果。但需要注意的是,这也会加重模型的落地及迭代的复杂度。总之,树+神经网络是一个好的故事,但是结局没有太惊艳。\r\n\r\n\u003e参考论文:https://www.microsoft.com/en-us/research/uploads/prod/2019/08/deepgbm_kdd2019__CR_.pdf\r\nhttps://github.com/motefly/DeepGBM","author":{"url":"https://github.com/aialgorithm","@type":"Person","name":"aialgorithm"},"datePublished":"2022-07-27T12:21:52.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/57/Blog/issues/57"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:b941c7db-bff0-d1e1-01d6-1e12fb985b52 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | B1D8:84862:12D1AC5:1A5D37B:696A3C4A |
| html-safe-nonce | bc212f7238a28e21f6201beee5c59bb8cbd67f4f4eab1ea9fce5325475c30091 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJCMUQ4Ojg0ODYyOjEyRDFBQzU6MUE1RDM3Qjo2OTZBM0M0QSIsInZpc2l0b3JfaWQiOiI2NjkwNTAzMjA4NTM2MjU5NjU4IiwicmVnaW9uX2VkZ2UiOiJpYWQiLCJyZWdpb25fcmVuZGVyIjoiaWFkIn0= |
| visitor-hmac | c2eaf6dbeeaa8e1a8f114a0cb7c485b94f1824b135afe3bdbbd39d65bc529fce |
| hovercard-subject-tag | issue:1319484020 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/aialgorithm/Blog/57/issue_layout |
| twitter:image | https://opengraph.githubassets.com/73b4d922e6a411b67c0ead84d2f5605d0f11800269c13892431f424e3d10eef1/aialgorithm/Blog/issues/57 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/73b4d922e6a411b67c0ead84d2f5605d0f11800269c13892431f424e3d10eef1/aialgorithm/Blog/issues/57 |
| og:image:alt | 结合论文《Revisiting Deep Learning Models for Tabular Data》的观点,集成树模型通常擅长于表格数据这种异构数据集,是实打实的表格数据王者。集成树模型中的LightGBM是增强版的GBDT,支持了分类变量,在工程层面大大提高了训练效率。关于树模型的介绍,可见之前文章:一文讲透树模型 DNN深度神经网络擅长于同构的高维数据,从高维稀疏的表示中学习到低... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | aialgorithm |
| hostname | github.com |
| expected-hostname | github.com |
| None | 321736bfdb3f591415ae895a0459bec204b26a76caf47ba5c980634cfacc4538 |
| turbo-cache-control | no-preview |
| go-import | github.com/aialgorithm/Blog git https://github.com/aialgorithm/Blog.git |
| octolytics-dimension-user_id | 33707637 |
| octolytics-dimension-user_login | aialgorithm |
| octolytics-dimension-repository_id | 147093233 |
| octolytics-dimension-repository_nwo | aialgorithm/Blog |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 147093233 |
| octolytics-dimension-repository_network_root_nwo | aialgorithm/Blog |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 13581a31d51edf4a3aca179e10890a4d4f9b6d76 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width