René's URL Explorer Experiment

Title: 一文梳理金融风控建模全流程(Python) · Issue #44 · aialgorithm/Blog · GitHub

Open Graph Title: 一文梳理金融风控建模全流程(Python) · Issue #44 · aialgorithm/Blog

X Title: 一文梳理金融风控建模全流程(Python) · Issue #44 · aialgorithm/Blog

Description: 一、信贷风控简介信贷风控是数据挖掘算法最成功的应用之一，这在于金融信贷行业的数据量很充足，需求场景清晰及丰富。信贷风控简单来说就是判断一个人借了钱后面（如下个月的还款日）会不会按期还钱。更专业来说，信贷风控是还款能力及还款意愿的综合考量，根据这预先的判断为信任依据进行放贷，以此大大提高了金融业务效率。与其他机器学习的工业场景不同，金融是极其厌恶风险的领域，其特殊性在于非常侧重模型的解释性及稳定性。业界通常的做法是基于挖掘多维度的特征建立一套可解释及效果稳定的规则及...

Open Graph Description: 一、信贷风控简介信贷风控是数据挖掘算法最成功的应用之一，这在于金融信贷行业的数据量很充足，需求场景清晰及丰富。信贷风控简单来说就是判断一个人借了钱后面（如下个月的还款日）会不会按期还钱。更专业来说，信贷风控是还款能力及还款意愿的综合考量，根据这预先的判断为信任依据进行放贷，以此大大提高了金融业务效率。与其他机器学习的工业场景不同，金融是极其厌恶风险的领域，其特殊性在于非常侧重模型的解释...

X Description: 一、信贷风控简介信贷风控是数据挖掘算法最成功的应用之一，这在于金融信贷行业的数据量很充足，需求场景清晰及丰富。信贷风控简单来说就是判断一个人借了钱后面（如下个月的还款日）会不会按期还钱。更专业来说，信贷风控是还款能力及还款意愿的综合考量，根据这预先的判断为信任依据进行放贷，以此大大提高了金融业务效率。与其他机器学习的工业场景不同，金融是极其厌恶风险的领域，其特殊性在于非常侧重模型的解释...

Opengraph URL: https://github.com/aialgorithm/Blog/issues/44

X: @github

direct link

Domain: github.com

Hey, it has json ld scripts:

{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"一文梳理金融风控建模全流程(Python)","articleBody":"## 一、信贷风控简介\r\n\r\n信贷风控是数据挖掘算法最成功的应用之一，这在于金融信贷行业的数据量很充足，需求场景清晰及丰富。\r\n\r\n信贷风控简单来说就是判断一个人借了钱后面（如下个月的还款日）会不会按期还钱。更专业来说，信贷风控是还款能力及还款意愿的综合考量，根据这预先的判断为信任依据进行放贷，以此大大提高了金融业务效率。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-5d87f2b74df9c759.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n与其他机器学习的工业场景不同，金融是极其厌恶风险的领域，其特殊性在于非常侧重模型的解释性及稳定性。业界通常的做法是基于挖掘多维度的特征建立一套可解释及效果稳定的规则及风控模型对每笔订单/用户/行为做出判断决策。\r\n\r\n其中，对于（贷前）申请前的风控模型，也称为申请评分卡--A卡。A卡是风控的关键模型，业界共识是申请评分卡可以覆盖80%的信用风险。此外还有贷中行为评分卡B卡、催收评分卡C卡，以及反欺诈模型等等。\r\n\u003eA卡（Application score card）。目的在于预测申请时（申请信用卡、申请贷款）对申请人进行量化评估。\r\nB卡（Behavior score card）。目的在于预测使用时点（获得贷款、信用卡的使用期间）未来一定时间内逾期的概率。\r\nC卡（Collection score card）。目的在于预测已经逾期并进入催收阶段后未来一定时间内还款的概率。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-4ba2725efc5a869a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n一个好的特征，对于模型和规则都是至关重要的。像申请评分卡--A卡，主要可以归到以下3方面特征：\r\n\r\n\r\n- 1、信贷历史类： 信贷交易次数及额度、收入负债比、查询征信次数、信贷历史长度、新开信贷账户数、额度使用率、逾期次数及额度、信贷产品类型、被追偿信息。（**信贷交易类的特征重要程度往往是最高的**，少了这部分历史还款能力及意愿的信息，风控模型通常直接就废了。）\r\n\r\n- 2、基本资料及交易记录类：年龄、婚姻状况、学历、工作类型及年薪、工资收入、存款AUM、资产情况、公积金及缴税、非信贷交易流水等记录（这类主要是从还款能力上面综合考量的。还可以结合多方核验资料的真伪以及共用像手机号、身份证号等团伙欺诈信息，用来鉴别欺诈风险。需要注意的，像性别、肤色、地域、种族、宗教信仰等类型特征使用要谨慎，可能模型会有效果，但也会导致算法歧视问题。）\r\n\r\n- 3、公共负面记录类： 如破产负债、民事判决、行政处罚、法院强制执行、涉赌涉诈黑名单等（这类特征不一定能拿得到数据，且通常缺失度比较高，对模型贡献一般，更多的是从还款意愿/欺诈维度的考虑）\r\n\r\n\r\n\r\n## 二、申请评分卡（A卡）全流程\r\n\r\n实战部分我们以经典的申请评分卡为例，使用的中原银行个人贷款违约预测比赛的数据集，使用信用评分python库--toad、树模型Lightgbm及逻辑回归LR做申请评分模型。（注：文中所涉及的一些金融术语，由于篇幅就不展开解释了，疑问之处 可以谷歌了解下哈。）\r\n\r\n###  2.1 模型定义\r\n\r\n申请评分模型定义主要是通过一系列的数据分析确定建模的样本及标签。\r\n\u003e首先，补几个金融风控的术语的说明。概念模糊的话，可以回查再理解下：\r\n逾期期数(M) ：指实际还款日与应还款日之间的逾期天数，并按区间划分后的逾期状态。M取自Month on Book的第一个单词。（注：不同机构所定义的区间划分可能存在差异）\r\nM0：当前未逾期（或用C表示，取自Current）\r\nM1： 逾期1-30日\r\nM2：逾期31-60日\r\nM3：逾期61-90日\r\nM4：逾期91-120日\r\nM5：逾期121-150日\r\nM6：逾期151-180日\r\nM7+：逾期180日以上\r\n\r\n\u003e观察点：样本层面的时间窗口。 用于构建样本集的时间点（如2010年10月申请贷款的用户），不同环节定义不同，比较抽象，这里举例说明：如果是申请模型，观察点定义为用户申贷时间，取19年1-12月所有的申贷订单作为构建样本集；如果是贷中行为模型，观察点定义为某个具体日期，如取19年6月15日在贷、没有发生逾期的申贷订单构建样本集。\r\n\u003e观察期：特征层面的时间窗口。构造特征的相对时间窗口，例如用户申请贷款订前12个月内（2009年10月截至到2010年10月申请贷款前的数据都可以用， 可以有用户平均消费金额、次数、贷款次数等数据特征）。设定观察期是为了每个样本的特征对齐，长度一般根据数据决定。一个需要注意的点是，只能用此次*申请前*的特征数据，不然就会数据泄露（时间穿越，用未来预测过去的现象）。\r\n\u003e表现期：标签层面的时间窗口。定义好坏标签Y的时间窗口，信贷风险具有天然的滞后性，因为用户借款后一个月（第一期）才开始还钱，有得可能还了好几期才发生逾期。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-19d86a2797966f82.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n对于现成的比赛数据，数据特征的时间跨度（观察期）、数据样本、标签定义都是已经提前分析确定下来的。但对于实际的业务来说，数据样本及模型定义其实也是申请评分卡的关键之处。毕竟实际场景里面可能没有人扔给你现成的数据及标签（好坏定义，有些公司的业务会提前分析好给建模人员），然后只是跑个分类模型那么简单。\r\n\r\n确定建模的样本量及标签，也就是模型从多少的数据样本中学习如何分辨其中的好、坏标签样本。如果样本量稀少、标签定义有问题，那学习的结果可想而知也会是差的。（对于建模样本量的确定，经验上肯定是满足建模条件的样本越多越好，一个类别最好有几千以上的样本数。）\r\n\r\n但对于标签的定义，可能我们直观感觉是比较简单，比如“好用户就是没有逾期的用户， 坏用户就是在逾期的用户”，但具体做量化起来会发现并不简单，有两个方面的主要因素需要考量：\r\n\r\n\r\n- 【坏的定义】逾期多少天算是坏客户。比如：只逾期2天算是建模的坏客户？\r\n\r\n\r\n根据巴塞尔协议的指导，一般逾期超过90天（M4+）的客户，即定义为坏客户。更为通用的，可以使用“滚动率”分析方法（Roll Rate Analysis）确定多少天算是“坏”，基本方法是统计分析出逾期M期的客户多大概率会逾期M+1期（同样的，我们不太可能等着所有客户都逾期一年才最终确定他就是坏客户。一来时间成本太高，二来这数据样本会少的可怜）。如下示例，我们通过滚动率分析各期逾期的变坏概率。当前未逾期（M0）下个月保持未逾期的概率99.71%； 当前逾期M1，下个月继续逾期概率为54.34%；当前M2下个月继续逾期概率就高达*90.04%*。我们可以看出M2是个比较明显的变坏拐点，可以以M2+作为坏样本的定义。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-616b66b8057fd4fc.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n- 【表现期】借贷申请的时间点（即：观察点）之后要在多久的时间暴露表现下，才能比较彻底的确定客户是否逾期。比如：借贷后观察了一个客户借贷后60天的那几个分期的表现都是按时还款，就可以判断他是好/坏客户？\r\n\r\n这也就是确定表现期，常用的分析方法是Vintage分析（Vintage在信贷领域不仅可以用它来评估客户好坏充分暴露所需的时间，即成熟期，还可以用它分析不同时期风控策略的差异等），通过分析历史累计坏用户暴露增加的趋势，来确定至少要多少期可以比较全面的暴露出大部分的坏客户。如下示例的坏定义是M4+，我们可以看出各期的M4+坏客户经过9或者10个月左右的表现，基本上可以都暴露出来，后面坏客户的总量就比较平稳了。这里我们就可以将表现期定位9或者10个月~\r\n![](https://upload-images.jianshu.io/upload_images/11682271-a499d22130f3cf9c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n确定了坏的定义以及需要的表现期，我们就可以确定样本的标签，最终划定的建模样本：\r\n- 好用户：表现期（如9个月）内无逾期的用户样本。\r\n- 坏用户：表现期（如9个月）内逾期（如M2+）的用户样本。\r\n- 灰用户：表现期内有过逾期行为，但不到坏定义（如M2+）的样本。注：实践中经常会把只逾期3天内的用户也归为好用户。\r\n\r\n比如现在的时间是2022-10月底，表现期9个月的话，就可以取2022-01月份及之前申请的样本（这也称为 观察点），打上好坏标签，建模。\r\n\r\n通过上面信用评分的介绍，很明显的好用户通常远大于坏用户的，这是一个类别极不均衡的典型场景，不均衡处理方法下文会谈到。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-3c3d72f6b3779cbb.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n###  2.2 读取数据及预处理\r\n\u003e本数据集的数据字典文档、比赛介绍及本文代码，可以到https://github.com/aialgorithm/Blog项目相应的代码目录下载\r\n![](https://upload-images.jianshu.io/upload_images/11682271-ab5bf896f90483ce.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n该数据集为中原银行的个人贷款违约预测数据集，个别字段有做了脱敏（金融的数据大都涉及机密）。主要的特征字段有个人基本信息、经济能力、贷款历史信息等等\r\n![](https://upload-images.jianshu.io/upload_images/11682271-c9616409a351512b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n数据有10000条样本，38维原始特征，其中isDefault为标签，是否逾期违约。\r\n\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-0f13029019594303.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\nimport pandas as pd\r\npd.set_option(\"display.max_columns\",50)\r\n\r\ntrain_bank = pd.read_csv('./train_public.csv')\r\n\r\nprint(train_bank.shape)\r\ntrain_bank.head()\r\n```\r\n\r\n\r\n数据预处理主要是对日期信息、噪音数据做下处理，并划分下类别、数值类型的特征。\r\n```\r\n# 日期类型：issueDate 转换为pandas中的日期类型，加工出数值特征\r\ntrain_bank['issue_date'] = pd.to_datetime(train_bank['issue_date'])\r\n# 提取多尺度特征\r\ntrain_bank['issue_date_y'] = train_bank['issue_date'].dt.year\r\ntrain_bank['issue_date_m'] = train_bank['issue_date'].dt.month\r\n# 提取时间diff # 转换为天为单位\r\nbase_time = datetime.datetime.strptime('2000-01-01', '%Y-%m-%d')   # 随机设置初始的基准时间\r\ntrain_bank['issue_date_diff'] = train_bank['issue_date'].apply(lambda x: x-base_time).dt.days\r\n# 可以发现earlies_credit_mon应该是年份-月的格式，这里简单提取年份\r\ntrain_bank['earlies_credit_mon'] = train_bank['earlies_credit_mon'].map(lambda x:int(sorted(x.split('-'))[0]))\r\ntrain_bank.head()\r\n\r\n\r\n# 工作年限处理\r\ntrain_bank['work_year'].fillna('10+ years', inplace=True)\r\n\r\nwork_year_map = {'10+ years': 10, '2 years': 2, '\u003c 1 year': 0, '3 years': 3, '1 year': 1,\r\n     '5 years': 5, '4 years': 4, '6 years': 6, '8 years': 8, '7 years': 7, '9 years': 9}\r\ntrain_bank['work_year']  = train_bank['work_year'].map(work_year_map)\r\n\r\ntrain_bank['class'] = train_bank['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})\r\n\r\n# 缺失值处理\r\ntrain_bank = train_bank.fillna('9999')\r\n\r\n# 区分 数值 或类别特征\r\n\r\ndrop_list = ['isDefault','earlies_credit_mon','loan_id','user_id','issue_date']\r\nnum_feas = []\r\ncate_feas = []\r\n\r\nfor col in train_bank.columns:\r\n    if col not in drop_list:\r\n        try:\r\n            train_bank[col] = pd.to_numeric(train_bank[col]) # 转为数值\r\n            num_feas.append(col)\r\n        except:\r\n            train_bank[col] = train_bank[col].astype('category')\r\n            cate_feas.append(col)\r\n            \r\nprint(cate_feas)\r\nprint(num_feas)\r\n\r\n```\r\n\r\n### 2.3 lightgbm评分卡建模\r\n如果是用Lightgbm建模做违约预测，简单的数据处理，基本上代码就结束了。lgb树模型是集成学习的强模型，自带缺失、类别变量的处理，特征上面不用做很多处理，建模非常方便，模型效果通常不错，还可以输出特征的重要性。\r\n\r\n（By the way，申请评分卡业界用逻辑回归LR会比较多，因为模型简单，解释性也比较好）。\r\n\r\n```\r\n\r\ndef model_metrics(model, x, y):\r\n    \"\"\" 评估 \"\"\"\r\n    yhat = model.predict(x)\r\n    yprob = model.predict_proba(x)[:,1]\r\n    fpr,tpr,_ = roc_curve(y, yprob,pos_label=1)\r\n    metrics = {'AUC':auc(fpr, tpr),'KS':max(tpr-fpr),\r\n               'f1':f1_score(y,yhat),'P':precision_score(y,yhat),'R':recall_score(y,yhat)}\r\n    \r\n    roc_auc = auc(fpr, tpr)\r\n\r\n    plt.plot(fpr, tpr, 'k--', label='ROC (area = {0:.2f})'.format(roc_auc), lw=2)\r\n\r\n    plt.xlim([-0.05, 1.05])  # 设置x、y轴的上下限，以免和边缘重合，更好的观察图像的整体\r\n    plt.ylim([-0.05, 1.05])\r\n    plt.xlabel('False Positive Rate')\r\n    plt.ylabel('True Positive Rate')  # 可以使用中文，但需要导入一些库即字体\r\n    plt.title('ROC Curve')\r\n    plt.legend(loc=\"lower right\")\r\n\r\n\r\n    return metrics\r\n# 划分数据集：训练集和测试集\r\ntrain_x, test_x, train_y, test_y = train_test_split(train_bank[num_feas + cate_feas], train_bank.isDefault,test_size=0.3, random_state=0)\r\n\r\n# 训练模型\r\nlgb=lightgbm.LGBMClassifier(n_estimators=5,leaves=5, class_weight= 'balanced',metric = 'AUC')\r\nlgb.fit(train_x, train_y)\r\nprint('train ',model_metrics(lgb,train_x, train_y))\r\nprint('test ',model_metrics(lgb,test_x,test_y))\r\n\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-3d3c06d6245b7e5d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n```\r\nfrom lightgbm import plot_importance\r\nplot_importance(lgb)\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-0b452ccfd4208f71.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n### 2.4 LR评分卡建模\r\nLR即逻辑回归，是一种广义线性模型，因为其模型简单、解释性良好，在金融行业是最常用的。\r\n\r\n也正因为LR过于简单，没有非线性能力，所以我们往往需要通过比较复杂的特征工程，如分箱WOE编码的方法，提高模型的非线性能力。\r\n关于LR的原理及优化方法，强烈推荐阅读下：\r\n- [《全面解析并实现逻辑回归(Python)》](https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026mid=2247486157\u0026idx=1\u0026sn=a823b2920efdfc621a5b599112c08ed4\u0026scene=19#wechat_redirect)\r\n-  [《逻辑回归优化技巧总结（全）》](https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026mid=2247486347\u0026idx=1\u0026sn=e8951e7299f267a5cd1eeb944d19de02\u0026scene=19#wechat_redirect)\r\n\r\n下面我们通过toad实现特征分析、特征选择、特征分箱及WOE编码\r\n#### 2.4.1 特征选择\r\n\r\n```\r\n# 数据EDA分析\r\ntoad.detector.detect(train_bank)\r\n\r\n# 特征选择,根据相关性 缺失率、IV 等指标\r\ntrain_selected, dropped = toad.selection.select(train_bank,target = 'isDefault', empty = 0.5, iv = 0.05, corr = 0.7, return_drop=True, exclude=['earlies_credit_mon','loan_id','user_id','issue_date'])\r\nprint(dropped)\r\nprint(train_selected.shape)\r\n\r\n# 划分训练集 测试集\r\ntrain_x, test_x, train_y, test_y = train_test_split(train_selected.drop(['loan_id','user_id','isDefault','issue_date','earlies_credit_mon'],axis=1), train_selected.isDefault,test_size=0.3, random_state=0)\r\n```\r\n### 2.4.2 卡方分箱\r\n```\r\n# 特征的卡方分箱\r\ncombiner = toad.transform.Combiner()\r\n\r\n# 训练数据并指定分箱方法\r\n\r\ncombiner.fit(pd.concat([train_x,train_y], axis=1), y='isDefault',method= 'chi',min_samples = 0.05,exclude=[])\r\n\r\n# 以字典形式保存分箱结果\r\n\r\nbins = combiner.export()\r\n\r\nbins \r\n```\r\n\r\n通过特征分箱，每一个特征被离散化为各个分箱。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-64701b4dfd77865f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n接下来就是LR特征工程的特色处理了--手动调整分箱的单调性。\r\n\r\n这一步的意义更多在于特征的业务解释性的约束，对于模型的拟合效果影响不一定是正面的。这里我们主观认为大多数特征的不同分箱的坏账率badrate应该是满足某种单调关系的，而起起伏伏是不太好理解的。如征信查询次数这个特征，应该是分箱数值越高，坏账率越大。（注：如年龄特征可能就不满足这种单调关系）\r\n\r\n\r\n我们可以查看下ebt_loan_ratio这个变量的分箱情况，根据bad_rate趋势图，并保证单个分箱的样本占比不低于0.05，去调整分箱，达到单调性。（其他的特征可以按照这个方法继续调整，单调性调整还是挺耗时的）\r\n```\r\nadj_var = 'scoring_low'\r\n#调整前原来的分箱 [560.4545455, 621.8181818, 660.0, 690.9090909, 730.0, 775.0]\r\nadj_bin = {adj_var: [ 660.0, 700.9090909, 730.0, 775.0]}\r\n\r\nc2 = toad.transform.Combiner()\r\nc2.set_rules(adj_bin)\r\n\r\ndata_ = pd.concat([train_x,train_y], axis=1)\r\ndata_['type'] = 'train'\r\ntemp_data = c2.transform(data_[[adj_var,'isDefault','type']], labels=True)\r\n\r\nfrom toad.plot import badrate_plot, proportion_plot\r\n# badrate_plot(temp_data, target = 'isDefault', x = 'type', by = adj_var)\r\n# proportion_plot(temp_data[adj_var])\r\nfrom toad.plot import  bin_plot,badrate_plot\r\nbin_plot(temp_data, target = 'isDefault',x=adj_var)\r\n```\r\n- 调整前\r\n![](https://upload-images.jianshu.io/upload_images/11682271-a492a76162505004.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n- 调整后\r\n![](https://upload-images.jianshu.io/upload_images/11682271-dec183ef6f748701.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\n# 更新调整后的分箱\r\ncombiner.set_rules(adj_bin)\r\ncombiner.export()\r\n\r\n```\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-4e58fb8da859c32c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n#### 2.4.3 WOE编码\r\n接下来就是对各个特征的分箱做WOE编码，通过WOE编码给各个分箱不同的权重，提升LR模型的非线性。\r\n```\r\n#计算WOE，仅在训练集计算WOE，不然会标签泄露\r\ntranser = toad.transform.WOETransformer()\r\nbinned_data = combiner.transform(pd.concat([train_x,train_y], axis=1))\r\n\r\n#对WOE的值进行转化，映射到原数据集上。对训练集用fit_transform,测试集用transform.\r\ndata_tr_woe = transer.fit_transform(binned_data, binned_data['isDefault'],  exclude=['isDefault'])\r\ndata_tr_woe.head()\r\n\r\n## test woe\r\n\r\n# 先分箱\r\nbinned_data = combiner.transform(test_x)\r\n#对WOE的值进行转化，映射到原数据集上。测试集用transform.\r\ndata_test_woe = transer.transform(binned_data)\r\ndata_test_woe.head()\r\n```\r\n\r\n#### 2.4.4 训练LR\r\n使用woe编码后的train数据训练模型。对于金融风控这种极不平衡的数据集，比较常用的做法是做下极少类的正采样或者使用代价敏感学习class_weight='balanced'，以增加极少类的学习权重。可见：[《一文解决样本不均衡（全）》](https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026mid=2247487430\u0026idx=1\u0026sn=abb25dfb333c53634f435c101e1fb8dd\u0026scene=19#wechat_redirect)\r\n\r\n对于LR等弱模型，通常会发现训练集与测试集的指标差异（gap）是比较少的，即很少过拟合现象。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-d746cf754c655691.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\n# 训练LR模型\r\nfrom sklearn.linear_model import LogisticRegression\r\n\r\nlr = LogisticRegression(class_weight='balanced')\r\nlr.fit(data_tr_woe.drop(['isDefault'],axis=1), data_tr_woe['isDefault'])\r\n\r\nprint('train ',model_metrics(lr,data_tr_woe.drop(['isDefault'],axis=1), data_tr_woe['isDefault']))\r\nprint('test ',model_metrics(lr,data_test_woe,test_y))\r\n```\r\n\r\n#### 2.4.5 评分卡应用\r\n利用训练好的LR模型，输出（概率）分数分布表，结合误杀率、召回率以及业务需要可以确定一个合适分数阈值cutoff （注：在实际场景中，通常还会将概率非线性转化为更为直观的整数分score=A-B*ln(odds)，方便评分卡更直观、统一的应用。）\r\n\r\n\r\n```\r\n\r\ntrain_prob = lr.predict_proba(data_tr_woe.drop(['isDefault'],axis=1))[:,1]\r\ntest_prob = lr.predict_proba(data_test_woe)[:,1]\r\n\r\n\r\n# Group the predicted scores in bins with same number of samples in each (i.e. \"quantile\" binning)\r\ntoad.metrics.KS_bucket(train_prob, data_tr_woe['isDefault'], bucket=10, method = 'quantile')\r\n```\r\n当预测这用户的概率大于设定阈值，意味这个用户的违约概率很高，就可以拒绝他的贷款申请。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-2e81e6d2cb398f26.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n","author":{"url":"https://github.com/aialgorithm","@type":"Person","name":"aialgorithm"},"datePublished":"2022-03-08T07:39:09.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/44/Blog/issues/44"}

route-pattern	/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controller	voltron_issues_fragments
route-action	issue_layout
fetch-nonce	v2:60ed0abe-7856-83f0-aafe-0267a9e8cd01
current-catalog-service-hash	81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id	E000:1C99EC:4851DF:64D906:696A21DF
html-safe-nonce	4da6484ed55b9bab5873d4f12b1f085f48dcac01548d70bd8b5fc4c36c16556d
visitor-payload	eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJFMDAwOjFDOTlFQzo0ODUxREY6NjREOTA2OjY5NkEyMURGIiwidmlzaXRvcl9pZCI6IjQ3OTE1ODk5NjgxNTg1MzIwNjMiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ==
visitor-hmac	85678167bcc19d7486143d2a25a4220880a43e61c0f8e94d93e35deefea91b5f
hovercard-subject-tag	issue:1162322956
github-keyboard-shortcuts	repository,issues,copilot
google-site-verification	Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-url	https://collector.github.com/github/collect
analytics-location	///voltron/issues_fragments/issue_layout
fb:app_id	1401488693436528
apple-itunes-app	app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/aialgorithm/Blog/44/issue_layout
twitter:image	https://opengraph.githubassets.com/21e9af551db2bea69942868247aba204caad427ff6da0570bcb83009a43bfc78/aialgorithm/Blog/issues/44
twitter:card	summary_large_image
og:image	https://opengraph.githubassets.com/21e9af551db2bea69942868247aba204caad427ff6da0570bcb83009a43bfc78/aialgorithm/Blog/issues/44
og:image:alt	一、信贷风控简介信贷风控是数据挖掘算法最成功的应用之一，这在于金融信贷行业的数据量很充足，需求场景清晰及丰富。信贷风控简单来说就是判断一个人借了钱后面（如下个月的还款日）会不会按期还钱。更专业来说，信贷风控是还款能力及还款意愿的综合考量，根据这预先的判断为信任依据进行放贷，以此大大提高了金融业务效率。与其他机器学习的工业场景不同，金融是极其厌恶风险的领域，其特殊性在于非常侧重模型的解释...
og:image:width	1200
og:image:height	600
og:site_name	GitHub
og:type	object
og:author:username	aialgorithm
hostname	github.com
expected-hostname	github.com
None	014f3d193f36b7d393f88ca22d06fbacd370800b40a547c1ea67291e02dc8ea3
turbo-cache-control	no-preview
go-import	github.com/aialgorithm/Blog git https://github.com/aialgorithm/Blog.git
octolytics-dimension-user_id	33707637
octolytics-dimension-user_login	aialgorithm
octolytics-dimension-repository_id	147093233
octolytics-dimension-repository_nwo	aialgorithm/Blog
octolytics-dimension-repository_public	true
octolytics-dimension-repository_is_fork	false
octolytics-dimension-repository_network_root_id	147093233
octolytics-dimension-repository_network_root_nwo	aialgorithm/Blog
turbo-body-classes	logged-out env-production page-responsive
disable-turbo	false
browser-stats-url	https://api.github.com/_private/browser/stats
browser-errors-url	https://api.github.com/_private/browser/errors
release	d515f6f09fa57a93bf90355cb894eb84ca4f458f
ui-target	full
theme-color	#1e2327
color-scheme	light dark

Links:

Skip to content	https://github.com/aialgorithm/Blog/issues/44#start-of-content
	https://github.com/
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F44
GitHub CopilotWrite better code with AI	https://github.com/features/copilot
GitHub SparkBuild and deploy intelligent apps	https://github.com/features/spark
GitHub ModelsManage and compare prompts	https://github.com/features/models
MCP RegistryNewIntegrate external tools	https://github.com/mcp
ActionsAutomate any workflow	https://github.com/features/actions
CodespacesInstant dev environments	https://github.com/features/codespaces
IssuesPlan and track work	https://github.com/features/issues
Code ReviewManage code changes	https://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilities	https://github.com/security/advanced-security
Code securitySecure your code as you build	https://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they start	https://github.com/security/advanced-security/secret-protection
Why GitHub	https://github.com/why-github
Documentation	https://docs.github.com
Blog	https://github.blog
Changelog	https://github.blog/changelog
Marketplace	https://github.com/marketplace
View all features	https://github.com/features
Enterprises	https://github.com/enterprise
Small and medium teams	https://github.com/team
Startups	https://github.com/enterprise/startups
Nonprofits	https://github.com/solutions/industry/nonprofits
App Modernization	https://github.com/solutions/use-case/app-modernization
DevSecOps	https://github.com/solutions/use-case/devsecops
DevOps	https://github.com/solutions/use-case/devops
CI/CD	https://github.com/solutions/use-case/ci-cd
View all use cases	https://github.com/solutions/use-case
Healthcare	https://github.com/solutions/industry/healthcare
Financial services	https://github.com/solutions/industry/financial-services
Manufacturing	https://github.com/solutions/industry/manufacturing
Government	https://github.com/solutions/industry/government
View all industries	https://github.com/solutions/industry
View all solutions	https://github.com/solutions
AI	https://github.com/resources/articles?topic=ai
Software Development	https://github.com/resources/articles?topic=software-development
DevOps	https://github.com/resources/articles?topic=devops
Security	https://github.com/resources/articles?topic=security
View all topics	https://github.com/resources/articles
Customer stories	https://github.com/customer-stories
Events & webinars	https://github.com/resources/events
Ebooks & reports	https://github.com/resources/whitepapers
Business insights	https://github.com/solutions/executive-insights
GitHub Skills	https://skills.github.com
Documentation	https://docs.github.com
Customer support	https://support.github.com
Community forum	https://github.com/orgs/community/discussions
Trust center	https://github.com/trust-center
Partners	https://github.com/partners
GitHub SponsorsFund open source developers	https://github.com/sponsors
Security Lab	https://securitylab.github.com
Maintainer Community	https://maintainers.github.com
Accelerator	https://github.com/accelerator
Archive Program	https://archiveprogram.github.com
Topics	https://github.com/topics
Trending	https://github.com/trending
Collections	https://github.com/collections
Enterprise platformAI-powered developer platform	https://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security features	https://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI features	https://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 support	https://github.com/premium-support
Pricing	https://github.com/pricing
Search syntax tips	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentation	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F44
Sign up	https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=aialgorithm%2FBlog
Reload	https://github.com/aialgorithm/Blog/issues/44
Reload	https://github.com/aialgorithm/Blog/issues/44
Reload	https://github.com/aialgorithm/Blog/issues/44
aialgorithm	https://github.com/aialgorithm
Blog	https://github.com/aialgorithm/Blog
Notifications	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Fork 259	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Star 942	https://github.com/login?return_to=%2Faialgorithm%2FBlog
Code	https://github.com/aialgorithm/Blog
Issues 66	https://github.com/aialgorithm/Blog/issues
Pull requests 0	https://github.com/aialgorithm/Blog/pulls
Actions	https://github.com/aialgorithm/Blog/actions
Projects 0	https://github.com/aialgorithm/Blog/projects
Security Uh oh! There was an error while loading. Please reload this page.	https://github.com/aialgorithm/Blog/security
Please reload this page	https://github.com/aialgorithm/Blog/issues/44
Insights	https://github.com/aialgorithm/Blog/pulse
Code	https://github.com/aialgorithm/Blog
Issues	https://github.com/aialgorithm/Blog/issues
Pull requests	https://github.com/aialgorithm/Blog/pulls
Actions	https://github.com/aialgorithm/Blog/actions
Projects	https://github.com/aialgorithm/Blog/projects
Security	https://github.com/aialgorithm/Blog/security
Insights	https://github.com/aialgorithm/Blog/pulse
New issue	https://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/44
New issue	https://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/44
一文梳理金融风控建模全流程(Python)	https://github.com/aialgorithm/Blog/issues/44#top
	https://github.com/aialgorithm
	https://github.com/aialgorithm
aialgorithm	https://github.com/aialgorithm
on Mar 8, 2022	https://github.com/aialgorithm/Blog/issues/44#issue-1162322956
	https://camo.githubusercontent.com/ebc40cf29416b65aa62a6f3a16c50f33a3db319b5766083bd51578c368043925/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d356438376632623734646639633735392e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/799a52445b624a979e581ea8531711de6bade14e0c8faef4d93b9274c244c883/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d346261323732356566633561383639612e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/6a2879471a90752b848770d339944e1b07c43c6f7e72547001c6e68d6cac57e5/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d313964383661323739373936366638322e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/e8f78fc5db95ec4df219b0b5f12f796bbccbe9ec9b51ca836f395493297df7d5/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d363136623636623830353766643466632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/e06f5819eb306d964f1d0bfff8c5c8b9d035f27a6fbcaf9998d8eacc1fdfab51/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d613439396432323133306633636639632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/c7b394b27e8635b5c3f0abb9d3a7a263b6326595306e01d775575965ef15ca27/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d336333643732663662333737396362622e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://github.com/aialgorithm/Blog项目相应的代码目录下载	https://github.com/aialgorithm/Blog%E9%A1%B9%E7%9B%AE%E7%9B%B8%E5%BA%94%E7%9A%84%E4%BB%A3%E7%A0%81%E7%9B%AE%E5%BD%95%E4%B8%8B%E8%BD%BD
	https://camo.githubusercontent.com/06346705f51bb6895f84744f066dfa76926dca93063bab320efa5a106eb20b60/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d616235626638393666393034383363652e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/a61aa71120682bb3c67d9112cab5e231ca63d3d5f5df64ae495510224dee18fb/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d633936313634303961333531353132622e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/6e56f2b0b18180abcbedb5f1419231531bccf435d0e11a3831cbfcf855678d75/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d306631333032393031393539343330332e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/2e9cbfafa28a494e08e8d9b3036b20a1808db575ed9fb2b6c21fd1c37d915d32/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d336433633036643632343562376535642e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/754a5b1d67122c5883e9d598a7cb2a256175b014844ec254d1b1325679d70852/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d306234353263636664343230386637312e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
《全面解析并实现逻辑回归(Python)》	https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==&mid=2247486157&idx=1&sn=a823b2920efdfc621a5b599112c08ed4&scene=19#wechat_redirect
《逻辑回归优化技巧总结（全）》	https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==&mid=2247486347&idx=1&sn=e8951e7299f267a5cd1eeb944d19de02&scene=19#wechat_redirect
	https://camo.githubusercontent.com/fa928371aa89abc5a67aabab7fa436777e5efd7ab73478eb2edf2084b87bb3f0/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d363437303162346466643737383635662e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/8f50fdf86b60f81564f36ea039e7cb84478f4e910bc8d816c5781d914b3b0556/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d613439326137363136323530353030342e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/9b5913860e3f4fb41d157ace47b2b10bd94352c039b88c345ba0b66fe57d558d/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d646563313833656636663734383730312e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/ff7fa5622f585ba04151670be6c9d45035d6bd6ce0998846797fa598014ee237/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d346535386662386461383539633332632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
《一文解决样本不均衡（全）》	https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==&mid=2247487430&idx=1&sn=abb25dfb333c53634f435c101e1fb8dd&scene=19#wechat_redirect
	https://camo.githubusercontent.com/554d171493a851f09074929ddae6eaf0e86033309b82a9aaf324702b8d1c4da4/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d643734366366373534633635353639312e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://camo.githubusercontent.com/3c547a9ad1cb8d7bb636ddce3bdce8986df8b99aa097b13d8c1e9bf27e8576d7/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d326538316536643263623339386632362e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
	https://github.com
Terms	https://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacy	https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Security	https://github.com/security
Status	https://www.githubstatus.com/
Community	https://github.community/
Docs	https://docs.github.com/
Contact	https://support.github.com?tags=dotcom-footer

Viewport: width=device-width

URLs of crawlers that visited me.