René's URL Explorer Experiment


Title: 一文梳理金融风控建模全流程(Python) · Issue #44 · aialgorithm/Blog · GitHub

Open Graph Title: 一文梳理金融风控建模全流程(Python) · Issue #44 · aialgorithm/Blog

X Title: 一文梳理金融风控建模全流程(Python) · Issue #44 · aialgorithm/Blog

Description: 一、信贷风控简介 信贷风控是数据挖掘算法最成功的应用之一,这在于金融信贷行业的数据量很充足,需求场景清晰及丰富。 信贷风控简单来说就是判断一个人借了钱后面(如下个月的还款日)会不会按期还钱。更专业来说,信贷风控是还款能力及还款意愿的综合考量,根据这预先的判断为信任依据进行放贷,以此大大提高了金融业务效率。 与其他机器学习的工业场景不同,金融是极其厌恶风险的领域,其特殊性在于非常侧重模型的解释性及稳定性。业界通常的做法是基于挖掘多维度的特征建立一套可解释及效果稳定的规则及...

Open Graph Description: 一、信贷风控简介 信贷风控是数据挖掘算法最成功的应用之一,这在于金融信贷行业的数据量很充足,需求场景清晰及丰富。 信贷风控简单来说就是判断一个人借了钱后面(如下个月的还款日)会不会按期还钱。更专业来说,信贷风控是还款能力及还款意愿的综合考量,根据这预先的判断为信任依据进行放贷,以此大大提高了金融业务效率。 与其他机器学习的工业场景不同,金融是极其厌恶风险的领域,其特殊性在于非常侧重模型的解释...

X Description: 一、信贷风控简介 信贷风控是数据挖掘算法最成功的应用之一,这在于金融信贷行业的数据量很充足,需求场景清晰及丰富。 信贷风控简单来说就是判断一个人借了钱后面(如下个月的还款日)会不会按期还钱。更专业来说,信贷风控是还款能力及还款意愿的综合考量,根据这预先的判断为信任依据进行放贷,以此大大提高了金融业务效率。 与其他机器学习的工业场景不同,金融是极其厌恶风险的领域,其特殊性在于非常侧重模型的解释...

Opengraph URL: https://github.com/aialgorithm/Blog/issues/44

X: @github

direct link

Domain: github.com


Hey, it has json ld scripts:
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"一文梳理金融风控建模全流程(Python)","articleBody":"## 一、信贷风控简介\r\n\r\n信贷风控是数据挖掘算法最成功的应用之一,这在于金融信贷行业的数据量很充足,需求场景清晰及丰富。\r\n\r\n信贷风控简单来说就是判断一个人借了钱后面(如下个月的还款日)会不会按期还钱。更专业来说,信贷风控是还款能力及还款意愿的综合考量,根据这预先的判断为信任依据进行放贷,以此大大提高了金融业务效率。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-5d87f2b74df9c759.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n与其他机器学习的工业场景不同,金融是极其厌恶风险的领域,其特殊性在于非常侧重模型的解释性及稳定性。业界通常的做法是基于挖掘多维度的特征建立一套可解释及效果稳定的规则及风控模型对每笔订单/用户/行为做出判断决策。\r\n\r\n其中,对于(贷前)申请前的风控模型,也称为申请评分卡--A卡。A卡是风控的关键模型,业界共识是申请评分卡可以覆盖80%的信用风险。此外还有贷中行为评分卡B卡、催收评分卡C卡,以及反欺诈模型等等。\r\n\u003eA卡(Application score card)。目的在于预测申请时(申请信用卡、申请贷款)对申请人进行量化评估。\r\nB卡(Behavior score card)。目的在于预测使用时点(获得贷款、信用卡的使用期间)未来一定时间内逾期的概率。\r\nC卡(Collection score card)。目的在于预测已经逾期并进入催收阶段后未来一定时间内还款的概率。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-4ba2725efc5a869a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n一个好的特征,对于模型和规则都是至关重要的。像申请评分卡--A卡,主要可以归到以下3方面特征:\r\n\r\n\r\n- 1、信贷历史类: 信贷交易次数及额度、收入负债比、查询征信次数、信贷历史长度、新开信贷账户数、额度使用率、逾期次数及额度、信贷产品类型、被追偿信息。(**信贷交易类的特征重要程度往往是最高的**,少了这部分历史还款能力及意愿的信息,风控模型通常直接就废了。)\r\n\r\n- 2、基本资料及交易记录类:年龄、婚姻状况、学历、工作类型及年薪、工资收入、存款AUM、资产情况、公积金及缴税、非信贷交易流水等记录(这类主要是从还款能力上面综合考量的。还可以结合多方核验资料的真伪以及共用像手机号、身份证号等团伙欺诈信息,用来鉴别欺诈风险。需要注意的,像性别、肤色、地域、种族、宗教信仰等类型特征使用要谨慎,可能模型会有效果,但也会导致算法歧视问题。)\r\n\r\n- 3、公共负面记录类: 如破产负债、民事判决、行政处罚、法院强制执行、涉赌涉诈黑名单等(这类特征不一定能拿得到数据,且通常缺失度比较高,对模型贡献一般,更多的是从还款意愿/欺诈维度的考虑)\r\n\r\n\r\n\r\n## 二、申请评分卡(A卡)全流程\r\n\r\n实战部分我们以经典的申请评分卡为例,使用的中原银行个人贷款违约预测比赛的数据集,使用信用评分python库--toad、树模型Lightgbm及逻辑回归LR做申请评分模型。(注:文中所涉及的一些金融术语,由于篇幅就不展开解释了,疑问之处 可以谷歌了解下哈。)\r\n\r\n###  2.1 模型定义\r\n\r\n申请评分模型定义主要是通过一系列的数据分析确定建模的样本及标签。\r\n\u003e首先,补几个金融风控的术语的说明。概念模糊的话,可以回查再理解下:\r\n逾期期数(M) :指实际还款日与应还款日之间的逾期天数,并按区间划分后的逾期状态。M取自Month on Book的第一个单词。(注:不同机构所定义的区间划分可能存在差异)\r\nM0:当前未逾期(或用C表示,取自Current)\r\nM1: 逾期1-30日\r\nM2:逾期31-60日\r\nM3:逾期61-90日\r\nM4:逾期91-120日\r\nM5:逾期121-150日\r\nM6:逾期151-180日\r\nM7+:逾期180日以上\r\n\r\n\u003e观察点:样本层面的时间窗口。 用于构建样本集的时间点(如2010年10月申请贷款的用户),不同环节定义不同,比较抽象,这里举例说明:如果是申请模型,观察点定义为用户申贷时间,取19年1-12月所有的申贷订单作为构建样本集;如果是贷中行为模型,观察点定义为某个具体日期,如取19年6月15日在贷、没有发生逾期的申贷订单构建样本集。\r\n\u003e观察期:特征层面的时间窗口。构造特征的相对时间窗口,例如用户申请贷款订前12个月内(2009年10月截至到2010年10月申请贷款前的数据都可以用, 可以有用户平均消费金额、次数、贷款次数等数据特征)。设定观察期是为了每个样本的特征对齐,长度一般根据数据决定。一个需要注意的点是,只能用此次*申请前*的特征数据,不然就会数据泄露(时间穿越,用未来预测过去的现象)。\r\n\u003e表现期:标签层面的时间窗口。定义好坏标签Y的时间窗口,信贷风险具有天然的滞后性,因为用户借款后一个月(第一期)才开始还钱,有得可能还了好几期才发生逾期。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-19d86a2797966f82.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n对于现成的比赛数据,数据特征的时间跨度(观察期)、数据样本、标签定义都是已经提前分析确定下来的。但对于实际的业务来说,数据样本及模型定义其实也是申请评分卡的关键之处。毕竟实际场景里面可能没有人扔给你现成的数据及标签(好坏定义,有些公司的业务会提前分析好给建模人员),然后只是跑个分类模型那么简单。\r\n\r\n确定建模的样本量及标签,也就是模型从多少的数据样本中学习如何分辨其中的好、坏标签样本。如果样本量稀少、标签定义有问题,那学习的结果可想而知也会是差的。(对于建模样本量的确定,经验上肯定是满足建模条件的样本越多越好,一个类别最好有几千以上的样本数。)\r\n\r\n但对于标签的定义,可能我们直观感觉是比较简单,比如“好用户就是没有逾期的用户, 坏用户就是在逾期的用户”,但具体做量化起来会发现并不简单,有两个方面的主要因素需要考量:\r\n\r\n\r\n- 【坏的定义】逾期多少天算是坏客户。比如:只逾期2天算是建模的坏客户?\r\n\r\n\r\n根据巴塞尔协议的指导,一般逾期超过90天(M4+)的客户,即定义为坏客户。更为通用的,可以使用“滚动率”分析方法(Roll Rate Analysis)确定多少天算是“坏”,基本方法是统计分析出逾期M期的客户多大概率会逾期M+1期(同样的,我们不太可能等着所有客户都逾期一年才最终确定他就是坏客户。一来时间成本太高,二来这数据样本会少的可怜)。如下示例,我们通过滚动率分析各期逾期的变坏概率。当前未逾期(M0)下个月保持未逾期的概率99.71%; 当前逾期M1,下个月继续逾期概率为54.34%;当前M2下个月继续逾期概率就高达*90.04%*。我们可以看出M2是个比较明显的变坏拐点,可以以M2+作为坏样本的定义。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-616b66b8057fd4fc.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n- 【表现期】借贷申请的时间点(即:观察点)之后要在多久的时间暴露表现下,才能比较彻底的确定客户是否逾期。比如:借贷后观察了一个客户借贷后60天的那几个分期的表现都是按时还款,就可以判断他是好/坏客户?\r\n\r\n这也就是确定表现期,常用的分析方法是Vintage分析(Vintage在信贷领域不仅可以用它来评估客户好坏充分暴露所需的时间,即成熟期,还可以用它分析不同时期风控策略的差异等),通过分析历史累计坏用户暴露增加的趋势,来确定至少要多少期可以比较全面的暴露出大部分的坏客户。如下示例的坏定义是M4+,我们可以看出各期的M4+坏客户经过9或者10个月左右的表现,基本上可以都暴露出来,后面坏客户的总量就比较平稳了。这里我们就可以将表现期定位9或者10个月~\r\n![](https://upload-images.jianshu.io/upload_images/11682271-a499d22130f3cf9c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n确定了坏的定义以及需要的表现期,我们就可以确定样本的标签,最终划定的建模样本:\r\n- 好用户:表现期(如9个月)内无逾期的用户样本。\r\n- 坏用户:表现期(如9个月)内逾期(如M2+)的用户样本。\r\n- 灰用户:表现期内有过逾期行为,但不到坏定义(如M2+)的样本。注:实践中经常会把只逾期3天内的用户也归为好用户。\r\n\r\n比如现在的时间是2022-10月底,表现期9个月的话,就可以取2022-01月份及之前申请的样本(这也称为 观察点),打上好坏标签,建模。\r\n\r\n通过上面信用评分的介绍,很明显的好用户通常远大于坏用户的,这是一个类别极不均衡的典型场景,不均衡处理方法下文会谈到。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-3c3d72f6b3779cbb.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n###  2.2 读取数据及预处理\r\n\u003e本数据集的数据字典文档、比赛介绍及本文代码,可以到https://github.com/aialgorithm/Blog项目相应的代码目录下载\r\n![](https://upload-images.jianshu.io/upload_images/11682271-ab5bf896f90483ce.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n该数据集为中原银行的个人贷款违约预测数据集,个别字段有做了脱敏(金融的数据大都涉及机密)。主要的特征字段有个人基本信息、经济能力、贷款历史信息等等\r\n![](https://upload-images.jianshu.io/upload_images/11682271-c9616409a351512b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n数据有10000条样本,38维原始特征,其中isDefault为标签,是否逾期违约。\r\n\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-0f13029019594303.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\nimport pandas as pd\r\npd.set_option(\"display.max_columns\",50)\r\n\r\ntrain_bank = pd.read_csv('./train_public.csv')\r\n\r\nprint(train_bank.shape)\r\ntrain_bank.head()\r\n```\r\n\r\n\r\n数据预处理主要是对日期信息、噪音数据做下处理,并划分下类别、数值类型的特征。\r\n```\r\n# 日期类型:issueDate 转换为pandas中的日期类型,加工出数值特征\r\ntrain_bank['issue_date'] = pd.to_datetime(train_bank['issue_date'])\r\n# 提取多尺度特征\r\ntrain_bank['issue_date_y'] = train_bank['issue_date'].dt.year\r\ntrain_bank['issue_date_m'] = train_bank['issue_date'].dt.month\r\n# 提取时间diff # 转换为天为单位\r\nbase_time = datetime.datetime.strptime('2000-01-01', '%Y-%m-%d')   # 随机设置初始的基准时间\r\ntrain_bank['issue_date_diff'] = train_bank['issue_date'].apply(lambda x: x-base_time).dt.days\r\n# 可以发现earlies_credit_mon应该是年份-月的格式,这里简单提取年份\r\ntrain_bank['earlies_credit_mon'] = train_bank['earlies_credit_mon'].map(lambda x:int(sorted(x.split('-'))[0]))\r\ntrain_bank.head()\r\n\r\n\r\n# 工作年限处理\r\ntrain_bank['work_year'].fillna('10+ years', inplace=True)\r\n\r\nwork_year_map = {'10+ years': 10, '2 years': 2, '\u003c 1 year': 0, '3 years': 3, '1 year': 1,\r\n     '5 years': 5, '4 years': 4, '6 years': 6, '8 years': 8, '7 years': 7, '9 years': 9}\r\ntrain_bank['work_year']  = train_bank['work_year'].map(work_year_map)\r\n\r\ntrain_bank['class'] = train_bank['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})\r\n\r\n# 缺失值处理\r\ntrain_bank = train_bank.fillna('9999')\r\n\r\n# 区分 数值 或类别特征\r\n\r\ndrop_list = ['isDefault','earlies_credit_mon','loan_id','user_id','issue_date']\r\nnum_feas = []\r\ncate_feas = []\r\n\r\nfor col in train_bank.columns:\r\n    if col not in drop_list:\r\n        try:\r\n            train_bank[col] = pd.to_numeric(train_bank[col]) # 转为数值\r\n            num_feas.append(col)\r\n        except:\r\n            train_bank[col] = train_bank[col].astype('category')\r\n            cate_feas.append(col)\r\n            \r\nprint(cate_feas)\r\nprint(num_feas)\r\n\r\n```\r\n\r\n### 2.3 lightgbm评分卡建模\r\n如果是用Lightgbm建模做违约预测,简单的数据处理,基本上代码就结束了。lgb树模型是集成学习的强模型,自带缺失、类别变量的处理,特征上面不用做很多处理,建模非常方便,模型效果通常不错,还可以输出特征的重要性。\r\n\r\n(By the way,申请评分卡业界用逻辑回归LR会比较多,因为模型简单,解释性也比较好)。\r\n\r\n```\r\n\r\ndef model_metrics(model, x, y):\r\n    \"\"\" 评估 \"\"\"\r\n    yhat = model.predict(x)\r\n    yprob = model.predict_proba(x)[:,1]\r\n    fpr,tpr,_ = roc_curve(y, yprob,pos_label=1)\r\n    metrics = {'AUC':auc(fpr, tpr),'KS':max(tpr-fpr),\r\n               'f1':f1_score(y,yhat),'P':precision_score(y,yhat),'R':recall_score(y,yhat)}\r\n    \r\n    roc_auc = auc(fpr, tpr)\r\n\r\n    plt.plot(fpr, tpr, 'k--', label='ROC (area = {0:.2f})'.format(roc_auc), lw=2)\r\n\r\n    plt.xlim([-0.05, 1.05])  # 设置x、y轴的上下限,以免和边缘重合,更好的观察图像的整体\r\n    plt.ylim([-0.05, 1.05])\r\n    plt.xlabel('False Positive Rate')\r\n    plt.ylabel('True Positive Rate')  # 可以使用中文,但需要导入一些库即字体\r\n    plt.title('ROC Curve')\r\n    plt.legend(loc=\"lower right\")\r\n\r\n\r\n    return metrics\r\n# 划分数据集:训练集和测试集\r\ntrain_x, test_x, train_y, test_y = train_test_split(train_bank[num_feas + cate_feas], train_bank.isDefault,test_size=0.3, random_state=0)\r\n\r\n# 训练模型\r\nlgb=lightgbm.LGBMClassifier(n_estimators=5,leaves=5, class_weight= 'balanced',metric = 'AUC')\r\nlgb.fit(train_x, train_y)\r\nprint('train ',model_metrics(lgb,train_x, train_y))\r\nprint('test ',model_metrics(lgb,test_x,test_y))\r\n\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-3d3c06d6245b7e5d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n```\r\nfrom lightgbm import plot_importance\r\nplot_importance(lgb)\r\n```\r\n![](https://upload-images.jianshu.io/upload_images/11682271-0b452ccfd4208f71.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n### 2.4 LR评分卡建模\r\nLR即逻辑回归,是一种广义线性模型,因为其模型简单、解释性良好,在金融行业是最常用的。\r\n\r\n也正因为LR过于简单,没有非线性能力,所以我们往往需要通过比较复杂的特征工程,如分箱WOE编码的方法,提高模型的非线性能力。\r\n关于LR的原理及优化方法,强烈推荐阅读下:\r\n- [《全面解析并实现逻辑回归(Python)》](https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026mid=2247486157\u0026idx=1\u0026sn=a823b2920efdfc621a5b599112c08ed4\u0026scene=19#wechat_redirect)\r\n-  [《逻辑回归优化技巧总结(全)》](https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026mid=2247486347\u0026idx=1\u0026sn=e8951e7299f267a5cd1eeb944d19de02\u0026scene=19#wechat_redirect)\r\n\r\n下面我们通过toad实现特征分析、特征选择、特征分箱及WOE编码\r\n#### 2.4.1 特征选择\r\n\r\n```\r\n# 数据EDA分析\r\ntoad.detector.detect(train_bank)\r\n\r\n# 特征选择,根据相关性 缺失率、IV 等指标\r\ntrain_selected, dropped = toad.selection.select(train_bank,target = 'isDefault', empty = 0.5, iv = 0.05, corr = 0.7, return_drop=True, exclude=['earlies_credit_mon','loan_id','user_id','issue_date'])\r\nprint(dropped)\r\nprint(train_selected.shape)\r\n\r\n# 划分训练集 测试集\r\ntrain_x, test_x, train_y, test_y = train_test_split(train_selected.drop(['loan_id','user_id','isDefault','issue_date','earlies_credit_mon'],axis=1), train_selected.isDefault,test_size=0.3, random_state=0)\r\n```\r\n### 2.4.2 卡方分箱\r\n```\r\n# 特征的卡方分箱\r\ncombiner = toad.transform.Combiner()\r\n\r\n# 训练数据并指定分箱方法\r\n\r\ncombiner.fit(pd.concat([train_x,train_y], axis=1), y='isDefault',method= 'chi',min_samples = 0.05,exclude=[])\r\n\r\n# 以字典形式保存分箱结果\r\n\r\nbins = combiner.export()\r\n\r\nbins \r\n```\r\n\r\n通过特征分箱,每一个特征被离散化为各个分箱。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-64701b4dfd77865f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n\r\n接下来就是LR特征工程的特色处理了--手动调整分箱的单调性。\r\n\r\n这一步的意义更多在于特征的业务解释性的约束,对于模型的拟合效果影响不一定是正面的。这里我们主观认为大多数特征的不同分箱的坏账率badrate应该是满足某种单调关系的,而起起伏伏是不太好理解的。如征信查询次数这个特征,应该是分箱数值越高,坏账率越大。(注:如年龄特征可能就不满足这种单调关系)\r\n\r\n\r\n我们可以查看下ebt_loan_ratio这个变量的分箱情况,根据bad_rate趋势图,并保证单个分箱的样本占比不低于0.05,去调整分箱,达到单调性。(其他的特征可以按照这个方法继续调整,单调性调整还是挺耗时的)\r\n```\r\nadj_var = 'scoring_low'\r\n#调整前原来的分箱 [560.4545455, 621.8181818, 660.0, 690.9090909, 730.0, 775.0]\r\nadj_bin = {adj_var: [ 660.0, 700.9090909, 730.0, 775.0]}\r\n\r\nc2 = toad.transform.Combiner()\r\nc2.set_rules(adj_bin)\r\n\r\ndata_ = pd.concat([train_x,train_y], axis=1)\r\ndata_['type'] = 'train'\r\ntemp_data = c2.transform(data_[[adj_var,'isDefault','type']], labels=True)\r\n\r\nfrom toad.plot import badrate_plot, proportion_plot\r\n# badrate_plot(temp_data, target = 'isDefault', x = 'type', by = adj_var)\r\n# proportion_plot(temp_data[adj_var])\r\nfrom toad.plot import  bin_plot,badrate_plot\r\nbin_plot(temp_data, target = 'isDefault',x=adj_var)\r\n```\r\n- 调整前\r\n![](https://upload-images.jianshu.io/upload_images/11682271-a492a76162505004.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n- 调整后\r\n![](https://upload-images.jianshu.io/upload_images/11682271-dec183ef6f748701.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\n# 更新调整后的分箱\r\ncombiner.set_rules(adj_bin)\r\ncombiner.export()\r\n\r\n```\r\n\r\n![](https://upload-images.jianshu.io/upload_images/11682271-4e58fb8da859c32c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n#### 2.4.3 WOE编码\r\n接下来就是对各个特征的分箱做WOE编码,通过WOE编码给各个分箱不同的权重,提升LR模型的非线性。\r\n```\r\n#计算WOE,仅在训练集计算WOE,不然会标签泄露\r\ntranser = toad.transform.WOETransformer()\r\nbinned_data = combiner.transform(pd.concat([train_x,train_y], axis=1))\r\n\r\n#对WOE的值进行转化,映射到原数据集上。对训练集用fit_transform,测试集用transform.\r\ndata_tr_woe = transer.fit_transform(binned_data, binned_data['isDefault'],  exclude=['isDefault'])\r\ndata_tr_woe.head()\r\n\r\n## test woe\r\n\r\n# 先分箱\r\nbinned_data = combiner.transform(test_x)\r\n#对WOE的值进行转化,映射到原数据集上。测试集用transform.\r\ndata_test_woe = transer.transform(binned_data)\r\ndata_test_woe.head()\r\n```\r\n\r\n#### 2.4.4 训练LR\r\n使用woe编码后的train数据训练模型。对于金融风控这种极不平衡的数据集,比较常用的做法是做下极少类的正采样或者使用代价敏感学习class_weight='balanced',以增加极少类的学习权重。可见:[《一文解决样本不均衡(全)》](https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==\u0026mid=2247487430\u0026idx=1\u0026sn=abb25dfb333c53634f435c101e1fb8dd\u0026scene=19#wechat_redirect)\r\n\r\n对于LR等弱模型,通常会发现训练集与测试集的指标差异(gap)是比较少的,即很少过拟合现象。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-d746cf754c655691.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n\r\n```\r\n# 训练LR模型\r\nfrom sklearn.linear_model import LogisticRegression\r\n\r\nlr = LogisticRegression(class_weight='balanced')\r\nlr.fit(data_tr_woe.drop(['isDefault'],axis=1), data_tr_woe['isDefault'])\r\n\r\nprint('train ',model_metrics(lr,data_tr_woe.drop(['isDefault'],axis=1), data_tr_woe['isDefault']))\r\nprint('test ',model_metrics(lr,data_test_woe,test_y))\r\n```\r\n\r\n#### 2.4.5 评分卡应用\r\n利用训练好的LR模型,输出(概率)分数分布表,结合误杀率、召回率以及业务需要可以确定一个合适分数阈值cutoff (注:在实际场景中,通常还会将概率非线性转化为更为直观的整数分score=A-B*ln(odds),方便评分卡更直观、统一的应用。)\r\n\r\n\r\n```\r\n\r\ntrain_prob = lr.predict_proba(data_tr_woe.drop(['isDefault'],axis=1))[:,1]\r\ntest_prob = lr.predict_proba(data_test_woe)[:,1]\r\n\r\n\r\n# Group the predicted scores in bins with same number of samples in each (i.e. \"quantile\" binning)\r\ntoad.metrics.KS_bucket(train_prob, data_tr_woe['isDefault'], bucket=10, method = 'quantile')\r\n```\r\n当预测这用户的概率大于设定阈值,意味这个用户的违约概率很高,就可以拒绝他的贷款申请。\r\n![](https://upload-images.jianshu.io/upload_images/11682271-2e81e6d2cb398f26.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\r\n","author":{"url":"https://github.com/aialgorithm","@type":"Person","name":"aialgorithm"},"datePublished":"2022-03-08T07:39:09.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/44/Blog/issues/44"}

route-pattern/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controllervoltron_issues_fragments
route-actionissue_layout
fetch-noncev2:60ed0abe-7856-83f0-aafe-0267a9e8cd01
current-catalog-service-hash81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-idE000:1C99EC:4851DF:64D906:696A21DF
html-safe-nonce4da6484ed55b9bab5873d4f12b1f085f48dcac01548d70bd8b5fc4c36c16556d
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJFMDAwOjFDOTlFQzo0ODUxREY6NjREOTA2OjY5NkEyMURGIiwidmlzaXRvcl9pZCI6IjQ3OTE1ODk5NjgxNTg1MzIwNjMiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ==
visitor-hmac85678167bcc19d7486143d2a25a4220880a43e61c0f8e94d93e35deefea91b5f
hovercard-subject-tagissue:1162322956
github-keyboard-shortcutsrepository,issues,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location///voltron/issues_fragments/issue_layout
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/aialgorithm/Blog/44/issue_layout
twitter:imagehttps://opengraph.githubassets.com/21e9af551db2bea69942868247aba204caad427ff6da0570bcb83009a43bfc78/aialgorithm/Blog/issues/44
twitter:cardsummary_large_image
og:imagehttps://opengraph.githubassets.com/21e9af551db2bea69942868247aba204caad427ff6da0570bcb83009a43bfc78/aialgorithm/Blog/issues/44
og:image:alt一、信贷风控简介 信贷风控是数据挖掘算法最成功的应用之一,这在于金融信贷行业的数据量很充足,需求场景清晰及丰富。 信贷风控简单来说就是判断一个人借了钱后面(如下个月的还款日)会不会按期还钱。更专业来说,信贷风控是还款能力及还款意愿的综合考量,根据这预先的判断为信任依据进行放贷,以此大大提高了金融业务效率。 与其他机器学习的工业场景不同,金融是极其厌恶风险的领域,其特殊性在于非常侧重模型的解释...
og:image:width1200
og:image:height600
og:site_nameGitHub
og:typeobject
og:author:usernameaialgorithm
hostnamegithub.com
expected-hostnamegithub.com
None014f3d193f36b7d393f88ca22d06fbacd370800b40a547c1ea67291e02dc8ea3
turbo-cache-controlno-preview
go-importgithub.com/aialgorithm/Blog git https://github.com/aialgorithm/Blog.git
octolytics-dimension-user_id33707637
octolytics-dimension-user_loginaialgorithm
octolytics-dimension-repository_id147093233
octolytics-dimension-repository_nwoaialgorithm/Blog
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id147093233
octolytics-dimension-repository_network_root_nwoaialgorithm/Blog
turbo-body-classeslogged-out env-production page-responsive
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
released515f6f09fa57a93bf90355cb894eb84ca4f458f
ui-targetfull
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://github.com/aialgorithm/Blog/issues/44#start-of-content
https://github.com/
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F44
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Faialgorithm%2FBlog%2Fissues%2F44
Sign up https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=aialgorithm%2FBlog
Reloadhttps://github.com/aialgorithm/Blog/issues/44
Reloadhttps://github.com/aialgorithm/Blog/issues/44
Reloadhttps://github.com/aialgorithm/Blog/issues/44
aialgorithm https://github.com/aialgorithm
Bloghttps://github.com/aialgorithm/Blog
Notifications https://github.com/login?return_to=%2Faialgorithm%2FBlog
Fork 259 https://github.com/login?return_to=%2Faialgorithm%2FBlog
Star 942 https://github.com/login?return_to=%2Faialgorithm%2FBlog
Code https://github.com/aialgorithm/Blog
Issues 66 https://github.com/aialgorithm/Blog/issues
Pull requests 0 https://github.com/aialgorithm/Blog/pulls
Actions https://github.com/aialgorithm/Blog/actions
Projects 0 https://github.com/aialgorithm/Blog/projects
Security Uh oh! There was an error while loading. Please reload this page. https://github.com/aialgorithm/Blog/security
Please reload this pagehttps://github.com/aialgorithm/Blog/issues/44
Insights https://github.com/aialgorithm/Blog/pulse
Code https://github.com/aialgorithm/Blog
Issues https://github.com/aialgorithm/Blog/issues
Pull requests https://github.com/aialgorithm/Blog/pulls
Actions https://github.com/aialgorithm/Blog/actions
Projects https://github.com/aialgorithm/Blog/projects
Security https://github.com/aialgorithm/Blog/security
Insights https://github.com/aialgorithm/Blog/pulse
New issuehttps://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/44
New issuehttps://github.com/login?return_to=https://github.com/aialgorithm/Blog/issues/44
一文梳理金融风控建模全流程(Python)https://github.com/aialgorithm/Blog/issues/44#top
https://github.com/aialgorithm
https://github.com/aialgorithm
aialgorithmhttps://github.com/aialgorithm
on Mar 8, 2022https://github.com/aialgorithm/Blog/issues/44#issue-1162322956
https://camo.githubusercontent.com/ebc40cf29416b65aa62a6f3a16c50f33a3db319b5766083bd51578c368043925/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d356438376632623734646639633735392e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/799a52445b624a979e581ea8531711de6bade14e0c8faef4d93b9274c244c883/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d346261323732356566633561383639612e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/6a2879471a90752b848770d339944e1b07c43c6f7e72547001c6e68d6cac57e5/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d313964383661323739373936366638322e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/e8f78fc5db95ec4df219b0b5f12f796bbccbe9ec9b51ca836f395493297df7d5/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d363136623636623830353766643466632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/e06f5819eb306d964f1d0bfff8c5c8b9d035f27a6fbcaf9998d8eacc1fdfab51/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d613439396432323133306633636639632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/c7b394b27e8635b5c3f0abb9d3a7a263b6326595306e01d775575965ef15ca27/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d336333643732663662333737396362622e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://github.com/aialgorithm/Blog项目相应的代码目录下载https://github.com/aialgorithm/Blog%E9%A1%B9%E7%9B%AE%E7%9B%B8%E5%BA%94%E7%9A%84%E4%BB%A3%E7%A0%81%E7%9B%AE%E5%BD%95%E4%B8%8B%E8%BD%BD
https://camo.githubusercontent.com/06346705f51bb6895f84744f066dfa76926dca93063bab320efa5a106eb20b60/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d616235626638393666393034383363652e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/a61aa71120682bb3c67d9112cab5e231ca63d3d5f5df64ae495510224dee18fb/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d633936313634303961333531353132622e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/6e56f2b0b18180abcbedb5f1419231531bccf435d0e11a3831cbfcf855678d75/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d306631333032393031393539343330332e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/2e9cbfafa28a494e08e8d9b3036b20a1808db575ed9fb2b6c21fd1c37d915d32/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d336433633036643632343562376535642e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/754a5b1d67122c5883e9d598a7cb2a256175b014844ec254d1b1325679d70852/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d306234353263636664343230386637312e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
《全面解析并实现逻辑回归(Python)》https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==&mid=2247486157&idx=1&sn=a823b2920efdfc621a5b599112c08ed4&scene=19#wechat_redirect
《逻辑回归优化技巧总结(全)》https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==&mid=2247486347&idx=1&sn=e8951e7299f267a5cd1eeb944d19de02&scene=19#wechat_redirect
https://camo.githubusercontent.com/fa928371aa89abc5a67aabab7fa436777e5efd7ab73478eb2edf2084b87bb3f0/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d363437303162346466643737383635662e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/8f50fdf86b60f81564f36ea039e7cb84478f4e910bc8d816c5781d914b3b0556/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d613439326137363136323530353030342e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/9b5913860e3f4fb41d157ace47b2b10bd94352c039b88c345ba0b66fe57d558d/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d646563313833656636663734383730312e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/ff7fa5622f585ba04151670be6c9d45035d6bd6ce0998846797fa598014ee237/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d346535386662386461383539633332632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
《一文解决样本不均衡(全)》https://mp.weixin.qq.com/s?__biz=MzI4MDE1NjExMQ==&mid=2247487430&idx=1&sn=abb25dfb333c53634f435c101e1fb8dd&scene=19#wechat_redirect
https://camo.githubusercontent.com/554d171493a851f09074929ddae6eaf0e86033309b82a9aaf324702b8d1c4da4/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d643734366366373534633635353639312e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://camo.githubusercontent.com/3c547a9ad1cb8d7bb636ddce3bdce8986df8b99aa097b13d8c1e9bf27e8576d7/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f31313638323237312d326538316536643263623339386632362e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.