skip navigation
skip mega-menu
Posts

Imperfect Intelligence, Part I – Garbage Data

Artificial Intelligence is here to make your life easier. It can recommend movies that you will like, it can drive you around relatively safely, diagnose your medical ailments, 它甚至可以在你自己知道之前预测你要买什么. In the right hands, AI is undoubtedly here to enhance your life. But what happens when we use AI to support big financial decisions? 人工智能解决方案能否仅仅因为首次购房者上个月在Deliveroo上花了太多钱而拒绝他们的抵押贷款申请?

人工智能有一个明显的致命弱点:提供给它的数据只具有我们人类赋予它的意义. Data cannot be objective. Rather, its significance is constructed by humans, fed into machine programs written by humans, and its results are subsequently interpreted by humans.

随着人工智能和机器学习继续渗透到越来越多的商业决策中, we need to be careful of its shortcomings and how to mitigate them. It is important to remember the common refrain “garbage in, garbage out”如果输入的数据有偏差,计算机算法就不能产生有用的输出. 虽然金融服务可以使用算法来帮助实现许多主要功能, like who to lend to and where to invest, they will not work unless they are trained on accurate, realistic and plentiful data.

我已经缩小了企业最终创造“垃圾”数据的3个关键原因,这些数据阻碍了他们的人工智能程序的客观性:

Framing the Problem

对于深度学习程序试图实现的每一个解决方案,都必须存在一个问题. When the machines are trained to produce a certain output, 他们完全受企业如何定义问题的摆布.

以一个算法为例,该算法旨在增加零售信贷部门的利润. Framed by the business in this manner, 解决方案很可能会发现,那些不太可能及时偿还债务的人更容易立即获利, 因此,对产品适用性提出疏忽的建议——这是企业肯定不会寻求的结果.

The first step to creating a robust, 有用和准确的人工智能解决方案是对业务问题的清晰和客观的表述,它考虑了最终客户的利益. Without this, 在技术开发开始之前,最终的解决方案将充满偏见和错误.

Data Collection

机器解释的数据的选择和整理对结果有很大的影响. If the data set does not reflect reality, it will give you skewed results, or worse still can reinforce existing biases or barriers.

Last year Amazon had to halt its AI recruiting tool 该公司审查了求职申请,以改进其人才识别流程. As tech in general is a male-dominated field, the programme taught itself that male candidates were preferable, whilst discriminating against female candidates. Any resume that had the word “women”, as in “women’s tennis team” or “women’s leadership group”, was automatically penalized.

Furthermore, the baseline for most statistical models is historical data, which helps create trends that will train the model effectively. 当没有足够的历史数据时,输出本质上是倾斜的. If there is nothing to compare your findings too, then it is really hard to tell how accurate your model is.

即使你有足够的数据来训练机器, 重要的是要仔细检查你是否有正确的数据来提供准确的图像.

大多数银行将受到系统捕获的数据点的支配,这些数据点最初并不是为了支持特定的人工智能问题陈述而设计的——这可能导致关键信息被忽视,无法作为算法的输入.

Say you choose to investigate the main reasons why customers are unable to make their mortgage repayments using internal account and transaction data; it is plausible you may not have enough context to generate an accurate finding. You may have captured customers’ age, income and Deliveroo habits, but this might not give you the whole picture. 那些最不可能错过抵押贷款还款的人也是年迈父母的照顾者吗? 他们去度假了,忘记提前还清账单了吗? Have they had a change in relationship status?

Data Preparation


特征选择是数据挖掘的关键组成部分,因此被描述为 “art” of machine learning. 每个数据集都由不同的“属性”组成,这些“属性”必须在被纳入计算机算法之前确定为重要或不重要.

当特征选择本身受制于人类偏见时,问题就出现了, 甚至当数据被训练的特征可能在道德上是不合适的. For example, 这是美国用来帮助预测罪犯再次犯罪可能性的计算机算法 错误地指出黑人被告再次违法的可能性是其他人的两倍 than their white counterparts.

如果一个人工智能/机器学习程序表明,年龄是决定信用价值的最重要因素——因为年龄越大,偿还贷款的能力越强——这是否意味着年轻人不应该有资格获得住房贷款?

Understanding Garbage Data

你不需要成为数据科学家或计算机程序员就能理解,用于人工智能程序的数据是否存在缺陷,是否受到人类偏见的影响, 那么它告诉我们的任何信息都是有缺陷的,同样是扭曲的. If we don’t account for these shortcomings, 我们无法做到客观,可能只会强化我们试图消除的偏见.

With good data, AI can be put to impressive use, and I’m not just talking about recommending movies on Netflix. However, even with good data, 算法可以根据隐藏的偏见进行训练,这些偏见会继续给我们带来微不足道的结果!

Subscribe to our newsletter

Sign up here