/
Assurance Scoring: Using Machine Learning and Analytics to Assurance Scoring: Using Machine Learning and Analytics to

Assurance Scoring: Using Machine Learning and Analytics to - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
383 views
Uploaded On 2017-11-07

Assurance Scoring: Using Machine Learning and Analytics to - PPT Presentation

Matt Thomson 17112016 Outline Introduction Traditional Fraud Detection Assurance Scoring Machine Learning Business Rules Anomaly Detection Graph Links Who am I Matt Thomson Senior Data Scientist at Capgemini ID: 603348

assurance data risk scoring data assurance scoring risk machine learning model capgemini applications big tickets high detection fraud feature

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Assurance Scoring: Using Machine Learnin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector

Matt Thomson

17/11/2016Slide2

Outline

Introduction

Traditional Fraud Detection

Assurance Scoring

Machine Learning

Business Rules

Anomaly Detection

Graph LinksSlide3

Who am I?Matt ThomsonSenior Data Scientist at Capgemini

PhD in Astrophysics (http://arxiv.org/abs/1010.3315)Several years experience in fraud detection

CapgeminiBig Data Analytics team~100 Data Scientists, Big Data Engineers and Data AnalystsFocus on Open Source and Big Data technologies to solve client problems

Sponsor the

meetup

today!Slide4

Introduction to the ProblemPublic sector constantly working in an environment of reduced resourcesWant to provide a better service but with greater efficiency

Therefore very important that limited resources are focussed correctlyAssurance ScoringUse ML and other analytical methods to identify the least risky people or applications so that investigators resources can be targeted on the most riskySlide5

Hypothetical Example – 2016 Olympics ticketsImagine running the application process for selling tickets to the 2016 Olympics

Avoid selling tickets to touts/resellersVast majority of people applying for tickets are genuineFraud detection with big class imbalance problem (<0.1%)Avoid approach of investigating each person applying

Lets say we know from 2012 Olympics which people ended up reselling their tickets – training dataUse ML to identify the 30% (say) least likely to be touts – fast trackedInvestigators focus on the high riskSlide6

Traditional Fraud DetectionSlide7

Assurance ScoringFocus on low-riskAllows resources to be better focussed

Not limited to Machine LearningBuilt using Python!Pandas, Scikit-learn etcScala

version using Spark MLlibSlide8

Assurance ScoringSlide9

POLE ‘Analytical’ Data Layer

Disparate data sources - Atomic Layer

Atomic data is Transformed and Loaded into POLE

POLE Layer

Event

Location

Object

PersonSlide10

POLE ‘Analytical’ Data Layer

POLE contains ALL entities from the Atomic Layer, plus their inter-linkagesSlide11

Assurance ScoringSlide12

Machine learning

Transform

Selection

Model

Training

Validation

Test

Feature extraction and selection

Model Building

Variety of output files: logs, graphics, pickle models,

etc

Testing: Unit tests, monitoring tests and integration tests

Vector Build

Input Data

Manipulate, Explore

Data

Framework: Structure, flexibility, consistencySlide13

Machine learning : Feature Engineering

SQL, Python

Transform

Explore

Select

Ask

questions,

validate

Refine

features

Feature Extraction

Data exploration

Feature selection

Historical DataSlide14

Machine Learning: Model Building

Training

Validation

Test

Split Datasets

Build

Models

Hyper-parameter tuning

Selected

features

Models

Training

results

Validation

results

Tests

results

Compare

ModelsSlide15

Low risk? High risk? Depends on classifier’s threshold

True-positives

: applications

the model correctly classifies as high

risk

True negatives

: applications model correctly classifies as low risk

False-positives

: applications

the model

scores

as high risk but are

not

F

alse-negatives

: applications

the model scores as low risk but were in fact high

riskSlide16

Assurance ScoringSlide17

Business RulesIdentifying Fraud often been done using deterministic rulesLook for transactions near a threshold or at the end of the day

Primarily data queries on your feature vectorOlympics example – Anyone applying for more than £10,000 ticketsSlide18

Assurance ScoringSlide19

Anomaly DetectionUse the training data to create a baseline of applications by postcode (say)If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucketSlide20

Assurance ScoringSlide21

Graph Links - MatchingKey part of assurance scoring – bringing data together from disparate sources

Probability of Match: 80%

Attribute

Data Source 1

Data Source 2

Name

Matt Thomson

Matthew

Thosmon

Phone Number

07123

456 789

07123

456 798

Favourite Sport

Football

CricketSlide22

Assurance ScoringSlide23

Further DetailsCome and find me!

matt.thomson@capgemini.com

/ @MattGThomsonAssurance Scoring brochure: http://ow.ly/

4nbEUI

Blogs:

Introduction:

https://www.capgemini.com/node/

1380596

Integrating multiple techniques:

http://bit.ly/

24BmszV

Machine Learning

:

http://bit.ly/

1QTMGnq

Many more on other topicsSlide24

We’re Hiring!

Data Science

https://www.uk.capgemini.com/careers/jobs/data-scientist-0

Big Data

Engineer

https://www.uk.capgemini.com/careers/jobs/big-data-

engineer

matt.thomson@capgemini.comSlide25