Matt Thomson 17112016 Outline Introduction Traditional Fraud Detection Assurance Scoring Machine Learning Business Rules Anomaly Detection Graph Links Who am I Matt Thomson Senior Data Scientist at Capgemini ID: 603348
Download Presentation The PPT/PDF document "Assurance Scoring: Using Machine Learnin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector
Matt Thomson
17/11/2016Slide2
Outline
Introduction
Traditional Fraud Detection
Assurance Scoring
Machine Learning
Business Rules
Anomaly Detection
Graph LinksSlide3
Who am I?Matt ThomsonSenior Data Scientist at Capgemini
PhD in Astrophysics (http://arxiv.org/abs/1010.3315)Several years experience in fraud detection
CapgeminiBig Data Analytics team~100 Data Scientists, Big Data Engineers and Data AnalystsFocus on Open Source and Big Data technologies to solve client problems
Sponsor the
meetup
today!Slide4
Introduction to the ProblemPublic sector constantly working in an environment of reduced resourcesWant to provide a better service but with greater efficiency
Therefore very important that limited resources are focussed correctlyAssurance ScoringUse ML and other analytical methods to identify the least risky people or applications so that investigators resources can be targeted on the most riskySlide5
Hypothetical Example – 2016 Olympics ticketsImagine running the application process for selling tickets to the 2016 Olympics
Avoid selling tickets to touts/resellersVast majority of people applying for tickets are genuineFraud detection with big class imbalance problem (<0.1%)Avoid approach of investigating each person applying
Lets say we know from 2012 Olympics which people ended up reselling their tickets – training dataUse ML to identify the 30% (say) least likely to be touts – fast trackedInvestigators focus on the high riskSlide6
Traditional Fraud DetectionSlide7
Assurance ScoringFocus on low-riskAllows resources to be better focussed
Not limited to Machine LearningBuilt using Python!Pandas, Scikit-learn etcScala
version using Spark MLlibSlide8
Assurance ScoringSlide9
POLE ‘Analytical’ Data Layer
Disparate data sources - Atomic Layer
Atomic data is Transformed and Loaded into POLE
POLE Layer
Event
Location
Object
PersonSlide10
POLE ‘Analytical’ Data Layer
POLE contains ALL entities from the Atomic Layer, plus their inter-linkagesSlide11
Assurance ScoringSlide12
Machine learning
Transform
Selection
Model
Training
Validation
Test
Feature extraction and selection
Model Building
Variety of output files: logs, graphics, pickle models,
etc
Testing: Unit tests, monitoring tests and integration tests
Vector Build
Input Data
Manipulate, Explore
Data
Framework: Structure, flexibility, consistencySlide13
Machine learning : Feature Engineering
SQL, Python
Transform
Explore
Select
Ask
questions,
validate
Refine
features
Feature Extraction
Data exploration
Feature selection
Historical DataSlide14
Machine Learning: Model Building
Training
Validation
Test
Split Datasets
Build
Models
Hyper-parameter tuning
Selected
features
Models
Training
results
Validation
results
Tests
results
Compare
ModelsSlide15
Low risk? High risk? Depends on classifier’s threshold
True-positives
: applications
the model correctly classifies as high
risk
True negatives
: applications model correctly classifies as low risk
False-positives
: applications
the model
scores
as high risk but are
not
F
alse-negatives
: applications
the model scores as low risk but were in fact high
riskSlide16
Assurance ScoringSlide17
Business RulesIdentifying Fraud often been done using deterministic rulesLook for transactions near a threshold or at the end of the day
Primarily data queries on your feature vectorOlympics example – Anyone applying for more than £10,000 ticketsSlide18
Assurance ScoringSlide19
Anomaly DetectionUse the training data to create a baseline of applications by postcode (say)If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucketSlide20
Assurance ScoringSlide21
Graph Links - MatchingKey part of assurance scoring – bringing data together from disparate sources
Probability of Match: 80%
Attribute
Data Source 1
Data Source 2
Name
Matt Thomson
Matthew
Thosmon
Phone Number
07123
456 789
07123
456 798
Favourite Sport
Football
CricketSlide22
Assurance ScoringSlide23
Further DetailsCome and find me!
matt.thomson@capgemini.com
/ @MattGThomsonAssurance Scoring brochure: http://ow.ly/
4nbEUI
Blogs:
Introduction:
https://www.capgemini.com/node/
1380596
Integrating multiple techniques:
http://bit.ly/
24BmszV
Machine Learning
:
http://bit.ly/
1QTMGnq
Many more on other topicsSlide24
We’re Hiring!
Data Science
https://www.uk.capgemini.com/careers/jobs/data-scientist-0
Big Data
Engineer
https://www.uk.capgemini.com/careers/jobs/big-data-
engineer
matt.thomson@capgemini.comSlide25