Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajonesh5com mbazrafshanh5com fdelgadoh5com tlihatshh5com tschuylerh5com ID: 207772
Download Presentation The PPT/PDF document "The Role of Metadata in Machine" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Role of Metadata in MachineLearning for TAR
Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler
ajones@h5.com mbazrafshan@h5.com fdelgado@h5.com tlihatsh@h5.com tschuyler@h5.comSlide2
Metadata Use in TAR – Lack of Consensus
2
It is generally agreed across the industry that metadata
is a critical component of
ESI for eDiscovery.
Some
view incorporation of metadata into machine learning algorithm
development for TAR
as a matter of
course.
Others view it as atypical, if not incompatible, with machine
learning
approaches to document classification.Slide3
Metadata in TAR – Goals of the Study
3
If
metadata
provides information that is vital for manual document review in eDiscovery, though, why would it be any less valuable for TAR?
Goals of the current study:
Establish
the potential benefit of
incorporating metadata into TAR algorithm
development
processes.
Establish the potential benefit of leveraging widely inclusive sets of metadata, as opposed to limited pre-determined sets.
Establish the potential benefit of integrating
metadata
using techniques that preserve the added layer of information associated with metadata values.Slide4
Metadata in TAR – Data & Methods
4
3 Distinct Data
S
ets:
1 drawn from Topic 301 of the TREC 2010 Interactive Task
2 proprietary business data sets
Random sample
of 4500
individually labeled documents for each
Split into
a
3000 document
Control Set and a
1500-document
Training
Set
All machine
learning
models developed using an open source Support
Vector
Machine (SVM) implementation
Performance Metric: Area
Under the Receiver Operating Characteristic Curve (AUROC
)Slide5
Metadata in TAR – Metadata Choices
5
Metadata availability varied across data sets
Fields were chosen opportunistically based on
availability and
amenability to feature transformation
Fields that were populated for fewer than
5% of
the documents were omitted
Continuous metadata
values were transformed into categorical
values.
For
example, date values were collapsed into simple Month-Year
values ; file
size values were assigned to
categories
ranging from very small to very
large.Slide6
Metadata in TAR – Metadata Choices
6
Standard Metadata:
Author
Sender, Recipient, Cc
Subject
, Title, File
Name
Document
Type, File
Extension
Sent
Date, Created
Date
Sender
Domain, Recipient
Domain
Extended Metadata (all
of the above, plus
):
All Custodians, Primary
Custodian
Record Type
Attachment Name
Bates Prefix
Drop Id
Company/Organization
Native
File Size, Text
Size
Normalized
Date, Parent
Date
Family
Count, Attachment
Count
Recipient
Count, Cc Count, Combined Recipient
Count
Page CountSlide7
Metadata in TAR – Incremental Testing
7
Hypothesis 1: Incorporating
metadata into the machine learning process
will
lead to improved
model performance.
Text
from Standard Metadata
added
to
the body
text of
documents
There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Set 3.
Hypothesis 2: Incorporating the text from Extended Metadata will lead to superior results as compared to incorporating Standard Metadata alone
.
Text from
Extended Metadata
added to the body text of
documents – compared to models based on addition of Standard Metadata
There was a general trend of improvement across the three data sets. The improvement was highly significant for Data
Sets 1 and
3
.Slide8
Metadata in TAR – Incremental Testing
8
Hypothesis 3
: Using metadata values
in ways that preserve both attribute and value information will result in superior performance.
Extended Metadata
values
prefixed
to indicate their field origins added to
body
text
– compared
to models
with Extended Metadata added as plain text
Improvements varied across the three data sets, but significant for Data Set 2.
Dual modeling – prefixed metadata values and simple body text modeled independently, scores from two models multiplied to arrive at a final score – dual models compared to single models with prefixed Extended Metadata
There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Sets
2
and 3.Slide9
9
Stepping back from incremental pairwise comparisons - clearer answers and more striking differences emerge
Models incorporating Extended Metadata
significantly
outperformed
models
based on body text alone in each condition for every data set.
Overall Findings – MD Can Improve TARSlide10
10
Similarly strong trends can be observed when each model created using Standard Metadata is compared to its Extended Metadata counterpart.
Extended Metadata improvements were
highly significant
in all cases for Data Sets 1 and 3 and significant for the dual model in Data Set 2.
Overall Findings –
More MD Is BetterSlide11
11
Incorporating metadata as
an integral component of
machine learning processes
for
TAR in eDiscovery will benefit the community of practice.
Neglecting
this
resource is – at best – a
missed
opportunity. In an information retrieval effort, why leave information on the table?
To realize the full potential of using metadata in machine learning for TAR, practitioners should not rely
solely on
a
limited
intuitive set of metadata.
Examining
the contributions of specific metadata fields at a more granular
level could be very worthwhile.
Is “all available” always the best choice
?
There are still more questions than answers when it comes to the use of metadata in modeling for TAR
.
More
effective
algorithms?
Better techniques for capturing the full metadata contribution?
Metadata in TAR – ConclusionsSlide12
Questions?