BitCurator User Forum Northwestern University April 2728 2017 Tools for identifying duplicate files and known software files 1 Tools 2017 BitCurator User Forum Tools for identifying duplicate files and known software files ID: 644538
Download Presentation The PPT/PDF document "Creighton Barrett Digital Archivist, Dal..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Creighton BarrettDigital Archivist, Dalhousie University ArchivesBitCurator User Forum, Northwestern UniversityApril 27-28, 2017
Tools for identifying duplicate files and known software files
1Slide2
Tools2017 BitCurator User Forum - Tools for identifying duplicate files and known software files2Slide3
FSlint (finds file system “lint”)DuplicatesInstalled packagesBad namesName clashesTemp files
Bad symlinksBad IDsEmpty directories
Non stripped binaries
Redundant whitespace
2017 BitCurator User Forum - Tools for identifying duplicate files and known software files
3Slide4
FSlint – Duplicates2017 BitCurator User Forum - Tools for identifying duplicate files and known software files4
Image source (
BitCurator
wiki): https
://wiki.bitcurator.net/index.php?title=Identify_and_delete_duplicate_filesSlide5
FTK – Flag DuplicatesSimpler process than FSlint, still a powerful featureChecks entire file and generates MD5
Assigns primary status to first instance of each MD5Assigns secondary status to subsequent instances of each MD5
2017 BitCurator User Forum - Tools for identifying duplicate files and known software files
5Slide6
FTK – Flag Duplicates2017 BitCurator User Forum - Tools for identifying duplicate files and known software files6
6Slide7
NSRL Reference Data Set (RDS)2017 BitCurator User Forum - Tools for identifying duplicate files and known software files7
Image source (NSRL):
https://www.nsrl.nist.gov/Documents/Data-Formats-of-the-NSRL-Reference-Data-Set-16.pdf
Slide8
NSRL Reference Data Set (RDS)Hashsets and metadata used in file identificationData can be used in third-party digital forensics tools
2017 BitCurator User Forum - Tools for identifying duplicate files and known software files
8
RDS is updated four times each year
As of v2.55, RDS is partitioned into four divisions:
Modern – applications created in or after 2000
Legacy – applications created in or before 1999
Android – Mobile apps for the Android OS
iOS – Mobile apps for iOSSlide9
FTK – Known File Filter (KFF)KFF data – hash values of known files that are compared against files in an FTK caseKFF data can come from pre-configured libraries (e.g., NSRL RDS, DHS, ICE, etc.) or custom libraries
FTK ships with version of NSRL RDS bifurcated into “Ignore” and “Alert” librariesKFF Server – used to process
KFF
data against
evidence
in an FTK case
KFF Import Utility – used to import
and index KFF data
2017 BitCurator User Forum - Tools for identifying duplicate files and known software files
9Slide10
FTK – Known File Filter (KFF)2017 BitCurator User Forum - Tools for identifying duplicate files and known software files
10Slide11
Other tools to work with NSRL RDSnsrlsvr - https://github.com/rjhansen/nsrlsvr/Keeps track of 40+ million hash values in an in-memory dataset to facilitate fast user queries
Supports custom libraries (“local corpus”)nsrlllokup -
https://rjhansen.github.io/nsrllookup
/
Command-line application
Works with tools
like hashdeep:
http://md5deep.sourceforge.net
/
National Software Reference Library - MD5/SHA1/File Name search -
http://
nsrl.hashsets.com
2017 BitCurator User Forum - Tools for identifying duplicate files and known software files
11Slide12
Bill Freedman fonds filtered in FTK2017 BitCurator User Forum - Tools for identifying duplicate files and known software files12
Filter
Description
# of files
Size
Unfiltered
All files in case
26,651,084
3,568 GB
Primary
status
Duplicate File indicator IS “Primary”
731,417
83.48 GB
Secondary status
Duplicate File indicator IS
“Secondary”
16,569,218
271.5 GB
KFF Ignore
Match
all files where KFF status IS “Ignore”
2,548,119
44.29 GB
No KFF Ignore
Match all files where KFF status
IS NOT “Ignore” + KFF status IS “Not checked”
24,102,965
3524 GB
Primary
status + No KFF Ignore
Match all files where duplicate
file indicator IS “Primary” + KFF status IS NOT “Ignore”
626,351
71.95 GB
Actual files + Primary status
+ No KFF Ignore
Match
all d
isk-bound files where duplicate file indicator IS “Primary” + KFF status
IS NOT “Ignore”
103,412
61.81 GBSlide13
QuestionsDoes it matter which duplicate file is selected for preservation? What if there are MD5 matches with different file names or extensions?Can queries against NSRL RDS be incorporated into BitCurator workflows
? Could provenance-based libraries of “known file” hashes be incorporated into
BitCurator
workflows?
Can repositories share provenance-based hash libraries (expose our “local corpus” of MD5s…)?
2017 BitCurator User Forum - Tools for identifying duplicate files and known software files
13