Mat Kelly Michele C Weigle Michael L Nelson mkellymweiglemln csoduedu Old Dominion University Norfolk VA What is WARCreate Google Chrome extension Creates WARC files Enables preservation by users from their browser ID: 682030
Download Presentation The PPT/PDF document "WARCreate Create Wayback-Consumable WARC..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
WARCreateCreate Wayback-Consumable WARC Files from Any Webpage
Mat Kelly, Michele C. Weigle, Michael L. Nelson{mkelly,mweigle,mln}@cs.odu.eduOld Dominion University; Norfolk, VASlide2
What is WARCreate?
Google Chrome extensionCreates WARC filesEnables preservation by users from their browserFirst steps in bringing Institutional Archiving facilities to the PC
2Slide3
Target Content
Unreachable by web crawlersBehind authenticationNot listed in search engines (Deep Web)PrivateWe don’t want our bank statements in WaybackNon-pertinent to publicOthers have little interest in our Facebook comments3Slide4
Preserving More!Much digital information is needlessly lost
User chooses what they deem importantCompatible with standard archiving tools.4Slide5
WYSIWYG5
Facebook-Supplied Data Dump
Archive created from
WARCreate in WaybackSlide6
WYSIWYG6
Using Scraping Tools (e.g. wget)Archive created from WARCreate in WaybackSlide7
WYSIWYG7
A Crawler Has No ContextArchive created from WARCreate in WaybackSlide8
WYSIWYG8
IA/HERITRIX OBEY ROBOTSArchive created from WARCreate in WaybackSlide9
GoalsMake it easy to use (GUI-based, no cmd
line)Make it useful (fill the need)Demonstrate novelty of browser-instigated preservationShow value of WARC format for Personal Web preservationBring WARC format to Personal Digital Archiving9Slide10
WARC Generation is Quick & Easy
Navigate to a webpage Click the WARCreate Icon Click Generate WARC Extension Output Options:In-Browser viewing of raw WARCDownload to Local Disk
10Slide11
11
Creating a WARCSlide12
I’ve Made a WARC. Now what?What you do with the archive is up to you.
Install it in your local Wayback instanceWho has their own Wayback Instance!?Wayback is free & open sourceThat seems like a lot of work!One additional reason for users NOT to preserve what they would like archived12Slide13
…to directory accessible to local
wayback
6
WARC Creation & Replay
1. User visits a website using their browser
13
1
2
4
3
2. WARCreate captures the HTTP Headers
3. User Selects “Generate WARC”
button in WARCreate
4. WARC generated, saved locally
5
5. Local Wayback instance indexes WARC
6. User accesses local
wayback
to view preserved contentSlide14
Suite Installation & InteractionDrag & Drop .zip to
hdStart relevant servicesusing GUIExecute WARCreate processView Archive at http://localhost/wayback
14Slide15
15
Replay of Preserved Twitter pageSlide16
And My Bank Statements?Preserved content:never leaves WARC files
never leaves local machineWARCreate provides preliminary encoding/encryption supportWayback instance is hosted on your own machine – no external access by default16Slide17
What It Doesn’t DoArchive entire websites with a clickSubmit your WARCs to IA
Contain comprehensive support for WARC formatA subset is utilized and all generated WARCs validated at time of creationProvide a direct means for replayReplay is executed through the XAMPP suite17Slide18
Why Use a Client-Side Server?
Server scripts do what JS can’tCan reside on your machine!Controls are GUI basedResource fetching w/o XSS issues18
Local Wayback Instance
WARCreate Server-Side
Support
Memento Proxy
…
Tomcat
Apache
XAMPP-Based Personal Web Archiving Suite
Built OnSlide19
Extras: Memento SupportSuite’s includes tailored
TimegateMemento abstraction is beyond WARCPoint MementoFox (or other Memento tools) to localhost
19Slide20
How it All Relates
WARCreateBROWSER
MementoFox
Browser Extensions
WARC/1.0
WARC-Type:
warcinfo
WARC-Date: 2012-07-15T22:15:59.485Z
WARC-Filename: 2220471175c820fee3fec986040ebd1f.warc
Generates WARC file
Local
Timegate
Local
Wayback
Instance
Send Desired Date
Index WARCs
Memento negotiated
& returned
Personal Archives Accessible at
localhost
20Slide21
Contribution of WorkFacilitate browser-based Personal Web ArchivingDetermine feasibility of fully Client-Side Preservation
Integrate with existing tools for establishing use cases21Slide22
22
WARCreate
Create Wayback-Consumable WARC Files from Any Webpage
http://WARCreate.comSlide23
Backup Slides
23Slide24
Future WorkDecouple from “server”
Refine Memento integrationReference full WARC specBuilt-in WARC validationBuilt-in replayCompressionOptimization (removing duplicates)…many more24Slide25
Extras: Configuration Sanity CheckServer
scipts make up for Javascript shortcomingsThe server can reside on your machine!Setup,Start,Stop are GUI based✗✗✗
✗
✗
WARC Validation
AJAX XSS Circumvention
HTML5 Sandbox Escaping
Memento Support
Local Wayback Instance
In WARCreate
25Slide26
Extras: Configuration Sanity CheckApache allows generated
WARCs to be validatedJavascript cannot write todisk, server-side scripts canServer prevents hot-linking & has securityContent better preserved using server techs✓
✓
✓
?
✗
WARC Validation
AJAX XSS Circumvention
HTML5 Sandbox Escaping
Memento Support
Local Wayback Instance
In WARCreate
26Slide27
Extras: Configuration Sanity CheckMemento requires
Wayback Wayback requires Tomcat∴ Memento requires TomcatMemento Timegate req’s Python+modules(pre-packaged + included)
✓
✓
✓
✓
✓
WARC Validation
AJAX XSS Circumvention
HTML5 Sandbox Escaping
Memento Support
Local Wayback Instance
In WARCreate
27