/
WARCreate Create Wayback-Consumable WARC Files from Any Webpage WARCreate Create Wayback-Consumable WARC Files from Any Webpage

WARCreate Create Wayback-Consumable WARC Files from Any Webpage - PowerPoint Presentation

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
346 views
Uploaded On 2018-09-29

WARCreate Create Wayback-Consumable WARC Files from Any Webpage - PPT Presentation

Mat Kelly Michele C Weigle Michael L Nelson mkellymweiglemln csoduedu Old Dominion University Norfolk VA What is WARCreate Google Chrome extension Creates WARC files Enables preservation by users from their browser ID: 682030

wayback warc local warcreate warc wayback warcreate local memento instance amp server personal support browser web preserved tools xss

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "WARCreate Create Wayback-Consumable WARC..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

WARCreateCreate Wayback-Consumable WARC Files from Any Webpage

Mat Kelly, Michele C. Weigle, Michael L. Nelson{mkelly,mweigle,mln}@cs.odu.eduOld Dominion University; Norfolk, VASlide2

What is WARCreate?

Google Chrome extensionCreates WARC filesEnables preservation by users from their browserFirst steps in bringing Institutional Archiving facilities to the PC

2Slide3

Target Content

Unreachable by web crawlersBehind authenticationNot listed in search engines (Deep Web)PrivateWe don’t want our bank statements in WaybackNon-pertinent to publicOthers have little interest in our Facebook comments3Slide4

Preserving More!Much digital information is needlessly lost

User chooses what they deem importantCompatible with standard archiving tools.4Slide5

WYSIWYG5

Facebook-Supplied Data Dump

Archive created from

WARCreate in WaybackSlide6

WYSIWYG6

Using Scraping Tools (e.g. wget)Archive created from WARCreate in WaybackSlide7

WYSIWYG7

A Crawler Has No ContextArchive created from WARCreate in WaybackSlide8

WYSIWYG8

IA/HERITRIX OBEY ROBOTSArchive created from WARCreate in WaybackSlide9

GoalsMake it easy to use (GUI-based, no cmd

line)Make it useful (fill the need)Demonstrate novelty of browser-instigated preservationShow value of WARC format for Personal Web preservationBring WARC format to Personal Digital Archiving9Slide10

WARC Generation is Quick & Easy

Navigate to a webpage Click the WARCreate Icon Click Generate WARC Extension Output Options:In-Browser viewing of raw WARCDownload to Local Disk

10Slide11

11

Creating a WARCSlide12

I’ve Made a WARC. Now what?What you do with the archive is up to you.

Install it in your local Wayback instanceWho has their own Wayback Instance!?Wayback is free & open sourceThat seems like a lot of work!One additional reason for users NOT to preserve what they would like archived12Slide13

…to directory accessible to local

wayback

6

WARC Creation & Replay

1. User visits a website using their browser

13

1

2

4

3

2. WARCreate captures the HTTP Headers

3. User Selects “Generate WARC”

button in WARCreate

4. WARC generated, saved locally

5

5. Local Wayback instance indexes WARC

6. User accesses local

wayback

to view preserved contentSlide14

Suite Installation & InteractionDrag & Drop .zip to

hdStart relevant servicesusing GUIExecute WARCreate processView Archive at http://localhost/wayback

14Slide15

15

Replay of Preserved Twitter pageSlide16

And My Bank Statements?Preserved content:never leaves WARC files

never leaves local machineWARCreate provides preliminary encoding/encryption supportWayback instance is hosted on your own machine – no external access by default16Slide17

What It Doesn’t DoArchive entire websites with a clickSubmit your WARCs to IA

Contain comprehensive support for WARC formatA subset is utilized and all generated WARCs validated at time of creationProvide a direct means for replayReplay is executed through the XAMPP suite17Slide18

Why Use a Client-Side Server?

Server scripts do what JS can’tCan reside on your machine!Controls are GUI basedResource fetching w/o XSS issues18

Local Wayback Instance

WARCreate Server-Side

Support

Memento Proxy

Tomcat

Apache

XAMPP-Based Personal Web Archiving Suite

Built OnSlide19

Extras: Memento SupportSuite’s includes tailored

TimegateMemento abstraction is beyond WARCPoint MementoFox (or other Memento tools) to localhost

19Slide20

How it All Relates

WARCreateBROWSER

MementoFox

Browser Extensions

WARC/1.0

WARC-Type:

warcinfo

WARC-Date: 2012-07-15T22:15:59.485Z

WARC-Filename: 2220471175c820fee3fec986040ebd1f.warc

Generates WARC file

Local

Timegate

Local

Wayback

Instance

Send Desired Date

Index WARCs

Memento negotiated

& returned

Personal Archives Accessible at

localhost

20Slide21

Contribution of WorkFacilitate browser-based Personal Web ArchivingDetermine feasibility of fully Client-Side Preservation

Integrate with existing tools for establishing use cases21Slide22

22

WARCreate

Create Wayback-Consumable WARC Files from Any Webpage

http://WARCreate.comSlide23

Backup Slides

23Slide24

Future WorkDecouple from “server”

Refine Memento integrationReference full WARC specBuilt-in WARC validationBuilt-in replayCompressionOptimization (removing duplicates)…many more24Slide25

Extras: Configuration Sanity CheckServer

scipts make up for Javascript shortcomingsThe server can reside on your machine!Setup,Start,Stop are GUI based✗✗✗

WARC Validation

AJAX XSS Circumvention

HTML5 Sandbox Escaping

Memento Support

Local Wayback Instance

In WARCreate

25Slide26

Extras: Configuration Sanity CheckApache allows generated

WARCs to be validatedJavascript cannot write todisk, server-side scripts canServer prevents hot-linking & has securityContent better preserved using server techs✓

?

WARC Validation

AJAX XSS Circumvention

HTML5 Sandbox Escaping

Memento Support

Local Wayback Instance

In WARCreate

26Slide27

Extras: Configuration Sanity CheckMemento requires

Wayback Wayback requires Tomcat∴ Memento requires TomcatMemento Timegate req’s Python+modules(pre-packaged + included)

WARC Validation

AJAX XSS Circumvention

HTML5 Sandbox Escaping

Memento Support

Local Wayback Instance

In WARCreate

27