Weigle Old Dominion University mkellymlnmweiglecsoduedu Web Science and Digital Libraries Research Group wsdlblogspotcom Web archives capture a lot but not everything Individuals interests may not be captured ID: 700879
Download Presentation The PPT/PDF document "Archive What I See Now Mat Kelly, Michae..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Archive What I See Now
Mat Kelly, Michael L. Nelson, Michele C. WeigleOld Dominion University{mkelly,mln,mweigle}@cs.odu.eduWeb Science and Digital Libraries Research Groupws-dl.blogspot.comSlide2
Web archives capture a lot but not everythingIndividuals’ interests may not be captured
Timely capture is importantCapture capability must be enabled for allWhat’s the Problem?2November 12, 2013Salt Lake City, Utah2013 Archive-It Partner MeetingSlide3
November 12, 2013
Salt Lake City, Utah2013 Archive-It Partner MeetingUse Case: Capturing Breaking Stories3
Calls for seed
URIs
are reactionary
Not quick enough
for rapidly
evolving eventsSlide4
November 12, 2013
Salt Lake City, Utah2013 Archive-It Partner MeetingUse Case: Capturing Breaking Stories4
Intermediate
mementos missed
The
s
tory is
incompleteSlide5
November 12, 2013
Salt Lake City, Utah2013 Archive-It Partner Meeting5
Use Case: Capturing Breaking StoriesSlide6
Users take ad hoc approaches
Screenshots of PagesWhy? Tools are hard.Build more accessible toolsAppeal to standards (e.g., WARC)Make interoperableThe Amateur Archivist’s Approach6November 12, 2013Salt Lake City, Utah
2013 Archive-It Partner Meeting
28500:2009Slide7
Safety of Archives Requires $No $, No Institution
Users Hard Drives FailNo Access to Save-As filesand ScreenshotsA hybrid approach is neededto leverage institutional safety, formats, and techwhile still allowing direct user depositsThe Institutional Dilemma7November 12, 2013Salt Lake City, Utah
2013 Archive-It Partner MeetingSlide8
Show use case where other tools cannot capturee.g., behind authentication
Juxtapose to Archive.is, Webcite, Save webpage AsVideo Here8November 12, 2013Salt Lake City, Utah
2013 Archive-It Partner MeetingSlide9
Scratch Slide
9November 12, 2013Salt Lake City, Utah2013 Archive-It Partner MeetingSlide10
So we built it!
10November 12, 2013Salt Lake City, Utah2013 Archive-It Partner Meeting
WARCreate
– Google
Chrome extension
Create web archives from browser
Capture
personalized content
Preserve
on a whim
Mat Kelly and Michele C., "
WARCreate
- Create
Wayback
-Consumable WARC Files from Any Webpage,"
In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (
JCDL 2012
). Washington, DC, June 2012, pp. 437-438
Mat Kelly, Michele C.
Weigle
, Michael Nelson. "
WARCreate
- Create
Wayback
-Consumable WARC Files from Any Webpage,"
Digital Preservation 2012
, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC. Slide11
WARCreate – How it Works
11November 12, 2013Salt Lake City, Utah2013 Archive-It Partner MeetingSlide12
Preserving the Original Context
12Facebook-Supplied Data DumpArchive created from
WARCreate in Wayback
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
Liberated Data Doesn’t Give The Whole PictureSlide13
Preserving the Original Context
13Using Scraping Tools (e.g. wget)Archive created from WARCreate in Wayback
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
The Target Controls What is AllowedSlide14
Preserving the Original Context
14A Crawler Has No ContextArchive created from WARCreate in Wayback
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
No Credentials
No Entry
No ArchivingSlide15
Preserving the Original Context
15IA/HERITRIX OBEY ROBOTSArchive created from WARCreate in Wayback
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
No Means No, if They Say and you ObeySlide16
PROBLEM:
Users don’t know what to do with WARCsSo we built it!16November 12, 2013Salt Lake City, Utah2013 Archive-It Partner Meeting
WARCreate
– Google
Chrome extension
Create web archives from browser
Capture
personalized content
Preserve
on a whim
Mat Kelly and Michele C., "
WARCreate
- Create
Wayback
-Consumable WARC Files from Any Webpage,"
In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (
JCDL 2012
). Washington, DC, June 2012, pp. 437-438
Mat Kelly, Michele C.
Weigle
, Michael Nelson. "
WARCreate
- Create
Wayback
-Consumable WARC Files from Any Webpage,"
Digital Preservation 2012
, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC. Slide17
So, again, we built it!
17November 12, 2013Salt Lake City, Utah2013 Archive-It Partner Meeting
Web Archiving
Integration Layer (WAIL)
Heritrix
,
Wayback
, etc. packaged for PC
GUI
front-end allows “One-Click Preservation”
Provides means to replay
WARCs
Mat Kelly, Michele C.
Weigle
, Michael Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving,"
Personal Digital Archiving 2013
, Poster Session; 2013 Feb 21; College Park, MD.
Mat Kelly, Michael Nelson and Michele C.
Weigle
. "
WARCreate
and WAIL: WARC,
Wayback
and
Heritrix
Made Easy,"
Digital Preservation 2013
, Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VASlide18
PROBLEM:
Users want to preserve but store at institutions for safe keepingSo, again, we built it!18November 12, 2013Salt Lake City, Utah
2013 Archive-It Partner Meeting
Web Archiving
Integration Layer (WAIL)
Heritrix
,
Wayback
, etc. packaged for PC
GUI
front-end allows “One-Click Preservation”
Provides means to replay
WARCs
Mat Kelly, Michele C.
Weigle
, Michael Nelson. "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving,"
Personal Digital Archiving 2013
, Poster Session; 2013 Feb 21; College Park, MD.
Mat Kelly, Michael Nelson and Michele C.
Weigle
. "
WARCreate
and WAIL: WARC,
Wayback
and
Heritrix
Made Easy,"
Digital Preservation 2013
, Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA
PROBLEM:
Even with replay, not everyone wants to use ChromeSlide19
The Plan
Port
Add functionality in:
…to upload WARCs to:
Implement Sequential Archiving
19
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
&
&Slide20
Disjoint extension/add-on APIsLittle logic can be re-used
Problems with HTTP header capture in Chrome are trivial in FirefoxChrome = highly asynchronous fetchingJavaScript code to save to local file system from Chrome for WARCreate is re-usablePorting WARCreate to Firefox20November 12, 2013Salt Lake City, Utah
2013 Archive-It Partner MeetingSlide21
The Plan
Port
Add functionality in:
…to upload WARCs to:
Implement Sequential Archiving
21
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
&
&
✓
In
βeta
now!Slide22
The Plan
Port
Add functionality in:
…to upload WARCs to:
Implement Sequential Archiving
22
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
&
&Slide23
Working with Archive-It to determine feasibility of user-provided WARCs
Consideration of data integrityShould data be merged with A-IT crawled WARCs? How do we account for your www.facebook.com vs. my www.facebook.comPrivacy?Uploading WARCs:An Open Question23November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner MeetingSlide24
The Plan
Port
Add functionality in:
…to upload WARCs to:
Implement Sequential Archiving
24
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
&
&Slide25
Like focused crawl but
URIs defined on per-site basis to be comprehensiveSimilar to Archive Facebook but generalizedImplement into
WARCreate
Utilize per-site specification tokeep tools from breaking★
personal stream
wall
posts
my tweets
global stream
news feed
streams
followees
’ tweets
multimedia-photos
photos
photos
N/A
multimedia-videos
videos
videos
N/A
photo collection
albums
N/A
N/A
posts
notes
N/A
N/A
friends
friends
circles
following
Sequential Archiving?
25
November 12, 2013
Salt Lake City, Utah
2013 Archive-It Partner Meeting
The Digital Libraries Approach
★
Discovery & Scraping:
The Information Retrieval Approach
- versus -Slide26
Only (and optionally) applied on recognized sites with scraping as fallback for establishing hierarchyLives online, tools allude to and are always updated
Standardized spec* prototype is live onlineSequential Archiving = Lots of Maintenance26November 12, 2013Salt Lake City, Utah2013 Archive-It Partner Meeting
* M. Kelly, An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication, Aug 2012Slide27
Firefox WARCreate in BetaChrome WARCreate Users Can Currently
Archive What They See NowSequential Archiving Implemented in Chrome WARCreate, needs portingNext Big Hurdle: Working with Archive-It in WARC upload logisticsSummary27November 12, 2013Salt Lake City, Utah2013 Archive-It Partner MeetingSlide28
Download Our Archiving Tools!
Share Your Use Cases for Capturing the Unpreserved and the UnpreservableHelp Us Improve Our Tools, Give Feedback!http://bit.ly/wc-wailArchive What I See Now28November 12, 2013Salt Lake City, Utah
2013 Archive-It Partner Meeting
In Beta
Available Soon!
Web Archiving Integration Layer (WAIL)
One-Click Preservation
Heritrix
, Wayback and Others On Your PC!
WARCreate for Chrome
Create WARC files form any web page
from your browser