/
Big Data: Big Data:

Big Data: - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
366 views
Uploaded On 2016-03-12

Big Data: - PPT Presentation

Movement Crunching and Sharing Guy Almes Academy for Advanced Telecommunications 13 February 2015 Overarching theme Understanding the interplay among data movement crunching and s haring is key ID: 253092

data 100 wide key 100 data key wide high sciencedmz resources area flows nsf internet2 cluster tools network sharing

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Big Data:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Big Data:Movement, Crunching, and Sharing

Guy Almes, Academy for Advanced Telecommunications

13 February 2015Slide2

Overarching theme

Understanding the interplay among

data movement,

crunching, and

s

haring is key.Slide3

This is a persistent theme

mid-1980s: NSF launched two closely related programs

The NSF Supercomputer Centers brought HPC and the emergent computational science to the mainstream of NSF-funded research

The NSFnet program, needed to connect science users to those supercomputers, resulted in connecting all our research universities to the Internet

File transfer of huge (e.g., one megabyte!) files was a key issue

Thus, A&M connected to NSFnet in August 1987Slide4

An ongoing themeThe Internet “outgrew” the narrow mission of connecting universities to supercomputers

But, in its broad missions, it often neglects the big-data needs of university researchers

Thus, having spawned the commercial Internet in the early 1990s, the universities created Internet2 in 1996

Again, a dramatic improvement in our ability to move huge (e.g., one gigabyte) files

Note the

Teragrid

network as a false stepSlide5

To the presentFirst, note A&M innovation in the

ScienceDMZ

, so that key data-intensive resources, e.g.,

the

gridftp

servers of

Brazos high-throughput cluster, have direct access to the wide-area (LEARN, Internet2,

ESnet

, etc.)

Recently, that wide-area infrastructure has been upgraded to 100-Gb/s

We’ll look at these in turnSlide6

ScienceDMZYou can achieve high-speed wide-are flows only if packet loss is very very small and MTU is not small

This fails if you try to extend these flows into the general-purpose campus LAN

Beginning 2009, we designed the Data Intensive Network to place key resources adjacent to the wide-area network

This idea, called “

ScienceDMZ

” and popularized by

ESnet

, is now widely adopted across the country

If both source and destination of a high-speed wide-area flow are on

ScienceDMZ’s

, very good performance can be

achieved

Example:

gridftp

servers supporting flows to/from the 240

TByte

file system for the Brazos high-throughput clusterSlide7

100 Gb/s UpgradeThe Internet2 backbone is built around 100-Gb/s circuits (and with up to 80 such circuits/lambdas per fiber)

With a combination of NSF and local funding, LEARN is evolving to 100-Gb/s:

Now: 100-Gb/s College Station to Houston

Now: 100-Gb/s Houston to Internet2 backbone at

Greenspoint

Now: 100-Gb/s Austin to Dallas

Now: 100-Gb/s Dallas to Internet2 backbone at Kansas City

Future: 100-Gb/s College Station to Dallas, and Austin to San Antonio to Houston

This would then result in a consistent 100-Gb/s wide-area infrastructureSlide8

Sum of current good situation

ScienceDMZ

and the emerging 100-Gb/s infrastructure permit very good end-to-end performance to resources on the

ScienceDMZ

Software tools such as

gridftp

, Globus Online, discipline-specific tools such as

Phedex

, permit wide-area flows in excess of 1

TByte

/hour to be sustained from other high-end sites

Emerging “Advanced Layer-2 Services”, based on software-defined network techniques, may be very importantSlide9

CrunchingSeveral key computing resources are already on A&M’s

ScienceDMZ

Parallel file system of Brazos

Similarly for Eos

Similarly for Ada

, a new very large x86 cluster

Emerging: Power7 cluster and eventually? the

BlueGene

cluster

Data-moving servers attached to the parallel file systems of these resources

And, using tools such as Globus Online, large data flows can be achieved to the computing resources of NSF/XSEDE and the DoESlide10

SharingThings are more primitive here. One can only point to:

A few discipline-specific examples, e.g., the

Phedex

system of the Large Hadron Collider’s CMS collaboration

Some key tools:

InCommon / Shibboleth provide federated identity/authentication

Globus Online provides some support for controlled sharing

But, generally, this situation does not match our needs to be able to share data among key scientific collaborationsSlide11

An important work in progress

Controlled high-performance sharing of data is key to effective scientific collaboration in the big-data world