/
Putting some             into HDF5 Putting some             into HDF5

Putting some into HDF5 - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
388 views
Uploaded On 2016-04-18

Putting some into HDF5 - PPT Presentation

Gerd Heber amp Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASAGSFC under Raytheon Co contract number NNG10HP02C The Return of Outline The Big Schism ID: 282934

file 2015 spark july 2015 file july spark hdf5 big files data hdf cluster list system names org https rdd work apache

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Putting some into HDF5" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Putting some into HDF5

Gerd Heber & Joe LeeThe HDF GroupChampaign Illinois USA

This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02CSlide2

The Return ofSlide3

Outline

“The Big Schism” A Shiny New EngineGetting off the GroundFuture Work

3July 14 – 17, 2015Slide4

“The Big Schism”

An HDF5 file is a Smart Data Container“This is what happens, Larry, when you copy an HDF5 file into HDFS!” (Walter

Sobchak)4July 14 – 17, 2015

Natural Habitat:

Traditional File System

Block Store:

Hadoop “File System” (HDFS)

Ouch!

Don’t mess with HDF5!Slide5

Now What?

Ask questions:Who want’s HDF5 files in Hadoop? (volatile

)Who wants to program MapReduce? (nobody)How big are your HDF5 files? (long tailed distrib.)

No size (solution) fits all...

Do experiments:

Reverse-engineer the format (

students, weirdos

)

In-core processing (

fiddly

)

Convert to Avro (

some success

)

Sit tight and wait for something better!

5

July 14 – 17, 2015Slide6

Spark Concepts

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be only created through deterministic operations on either (1) a dataset in stable storage or (2) other existing RDDs.

6July 14 – 17, 2015Slide7

What’s Great about Spark

Refreshingly abstract

Supports PythonTypically runs in RAMHas batteries included

7

July 14 – 17, 2015

?Slide8

Experimental Setup

GSSTF_NCEP.3 collection 7/1/1987 to

12/31/20087,850 HDF-EOS5 files, 16 MB per file, ~120 GB total4 variables on daily 1440x720 gridSea level pressure (hPa)

2m

air

temperature

(C)

S

ea

surface skin

temperature

(C

)

S

ea surface saturation

humidity (g/kg)Lenovo ThinkPad X230TIntel Core i5-3320M (2 cores, 4 threads), 8GB of RAM, Samsung SSD 840 ProWindows 8.1 (64-bit), Apache Spark 1.3.0

8July 14 – 17, 2015Slide9

Getting off t

he Ground9

July 14 – 17, 2015

Where do they dwell?Slide10

General Strategy

Create our first RDD – “list of file names/paths/...”

Traverse base directory, compile list of HDF5 filesPartition the list via SparkContext.parallelize

()

Use the RDD’s

flatMap

method to calculate something interesting, e.g., summary statistics

10

July 14 – 17, 2015

RDD

Calculating

Tair_2m

mean and median for 3.5 years took about

10 seconds

on my notebook.Slide11

Variations

Instead of traversing directories, you can provide a CSV file of [HDF5 file names, path names, hyperslab selections, etc.] to partition

A fast SSD array goes a long wayIf you have a distributed file system (e.g., GPFS, Lustre, Ceph), you should be able to feed large numbers of Spark workers (running on a cluster)If you don’t have a parallel file system and use most of the data in a file, you can stage (copy) the files first on the cluster nodes

11

July 14 – 17, 2015Slide12

Conclusion

Forget MapReduce, stop worrying

about HDFSWith Spark, exploiting data parallelism has never been more accessible (easier and cheaper)Current HDF5 to Spark on-ramps can be effective under the right circumstances, but are kludgyWork with us to build the right things right!12

July 14 – 17, 2015Slide13

References

13July 14 – 17, 2015

[BigHDF

]

https://www.hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf

[Blog]

https://hdfgroup.org/wp/2015/04/putting-some-spark-into-hdf-eos/

[Report]

Zaharia

et al.,

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

,

UCBerkeley

2011.

[Spark]

https://spark.apache.org/

[YouTube]

Mark

Madsen: Big Data, Bad Analogies, 2014.Slide14

Thank YouSlide15

This work was supported by NASA/GSFC under Raytheon Co. contract number

NNG10HP02C