/
HDF Cloud Services Moving HDF5 to the Cloud HDF Cloud Services Moving HDF5 to the Cloud

HDF Cloud Services Moving HDF5 to the Cloud - PowerPoint Presentation

priscilla
priscilla . @priscilla
Follow
64 views
Uploaded On 2024-01-29

HDF Cloud Services Moving HDF5 to the Cloud - PPT Presentation

John Readey The HDF Group jreadeyhdfgrouporg Outline Brief review of HDF5 Motivation for HDF Cloud Services The HDF REST API H5serv REST API reference implementation Storage for data in the cloud ID: 1042496

data hdf5 hdf storage hdf5 data storage hdf rest object api web hdfgroup file service server https read h5serv

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "HDF Cloud Services Moving HDF5 to the Cl..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. HDF Cloud ServicesMoving HDF5 to the CloudJohn ReadeyThe HDF Groupjreadey@hdfgroup.org

2. OutlineBrief review of HDF5Motivation for HDF Cloud ServicesThe HDF REST APIH5serv – REST API reference implementationStorage for data in the cloudHDF Scalable Data Service (HSDS) – HDF at Cloud Scale

3. What is HDF5?Depends on your point of view:a C-APIa File Formata data modelThink of HDF5 as a file systemwithin a file.With chunking and compression.Add NumPy style data selection.Note: NetCDF4 Is based on HDF5

4. HDF5 FeaturesSome nice things about HDF5:Sophisticated type systemHighly portable binary formatHierarchal objects (directed graph)CompressionFast data slicing/reductionAttributesBindings for C, Fortran, Java, PythonThings HDF5 doesn’t do (out of the box):Scalable analytics (other than MPI)DistributedMultiple writer/multiple readerFine level access controlQuery/searchWeb accessible

5. Why HDF in the CloudIt can provide a cost-effective infrastructurePay for what you use vs pay for what you may needLower overhead: no hard ware setup/network configuration, etc.Potentially can benefit from cloud-based technologies:Elastic compute – scale compute resources dynamicallyObject based storage – low cost/built in redundancyCommunity platform (potentially)Enables interested users to bring their applications to the data

6. What do we need to bring HDF to the cloud?Define a Web API for HDFREST based API vs C API of HDF5Determine storage mediumDisk? Object Storage? NoSQL?Create Web Service that implements REST APIPreferably high performance and scalableREST API and Reference Service are available now (h5serv).Work on a scalable service used Object Based Storage has just started.

7. Why an HDF5 Web API?Motivation to create a web API:Anywhere reference-able data – ie URINetwork TransparencyClients can be lighter weightSupport Multiple Writer/Multiple ReaderEnable Web UIsIncreased scope for features/performance boostersE.g. in memory cache of recently used data Transparently support parallelism (e.g. processing requests in a cluster) Support alternative storage technologies (e.g. Object Storage)

8. A simple diagram of the REST API

9. What makes it RESTful?Client-server modelStateless – (no client context stored on server)Cacheable – clients can cache responsesResources identified by URIs (datasets, groups, attributes, etc)Standard HTTP methods and behaviors:MethodSafeIdempotentDescriptionGETYYGet a description of a resourcePOSTNNCreate a new resourcePUTNYCreate a new named resourceDELETENYDelete a resource

10. Example URIhttp://tall.data.hdfgroup.org:7253/datasets/34…d5e/value?select=[0:4,0:4]schemedomainportresourceQuery paramScheme: the connection protocolDomain: HDF5 files on the server can be viewed as domainsPort: the port the server is running onResource: identifier for the resource (dataset values in this case)Query param: Modify how the data will be returned (e.g. hyperslab selection)http://tall.data.hdfgroup.org:7253/datasets/feef70e8-16a6-11e5-994e-06fc179afd5e/value?select=[0:4,0:4]Note: no run time context!

11. Example POST Request – Create DatasetPOST /datasets HTTP/1.1 Content-Length: 39 User-Agent: python-requests/2.3.0 CPython/2.7.8 Darwin/14.0.0 Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==host: newdset.datasettest.test.hdfgroup.org Accept: */* Accept-Encoding: gzip, deflate { "shape": 10, "type": "H5T_IEEE_F32LE" } HTTP/1.1 201 Created Date: Thu, 29 Jan 2015 06:14:02 GMT Content-Length: 651 Content-Type: application/json Server: TornadoServer/3.2.2 { "id": "0568d8c5-a77e-11e4-9f7a-3c15c2da029e", "attributeCount": 0, "created": "2015-01-29T06:14:02Z", "lastModified": "2015-01-29T06:14:02Z", … ] }

12. Client/Server ArchitectureClient Software StackHDF Service (h5serv or …)HDF5 LibREST VOLNetCDF4 LibC/Fortran Applicationsh5pydREST BackendPython Applications(http)CMD Line ToolsNote: Clients don’t need to know what’s going on inside this box!BrowserWeb ApplicationsHDF REST API

13. Reference Implementation – h5servOpen Source implementation of the HDF REST APIGet it at: https://github.com/HDFGroup/h5servFirst release in 2015 – many features added since thenEasy to setup and configureRuns on Windows/Linux/MacNot intended for heavy production useImplementation is single threadedEach request is completed before the next one is processed

14. H5serv HighlightsWritten in Python using Tornado Framework (uses h5py & hdf5lib)REST-based APIHTTP request/responses in JSON or binaryFull CRUD (create/read/update/delete) supportMost HDF5 features (Compound types, Compression, chunking, links)Content directorySelf-contained web server Open Source (except web ui)UUID identifiers for Groups/Datasets/DatatypesAuthentication and/or public accessObject-level access control (read/write control per object)Query support

15. H5serv ArchitectureRequest HandlerHDF5Dbh5pyhdf5libFile StorageREQRSP

16. H5serv PerformanceIO Intensive benchmark results – read n:n:n data cube as n:n:1 slicesBinary is 10x faster than JSONStill 5x slower than NFS access of HDF5 file!Haven’t spent too much effort on performance so far Write results are comparable to read

17. Sample ApplicationsEven though h5serv has limited scalability, there have been some interesting applications built using it…A couple of examples…The HDF Group is developing a AJAX-based HDF Viewer for the web Anika Cartas at NASA Goddard developed a ”Global Fire Emissions Database”This is a Web-based app as wellEllen Johnson has created sample MATLAB scripts using the REST API(stay tuned for her talk)H5pyd – A h5py compatible Python SDKSee: https://github.com/HDFGroup/h5pyd CMD Line tools – coming soon

18. Web UI – Display HDF Content in a browser

19. Global Fire Emissions Database

20. H5pyd – Python Client for REST APIH5py-like client library for Python apps HDF5 library not needed on clientCalls to HDF5 in h5py replaced by http requests to h5servProvide most of the functionality of h5py high-level librarySame code can work with local h5py (to files) or h5pyd (to REST API)Extensions for HDF REST API-specific featuresE.g.: query supportFuture Work: HDF5 Rest VOL Library for C/Fortran clients that provides HDF5 API with REST backend.

21. CMD Line ToolsTools for common admin tasks:List files (‘domains’ in HDF REST API parlance) hosted by service Update permissionsDownload content as local HDF5 filesUpload local HDF5 filesOutput content of HDF5 domain (similar to h5dump or h5ls cmd tools)

22. Object StorageMost common storage technology used in the cloudManage data as objectsKeys (a string) map to data blobsData sizes from 1 byte to 5 TB (AWS)Cost effective compared with other cloud storage technologiesBuilt in redundancyPotentially high throughputDifferent Implementations:Public: AWS S3, Google Cloud Storage…Private: Ceph, openstack/Swift,…Mostly compatible… a_keyData blobAny string1 byte to 5 TB

23. Storage Costs How much will it costs to store 1PB for one year on AWS?Answer depends on the technology and tradeoffs you are willing to accept…TechnologyWhat it isCost for 1PB/1yrFine PrintGlacierOffline (tape) Storage$125K4 hour latency for first readAdditional costs for restoreS3 Infrequent AccessNearline Object Storage$157K$0.01/GB data retrieval charge$10K to read entire PB!S3Online Object Storage$358KRequest pricing $0.01 per 10K reqTransfer out charge $0.01/GBEBSAttachable Disk Storage$629KExtra charges for guaranteed IOPSNeed backupsEFSShared Network (NFS)$3,774KNot all NFSv4.1 features supportedE.g. File LockingDynamoDBNoSQL Database$3,145KExtra charge for IOPS

24. Object Storage Challenges for HDFNot POSIX!High latency (0.25s) per requestNot write/read consistentHigh throughput needs some tricks(use many async requests)Request charges can add up (public cloud)For HDF5, using the HDF5 library directly on an object storage system is a non-starter. Will need an alternative solution…

25. How to store HDF5 data in an object store?Idea:Store each HDF5 file as an objectRead on demandUpdate locally – write back entire file to storeBut..Slow – need to read entire file for each readConsistency issues for updatesLimit to max file size (AWS = 5TB)Store each HDF5 file as an object store object?

26. Objects as Objects!Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage ObjectsMaximum storage object size is limitedData can be accessed efficientlyOnly data that is modified needs to be updated(Potentially) Multiple clients can be reading/updating the same “file”Example:Dataset is partitioned into chunksEach chunk stored as an objectDataset meta data (type, shape, attributes, etc.) stored in a separate objectEach chunk (heavy outlines) get persisted as a separate object

27. HDF Scalable Data Service (HSDS)Support any sized repositoryAny number of users/clientsAny request volumeProvide data as fast as the client can pull it inTargeted for AWS but portable to other public/private cloudsCost effectiveUse AWS S3 as primary storageDecouple storage and compute costsElastically scale compute with usageA highly scalable implementation of the HDF REST APIGoals:

28. Architecture for HSDSLegend:Client: Any user of the serviceLB: Load balancer – distributes requests to Service nodesSN: Service Nodes – processes requests from clients (with help from Data Nodes)DN: Data Nodes – responsible for partition of Object StoreObject Store: Base storage service (e.g. AWS S3)

29. HSDS Architecture HighlightsDN’s provide read/write consistent layer on top of AWS S3DN’s also serve as data cache (improve performance and lower S3 request cost)SN’s deterministically know which DN’s are needed to server a given requestNumber of DN’s/SN’s can grow or shrink depending on demandMinimal operational costs would be 1 SN, 1 DN and data storage costs for S3Query operations can run across all data nodes in parallel

30. HSDS Timeline and next stepsWork just started July 1, 2016This work is being supported by NASA Cooperative Agreement NNX16AL91AWorking with NASA OpenNEX teamAlso included client components:HDF REST VOLH5pydCMD line toolsScope of project is for the next two years, but hoping to have prototype available soonerWould love feedback on design, use cases, or additional features you’d like to see

31. To Find out More:H5serv: https://github.com/HDFGroup/h5servDocumentation: http://h5serv.readthedocs.io/H5pyd: https://github.com/HDFGroup/h5pydRESTful HDF5 White Paper: https://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdfOpenNex: https://nex.nasa.gov/nex/static/htdocs/site/extra/opennex/ Blog articles:https://hdfgroup.org/wp/2015/04/hdf5-for-the-web-hdf-server/https://hdfgroup.org/wp/2015/12/serve-protect-web-security-hdf5/