Try it httpcoastwatchpfegnoaagoverddap Bob Simons ltbobsimonsnoaagovgt NOAA NMFS SWFSC ERD OBIS SOS Custom DAP ERDDAP Database ERDDAP Files ID: 727625
Download Presentation The PPT/PDF document "DAP, ERDDAP, and Tabular (Sequence) Data..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DAP, ERDDAP, andTabular (Sequence) Datasets
Try it: http://coastwatch.pfeg.noaa.gov/erddapBob Simons <bob.simons@noaa.gov>NOAA NMFS SWFSC ERD
OBIS SOS Custom DAP ERDDAP ...Database ERDDAP Files
Your Favorite Client SoftwareSlide2
My Goals for this Presentation
Tell you more about ERDDAP.Raise awareness and appreciation of tabular data.Convince you that tabular datasets are best served as DAP sequences.And that serving them in DAP as 1D or 2D gridded datasets is a bad idea.
(This has nothing to do with how they are stored.)Bonus: 3 powerful ideas: Abstractions (capture the essence; hide the instance details)Representations (different file formats)Reusability (value is multiplied)Slide3
1) ERDDAPSlide4Slide5Slide6
ERDDAP Features
(Re)serves diverse local and remote datasets Abstraction: thanks to DAP, the source differences are hidden.Serves gridded and tabular datasetsOffers a unified place to search for datasetsFull-text, category-based, or advanced.
Encourages improved metadataSo users can understand the dataset.Offers a standard way to request data from any datasetFor humans: forms on web pages.For computers: DAP, WMS, (SOS) web services.Offers a choice of response file formatsDifferent
representationsStandardizes time formats (Here, different representations are trouble.)
As Strings - ISO 8601:2004(E), e.g., 2014-07-01T20:00:00ZAs numbers - seconds since 1970-01-01T00:00:00Z
Is reusable.Slide7
2) Tabular DataSlide8
Tabular Datasets
Tabular data sources: databases, OBIS, SOS, CSV files, flat .nc files, CF DSG .nc files, ...GeospatialCF Discrete Sampling
Geometry (DSG) feature types: Point: whale sightingsProfile: disposable CTDTimeSeries: moored buoyTimeSeriesProfile: CTDTrajectory: shipTrajectoryProfile
: profiling gliderNon-Geospatiallaboratory data, references, fish disease lists, ecosystem: what eats what, ...
Larry Ellison is rich because databases are reusable for numerous types of data.Slide9
(ERD)DAP Data Requests:Gridded vs. Tabular Datasets
Gridded Datasets (DAP projection constraints)DAP: ?temperature[437] [46:1:162][122:282]ERDDAP: ?temperature[(2014-07-01)][(22):(51)][(-145):(-105)]
Tabular Datasets (DAP selection constraints)DAP: ?s.id,s.owner,s.time,s.latitude,s.longitude,s.wtemp&s.id="sp031"&s.time>=1404172800ERDDAP: ?id,owner,time,latitude,longitude,wtemp&id="sp031"&time>=2014-07-01
id
owner
typetime
latitudelongitudewtemp
atmp46088
NDBC3m Discus
1993-06-01T14:20:00Z48.336
-123.15916.418.0
46088NDBC
3m Discus
1993-06-01T14:50:00Z
48.336
-123.159
16.5
18.2
...
...
...
...
...
...
...
...
SANF1
SFSU
C-MAN
1968-10-14T16:00:00Z
24.456
-81.877
15.8
14.9
SANF1
SFSU
C-MAN
1968-10-14T17:00:00Z
24.456
-81.877
15.8
14.8
...
...
...
...
...
...
...
...Slide10
(ERD)DAP Sequence Requests vs. Database SQL Requests
(ERD)DAP: ?id,owner,type,time,latitude,longitude,wtemp&id="46088"&time>=2014-07-01SQL: SELECT id,owner,type,time,latitude,longitude,wtemp
FROM s WHERE id="46088" AND time>=2014-07-01Pablo Picasso: "Good artists copy, great artists steal."Slide11
Related Tables vs. One Table
idowner
typelatitudelongitudetimewtempatmp
46088NDBC3m Discus
48.336-123.159
1993-06-01T14:20:00Z
16.418.0
46088NDBC3m Discus
48.336-123.159
1993-06-01T14:50:00Z
16.518.2
.........
...
...
...
...
NC312
NCSU
C-MAN
24.456
-81.877
1968-10-14T16:00:00Z
15.8
14.9
NC312
NCSU
C-MAN
24.456
-81.877
1968-10-14T17:00:00Z
15.814.8.....................
idtimewtempatmp460881993-06-01T14:20:00Z16.418.0460881993-06-01T14:50:00Z16.518.2............NC3121968-10-14T16:00:00Z15.814.9NC3121968-10-14T17:00:00Z15.814.8............
idownertypelatitudelongitude46088NDBC3m Discus48.336-123.15941005NDBC6m Discus32.501-79.099BP114BP3m DIscus36.905-75.713NC312NCSUC-MAN24.456-81.877...............
Join (
Denormalized)
Buoy Table
Observation Table
NormalizedSlide12
Yeah, but why doesn't ERDDAP support nested sequences?
It does, but just internally.ERDDAP (re)presents the dataset as a single table.One table is an abstraction. It hides details.
The average user understands a table.One vs. many tables: just different representations.This lets all tabular datasets have the same structure. The results of a DAP or SQL query is always one table.There are many file format representations of one table.Slide13
3) Tabular datasets are bestserved as DAP
sequences.(Why DAP Sequences Rock!)And that serving them in DAP as 1D or 2D gridded datasets is a bad idea.(This has nothing to do with how they are stored.)Slide14
Why Sequences Rock! Reason #1
If the data is coming from a relational database, OBIS, or SOS, the dataset can't be served as a gridded dataset.There are no index (row) numbers.It isn't easy/possible to know how many rows there are.The
order of the rows may change at any time.New rows are added as new data arrives: frequently.Slide15
Why Sequences Rock! Reason #2
Serving tabular data in DAP as 1D or 2D gridded datasets is a bad idea. Logic: Men:mortal. Socrates:man.
Socrates:mortal. Grids:handled well by DAP. Treat table as:grid. Treat table as grid:handled well?Grid dimensions usually represent a physical continuum.DAP: ?temperature[408:437
][46:1:162][122:282]ERDDAP: ?temperature[(
2014-06-01):(2014-06-30)][(22):(51)][(-145):(-105)]No arrangement of tabular dataset dimensions works well
2D [buoy][time]: buoy is not a continuum, time leads to wasted space1D [time]: fine, but then you need 1000 datasets (1 per buoy)
1D [row]: aggregated, but row isn't a continuum. In every case, it's hard to know which rows to request.
The rows you want are scattered through the dataset.so you have to either download everything or make numerous requests.Serving a DSG file directly: too many formats, too hard to query.Slide16
Why Sequences Rock! Reason #3
DAP sequence requests use the terminology of the dataset. (It's easy.)?id,owner,type,latitude,longitude&distinct()
?id,type,latitude,longitude&owner="NDBC"&distinct()?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&distinct()?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01&distinct()
?&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01
index
id
owner
typelatitude
longitudetime
wtempatmp1
46088NDBC
3m Discus48.336-123.159
1993-06-01T14:20:00Z
16.4
18.0
2
46088
NDBC
3m Discus
48.336
-123.159
1993-06-01T14:50:00Z
16.5
18.2
137522
BP114
BP
3m
Discus
36.905
-75.7132003-02-09T02:00:00Z16.712.2137523BP114BP3m discus36.905
-75.7132003-02-09T04:00:00Z16.612.01732156NC312NCSUC-MAN24.456-81.8771968-10-14T16:00:00Z15.814.91732157NC312NCSUC-MAN24.456-81.8771968-10-14T17:00:00Z15.814.8328245941005NDBC6m Discus32.501-79.0901984-08-22T14:20:00Z14.626.8
328246041005NDBC6m Discus32.501-79.0901984-08-22T14:50:00Z14.726.2Making these requests with index numbers is a difficult (not for Roberto), multi-step, programming task. And it's inefficient.Slide17
Why Sequences Rock! Reason #4
Because declarative languages (SQL, DAP selection constraints) let you describe what you want, not how to get it.?
id,owner,type,latitude,longitude&distinct()?id,type,latitude,longitude&owner="NDBC"&distinct()?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&distinct()?id&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01&distinct()
?&latitude>=22&latitude<=55&longitude>=-145&longitude<=-105&time>=2014-07-01With
imperative languages (C, Fortran, Java, Python), you must describe, step-by-step, how to solve the problem.
1) Request all latitudes.2) Filter
3) Request all longitudes.4) Multiple requests because data is scattered throughout the dataset.Slide18
Why Sequences Rock! Reason #5
Because the other options all suck. Serving the datasets as grids doesn't work.You now understand why, right?Serve the data files via FTP. Getting a chunk of data is all or nothing. Makes user deal with various file formats.
Custom forms and web services are too much work to make.Custom: 6+ months per dataset? Ongoing maintenance. No consistency! Reusable: 1 day, minimal maintenance, consistent!Give trusted colleagues access to the database or the files.
That's not making the data public!Don't let anyone else use the data.This is actually the #1 method of fisheries data distribution.Slide19
My Goals for this Presentation
Tell you more about ERDDAP.Raise awareness and appreciation of tabular data.Convince you that tabular datasets are best served as DAP sequences.And that serving them in DAP as 1D or 2D gridded datasets is a bad idea.
(This has nothing to do with how they are stored.)Bonus: 3 powerful ideas: Abstractions (capture the essence; hide the instance details)Representations (different file formats)Reusability (value is multiplied)Slide20
Thank you!