/
SQMD: Architecture for Scalable, Distributed Database Syste SQMD: Architecture for Scalable, Distributed Database Syste

SQMD: Architecture for Scalable, Distributed Database Syste - PowerPoint Presentation

pasty-toler
pasty-toler . @pasty-toler
Follow
390 views
Uploaded On 2017-08-31

SQMD: Architecture for Scalable, Distributed Database Syste - PPT Presentation

built on Virtual Private Servers eScience for cheminformatics and drug discovery 4th IEEE International Conference on eScience 2008 Kangseok Kim Marlon E Pierce Community Grids Laboratory Indiana University ID: 583800

database query data response query database response data distributed databases system virtual service web private time servers server total

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "SQMD: Architecture for Scalable, Distrib..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

SQMD: Architecture for Scalable, Distributed Database System built on Virtual Private Servers e-Science for cheminformatics and drug discovery 4th IEEE International Conference on e-Science 2008

Kangseok

Kim, Marlon E. Pierce

Community Grids Laboratory, Indiana University

kakim@indiana.edu

,

marpierc@indiana.edu

Rajarshi

Guha

School of Informatics, Indiana University

rguha@indiana.eduSlide2

IntroductionHuge increase in the size of datasets in a variety of fields, e.g.Scientific observations for e-ScienceSensors (video, environmental)Data fetched from Internet defining users interestsNeed data management and partitioning and processing strategies that are scalable

We also need to find effective ways to use our overabundance of computing

power.

Cloud computing and virtualizationThe partitioning of database over virtual private servers can be a critical factor for scalability and performance.The purpose of the virtual private servers’ use is to facilitate concurrent access to individual applications (databases) residing on multiple virtual platforms on a single or multiple physical machines with effective resources’ use and management, as compared to an application (database) on a physical machine

2Slide3

Distributed database system built on virtual private serversDatabase system is composed of three tiersweb service (WS) client (front-end)web service and message service system (middleware)

agents and a collection of databases (back-end)

Distributed database system allows WS clients to access data from databases distributed over virtual private servers.

Databases are distributed over multiple virtual private servers by fragmenting data using two different methods: data clusteringhorizontal (or equal) partitioningThe distributed database system is a network of two or more PostgreSQL databases that reside on one or more virtual private servers.

Lab uses 8 virtual private servers over one physical machine with OpenVZ virtualization technology

WS client can simultaneously access (or query) the data in several databases in a single distributed environment.

SQMD (Single Query Multiple Database) mechanism which transmits a single query that synchronously operates on multiple databases, using publish/subscribe paradigm.

3Slide4

Scalable distributed database system architectureQuery/Response

Message / Service

System (Broker)

Web Service

Message Service

WS Client

(Front-end User Interface)

Web Server

Query/Response

Query/Response

Query/Response

Query/Response

DB Host Server

DB Agent

(JDBC to

PostgreSQL

)

Topics:

1. Query/Response

2. Heart-beat

DB Host Server

DB Agent

(JDBC to

PostgreSQL

)

DB Host Server

DB Agent

(JDBC to

PostgreSQL

)

4Slide5

Example query and Total number of hits for varying R (distance cutoff)SELECT cid, structure FROM pubchem_3d WHERE cube_enlarge ( COORDS, R, 12 ) @> momsimcid –

compoundID

Pubchem_3d – 3D structure for public repository of chemical information

including connection tables, properties and biological assay resultsCOORDS - 12-D shape descriptor of query molecule R – user specified distance cutoff to retrieve those points from the database

whose distance to the query point

cube_enlarge

- PostgreSQL function that generates the bounding hypercube from the query point

momsim

- 12-D CUBE field

The example query means to find all rows of the database for which the 12-D shape descriptor lies in the

hypercubical

region defined by

cube_enlargeTotal number of hits for varying R, using the above query5R0.30.4

0.5

0.6

0.7

Total number of response data

495

6,870

37,049

113,123

247,171

Size in bytes

80,837

1,121,181

6,043,337

18,447,438

40,302,297Slide6

Total latency = Transit cost (Tclient2ws) + Web service cost (Tws2db)

6

T

query

T

response

T

client2ws

(Transit cost)

T

ws2db

(WS cost)

………….………

…………………..

Web Service

(WS) Client

WS

Broker

DB Agent

DB Agent

Total latency = T

client2ws

+ T

ws2db

T

aggregation

T

agent2dbSlide7

Mean query response time in a centralized (not fragmented) database

7

time to transmit a query (

T

query

) to and

receive a response (Tresponse) from the web

service running on web server

time spent in the web service for serially

aggregating responses from databases

time between submitting a query from an agent to and retrieving the responses of the query

from a database server including the corresponding execution time of the agent

As the distance R increases, the time needed to perform a query in the database increases since the size of result set increases and thus the query processing cost clearly becomes the biggest portion of the total cost.Slide8

Performance analysisWe show the performance of a query/response interaction mechanism between a client and distributed databases, with overheads associated with virtualization deployments compared to real (physical) host deployments, and also with two different data partitioning strategies – horizontal partitioning vs. data clustering.In our experiment with virtual private serversin case of using data clustering methodwe allocated the memory into each virtual server in proportion to the size of each cluster

in case of using horizontal partitioning method

we allocated the memory into each server in same size

8Slide9

mean query response time in a centralized database system Speedup = -------------------------------------------------------------------------------------------------------- mean query response time in a distributed database system

9

Using horizontal partitioning is faster than using data clustering since fragments

partitioned by the data clustering method can be different in the number of dataset.Slide10

Mean query processing timein each cluster (R = 0.5)

10

As the responses occurred in performing a query in a large size of cluster

increase, the time needed to perform the query in the cluster increases as well.

In other words the total active (hash) index set for the query increases as the distance R

increase.

To avoid as much disk access as possible and thus to improve the query processing performance, the total index set is needed to fit in main memory. Slide11

Mean query response time

11Slide12

Summary and Future workSQMD mechanism, based on publish/subscribe paradigm, transmits a single query that simultaneously operates on multiple databases, hiding the details about data distribution in middleware to provide the transparency of the distributed databases to heterogeneous web service clients.The results of experiments with our distributed system indicate the performance in using virtual private servers on a machine (host) is comparable to that in using eight physical machines (hosts).In future work, we need to decrease the workload for aggregating the results of a query in web service.We will investigate the use of the M-tree index.

M-tree indexes allow one to perform queries using

hyperspherical

regions.This would allow us to avoid the extra hits we currently obtain due to the hypercube representation.To eliminate the unnecessary query processing with some databases distributed by the data clustering method, we should consider the query optimization that allows a query to localize into some specific databases in future work. In future work we will extend the evaluation for the (optimized) effective use of other resources as well as memory with our distributed database system.

12