/
CS639:  Data Management for CS639:  Data Management for

CS639: Data Management for - PowerPoint Presentation

martin
martin . @martin
Follow
65 views
Uploaded On 2023-11-15

CS639: Data Management for - PPT Presentation

Data Science Lecture 7 Wrapping up RDBMS Theodoros Rekatsinas 1 2 Announcements Final date To be set Issue with admins Questions on PA2 Submission Template Please follow instructions ID: 1031743

price group product quantity group price quantity product sum transaction select txn count wrote date attributes purchasewhere condition author

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS639: Data Management for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. CS639: Data Management for Data ScienceLecture 7: Wrapping up RDBMSTheodoros Rekatsinas1

2. 2AnnouncementsFinal date: To be set. Issue with admins.Questions on PA2?Submission TemplatePlease follow instructionsResearch projects postedPlease sign up to multiple projects that you find interestingSign up in Slack

3. Today’s LectureFinish SQLOverview of an RDBMSTransactions and ACID3

4. 1. SQL (Aggregation and Group By)4

5. 5AggregationSELECT COUNT(*)FROM ProductWHERE year > 1995Except COUNT, all aggregations apply to a single attributeSELECT AVG(price)FROM ProductWHERE maker = “Toyota”SQL supports several aggregation operations:SUM, COUNT, MIN, MAX, AVG

6. 6COUNT applies to duplicates, unless otherwise statedSELECT COUNT(category) FROM ProductWHERE year > 1995Note: Same as COUNT(*). Why?We probably want:SELECT COUNT(DISTINCT category)FROM ProductWHERE year > 1995Aggregation: COUNT

7. 7Purchase(product, date, price, quantity)More ExamplesSELECT SUM(price * quantity)FROM PurchaseSELECT SUM(price * quantity)FROM PurchaseWHERE product = ‘bagel’What do these mean?

8. 8Simple AggregationsPurchaseProductDatePriceQuantitybagel10/21120banana10/30.510banana10/10110bagel10/251.5020SELECT SUM(price * quantity)FROM PurchaseWHERE product = ‘bagel’50 (= 1*20 + 1.50*20)

9. 9Grouping and AggregationSELECT product, SUM(price * quantity) AS TotalSalesFROM PurchaseWHERE date > ‘10/1/2005’GROUP BY productLet’s see what this means…Find total sales after 10/1/2005 per product.Purchase(product, date, price, quantity)

10. 10Grouping and Aggregation1. Compute the FROM and WHERE clauses2. Group by the attributes in the GROUP BY3. Compute the SELECT clause: grouped attributes and aggregatesSemantics of the query:

11. 111. Compute the FROM and WHERE clausesProductDatePriceQuantityBagel10/21120Bagel10/251.5020Banana10/30.510Banana10/10110SELECT product, SUM(price*quantity) AS TotalSalesFROM PurchaseWHERE date > ‘10/1/2005’GROUP BY productFROM

12. ProductDatePriceQuantityBagel10/21120Bagel10/251.5020Banana10/30.510Banana10/10110122. Group by the attributes in the GROUP BYSELECT product, SUM(price*quantity) AS TotalSalesFROM PurchaseWHERE date > ‘10/1/2005’GROUP BY productGROUP BY ProductDatePriceQuantityBagel10/2112010/251.5020Banana10/30.51010/10110

13. 133. Compute the SELECT clause: grouped attributes and aggregatesSELECT product, SUM(price*quantity) AS TotalSalesFROM PurchaseWHERE date > ‘10/1/2005’GROUP BY productProductTotalSalesBagel50Banana15SELECTProductDatePriceQuantityBagel10/2112010/251.5020Banana10/30.51010/10110

14. 14HAVING ClauseSame query as before, except that we consider only products that have more than100 buyersHAVING clauses contains conditions on aggregatesSELECT product, SUM(price*quantity)FROM PurchaseWHERE date > ‘10/1/2005’GROUP BY productHAVING SUM(quantity) > 100Whereas WHERE clauses condition on individual tuples…

15. 15General form of Grouping and AggregationS = Can ONLY contain attributes a1,…,ak and/or aggregates over other attributesC1 = is any condition on the attributes in R1,…,RnC2 = is any condition on the aggregate expressionsSELECT SFROM R1,…,RnWHERE C1GROUP BY a1,…,akHAVING C2Why?

16. 16General form of Grouping and AggregationSELECT SFROM R1,…,RnWHERE C1GROUP BY a1,…,akHAVING C2Evaluation steps:Evaluate FROM-WHERE: apply condition C1 on the attributes in R1,…,Rn GROUP BY the attributes a1,…,ak Apply condition C2 to each group (may have aggregates) Compute aggregates in S and return the result

17. 17Group-by v.s. Nested QueryFind authors who wrote ³ 10 documents:Attempt 1: with nested queriesSELECT DISTINCT Author.nameFROM AuthorWHERE COUNT( SELECT Wrote.url FROM Wrote WHERE Author.login = Wrote.login) > 10Author(login, name)Wrote(login, url)This isSQL bya novice

18. 18Group-by v.s. Nested QueryFind all authors who wrote at least 10 documents:Attempt 2: SQL style (with GROUP BY)SELECT Author.nameFROM Author, WroteWHERE Author.login = Wrote.loginGROUP BY Author.nameHAVING COUNT(Wrote.url) > 10No need for DISTINCT: automatically from GROUP BYThis isSQL byan expert

19. Group-by vs. Nested QueryWhich way is more efficient?Attempt #1- With nested: How many times do we do a SFW query over all of the Wrote relations?Attempt #2- With group-by: How about when written this way?With GROUP BY can be much more efficient!

20. 2. Overview of an RDBMS20

21. RDBMS ArchitectureHow does a SQL engine work ?SQL QueryRelational Algebra (RA) PlanOptimized RA PlanExecutionDeclarative query (from user)Translate to relational algebra expressionFind logically equivalent- but more efficient- RA expressionExecute each operator of the optimized plan!

22. Logical vs. Physical OptimizationLogical optimization (we will only see this one):Find equivalent plans that are more efficientIntuition: Minimize # of tuples at each step by changing the order of RA operatorsPhysical optimization:Find algorithm with lowest IO cost to execute our planIntuition: Calculate based on physical parameters (buffer size, etc.) and estimates of data size (histograms)ExecutionSQL QueryRelational Algebra (RA) PlanOptimized RA Plan

23. Recall: Logical Equivalence of RA PlansGiven relations R(A,B) and S(B,C):Here, projection & selection commute: What about here? 

24.  R(A,B)S(B,C)T(C,D)sA<10 SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B AND S.C = T.C AND R.A < 10;R(A,B) S(B,C) T(C,D)Translating to RA

25. Logical OptimizationHeuristically, we want selections and projections to occur as early as possible in the plan Terminology: “push down selections” and “pushing down projections.”Intuition: We will have fewer tuples in a plan.Could fail if the selection condition is very expensive (say runs some image processing algorithm). Projection could be a waste of effort, but more rarely.

26.  R(A,B)S(B,C)T(C,D)sA<10 SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B AND S.C = T.C AND R.A < 10;R(A,B) S(B,C) T(C,D)Optimizing RA PlanPush down selection on A so it occurs earlier

27.  R(A,B)S(B,C)T(C,D) SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B AND S.C = T.C AND R.A < 10;R(A,B) S(B,C) T(C,D)Optimizing RA PlanPush down selection on A so it occurs earlier sA<10

28.  R(A,B)S(B,C)T(C,D) SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B AND S.C = T.C AND R.A < 10;R(A,B) S(B,C) T(C,D)Optimizing RA PlanPush down projection so it occurs earlier sA<10

29.  R(A,B)S(B,C)T(C,D) SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B AND S.C = T.C AND R.A < 10;R(A,B) S(B,C) T(C,D)Optimizing RA PlanWe eliminate B earlier!sA<10 In general, when is an attribute not needed…?

30. 3. Transactions and ACID30

31. Transactions: Basic DefinitionA transaction (“TXN”) is a sequence of one or more operations (reads or writes) which reflects a single real-world transition.START TRANSACTION UPDATE Product SET Price = Price – 1.99 WHERE pname = ‘Gizmo’COMMITIn the real world, a TXN either happened completely or not at all

32. Transactions: Basic DefinitionA transaction (“TXN”) is a sequence of one or more operations (reads or writes) which reflects a single real-world transition.In the real world, a TXN either happened completely or not at allExamples:Transfer money between accountsPurchase a group of productsRegister for a class (either waitlist or allocated)

33. 33Transactions in SQLIn “ad-hoc” SQL:Default: each statement = one transactionIn a program, multiple statements can be grouped together as a transaction:START TRANSACTION UPDATE Bank SET amount = amount – 100 WHERE name = ‘Bob’ UPDATE Bank SET amount = amount + 100 WHERE name = ‘Joe’COMMIT

34. 34Transaction Properties: ACIDAtomicState shows either all the effects of txn, or none of themConsistentTxn moves from a state where integrity holds, to another where integrity holdsIsolatedEffect of txns is the same as txns running one after another (ie looks like batch mode)DurableOnce a txn has committed, its effects remain in the databaseACID continues to be a source of great debate!

35. 35ACID: AtomicityTXN’s activities are atomic: all or nothingIntuitively: in the real world, a transaction is something that would either occur completely or not at allTwo possible outcomes for a TXNIt commits: all the changes are madeIt aborts: no changes are made

36. TransactionsA key concept is the transaction (TXN): an atomic sequence of db actions (reads/writes)Atomicity: An action either completes entirely or not at all36 AcctBalancea1020,000a2015,000 AcctBalancea1017,000a2018,000Transfer $3k from a10 to a20:Debit $3k from a10Credit $3k to a20Crash before 1,After 1 but before 2, After 2.Written naively, in which states is atomicity preserved?DB Always preserves atomicity!

37. 37ACID: ConsistencyThe tables must always satisfy user-specified integrity constraintsExamples:Account number is uniqueStock amount can’t be negativeSum of debits and of credits is 0How consistency is achieved:Programmer makes sure a txn takes a consistent state to a consistent stateSystem makes sure that the txn is atomic

38. 38ACID: IsolationA transaction executes concurrently with other transactionsIsolation: the effect is as if each transaction executes in isolation of the others.E.g. Should not be able to observe changes from other transactions during the run

39. Challenge: Scheduling Concurrent TransactionsThe DBMS ensures that the execution of {T1,…,Tn} is equivalent to some serial executionOne way to accomplish this: LockingBefore reading or writing, transaction requires a lock from DBMS, holds until the endKey Idea: If Ti wants to write to an item x and Tj wants to read x, then Ti, Tj conflict. Solution via locking:only one winner gets the lockloser is blocked (waits) until winner finishesA set of TXNs is isolated if their effect is as if all were executed serially39What if Ti and Tj need X and Y, and Ti asks for X before Tj, and Tj asks for Y before Ti?-> Deadlock! One is aborted…All concurrency issues handled by the DBMS…

40. 40ACID: DurabilityThe effect of a TXN must continue to exist (“persist”) after the TXNAnd after the whole program has terminatedAnd even if there are power failures, crashes, etc.And etc…Means: Write data to disk

41. Ensuring Atomicity & DurabilityDBMS ensures atomicity even if a TXN crashes!One way to accomplish this: Write-ahead logging (WAL)Key Idea: Keep a log of all the writes done.After a crash, the partially executed TXNs are undone using the logWrite-ahead Logging (WAL): Before any action is finalized, a corresponding log entry is forced to disk41We assume that the log is on “stable” storageAll atomicity issues also handled by the DBMS…

42. Challenges for ACID propertiesIn spite of failures: Power failures, but not media failuresUsers may abort the program: need to “rollback the changes”Need to log what happenedMany users executing concurrentlyCan be solved via locking (we’ll see this next lecture!)And all this with… Performance!!

43. A Note: ACID is contentious!Many debates over ACID, both historically and currentlyMany newer “NoSQL” DBMSs relax ACIDIn turn, now “NewSQL” reintroduces ACID compliance to NoSQL-style DBMSs…ACID is an extremely important & successful paradigm, but still debated!

44. Summary of DBMSDBMS are used to maintain, query, and manage large datasets.Provide concurrency, recovery from crashes, quick application development, integrity, and securityKey abstractions give data independenceDBMS R&D is one of the broadest fields in CS. Fact! 44