Review of Database Software Design

Wednesday, 13 April

Disks

(Chapter 13)

Mechanics: Read/write of whole blocks, Timing: seek, rotational latency, transfer time
Disk I/O is dominant "cost" of computation:
Speeding it up: Organizing data by cylinders, using multiple disks, elevator algorithm
Failures: checksums, stable storage (mirroring), and RAID
Blocks containing records -Fixed length or Variable length. Need for headers to describe what's there.

Indexes

B+ tree: Disk structure for rapid lookup. A relation can have both primary and secondary (possibly non-unique) indexes. Useful to fetch records (tuples) in order, and for range queries.
Hashing into buckets (blocks) as primary storage technique. No "data structure" on disk, each record is simply stored in a "bucket" of one or more blocks, determined by "hashing" its primary key. To find a record, hash the sought key, and read that bucket, often just one block read! Useless for ordering or range queries. You can't also so has the relation some other way.

Query Execution: scanning and algorithms

(Chapter 15) Number of Block I/O required

Scanning tables: sequential, or using index.

There is a trade-off here, depending on how many records fit in a block, it may be quicker to scan the whole table and sort, for example than to use the index.

One-pass algorithms

Tuple at a time, for example, selection
Grouping
Binary operations, such as join, cost is B(R) + B(S)

Number of buffers needed -- Join, for example, can smaller relation fit in memory? This places limits on the size of inputs.
Nested-loop join - No size limit, cost is B(S) + B(S)B(R)/M
Two-pass algorithms

Generally, cost is 3(B(R) + B(S))
- using sorting
  - Multiway Merge-sort - describe, cost: read-write-read of everything
  - Sort-join

using Hashing, size limit on smaller relation only

index based: selection, join using index zigzag join, using 2 B-trees

Query compiler

(Chapter 16)

select-from-where to relational algebra (tree)

Project on SELECT list, select on conditions in WHERE clause, cross-product of FROM list

Pushing operations up or down the tree, selection in particular

If the conditions are connected by AND, they can be done in any order.

Involving one attribute, push selection down to the relation(s) containing it
Involving [equality] of attribute in 2 relations, convert cross product to join ("join condition")

cost estimation

At this stage in the logical query plan, we want to estimate the size, in tuples, of the result and intermediate stages. This will be based on statistics of T(R), and V(R,a) -- number of distinct values of a in R.
(For example, if we want a particular value for a, we would expect to select T(R)/V(R,a) tuples)

Best join order: minimize intermediate results

We should choose the order the various operations, joins in particular, to minimize the estimated size of intermediate results (The size of the final result will be the same, however we arrange the plan)
If there are several joins, this is non-trivial, as there are many possibilities. A simple strategy is to use a left-deep tree, and pick as the first join the pair with the smallest result.
Note that most algorithms are asymmetric, with the build relation on the left (put into M) and the probe relation on the right (read a tuple at a time, or use index, if available)

choosing algorithms

Finally, choose the best algorithm for each operation, starting with scanning of tables
Passing intermediate results

pipelining
storing in Memory buffers
materializing = writing blocks to disk

Logging and Recovery

(Chapter 17)

Transactions are basic units, that maintain consistency of a database. In case of power failure, we must recover the latest consistent state from what we find on disk
Accordingly, it is important to write on disk a log of all the changes made by each transaction, and, when a transaction commits, write that fact on the log and flush the log to disk.
UNDO logging

When all changes by T have reached disk, then write <commit T> on log, then flush the log
To recover after a crash, scan the log backwards, and undo all changes by uncommitted transactions

REDO logging

write <commit T> on log and flush log to disk, before
writing any of T's changes to disk, now do so as soon as convenient.
To recover after a crash, scan the log backwards to identify the committed transactions, then forwards to redo all the changes made by them.

(There is also UNDO-REDO logging, which is flexible in when data goes to disk, at the expense of writing both old and new values in the log.)
A checkpoint can limit the length of the log required to be scanned, and kept.
an archive is a dump of the whole database, necessary in case of catastrophic failure. Best kept off-site, and the log also. In fact, the log is supposed to be on "stable storage", with the redundancy provided by transmitting it to a secure location.

Concurrency control - locking, etc. - exam will be limited to locking

Why? A Transaction should, in isolation, maintain the database in a consistent state. Even a read-only transaction should "see" a consistent state. If transactions were executed one at a time (serially) there would be no problems. But concurrency is good, so we need to avoid conflict between transactions.

Read-only transactions will not cause inconsistency, but probably want to "see" a consistent state.
Transactions that write must avaid confilct with other concurrent transactions. There are 2 problems where we cannot say that one transaction "came before" another:

The same element is read by one transaction and written by another
The same element is written by two different transactions

Something must be done to either prevent these "race conditions" from occuring, or aborting one of the transactions involved. (It will have to be restarted)

Locking

The most common method of concurrency control is to have each transaction lock the elements it plans to read or write. A lock manager will either grant the lock request, or delay the transaction until a lock held by another transaction has released it.

2-phase locking: Phase 1: get all the locks you will need, after that you may start releacing locks
All locks must be released - In fact, this is not up to the programmer, but the system!
No locks may be requested after the first unlock, lest another transaction get "in the middle"

Different lock modes

Readers do not conflict with each other, so they can have "shared locks." These lock out any writers.
Writers must request "exclusive locks" since they need to avoid conflict with other writers.
If there are many active readers, a writer might have to wait a long time. (starvation)

Optimistic methods (not on exam)

Timestamps: Each transaction is given a unique timestamp. It is allowed to run, unless one of two things occurs:

Late read: It attempts to read an element already written by a transaction with a later timestamp. Possibly the database can supply the proper previous value, else it must abort.
Late write: It attempts to write an element already read by a transaction with a later timestamp. In this case it must abort.

Validation: Transactions have a read phase and a write phase. We know the read set(RS) and write set(WS) of each transaction that is concurrently active.

During the read phase they do all their reading and prepare what they want to write
Validation: the transaction is ready to commit, but it must abort if its RS or WS overlaps with the WS of any previously validated recent transaction.
It now does its writing.

Dirty reads, Deadlocks - Ahh, transactions

a "Dirty Read" is a transaction reading a value written by an incomplete transaction. Since that transaction may still abort for any number of reasons, the "dirty" value becomes invalid, and should not have been read. This can lead to "cascading rollback."
To avoid the problem:

Strict locking: No locks may be released until a transaction is surely going to commit.
Timestamps: Reader must be delayed until other transaction commits or aborts (and this could lead to deadlock)

Validation: The problem does not occur, since no writing is done until the commit decision has been made.
Deadlock:

So, course evaluator "Ben," new on the job, was following the instruction to "wait for Prof. Jensen to sign a paper" before conducting the course evaluation. Meanwhile, Prof. Jensen was waiting for the evaluator to emerge from the room before entering, as custom requires. Being humans, we eventually detected this deadlock.
Any time a computer system delays something, such as a lock request, there is a potential for deadlock, and this must be detected, by timeout (I got tired of waiting), or by constructing a "wait-for" graph and searching it for cycles. If a cycle is detected, one transaction must be aborted.
In a locking scheme, if the elements can be ordered (say, by disk address) and locks requested in that order, deadlocks are avoided. -- some of you did that on the assignment!

Lin Jensen