CS 457/557 Assignment 4 - Algorithms

Due Friday, 9 February 2018

Suppose a Student relation at Bishop's consists of T=4000 tuples, and occupies B=400 blocks. There are M=100 memory blocks available. There are B+ trees on both id and name. We want a sorted list under the following conditions. Should we use a B+ index, or a sequential scan, followed by a sort? (As we shall see, a 2 phase sort requires 3B block I/O s: read, write sublists, read back.)
Give the cost in Block reads for each method you consider.

We want the whole list, sorted by name. There is an index on name, but the blocks are NOT in name order. (Note that we can read 100 blocks into memory, at which point there is a 25% chance that the next tuple will already be in memory.)
We just want students named 'Wang, Wei'
We want only first year students, in ID order (so, ID > 2220000 or some such number) and we have statistics that this will be less than 25% of the tuples. Does your answer depend on whether the blocks are in ID order?
(Graduate Students) In general, if we want the whole relation sorted, as in (a), for what ratio for tuples per block, T/B would using a B+ index be advantageous?

There may be occasions where an index might be useful in computing a join. Consider the following: A professor might want a list of the students in her class. As well as T(enroll), We know V(enroll, coursecode), [in fact, this is <= T(courses)]. This will give us an estimate of how many students might be in a class. Let's estimate average class size for Bishop's as 40. At this point, we have a smaller relation, and wish to join it with the student relation of question 1. Should we use an index? Assume a 2-level B+ tree, with the root block kept in memory. (So to access a single tuple takes 2 block reads: one for the leaf node, + 1 for the data).
In fact, we only need a semi-join, the student fields for the matching rows, as in
```
select * from students where id IN
	(select id from enroll 
	 where course = 'MAT123')
```
(a) Describe an algorithm for a set intersection operator R ∩ S, where S is "smaller" and B(S) < M, so a search structure for S will fit into M blocks of memory. Assume that R, at least, is a set (has no duplicates). How much blocking is involved in your algorithm? That is, when can the first tuple be output?
3.(b) If R might have duplicates, how could we keep them out of the result?
An "outer join" of 2 relations includes also tuples of one or both relations that "do not match", the output being the "dangling" tuple joined with nulls for the other relation. Describe an algorithm for R natural right outer join S where S fits in main memory (B(S) < M). When can the first joined tuple be output? When are the dangling tuples from S output?
(Graduate Students) A full outer join, assuming S fits in memory. As above, but when do the dangling tuples from R appear in the output?