Database Software -  Assignment 6

Due Frday, 23 February 2018

Let's consider Hashing

1. We have seen that hashing can be used for indexing, since we can determine from the primary key which "bucket" a tuple will be stored in. If we choose the number of buckets so that each  usually fits in one block, then only one block needs to be read or written for a given value of the key.

Let's suppose that 11 tuples of some relation can fit into a block, and we expect to have up to 40,000 of them. How many "buckets" would you recommend? (remember that the number of tuples in each bucket will vary somewhat).


2. Also, hashing can be used for a (2-pass) join R ⋈ S, where M buffers are available, and (for the smaller relation S) M < B(S) < M2, we can use hashing on the key in a 2-pass join. In this case, if the relation is already "hashed" into buckets for indexing, do we need to "re-hash" the relation into a different number of buckets? If so, how many? Let's try this for:
M = 105 (100 + 5 "extra" just in case)
B(S) = 4900, and we assume that each block can hold many tuples, and averages 90% full.
B(R) is not limited, but if you want a figure, lets say B(R) = 20,000

  1. For the" index", (looking up records,) how many buckets should we have. (in other words, what is the MOD we use after computing the hash function)
  2. For a 2-pass hash join using primary key, how many buckets do we need? (there is probably a range of acceptable values)
  3. If either R or S is stored by hashing, as in question 1., can we use the buckets of the index in the first pass of the hash join?
  4. can we use the same hash function?
  5. In the second pass, we do a one-pass join on the individual buckets. Does this need to be done using hashing? If we do, can we use the same hash function, or must we use a different one? (If you are not sure, it is prudent to use a different one.)
  6. Discribe pass 1 of the hash join. How are the M buffers used?
  7. Discribe pass 2 of the hash join. Once all the buckets have been joined, what is left to do?