CS 307, Assignment 6

Due as specified in course webpage.

Submit on paper, as I expect you to draw some trees. Please note that the Science Secretary leaves at noon on Fridays. You will need to run EXPLAIN on 4 queries, and print them out.

Query plans using EXPLAIN

explain select .... will show you the query plan for any query. From it you can deduce the relational algebra "tree", the first operations will be at the bottom, and the last one at the top of the indented list. It will also show you what algorithm is chosen to do something.
Please note how joins are processed. Also note whether indexes are used, where they exist. (eg. Battles has an index on the primary key).
"Scan" means get rows from a table. Sequential (seq) is the fastest, Index scan will get the rows in order by primary key. (An index is quite useful for a very large table, but is unlikely to be chosen as our databases are relatively small.)
Finally, there is an estimate of "cost" on each line. The top line will have total cost. (Projection operations are not explicitly shown, though they affect the "width" of results.)

What to do:

For each explain of a query, first derive the relational algebra tree that is used, and draw it, showing the joins ("Join Filter" is the join condition), and where the selections are placed ("Filter"). The order of joins is important to note, and also the algorithm chosen. The query planner tries to estimate how many rows will result from each operation, in order to figure "cost." Please comment on the "cost" estimates you observe.

Here are two queries, explained, from the ships database. Both have the same result, the second does the useless join with battles defined in the view OB (as a number of you did.)

ships=# select a.ship, a.battle, b.battle
from outcomes a, outcomes b
where a.ship=b.ship
  and a.battle < b.battle
;
      ship       |     battle      |     battle      
-----------------+-----------------+-----------------
 California      | Guadalcanal     | Surigao Strait 
 Duke of York    | North Cape      | Surigao Strait 
 Petri           | Denmark Strait  | North Cape     
 Yorktown        | Guadalcanal     | MidwayIsland   
(4 rows)

ships=# explain select a.ship, a.battle, b.battle
from outcomes a, outcomes b
where a.ship=b.ship
  and a.battle < b.battle
;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Merge Join  (cost=143.69..241.66 rows=1768 width=57)
   Merge Cond: (a.ship = b.ship)
   Join Filter: (a.battle < b.battle)
   ->  Sort  (cost=71.84..74.42 rows=1030 width=38)
         Sort Key: a.ship
         ->  Seq Scan on outcomes a  (cost=0.00..20.30 rows=1030 width=38)
   ->  Sort  (cost=71.84..74.42 rows=1030 width=38)
         Sort Key: b.ship
         ->  Seq Scan on outcomes b  (cost=0.00..20.30 rows=1030 width=38)
(9 rows)

ships=# explain select a.ship, a.battle, b.battle
from ob a, ob b
where a.ship=b.ship
  and a.battle < b.battle
;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Hash Join  (cost=26.97..51.59 rows=1 width=57)
   Hash Cond: (public.outcomes.ship = public.outcomes.ship)
   Join Filter: (public.outcomes.battle < public.outcomes.battle)
   ->  Hash Join  (cost=1.11..25.54 rows=26 width=38)
         Hash Cond: (public.outcomes.battle = public.battles.name)
         ->  Seq Scan on outcomes  (cost=0.00..20.30 rows=1030 width=38)
         ->  Hash  (cost=1.05..1.05 rows=5 width=19)
               ->  Seq Scan on battles  (cost=0.00..1.05 rows=5 width=19)
   ->  Hash  (cost=25.54..25.54 rows=26 width=38)
         ->  Hash Join  (cost=1.11..25.54 rows=26 width=38)
               Hash Cond: (public.outcomes.battle = public.battles.name)
               ->  Seq Scan on outcomes  (cost=0.00..20.30 rows=1030 width=38)
               ->  Hash  (cost=1.05..1.05 rows=5 width=19)
                     ->  Seq Scan on battles  (cost=0.00..1.05 rows=5 width=19)
(14 rows)

Please also comment on the plan, which seems to be based upon wild estimates.
PostgreSQL has a command, VACUUM ANALYZE, that can be run to clean up a database, and optionally analyze it, which should give the query planner more information about things like the number of rows to expect.

After running this command, here are the two new plans: (If the tree remains the same for each query, you can just say so, but if it changes, draw the new one.)

ships=# vacuum analyze;
VACUUM
ships=# explain select a.ship, a.battle, b.battle
from outcomes a, outcomes b
where a.ship=b.ship
  and a.battle < b.battle
;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Hash Join  (cost=1.54..3.30 rows=11 width=48)
   Hash Cond: (a.ship = b.ship)
   Join Filter: (a.battle < b.battle)
   ->  Seq Scan on outcomes a  (cost=0.00..1.24 rows=24 width=32)
   ->  Hash  (cost=1.24..1.24 rows=24 width=32)
         ->  Seq Scan on outcomes b  (cost=0.00..1.24 rows=24 width=32)
(6 rows)

ships=# explain select a.ship, a.battle, b.battle
from ob a, ob b
where a.ship=b.ship
  and a.battle < b.battle
;
                                    QUERY PLAN                                  
  
--------------------------------------------------------------------------------
--
 Hash Join  (cost=3.76..5.77 rows=8 width=48)
   Hash Cond: (public.outcomes.battle = public.battles.name)
   ->  Hash Join  (cost=2.65..4.54 rows=9 width=48)
         Hash Cond: (public.outcomes.battle = public.battles.name)
         ->  Hash Join  (cost=1.54..3.30 rows=11 width=48)
               Hash Cond: (public.outcomes.ship = public.outcomes.ship)
               Join Filter: (public.outcomes.battle < public.outcomes.battle)
               ->  Seq Scan on outcomes  (cost=0.00..1.24 rows=24 width=32)
               ->  Hash  (cost=1.24..1.24 rows=24 width=32)
                     ->  Seq Scan on outcomes  (cost=0.00..1.24 rows=24 width=32
)
         ->  Hash  (cost=1.05..1.05 rows=5 width=16)
               ->  Seq Scan on battles  (cost=0.00..1.05 rows=5 width=16)
   ->  Hash  (cost=1.05..1.05 rows=5 width=16)
         ->  Seq Scan on battles  (cost=0.00..1.05 rows=5 width=16)
(14 rows)

Join or subquery? -- Timetable database.

Four done, four more to go. For the next two, you are more on your own. Do the EXPAIN in a terminal yourself, and interpret as before.

select students.* from        
students natural join enroll
where course='CSC101A01';

select * from        
students where linux in (select linux from  enroll
where course='CSC101A01');

Your own database

Finally, explain, and comment on the plan, for two queries you use in your own database. If you have a view defined involving a join, that would be a good one to do.