Unit6
Unit6
Lesson 14
Indexing and B-Tree Family
Objectives:
- Understand types of indexing.
- Implement multilevel indexing.
- Learn about the B-tree family of data structures.
What are the indexes?
- Data structures used to speed up query/data retrieval.
- Without indexes, full table scans are needed (very slow for large data).
Types of indexes:
- Ordered Indexes
- Store search keys in sorted order.
- Clustering Index: data sorted same as index.
- Non-Clustering Index: data not sorted same as index. Subtypes:
- Dense: one index entry for each record.
Sparse: one entry per block or few records (requires clustering).
- Hashed Indexes
- Use a hash function to map search keys to buckets.
- Fast for equality searches, but not good for range queries.
Multilevel Indexing
- Used when index is too large for memory. Involves:
- Inner index (on disk).
- Outer index (in memory) pointing to inner index.
Secondary Indexes
- Built on non-primary key attributes.
- Useful when frequent queries are made on non-key fields.
B-Tree and B+ Tree
- B-Tree: Balanced search tree; stores keys and data in all nodes. B+ Tree:
Hashing and Hash Indexes
- Uses buckets and chaining to resolve collisions.
- Ideal for exact matches, not for range queries. Static vs Dynamic Hashing:
- Static: fixed number of buckets.
- Dynamic: buckets grow dynamically (e.g., linear hashing).
Conclusion:
- Indexing is essential for performance in databases.
- Choose index type (ordered vs hashed) based on query needs.
- B+ Trees are widely used due to balance and support for ranges.
Lesson 15
Query Processing and Optimization
Objectives:
- Understand the process of query handling.
- Learn how to evaluate and optimize queries.
- Calculate query evaluation costs.
Query Processing Steps
- Parsing & Translation
- SQL → relational algebra (logical plan).
- Evaluation
- Choose and execute physical plan.
- Optimization
- Select most efficient plan based on cost.
Evaluation Techniques
- Relational Algebra: Used for internal query representation.
- Query Plan: Specifies how to evaluate the query (includes access paths, algorithms).
- Pipelining: Pass output of one operation directly to next without storing.
- Materialization: Store intermediate results in temporary tables.
Query Cost Factors:
- Disk I/O (dominant cost in older systems).
- CPU cost (important in memory/SSD systems).
- Buffer availability, data layout, and parallelism also affect cost.
Join Algorithms
- Nested-Loop Join: Compare every tuple.
- Block Nested-Loop: Compare block-wise (less costly).
- Indexed Join: Use index for faster lookups.
- Merge Join: Efficient if inputs are sorted.
Other Operations
- Sorting: Done with external sort-merge for large datasets.
- Duplicate Elimination: Use hashing or sorting.
- Projection, Set Operations, Aggregation, and Outer Join: Have cost implications based on method used.
Pipelining vs Materialization
- Pipelining:
- Lower cost.
- Used when intermediate data doesn’t need to be stored.
- Materialization:
- Necessary for operations like sort or when multiple passes are needed.
Memory & Cache Optimization
- Cache-conscious algorithms improve performance by reducing CPU cache misses.
- Query compilation (instead of interpretation) speeds up execution.
Conclusion:
- Efficient query processing relies on good optimization.
- Cost models guide the optimizer in choosing the best plan.
- Understanding evaluation strategies and join algorithms helps improve query performance.
This post is licensed under CC BY 4.0 by the author.