How come the computer doesn’t have to read the entire table when the column is indexed? [duplicate]

Let’s say a table with two columns has 100 quadrillion records. And I want to find a record that has column #2 equal something.

If column #2 is indexed it returns the result immediately, but if it’s not the computer has to read the entire table so it takes a long time.

How come the computer doesn’t have to read the entire table when the column is indexed? How does it know what the result is without reading the table?

The “index” of a book is not a great metaphor, IMO. A better metaphor is trying to look up a word in a dictionary. Imagine that you want to look up a word in the Oxford English Dictionary (OED), which is a massive dictionary that comprises multiple volumes.

The words in the OED are sorted alphabetically, of course, to make it easy to look up a word. But what if the words were not sorted? What if they were added to the dictionary in random order?

Let’s do a thought experiment where we want to look up the word “quadrillion” in this hypothetical, unsorted dictionary, and compare that to looking up the same word in a normal, sorted dictionary.

Let’s assume the dictionary has 10,000 pages and there is exactly 1 word on each page.

Unsorted Dictionary

To find the word “quadrillion” in a dictionary that isn’t alphabetically sorted, you would need to start at one end (say page 1). If the word on page 1 is “quadrillion”, then congratulations, you’re done! Of course, since the dictionary is random, odds are that “quadrillion” is not on the first page, which means you need to check the next page, and then the page after that, etc. until you find “quadrillion”.

How many pages do you need to look at (in the worst case) to find the word “quadrillion”? Well, potentially all of them… “quadrillion” could by chance be the very last word in the dictionary, so you’d need to look at all 10,000 pages in the worst case scenario.

This would be an excruciatingly tedious task, and its analogous to a database looking at every single row of a table.

Sorted Dictionary

Fortunately, dictionaries are sorted alphabetically, which is why we can look up a word in minutes, not weeks. We all know how to find a word in the dictionary, but pretend for a moment that you had to write an algorithm to do it. How would that algorithm work?

Open the dictionary up to the middle page (5,000) and read the word on that page: “mellifluous”.
This word is higher (alphabetically) than quadrillion, so we know that quadrillion must be somewhere between pages 5,001 and 10,000.
Next, we go to to page 7,500, which is the midpoint of 5,001 and 10,000.
The word on this page is “sacrosanct”. This is lower than “quadrillion”, so now we know quadrillion is between pages 5,001 and 7,500.
Now we flip to 6,250, which is the midpoint of 5,001 and 7,500.
…
We can repeat this process of dividing the remaining pages in half and figuring out which half the word is in until, eventually, we’ll open to the page that contains “quadrillion”, and we’re done!

How efficient is this compared to the unsorted dictionary algorithm? We still probably had to look at a lot of pages, right?

We can compute the efficiency by noticing that at each step we are eliminating half of the pages. So by counting the number of pages that are left, we can count how many steps it takes to get to any word:

Start with 10,000 pages.
Narrow it down to 5,000 pages.
Then 2,500.
1,250
625
312 (we’ll round down if there are any decimal places)
156
78
39
19
9
4
2
1

By the time we get to 1 page remaining, we know with certainty that we’ve either found the word we were looking for, or it doesn’t exist in the dictionary. We can count the number of steps and see that we only had to look at 14 pages — in the worst case — to find any word. That’s a huge improvement over looking at all 10,000 pages!

More generally, the worst-case lookup time is log2(n), where n is the number of pages in the dictionary. (Try computing it yourself to see if your answer agrees with mine.) This is a very desirable property for an algorithm, because the log function grows very slowly. If the dictionary had a billion pages instead of 10k, it would still only take 30 steps to find any given word in it! (Again, try computing it yourself.)

This is analogous to the database using an index to find a row in a table.

B-Tree

A B-tree is a data structure that allows us to use an algorithm similar to the one we use to find a word in a sorted dictionary. This is a very important data structure in fundamental computer science.

http://en.wikipedia.org/wiki/B-tree

Many database indexes are actually b-trees under the hood.

Pedantry

You said:

If column #2 is indexed it returns the result immediately, but if it’s not the computer has to read the entire table so it takes a long time

The “immediately” part isn’t really true. It’s still an iterative algorithm that takes longer to run the more items that need to be looked at. It’s such an efficient algorithm that the results might feel immediate, but it’s important to note that it’s not a constant time algorithm.

Basically, because an index is ordered, and when searching through an ordered data set, you don’t have to search every item to find the element you’re looking for; there are faster ways.

Discussing database indexing in detail can become very arcane very quickly, but the simplest way to answer this is that you can use techniques such as a binary search to find the element much more quickly, for reasons that should be intuitively obvious to anyone who’s ever played the “I’m thinking of a number between 1 and 100” guessing game. You know how you can always guess the number within 7 guesses, not 100, if your guesses cut the size of the number of values to be “searched” in half each time? Databases do something very similar with indexing.

It is not that the computer knows what the result is without reading the table. It actually does quite a lot of work to find the result, but it is very fast, so it appears instantaneous to you. But yes, certainly, it does not read the entire table.

The way it works is implementation dependent, but a popular simple algorithm which serves for illustration purposes is binary search. It works similarly to how you find a word in a dictionary. Clearly, you don’t read the entire dictionary every time you are looking for a word, do you?

What you do is that you open up the dictionary somewhere near the middle, (the algorithm will go to the exact middle, but that’s just computers finding precision easier than approximation,) then you read just one word from that page, and see whether the word you are looking for comes before or after the word you read. If it comes before, you ignore the second half of the book, and focus only on the first half; if it comes after, you ignore the first half of the book, and you focus only on the second half: then you open up the page in the middle of the half that you are focusing on, read another word, see if your word comes before or after, and keep repeating these steps, until you have found the right page.

The reason why this works is because the words in the dictionary are sorted. If they were not sorted, we would have to go through the entire dictionary to find a word. What an index does on a database is that it provides a sorted view of the data, so that binary search can be used to find a value.

Using binary search you can find the right page for any word in a thousand-page dictionary with just 10 page lookups, and you can find the right page for any word in a 4-billion page dictionary with just 32 page lookups. As you see, it scales non-linearly, to our advantage, which is sweet.

binary beans

You have a bunch of red jelly beans and a bunch of blue jelly beans.

You put the red jelly beans to the left and the blue ones to the right.

Now in the left-hand-side-world of red jelly beans you separate the ones that have no freckles to the left, the ones that have many freckles to the right.

So now you have something like

                    Root
              Reds/      Blues
           /     
no freckles     freckles

You can keep branching your tree on different attributes and eventually if you want to find a red jelly bean with freckles you only have to take 2 steps, you don’t have to pick up every single bean and check, put it back, and grab a new one until you reach a freckled red bean.

No matter how many jelly beans you have, finding a red one or a blue one is very simple.

A tree works a similar way (binary tree, B+tree, trie, etc.) by branching differences in data.

princess hotel

Indexing is just another way of exploiting patterns in data for efficiency.

Say you have a hotel and you put all the VIPs on the top floor, all the princesses on the 3rd floor, all the kings on the 2nd floor, etc.

Then you don’t have to search through every room when you want to find a princess, you just go to the 3rd floor. It cuts down your seek time by a lot.

Now imagine doing that with your data.

An index is therefore not all the data, it’s just a map that tells you where all the data is. If your data has some nice patterns that can be exploited, then your index will make your searching very efficient.

Others have given more technical answers, so I will provide an analogy—and I realize this analogy may be a bit old fashioned:

How do you look for a number in a phone book?

Do you start at page one reading every name until you find it? Of course not, you split the huge thing in two, check where you’re at then split it again and again, until you converge closer and closer to a page likely to contain the listing.

This is possible because we know the information in the phone book has been sorted alphabetically. Thus we are effectively performing the same steps as a binary-search algorithm.

Related concepts hold true for all books. When you open any book you’ll often find an “index” within the first few pages that list relevant sections with pointers to data (page numbers).

Creating a database index for a particular table column is the same concept.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 17:33

Thẻ: computer-science, database, indexing, search

Thiết kế website giá rẻ

Danh mục