all repos — h3rald @ e323e34c33471f8f51645a95fbfe719dc561d13e

The sources of https://h3rald.com

content/articles/boolean-search.bbcode

 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
----- 
permalink: boolean-search
filters_pre: 
- bbcode
title: Understanding Boolean Search
comments: []

date: 2005-12-10 12:57:16 +01:00
tags: 
- internet
- google
type: article
toc: true
-----
These days, it is necessary to use a search engines to find the information you want. When the World Wide Web was smaller, search engines weren't an essential websurfing tool, but once the Web started growing exponentially, and hosting literally billions of documents and files, even normal searches aren't enough to find important information, especially when it is not readily available.  So, I'm going to show you a more powerful way to search.[b]Learning how to search[/b]

Imagine yourself in the shoes of someone who has never used the Internet before. That's pretty rare nowadays, but it does happen. Take my dad, for example, who recently asked me something like "Where can I find a map of the Internet?". I explained that there wasn't any such thing because the Web is too dynamic to be mappable, and that's why we use search engines.

I introduced him to Google [1], and he has since started to use search engines regularly. He didn't have much luck on his first few tries, but eventually he learned how to search properly.

Searching the web is easy (just type in a word and hit enter), but finding stuff can be tricky, especially if you don't know enough about a subject to narrow your search down. Most people (including myself) tend to find what they're looking for only after multiple searches: we start with a general item,  check the results, and restrict the next search based on what we learned from the previous one. While this is generally successful, every once in a while you will find yourself oging in circles.

Let's look at a sample situation: I want to learn Ruby on Rails [2] and I want a free host to try it out. So, I go on Google and type something like: 

[i]ruby on rails free hosting[/i]

I immediately find various blog entries referring to a project that aims to offer free hosting to try out the Ruby-based framework "Rails Playground". [3]  It seems to be the perfect solution - they offer, completely free, enough space to try out Rails. It's a pity they recently decided to close new account registration, so now the whole thing is useless.

Variants of the search query mentioned above bring up stuff related to Rails Playground.  The project became so well-known that almost every Rails-related blog mentioned it at some point as the only place offering free hosting supporting Rails. Since it is useless now, is there a way to prevent Google (or other search engines) from displaying Rails Playground related results?  Yes!

You would need something like this: 

[i]rails free hosting -playground -railsplayground[/i]

In this new query I excluded the words "playground" and "railsplayground" using a minus sign before them so I would find other results that didn't refer to the project.  In the end, I didn't actually find any other free hosting that supported rails, but I did find the following: 

- a company which offers free rails hosting for testing purposes (until they officially launch their service) 
- a guy who offered some space on his private server for testing rails (no longer available) 

Although I didn't find anything equivalent to Rails Playground, I didn't waste time either going in circles or scrolling through tons of pages trying to find something else. Actually, most people know how to exclude (or include) words in Google searches but they rarely do it.  Furthermorte, most people don't know that there are many more search functions available on almost all the popular search engines.  These functions, like the minus sign, are called Boolean operators.


[b]A few words about Boolean algebra: [/b]

Boolean searches get their name from George Bool[4], the inventor of Boolean algebra[5], which is a particular algebraic structure involving three fundamental operators: AND, OR and NOT. If you attended any math class or course you should be already familiar with it. If not, here is a short summary of some of the concepts I will discuss in upcoming sections.

Using Boolean searches (rather than Boolean algebra), the expressions A, B, C, etc. can be considered words, and "A <Boolean operator B" can be considered search queries.

- A AND B: pages must contain both words A and B.
- A OR B: pages must contain either the word A or the word B
- NOT A: pages must not contain the word A

Trivial. Now let's see some more examples:

- (A OR B) AND (NOT C): here I used brackets to create nesting, which causes expressions within brackets to be carried out before the rest, so the query means: "search for pages containing either A or B but which do not contain C".

- (A OR (C AND D)) AND (NOT (F OR G)): similar but more complex than the previous: "search for pages containing either A or both C and D. Additionally, only F or G can be present, or neither of them".

In some applications, like electrical circuits, NOR, NAND and XOR operators are also used to express Not OR, Not AND and eXclusive OR. As for search engines, only some of them support the XOR operator. A XOR B means that pages can contain either A but not B or B but not B.


[b]Boolean search and Google[/b]

After reading this you might want to try typing Boolean expressions like "(food AND for) AND (cats OR DOGS) AND (NOT birds)" into a search engine, but that won't work. A Boolean expression typed "as is" rarely works on a search engine (it isn't supported because it's considered to be not user friendly enough). Google in particular adopted a more intuitive way[6] of performing Boolean searches.  

For starters, you almost always perform a Boolean search when searching something on Google simply because they decided (like most major search engines have) to automatically include the AND operator unless OR is specified.

Searching the phrase "food for dogs" actually corresponds to "food AND for AND dogs" (using the proper Boolean expression).  Presumably, this was done to prevent the search engine from delivering too many (and usually inconsistent) results. The other possibility (the default in MySQL's FULLTEXT boolean search[7]) would be to use the OR operator by default.  Thus, searching for "food for dogs" might deliver results about food for cats, other pets, or even food in general. 

To improve the precision of their searches, Google also implements automatic exclusion for common words (like "for" in the example below). However, on occasion, a common word needs to be included in a search.  To be fair,usually you will find what you are looking for, even with common words excluded.  Nevertheless, to force Google to include a word, just add a plus symbol before it, like "+for".

Similarly, a minus in front of a word (rails free hosting -playground -railsplayground) forces Google to exclude a word from the search query: in other words, the minus sign is Google's version of the Boolean NOT operator.
In order to transform the Boolean expression that I used at the start of this chapter - (food AND for) AND (cats OR DOGS) AND (NOT birds) - into a proper query accepted by Google, I have to write: "food for" "cats OR dogs" -birds. The OR operator [i]must[/i] be specified, and anything in parentheses roughly corresponds to quotation marks because Google searches for the exact phrase enclosed in the quotation marks (also evaluating an OR operator, if present).

The biggest limitation of Google when it comes to Boolean searches is the lack of support for nested expressions. Something like (food AND (NOT for)) AND (cats OR dogs) AND (NOT birds) cannot be translated into something like [i]"food -for" "cats OR dogs" -birds[/i] because Google will not evaluate the "-" operator if it is enclosed in quotation marks. Something more complex like:

[i]((food AND for) AND (cats OR DOGS) AND (NOT birds)) OR ((stuff AND for) AND (goats OR horses) AND (NOT (cows OR bulls)))[/i]

cannot be translated into a Google-friendly query. Normal people probably won't ever do that complicated a search, but you never know...


[b]All the other search engines, strategies and conclusions[/b]

There are various articles (see [8][9][10]) about how Boolean search has been implemented in various major search engines and AltaVista[11], AlltheWeb[12] and MSN Search[13] seem to support Boolean search features better than Google. All of them support the standard Boolean operators, as well as the "+" and "-" symbols, but apparently only MSN Search[13] seems to support full Boolean search queries with nesting: I actually managed to execute my previous complex example: 

[i]((food AND for) AND (cats OR DOGS) AND (NOT birds)) OR ((stuff AND for) AND (goats OR horses) AND (NOT (cows OR bulls)))[/i]

and I got some decent results. The only (understandable) exception is that I had to specify +for to have the word "for" included.

Although Boolean search is useful, it is not the only way to get relevant results as quickly as possible. Additional thinking is required to prepare a query properly. In everyday life, you won't really use heavily nested queries, simply because other methods are more effective. If you're interested in learning how to search I'd recommend a very informative article available at Waikato University[14].

I found out that a mix between making multiple search attempts and using basic Boolean queries (word exclusion in particular) can deliver pertinent results fairly readily. Suppose you've heard something regarding a person named Halley who contributes to an IT-related community and that someone mentioned the word "kernel" when talking about him, and you remember that it wasn't referring to Linux. You could come up with something like: 

Halley kernel -Linux

Et voila': Halley's CyberArmy Profile[15] appears as the first search result in Google!  If you typed just [i]Halley[/i] you wouldn't have found the right one right away; you would probably get more information about the Halley's Comet or the astronomer Sir Edmund Halley. If you typed [i]kernel Halley[/i] you'd have found something about Kernel Halley on zZine first and then on CyberArmy lower down in the search results.

Boolean search can be useful, but it must not be abused. Google's decision to implement only partial Boolean support without standard Boolean operation was probably the best choice to achieve both pertinent results and user-friendliness. 

[b]Notes and further resources[/b]
[1] Google Inc.: [url]http://www.google.com/[/url]
[2] Ruby on Rails framework: [url]http://www.rubyonrails.org/[/url]
[3] Ruby Playground: [url]http://www.railsplayground.com/[/url]
[4] George Bool, Wikipedia Page: [url]http://en.wikipedia.org/wiki/George_Boole[/url]
[5] Boolean Algebra, Wikipedia Page: [url]http://en.wikipedia.org/wiki/Boolean_algebra[/url]
[6] Google Help on Advanced Search: [url]http://www.google.com/help/refinesearch.html[/url]
[7] MySQL FULLTEXT boolean search: [url]http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html[/url]
[8] Search engines that implement boolean search (outdated): [url]http://searchenginewatch.com/facts/article.php/2155991[/url]
[9] Boolean Searching on the Internet: [url]http://library.albany.edu/internet/boolean.html[/url]
[10] How to choose a search engine or directory: [url]http://library.albany.edu/internet/choose.html#logic[/url]
[11] AltaVista Special Search Terms: [url]http://www.altavista.com/help/search/syntax[/url]
[12] AlltheWeb Query Language: [url]http://alltheweb.com/help/faqs/query_language#2[/url]
[13] MSN Search: [url]http://search.msn.com/[/url]
[14] "The Assignment Process: Search Strategies": [url]http://www.waikato.ac.nz/library/learning/g_strategies.shtml[/url]
[15] Halley's CyberArmy Profile: [/url]http://www.cyberarmy.net/~Halley/[/url]