all repos — h3rald @ v10

The sources of https://h3rald.com

contents/articles/boolean-search.html

 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
-----
title: "Understanding Boolean Search"
content-type: article
timestamp: 1134215836
tags: "internet|google"
-----

<p>These days, it is necessary to use a search engines to find the information you want. When the World Wide Web was
    smaller, search engines weren't an essential websurfing tool, but once the Web started growing exponentially, and
    hosting literally billions of documents and files, even normal searches aren't enough to find important information,
    especially when it is not readily available. So, I'm going to show you a more powerful way to search.</p>

<h3>Imagine yourself in the shoes of someone who has never used the Internet before.</h3>
<p>That's pretty rare nowadays, but it does happen. Take my dad, for example, who recently asked me something like
    "Where can I find a map of the Internet?". I explained that there wasn't any such thing because the Web is too
    dynamic to be mappable, and that's why we use search engines.</p>

<p>I introduced him to Google [1], and he has since started to use search engines regularly. He didn't have much luck on
    his first few tries, but eventually he learned how to search properly.</p>

<p>Searching the web is easy (just type in a word and hit enter), but finding stuff can be tricky, especially if you
    don't know enough about a subject to narrow your search down. Most people (including myself) tend to find what
    they're looking for only after multiple searches: we start with a general item, check the results, and restrict the
    next search based on what we learned from the previous one. While this is generally successful, every once in a
    while you will find yourself going in circles.</p>

<h3>Let's look at a sample situation: I want to learn Ruby on Rails</h3>
<p>and I want a free host to try it out. So, I go on Google and type something like:</p>

<em>ruby on rails free hosting</em>

<p>I immediately find various blog entries referring to a project that aims to offer free hosting to try out the
    Ruby-based framework "Rails Playground". [<a href="#">3</a>] It seems to be the perfect solution - they offer,
    completely free, enough space to try out Rails. It's a pity they recently decided to close new account registration,
    so now the whole thing is useless.</p>

<p>Variants of the search query mentioned above bring up stuff related to Rails Playground. The project became so
    well-known that almost every Rails-related blog mentioned it at some point as the only place offering free hosting
    supporting Rails. Since it is useless now, is there a way to prevent Google (or other search engines) from
    displaying Rails Playground related results? Yes!</p>

<p>You would need something like this:</p>

<em>rails free hosting -playground -railsplayground</em>

<p>In this new query I excluded the words "playground" and "railsplayground" using a minus sign before them so I would
    find other results that didn't refer to the project. In the end, I didn't actually find any other free hosting that
    supported rails, but I did find the following:</p>

<ul>
    <li>a company which offers free rails hosting for testing purposes (until they officially launch their service)</li>
    <li>a guy who offered some space on his private server for testing rails (no longer available)</li>
</ul>

<p>Although I didn't find anything equivalent to Rails Playground, I didn't waste time either going in circles or
    scrolling through tons of pages trying to find something else. Actually, most people know how to exclude (or
    include) words in Google searches but they rarely do it. Furthermore, most people don't know that there are many
    more search functions available on almost all the popular search engines. These functions, like the minus sign, are
    called Boolean operators.</p>


<h3>A few words about Boolean algebra:</h3>
<p>Boolean searches get their name from George Bool[4], the
    inventor of Boolean algebra[5], which is a particular algebraic structure involving three fundamental operators:
    AND, OR
    and NOT. If you attended any math class or course you should be already familiar with it. If not, here is a short
    summary of some of the concepts I will discuss in upcoming sections.</p>

<p>Using Boolean searches (rather than
    Boolean algebra), the expressions A, B, C, etc. can be considered words, and "A &lt;Boolean operator B" can be
    considered search queries.</p>

<ul>
    <li>A AND B: pages must contain both words A and B.</li>
    <li>A OR B: pages must contain either the word A or the word B</li>
    <li>NOT A: pages must not contain the word A</li>
</ul>

<p>Trivial. Now let's
    see some more examples:</p>

<ul>
    <li>(A OR B) AND (NOT C): here I used brackets to create nesting, which causes
        expressions within brackets to be carried out before the rest, so the query means: "search for pages containing
        either A
        or B but which do not contain C".</li>
    <li>(A OR (C AND D)) AND (NOT (F OR G)): similar but more complex than the
        previous: "search for pages containing either A or both C and D. Additionally, only F or G can be present, or
        neither of
        them".</li>
</ul>

<p>In some applications, like electrical circuits, NOR, NAND and XOR operators are also used to express
    Not OR, Not AND and eXclusive OR. As for search engines, only some of them support the XOR operator. A XOR B means
    that
    pages can contain either A but not B or B but not B.</p>

<h3>Boolean search and
    Google</h3>
<p>After reading this you might want to try typing Boolean expressions like "(food AND for)
    AND (cats OR DOGS) AND (NOT birds)" into a search engine, but that won't work. A Boolean expression typed "as is"
    rarely
    works on a search engine (it isn't supported because it's considered to be not user friendly enough). Google in
    particular adopted a more intuitive way[6] of performing Boolean searches.</p>

<p>For starters, you almost always perform a Boolean search when searching something on Google simply because they
    decided (like most major search engines
    have) to automatically include the AND operator unless OR is specified.</p>

<p>Searching the phrase "food for dogs"
    actually corresponds to "food AND for AND dogs" (using the proper Boolean expression). Presumably, this was done to
    prevent the search engine from delivering too many (and usually inconsistent) results. The other possibility (the
    default in MySQL's FULLTEXT boolean search[7]) would be to use the OR operator by default. Thus, searching for "food
    for
    dogs" might deliver results about food for cats, other pets, or even food in general.</p>

<p>To improve the
    precision of their searches, Google also implements automatic exclusion for common words (like "for" in the example
    below). However, on occasion, a common word needs to be included in a search. To be fair,usually you will find what
    you
    are looking for, even with common words excluded. Nevertheless, to force Google to include a word, just add a plus
    symbol before it, like "+for".</p>

<p>Similarly, a minus in front of a word (rails free hosting -playground
    -railsplayground) forces Google to exclude a word from the search query: in other words, the minus sign is Google's
    version of the Boolean NOT operator.</p>

<p>In order to transform the Boolean expression that I used at the start of this
    chapter - (food AND for) AND (cats OR DOGS) AND (NOT birds) - into a proper query accepted by Google, I have to
    write:
    "food for" "cats OR dogs" -birds. The OR operator <em>must</em> be specified, and anything in parentheses roughly
    corresponds to quotation marks because Google searches for the exact phrase enclosed in the quotation marks (also
    evaluating an OR operator, if present).</p>

<p>The biggest limitation of Google when it comes to Boolean searches is
    the lack of support for nested expressions. Something like (food AND (NOT for)) AND (cats OR dogs) AND (NOT birds)
    cannot be translated into something like <em>"food -for" "cats OR dogs" -birds</em> because Google will not evaluate
    the
    "-" operator if it is enclosed in quotation marks. Something more complex like:</p>

<em>((food AND for) AND (cats
    OR DOGS) AND (NOT birds)) OR ((stuff AND for) AND (goats OR horses) AND (NOT (cows OR
    bulls)))</em>

<p>cannot be translated into a Google-friendly query. Normal people probably won't ever do
    that complicated a search, but you never know...</p>

<h3>All the other search engines, strategies and conclusions</h3>
<p>There are various articles (see
    [8][9][10]) about how Boolean search has been implemented in various major search engines and AltaVista[11],
    AlltheWeb[12] and MSN Search[13] seem to support Boolean search features better than Google. All of them support the
    standard Boolean operators, as well as the "+" and "-" symbols, but apparently only MSN Search[13] seems to support
    full
    Boolean search queries with nesting: I actually managed to execute my previous complex example:</p>

<em>((food
    AND for) AND (cats OR DOGS) AND (NOT birds)) OR ((stuff AND for) AND (goats OR horses) AND (NOT (cows OR
    bulls)))</em>

<p>and I got some decent results. The only (understandable) exception is that I had to specify +for to have the word
    "for" included.</p>

<p>Although Boolean search is useful, it is not the only way to get
    relevant results as quickly as possible. Additional thinking is required to prepare a query properly. In everyday
    life,
    you won't really use heavily nested queries, simply because other methods are more effective. If you're interested
    in
    learning how to search I'd recommend a very informative article available at Waikato University[14].</p>

<p>I found
    out that a mix between making multiple search attempts and using basic Boolean queries (word exclusion in
    particular)
    can deliver pertinent results fairly readily. Suppose you've heard something regarding a person named Halley who
    contributes to an IT-related community and that someone mentioned the word "kernel" when talking about him, and you
    remember that it wasn't referring to Linux. You could come up with something like:</p>

<p>Halley kernel
    -Linux</p>

<p>Et voila': Halley's CyberArmy Profile[15] appears as the first search result in Google! If you typed
    just <em>Halley</em> you wouldn't have found the right one right away; you would probably get more information about
    the
    Halley's Comet or the astronomer Sir Edmund Halley. If you typed <em>kernel Halley</em> you'd have found something
    about
    Kernel Halley on zZine first and then on CyberArmy lower down in the search results.</p>

<p>Boolean search can be
    useful, but it must not be abused. Google's decision to implement only partial Boolean support without standard
    Boolean
    operation was probably the best choice to achieve both pertinent results and user-friendliness.</p>

<h3>Notes and further resources</h3>
<ul>
    <li>[1] Google Inc.: <a href="http://www.google.com/">http://www.google.com/</a></li>
    <li>[2] Ruby on Rails framework: <a href="http://www.rubyonrails.org/">http://www.rubyonrails.org/</a></li>
    <li>[3] Ruby Playground: <a href="http://www.railsplayground.com/">http://www.railsplayground.com/</a></li>
    <li>[4] George Bool, Wikipedia Page: <a
            href="http://en.wikipedia.org/wiki/George_Boole">http://en.wikipedia.org/wiki/George_Boole</a></li>
    <li>[5] Boolean
        Algebra, Wikipedia Page: <a
            href="http://en.wikipedia.org/wiki/Boolean_algebra">http://en.wikipedia.org/wiki/Boolean_algebra</a></li>
    <li>[6] Google
        Help on Advanced Search: <a
            href="http://www.google.com/help/refinesearch.html">http://www.google.com/help/refinesearch.html</a></li>
    <li>[7] MySQL
        FULLTEXT boolean search: <a
            href="http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html">http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html</a>
    </li>
    <li>[8]
        Search engines that implement boolean search (outdated): <a
            href="http://searchenginewatch.com/facts/article.php/2155991">http://searchenginewatch.com/facts/article.php/2155991</a>
    </li>
    <li>[9]
        Boolean Searching on the Internet: <a
            href="http://library.albany.edu/internet/boolean.html">http://library.albany.edu/internet/boolean.html</a>
    </li>
    <li>[10]
        How to choose a search engine or directory: <a
            href="http://library.albany.edu/internet/choose.html#logic">http://library.albany.edu/internet/choose.html#logic</a>
    </li>
    <li>[11]
        AltaVista Special Search Terms: <a
            href="http://www.altavista.com/help/search/syntax">http://www.altavista.com/help/search/syntax</a>
    </li>
    <li>[12]
        AlltheWeb Query Language: <a
            href="http://alltheweb.com/help/faqs/query_language#2">http://alltheweb.com/help/faqs/query_language#2</a>
    </li>
    <li>[13]
        MSN Search: <a href="http://search.msn.com/">http://search.msn.com/</a>
    </li>
    <li>[14] "The Assignment Process: Search
        Strategies": <a
            href="http://www.waikato.ac.nz/library/learning/g_strategies.shtml">http://www.waikato.ac.nz/library/learning/g_strategies.shtml</a>
    </li>
    <li>[15]
        Halley's CyberArmy Profile: <a href="http://www.cyberarmy.net/~Halley/">http://www.cyberarmy.net/~Halley/</a>
    </li>
</ul>