Version at: 26/10/2013, 08:46

#How to Search for Text

Each page on Tatoeba features a box that allows you to search for text within the collection of sentences. The search engine is Sphinx, which offers [documentation](http://sphinxsearch.com/docs/current.html#boolean-syntax) of its sophisticated capabilities. However, some features of Sphinx are not relevant to Tatoeba, while other aspects that are most likely to be important to users of the site are not immediately apparent.

One of the most important things to note is that in many languages, including English, Sphinx **stems** the search words by default. This means that it removes trailing letters from both search words and indexed words. Thus a search for *pare* will also find *pared* and *paring*.

If you want to find an exact match for a word, you must proceed it with an equals sign: *=pare*. This may come as a surprise to users who are accustomed to Google Search, where wrapping a word or phrase in double quotes forces an exact match. In Sphinx, double quotes have a different function, which only affects multiword (phrase) searches: wrapping a phrase in double quotes requires matching sentences to contain words in the specified sequence. Simply placing a phrase in quotes does not suppress stemming of its individual words. To do that, you will need to place an equals sign before each word in the phrase for which you want to suppress stemming.

As an example, take the search *like thing*. This will find *like things*, *likely things*, and even *things like*. Adding quotes, as in *"like thing"*, will prevent a match against *things like* (where the words appear in the wrong order), but it will continue to match *like things*, *likely things*, and so on. By contrast, *"=like =thing"* will only match *like thing* (which does not occur in the Tatoeba corpus).

version at: 26/10/2013, 08:50

#How to Search for Text

Each page on Tatoeba features a box that allows you to search for text within the collection of sentences. The search engine is Sphinx, which offers [documentation](http://sphinxsearch.com/docs/current.html#boolean-syntax) of its sophisticated capabilities. However, some features of Sphinx are not relevant to Tatoeba, while other aspects that are most likely to be important to users of the site are not immediately apparent.

One of the most important things to note is that in many languages, including English, Sphinx **stems** the search words by default. This means that it removes trailing letters from both search words and indexed words. Thus a search for *pare* will also find *pared* and *paring*.

If you want to find an exact match for a word, you must proceed it with an equals sign, as in *=pare*. This may come as a surprise to users who are accustomed to Google Search, where wrapping a word or phrase in double quotes forces an exact match. In Sphinx, double quotes have a different function, which only affects multiword (phrase) searches: wrapping a phrase in double quotes requires matching sentences to contain words in the specified continuous sequence. Simply placing a phrase in quotes does not suppress stemming of its individual words. To do that, you will need to place an equals sign before each word in the phrase for which you want to suppress stemming.

As an example, take the search *like thing*. This will find *like things*, *likely things*, and even *things like*. Adding quotes, as in *"like thing"*, will prevent a match against *things like* (where the words appear in the wrong order), but it will continue to match *like things*, *likely things*, and so on. By contrast, *"=like =thing"* will only match *like thing* (which does not occur in the Tatoeba corpus). Removing one of the equals signs, as in *like =thing*, will also find sentences where the words appear in a different order, and not adjacent to each other, such as "Such a strange thing is not likely to happen."

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.