Version at: 02/01/2020, 03:46

# Using the Tatoeba Corpus for Your Own Projects

## Terms of Use

* Read the [Terms of Use](http://tatoeba.org/eng/terms_of_use).
* Note that the Terms of Use for the audio files is not the same as for using the text of sentences.  See [the list of audio lists](https://tatoeba.org/eng/sentences_lists/of_user/CK/audio%20-/page:1/sort:modified/direction:desc) to see what license, if any, these people have offered their files for use outside of tatoeba.org.  You should verify these licenses by clicking  "audio files"  on each member's profile.

## Warning: The Tatoeba Corpus is not error-free.

* Due to the nature of a public collaborative project, this data will never be 100% free of errors.
* Be aware of the following.
   * Though we recommend native-speaker contributions, a number of non-native speakers have contributed in languages they are learning.
   * We ask our members not to change archaic language to something that currently sounds natural.
 * Translations may not always be accurate, even though the linked sentences are correct sentences.
 * Some sentences, even those written by native speakers, may not be the most natural way something is said, since the contributor has tried to too closely mirror the original language.

## Suggestions for Those Planning to Use the Corpus

* Don't use the whole corpus, but do some filtering out of obviously suspect items. (Things like items tagged @need native check, @change, archaic, non-sentence, etc. [Browse Tags](http://tatoeba.org/eng/tags/view_all) to find others.)
* You may want to eliminate all sentences not "owned" by native speakers.  However, even this will not guarantee perfect data.  *(See [Tatoeba.org Native Speakers](http://bit.ly/nativespeakers) maintained by CK)*
* You should inform your audience that the data may contain errors *(See [an example](http://tatoeba.org/eng/sentences/show/2535464))* and explain what steps you have taken to help minimize the errors.
* Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.
* You might want to only use sentences you have personally proofread if you are creating materials for people studying a foreign language. This helps make sure that what you are teaching people isn't a mistake.
*(See a live example on the right side of this page. [http://www.manythings.org/bilingual/](http://www.manythings.org/bilingual/)).
* If you are looking for bilingual pairs with one of the pair being English, see [http://www.manythings.org/anki/](http://www.manythings.org/anki/). These are proofread English sentences that are paired with sentences by native speakers in the other languages. This may help you avoid using sentences that are more likely to contain errors.  (See the Terms of Use on that page.)

## Download the Tatoeba Corpus

* [Downloads](http://tatoeba.org/eng/download_tatoeba_example_sentences) are updated every Saturday.

version at: 09/05/2020, 15:54

# Using the Tatoeba Corpus for Your Own Projects

## Terms of Use

Read the [Terms of Use](http://tatoeba.org/eng/terms_of_use). Note that the terms of use for the audio files are not the same as for sentence text. See [the list of audio lists](https://tatoeba.org/eng/sentences_lists/of_user/CK/audio%20-/page:1/sort:modified/direction:desc) to see what license, if any, these people have offered their files for use outside of tatoeba.org.  You should verify these licenses by clicking "audio files" on each member's profile.

## Processing the Tatoeba Corpus

You will probably want to filter out sentences that:
* require correction or improvement
* sound unnatural
* are poor or unnatural translations of other sentences

You may also may want to filter out those that:
* contain vulgar language or sexual references
* contain archaic or old-fashioned content
* are particularly long

You can use various forms of metadata to aid with this process:
* tags (for instance, "@change", "archaic", "vulgar"; see  [Tags](http://tatoeba.org/eng/tags/view_all) for more)
* sentence ratings
* contributors' self-reported skill in the language (as indicated in their profiles)
* whether the contributor is a self-reported native speaker (as indicated in their profiles and/or a separately-maintained [list of native speakers](http://bit.ly/nativespeakers))

Note that most sentences that do not have errors are not explicitly marked with an "OK" rating or tag. Taking all of this into account, you will probably need to perform both custom automated processing and manual review.

## Suggestions for Those Planning to Use the Corpus

* Tell your audience how you selected the sentences.
* Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.

## Download the Tatoeba Corpus

[Downloads](http://tatoeba.org/eng/download_tatoeba_example_sentences) are updated every Saturday.

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.