Notice

This page show a previous version of the article

Using the Tatoeba Corpus for Your Own Projects

Terms of Use

  • Read the Terms of Use.
  • Note that the Terms of Use for the audio files is not the same as for using the text of sentences. See the the list of audio lists for lists of sentences and what license, if any, these people have offered their files for use outside of tatoeba.org.

Warning: The Tatoeba Corpus is not error-free.

  • Due to the nature of a public collaborative project, this data will never be 100% free of errors.
  • Be aware of the following.
    • Though we recommend native-speaker contributions, a number of non-native speakers have contributed in languages they are learning.
    • We ask our members not to change archaic language to something that currently sounds natural.
  • Translations may not always be accurate, even though the linked sentences are correct sentences.

Suggestions for Those Planning to Use the Corpus

  • Don't use the whole corpus, but do some filtering out of obviously suspect items. (Things like items tagged @need native check, @change, archaic, non-sentence, etc. Browse Tags to find others.)
  • You may want to eliminate all sentences not "owned" by native speakers. However, even this will not guarantee perfect data. (See Tatoeba.org Native Speakers maintained by CK)
  • You should inform your audience that the data may contain errors (See an example) and explain what steps you have taken to help minimize the errors.
  • Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.
  • You might want to only use sentences you have personally proofread if you are creating materials for people studying a foreign language. This helps make sure that what you are teaching people isn't a mistake. *(See a live example on the right side of this page:: http://www.manythings.org/bilingual/).

Download the Tatoeba Corpus