Notice

This page show a previous version of the article

Using the Tatoeba Corpus for Your Own Projects

Terms of Use

Warning: The Tatoeba Corpus is not error-free.

  • Due to the nature of a public collaborative project, this data will never be 100% free of errors.
  • Be aware of the following.
    • We allow non-native speakers to contribute in languages they are learning.
    • We ask our members not to change archaic language to something that currently sounds natural.
    • We allow our members to submit book titles and other things you might not consider sentences.
  • Translations may not always be accurate, even though the linked sentences are correct sentences.

Suggestions for Those Planning to Use the Corpus

  • Don't use the whole corpus, but do some filtering out of obviously suspect items. (Things like items tagged @need native check, @change, archaic, non-sentence, etc. Browse Tags to find others.)
  • You may want to eliminate all sentences not "owned" by native speakers. However, even this will not guarantee perfect data. (See Tatoeba.org Native Speakers maintained by CK)
  • You should inform your audience that the data may contain errors (See an example) and explain what steps you have taken to help minimize the errors.
  • Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.
  • You might want to only use sentences you have personally proofread if you are creating materials for people studying a foreign language. This helps make sure that what you are teaching people isn't a mistake.

Download the Tatoeba Corpus

  • Downloads are updated every Saturday at 9:00 a.m. (GMT)
  • Note: Downloads are actually ready about 15 minutes later.