Using the Tatoeba Corpus for Your Own Projects
Warning: The Tatoeba Corpus is not error-free.
- Due to the nature of a public collaborative project, this data will never be 100% free of errors.
- Be aware of the following.
- We allow non-native speakers to contribute in languages they are learning.
- We ask our members not to change archaic language to something that currently sounds natural.
- We allow our members to submit book titles and other things you might not consider sentences.
- Translations may not always be accurate, even though the linked sentences are correct sentences.
Suggestions for Those Planning to Use the Corpus
- Don't use the whole corpus, but do some filtering out of obviously suspect items. (Things like items tagged @need native check, @change, archaic, non-sentence, etc. Browse Tags to find others.)
- You may want to eliminate all sentences not "owned" by native speakers. However, even this will not guarantee perfect data. (See Tatoeba.org Native Speakers maintained by CK)
- You should inform your audience that the data may contain errors (See an example) and explain what steps you have taken to help minimize the errors.
- Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.
- You might want to only use sentences you have personally proofread if you are creating materials for people studying a foreign language. This helps make sure that what you are teaching people isn't a mistake.
Download the Tatoeba Corpus
Article available in: