Using the Tatoeba Corpus for Your Own Projects
Warning: The Tatoeba Corpus is not error-free.
- Due to the nature of a public collaborative project, this data will never be 100% free of errors.
- Be aware of the following.
- Though we recommend native-speaker contributions, a number of non-native speakers have contributed in languages they are learning.
- We ask our members not to change archaic language to something that currently sounds natural.
- Translations may not always be accurate, even though the linked sentences are correct sentences.
Suggestions for Those Planning to Use the Corpus
- Don't use the whole corpus, but do some filtering out of obviously suspect items. (Things like items tagged @need native check, @change, archaic, non-sentence, etc. Browse Tags to find others.)
- You may want to eliminate all sentences not "owned" by native speakers. However, even this will not guarantee perfect data. (See Tatoeba.org Native Speakers maintained by CK)
- You should inform your audience that the data may contain errors (See an example) and explain what steps you have taken to help minimize the errors.
- Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.
- You might want to only use sentences you have personally proofread if you are creating materials for people studying a foreign language. This helps make sure that what you are teaching people isn't a mistake.
*(See a live example on the right side of this page:: http://www.manythings.org/bilingual/).
Download the Tatoeba Corpus
Article available in: