Goals of the Tatoeba Project

Tatoeba is an open-source project for collecting a large database of sentences and their translations into other languages. See the Wikipedia page on Tatoeba for more background and this post for a history of the project. Anyone can use the collection of sentences and translations for free for any purpose, private or public, open-source or commercial. Trang, the founder of the Tatoeba project, has expressed the goal as follows: "... we gather a lot of data, try to organize it, ensure it is of good quality and make it freely accessible, downloadable and redistributable, so that anyone who has a great idea for a language learning application (or a language tool) can just focus on coding the application and rely on us to provide data of excellent quality."

In addition to serving the public as a reference for translations of sentences from one language to another, the site is also of interest to those who study the process of translation in a collaborative community.

Language Learning at Tatoeba

While Tatoeba offers many helpful resources to language learners, language instruction is not its core mission. In contrast to multilanguage sites such as Lang-8 or single-language sites such as lernu.net for Esperanto, Tatoeba does not formally offer any of the following:

  • structured presentation of grammar or vocabulary
  • mentoring (in which an instructor is formally assigned to a student)
  • a way to segregate material that is written as an exercise from material intended to serve as a reference to other language learners
  • a dedicated message system for requesting translation of particular sentences
  • a means of indicating the skill level of a contributor in any particular language
  • a means of grouping sentences by difficulty

All of these items can be addressed informally on Tatoeba. For instance, one can find self-identified native speakers by using an externally published list, and then send a private message requesting translation of particular sentences. Sentences can be tagged to indicate their unsuitability as a reference. But this must all be done on an ad hoc basis; there is no comprehensive, dedicated infrastructure for these tasks.

In order to learn a language, one must be willing to say and write sentences that are incorrect or at least do not sound like those that a native would produce, with the hope that the errors will be corrected and explained. In this respect, a language learner submitting sentences on Tatoeba is not only likely to encounter difficulty in finding someone to correct the sentences, but also likely to cause conflict with the overall goal of providing a high-quality reference for others.

Native-Level Sentences

Ideally, every sentence that might be used for reference would meet these criteria:

  • it could be produced by a speaker who had spoken the language surrounded by native speakers from early childhood to early adulthood
  • it would not be corrected by most such speakers, including the original speaker, if the speaker could see it written out

While the sentence might have originally contained errors, the idea is that the collaborative review process should take care of those errors within an appropriately short period of time after the sentence is added.


Article available in: