Notice

This page show a previous version of the article

Using the Tatoeba Corpus for Your Own Projects

Terms of Use

Read the Terms of Use. Note that the terms of use for the audio files are not the same as for sentence text. See the list of audio lists to see what license, if any, these people have offered their files for use outside of tatoeba.org. You should verify these licenses by clicking "audio files" on each member's profile.

Processing the Tatoeba Corpus

You will probably want to filter out sentences that:

  • require correction or improvement
  • sound unnatural
  • are poor or unnatural translations of other sentences

You may also may want to filter out those that:

  • contain vulgar language or sexual references
  • contain archaic or old-fashioned content
  • are particularly long

You can use various forms of metadata to aid with this process:

  • tags (for instance, "@change", "archaic", "vulgar"; see Tags for more)
  • sentence ratings
  • contributors' self-reported skill in the language (as indicated in their profiles)
  • whether the contributor is a self-reported native speaker (as indicated in their profiles and/or a separately-maintained list of native speakers)

Note that most sentences that do not have errors are not explicitly marked with an "OK" rating or tag. Taking all of this into account, you will probably need to perform both custom automated processing and manual review.

Suggestions for Those Planning to Use the Corpus

  • Tell your audience how you selected the sentences.
  • Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.

Download the Tatoeba Corpus

Downloads are updated every Saturday.