Version at: 03/10/2023, 10:02 vs. version at: 07/10/2023, 00:06
11# Using the Tatoeba Corpus for Your Own Projects
22
33## Terms of Use
44
55**Creative Commons License**
66
77Tatoeba's technical infrastructure uses the default **Creative Commons Attribution 2.0 France license (CC-BY 2.0 FR)** for the use of textual sentences. The BY mention implies a single restriction on the use, reuse, modification and distribution of the sentence: a condition of attribution. That is, using, reusing, modifying and distributing the sentence is only allowed if the name of the author is cited.
88
99Read the complete [Terms of Use](http://tatoeba.org/eng/terms_of_use).
1010
1111**License for Audio Files**
1212
1313Note that the terms of use for the **audio files** are not the same as for the text of sentences.
1414
1515See the [list of audio lists](https://tatoeba.org/eng/sentences_lists/of_user/CK/audio%20-/page:1/sort:modified/direction:desc) to see the license, if any, under which these people have offered their files for use outside of tatoeba.org. You should verify these licenses by clicking "audio files" on each member's profile.
1616
1717## Processing the Tatoeba Corpus
1818
1919You will probably want to filter out sentences that:
2020
2121* require correction or improvement
2222* sound unnatural
2323* are poor or unnatural translations of other sentences
2424
2525You may also may want to filter out those that:
2626
2727* contain vulgar language or sexual references
2828* contain archaic or old-fashioned content
2929* are untrue
3030* are sexist, are racist, are insulting to others, or otherwise inappropriate for your audience
3131* are particularly long
3232
3333You can use various forms of metadata to aid with this process:
3434
3535* tags (for instance, "@change", "archaic", "vulgar"; see [Tags](http://tatoeba.org/eng/tags/view_all) for more)
3636* sentence ratings
3737* contributors' self-reported skill in the language (as indicated in their profiles). Note that several members rate themselves as native speakers of multiple languages and that self-reported levels may not be accurate.
3838
3939If you are using the data to create language learning materials:
4040
4141* You should probably use only sentences that you or someone else has personally proofread and not rejected, since you do not want to be teaching people errors.
4242
4343* There is a list of over 900,000 proofread English sentences selected by CK for use in his projects. You might find this list useful. There is an option to download this list.
4444
4545[tatoeba.org/eng/sentences_lists/show/907/und](https://tatoeba.org/eng/sentences_lists/show/907/und)
4646
4747* [Sentences Tagged OK](https://tatoeba.org/en/tags/show_sentences_with_tag/7)
4848
4949Note that not all sentences that are OK are explicitly marked with an "OK" rating or tag, and some sentences that do have errors are not marked with a negative rating or tag. Taking all of this into account, you will probably need to perform both custom automated processing and manual review.
5050
5151
5252## Suggestions for Those Planning to Use the Corpus
5353
5454* Tell your audience how you selected the sentences.
5555 * See an example on this page:[www.manythings.org/bilingual](http://www.manythings.org/bilingual/)
5656* Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.
5757
5858## Download the Tatoeba Corpus
5959
6060[Downloads](http://tatoeba.org/eng/download_tatoeba_example_sentences) are updated every Saturday.
6161
6262## FAQ
6363
6464* [How do I give proper attribution?](https://en.wiki.tatoeba.org/articles/show/faq#i-would-like-to-use-tatoeba's-data-for-my-project.)
6565* [Where can I download Tatoeba's audio data?](https://en.wiki.tatoeba.org/articles/show/faq#where-can-i-download-tatoeba's-audio-data?)
6666* [How can I download all sentences and translations in specific languages?](https://en.wiki.tatoeba.org/articles/show/faq#how-can-i-download-all-sentences-and-translations-)
67* [API](https://en.wiki.tatoeba.org/articles/show/api)
diff view generated by jsdifflib

Version at: 03/10/2023, 10:02

# Using the Tatoeba Corpus for Your Own Projects

## Terms of Use

**Creative Commons License**

Tatoeba's technical infrastructure uses the default **Creative Commons Attribution 2.0 France license (CC-BY 2.0 FR)** for the use of textual sentences. The BY mention implies a single restriction on the use, reuse, modification and distribution of the sentence: a condition of attribution. That is, using, reusing, modifying and distributing the sentence is only allowed if the name of the author is cited.

Read the complete [Terms of Use](http://tatoeba.org/eng/terms_of_use). 

**License for Audio Files**

Note that the terms of use for the **audio files** are not the same as for the text of sentences.

See the [list of audio lists](https://tatoeba.org/eng/sentences_lists/of_user/CK/audio%20-/page:1/sort:modified/direction:desc) to see the license, if any, under which these people have offered their files for use outside of tatoeba.org. You should verify these licenses by clicking "audio files" on each member's profile.

## Processing the Tatoeba Corpus

You will probably want to filter out sentences that:

* require correction or improvement
* sound unnatural
* are poor or unnatural translations of other sentences

You may also may want to filter out those that:

* contain vulgar language or sexual references
* contain archaic or old-fashioned content
* are untrue
* are sexist, are racist, are insulting to others, or otherwise inappropriate for your audience
* are particularly long

You can use various forms of metadata to aid with this process:

* tags (for instance, "@change", "archaic", "vulgar"; see  [Tags](http://tatoeba.org/eng/tags/view_all) for more)
* sentence ratings
* contributors' self-reported skill in the language (as indicated in their profiles). Note that several members rate themselves as native speakers of multiple languages and that self-reported levels may not be accurate.

If you are using the data to create language learning materials:

* You should probably use only sentences that you or someone else has personally proofread and not rejected, since you do not want to be teaching people errors.

* There is a list of over 900,000 proofread English sentences selected by CK for use in his projects. You might find this list useful. There is an option to download this list.

[tatoeba.org/eng/sentences_lists/show/907/und](https://tatoeba.org/eng/sentences_lists/show/907/und)

* [Sentences Tagged OK](https://tatoeba.org/en/tags/show_sentences_with_tag/7)

Note that not all sentences that are OK are explicitly marked with an "OK" rating or tag, and some sentences that do have errors are not marked with a negative rating or tag. Taking all of this into account, you will probably need to perform both custom automated processing and manual review.


## Suggestions for Those Planning to Use the Corpus

* Tell your audience how you selected the sentences.
  * See an example on this page:[www.manythings.org/bilingual](http://www.manythings.org/bilingual/)
* Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.

## Download the Tatoeba Corpus

[Downloads](http://tatoeba.org/eng/download_tatoeba_example_sentences) are updated every Saturday.

## FAQ

* [How do I give proper attribution?](https://en.wiki.tatoeba.org/articles/show/faq#i-would-like-to-use-tatoeba's-data-for-my-project.)
* [Where can I download Tatoeba's audio data?](https://en.wiki.tatoeba.org/articles/show/faq#where-can-i-download-tatoeba's-audio-data?)
* [How can I download all sentences and translations in specific languages?](https://en.wiki.tatoeba.org/articles/show/faq#how-can-i-download-all-sentences-and-translations-)

version at: 07/10/2023, 00:06

# Using the Tatoeba Corpus for Your Own Projects

## Terms of Use

**Creative Commons License**

Tatoeba's technical infrastructure uses the default **Creative Commons Attribution 2.0 France license (CC-BY 2.0 FR)** for the use of textual sentences. The BY mention implies a single restriction on the use, reuse, modification and distribution of the sentence: a condition of attribution. That is, using, reusing, modifying and distributing the sentence is only allowed if the name of the author is cited.

Read the complete [Terms of Use](http://tatoeba.org/eng/terms_of_use). 

**License for Audio Files**

Note that the terms of use for the **audio files** are not the same as for the text of sentences.

See the [list of audio lists](https://tatoeba.org/eng/sentences_lists/of_user/CK/audio%20-/page:1/sort:modified/direction:desc) to see the license, if any, under which these people have offered their files for use outside of tatoeba.org. You should verify these licenses by clicking "audio files" on each member's profile.

## Processing the Tatoeba Corpus

You will probably want to filter out sentences that:

* require correction or improvement
* sound unnatural
* are poor or unnatural translations of other sentences

You may also may want to filter out those that:

* contain vulgar language or sexual references
* contain archaic or old-fashioned content
* are untrue
* are sexist, are racist, are insulting to others, or otherwise inappropriate for your audience
* are particularly long

You can use various forms of metadata to aid with this process:

* tags (for instance, "@change", "archaic", "vulgar"; see  [Tags](http://tatoeba.org/eng/tags/view_all) for more)
* sentence ratings
* contributors' self-reported skill in the language (as indicated in their profiles). Note that several members rate themselves as native speakers of multiple languages and that self-reported levels may not be accurate.

If you are using the data to create language learning materials:

* You should probably use only sentences that you or someone else has personally proofread and not rejected, since you do not want to be teaching people errors.

* There is a list of over 900,000 proofread English sentences selected by CK for use in his projects. You might find this list useful. There is an option to download this list.

[tatoeba.org/eng/sentences_lists/show/907/und](https://tatoeba.org/eng/sentences_lists/show/907/und)

* [Sentences Tagged OK](https://tatoeba.org/en/tags/show_sentences_with_tag/7)

Note that not all sentences that are OK are explicitly marked with an "OK" rating or tag, and some sentences that do have errors are not marked with a negative rating or tag. Taking all of this into account, you will probably need to perform both custom automated processing and manual review.


## Suggestions for Those Planning to Use the Corpus

* Tell your audience how you selected the sentences.
  * See an example on this page:[www.manythings.org/bilingual](http://www.manythings.org/bilingual/)
* Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.

## Download the Tatoeba Corpus

[Downloads](http://tatoeba.org/eng/download_tatoeba_example_sentences) are updated every Saturday.

## FAQ

* [How do I give proper attribution?](https://en.wiki.tatoeba.org/articles/show/faq#i-would-like-to-use-tatoeba's-data-for-my-project.)
* [Where can I download Tatoeba's audio data?](https://en.wiki.tatoeba.org/articles/show/faq#where-can-i-download-tatoeba's-audio-data?)
* [How can I download all sentences and translations in specific languages?](https://en.wiki.tatoeba.org/articles/show/faq#how-can-i-download-all-sentences-and-translations-)
* [API](https://en.wiki.tatoeba.org/articles/show/api)

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.