Notice

This page show a previous version of the article

How to mass import sentences

Tatoeba has a feature to mass import sentences but it is restricted to admins. It is not very user-friendly to use and can be potentially harmful in the wrong hands.

The page to mass import sentences is: https://tatoeba.org/eng/sentences/import

There are two sections on this page:

  • Single sentences
  • Sentences and translations

Single sentences

This section allows you to import a list of sentences in a same language. You cannot import a list of sentences in a mix of different languages.

The format of the file is simple: each line is one sentence.

Example: https://gist.github.com/trang/50497ce80f494d2be801879ed3193129

The form to import sentences has 3 fields:

  • Language of the sentences: this is a dropdown for you to choose the language of the sentences.
  • File: this is a button which will open a popup for you to select the file to import.
  • Numeric user id: this is a text input for you to enter the ID of the user.

Sentences and translations

This section allows you to import sentences and their translations. Again, it is not possible to mix languages.

The format of the file is: sentence [tab] translation. The separator between the sentence and translation must be a tab and strictly a tab. It cannot be not a sequence of several spaces.

Example: https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816

The form has 4 fields:

  • Language of the sentences: this is a dropdown for you to choose the language of the sentences.
  • Language of the translations: this is a dropdown for you to choose the language of the translations.
  • File: this is a button which will open a popup for you to select the file to import.
  • Numeric user id: this is a text input for you to enter the ID of the user.

How to find the user ID

  • Go to the List of all members.
  • Search for the user in the sidebar.
  • This will lead you to the latest activity page of the user.
  • This page can also be accessed from the user profile, by clicking on the link "show latest activity" in the "Stats" section in the sidebar.
  • Check the URL of the page. It looks like https://tatoeba.org/eng/users/show/{number}. The {number} is the user ID.

Encoding requirement

The file that you upload must be encoded in UTF-8. If your happen to import a file that is not encoded in UTF-8 and the sentences contain some special characters, the text imported will be messed up.

You can check the encoding of your file online. Your file is encoded in UTF-8 if the result says Detected by Chared: utf_8.

There are also ways you can check offline, depending on which OS you are using. Best is to google it (for instance: detect file encoding).

Warnings

(1) There is no safety net with this feature. Once you click "import" all the sentences will be processed and there is no easy way to cancel or revert.

  • It's a good idea to limit the number of imports per time to less than 400. Larger imports will work sometimes, but it's better for others using the website at the same time if you don't import too many at one time.
  • Make sure the encoding is UTF-8.
  • Make sure you entered the correct user ID.
  • Make sure you selected the correct language(s).
  • Don't import files from just anyone who asks you. Only import from trusted contributors or you could be importing a ton of bad quality sentences.
    • For single sentence imports, please only import sentences by native speakers.
    • For imports of sentences and translations, be sure that the language of one side of the pair is the contributor's native language.
  • Luckily you don't have to worry anymore about duplicates because Tatoeba handles that.
  • It is recommended to make a test on the dev website first, and only upload your file on the main website if it went well on the dev website.

(2) The logs of the sentences will indicate you (the admin) as the person who created the sentences.

  • If you are importing sentences for another user, make sure they are aware of it and that they don't care about it.
  • The user will still be assigned as the owner, but the logs will state that you are contributor and the "Contributions" stats will increase on your profile and not on the owner's profile.

(3) When importing sentences and translations, it is impossible to know from technically which sentence is actually the original and which one is the translation. Both sentences and translations will be considered as original.

  • The mass import feature functions as if someone was adding the sentence, then adding the translation separately as a new sentence (instead of clicking on the "Translate" button), and then using the linking feature to connect the sentence and the translation together.
  • As a result It doesn't matter if the file you are importing has the translations in the first column and the original sentences in the second column.
  • But as a result as well, this means that the mass import feature does not reflect accurately the reality. It loses information that is not completely critical, but could be useful in some cases (for instance for handling the permission to change the license of a sentence).