Notice

This page show a previous version of the article

How to mass import sentences

[Information on this page is outdated because this feature is currently disabled and to be rewritten at some point]

Tatoeba has a feature to mass import sentences but it is restricted to admins. It is not designed to be user-friendly to all contributors, and can potentially be harmful in the wrong hands.

The page to mass import sentences is: https://tatoeba.org/eng/sentences/import

There are two sections on this page:

  • Single sentences
  • Sentences and translations

Single sentences

This section allows you to import a list of sentences that are all in the same language. You cannot import a list of sentences in a mix of different languages.

The format of the file is simple: each line is one sentence.

Example: https://gist.github.com/trang/50497ce80f494d2be801879ed3193129

The form to import sentences has 3 fields:

  • Language of the sentences: a dropdown for you to choose the language of the sentences
  • File: a button which will open a popup for you to select the file to import
  • Numeric user id: a text input for you to enter the ID of the user

Sentences and translations

This section allows you to import sentences and their translations. Again, it is not possible to mix languages.

The format of the file is sentence [tab] translation. The separator between the sentence and translation must be a tab. It cannot be a space or sequence of spaces.

Example: https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816

The form has 4 fields:

  • Language of the sentences: a dropdown to choose the language of the sentences
  • Language of the translations: a dropdown to choose the language of the translations
  • File: a button which will open a popup to select the file to import
  • Numeric user id: a text input to enter the ID of the user

How to find the user ID

  • Go to the List of all members.
  • Search for the user in the sidebar.
  • This will lead you to the "latest activity" page for the user.
  • This page can also be accessed from the user profile by clicking on the "show latest activity" link in the "Stats" section in the sidebar.
  • Check the URL of the page. It will look like https://tatoeba.org/eng/users/show/{number}. The {number} is the user ID.

Encoding requirement

The file that you upload must be encoded in UTF-8. If you happen to import a file that is not encoded in UTF-8 and the sentences contain special characters, the text imported will be messed up.

You can check the encoding of your file online. Your file is encoded in UTF-8 if the result says Detected by Chared: utf_8.

There are also ways you can check offline, depending on which OS you are using. Best is to Google it (for instance: detect file encoding).

Warnings

(1) There is no safety net with this feature. Once you click "import" all the sentences will be processed and there is no easy way to cancel or revert.

  • Make sure the encoding is UTF-8.
  • Make sure you entered the correct user ID.
  • Make sure you selected the correct language(s).
  • Don't import files from just anyone who asks you. Only import from trusted contributors.
    • To avoid importing a ton of bad quality sentences, you should import only sentences that are in the native/strongest language of the user.
    • Do not import the sentences if you suspect there may be copyright/license issues. Ask the user if they wrote all the sentences themselves.
  • Luckily you don't have to worry anymore about duplicates because Tatoeba handles that. If you accidentally import a file that you have already imported, it should have no effect.
  • Do a test on the dev website first, and only upload your file on the main website if it went well on the dev website.

(2) The mass import feature can slow down the website.

  • It's a good idea to limit the number of sentences imported per operation to 400 or fewer. Larger imports will work sometimes, but it's better for others using the website at the same time if you don't import too many at once.
  • It's also a good idea to import during the weekend rather than during weekdays, as there is usually less traffic. Sunday and Saturday between 00:00 and 09:00 (GMT+1) are when Tatoeba has the least traffic.

(3) The logs of the sentences will indicate you (the admin) as the person who created the sentences.

  • If you are importing sentences for another user, make sure they are aware of this and don't care about it.
  • The user will still be assigned as the owner, but the logs will state that you are the contributor, and the "Contributions" stats will increase on your profile and not on the owner's profile.

(4) When sentences and translations are imported, the interface does not record which sentence is actually the original and which one is the translation. Both sentences and translations will be considered as original.

  • The mass import feature functions as if someone were adding the sentence, then adding the translation separately as a new sentence (instead of clicking on the "Translate" button), and then using the linking feature to connect the sentence and the translation.
  • As a result, It doesn't matter whether the file you are importing has the translations in the first column and the original sentences in the second column, or vice versa.
  • However, this also means that the mass import feature does not reflect reality accurately. It loses information that is not completely critical, but could be useful in some cases (for instance for handling the permission to change the license of a sentence).