Version at: 10/11/2018, 20:01 vs. version at: 10/11/2018, 20:02
11# How to mass import sentences
22
33Tatoeba has a feature to mass import sentences but it is restricted to admins. It is not very user-friendly to use and can be potentially harmful in the wrong hands.
44
55The page to mass import sentences is: [https://tatoeba.org/eng/sentences/import](https://tatoeba.org/eng/sentences/import)
66
77There are two sections on this page:
88
99- Single sentences
1010- Sentences and translations
1111
1212## Single sentences
1313
1414This section allows you to import a list of sentences in a **same language**. You cannot import a list of sentences in a mix of different languages.
1515
1616The format of the file is simple: each line is one sentence.
1717
1818Example: [https://gist.github.com/trang/50497ce80f494d2be801879ed3193129](https://gist.github.com/trang/50497ce80f494d2be801879ed3193129)
1919
2020
2121The form to import sentences has 3 fields:
2222
2323- **Language of the sentences**: this is a dropdown for you to choose the language of the sentences.
2424- **File**: this is a button which will open a popup for you to select the file to import.
2525- **Numeric user id**: this is a text input for you to enter the ID of the user.
2626
2727## Sentences and translations
2828
2929This section allows you to import sentences and their translations. Again, it is not possible to mix languages.
3030
3131The format of the file is: sentence [tab] translation. The separator between the sentence and translation must be a tab and strictly a tab. It cannot be not a sequence of several spaces.
3232
3333Example: [https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816](https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816)
3434
3535The form has 4 fields:
3636
3737- **Language of the sentences**: this is a dropdown for you to choose the language of the sentences.
3838- **Language of the translations**: this is a dropdown for you to choose the language of the translations.
3939- **File**: this is a button which will open a popup for you to select the file to import.
4040- **Numeric user id**: this is a text input for you to enter the ID of the user.
4141
4242## How to find the user ID
4343
4444* Go to the [List of all members](https://tatoeba.org/eng/users/all).
4545* Search for the user in the sidebar.
4646* This will lead you to the latest activity page of the user.
4747* This page can also be accessed from the user profile, by clicking on the link "show latest activity" in the "Stats" section in the sidebar.
4848* Check the URL of the page. It looks like `https://tatoeba.org/eng/users/show/{number}`. The `{number}` is the user ID.
4949
5050## Encoding requirement
5151
5252The file that you upload must be encoded in **UTF-8**. If your happen to import a file that is not encoded in UTF-8 and the sentences contain some special characters, the text imported will be messed up.
5353
5454You can [check the encoding of your file online](https://nlp.fi.muni.cz/projects/chared/). Your file is encoded in UTF-8 if the result says `Detected by Chared: utf_8`.
5555
5656There are also ways you can check offline, depending on which OS you are using. Best is to google it (for instance: [detect file encoding](http://google.com/search?q=detect+file+encoding)).
5757
5858## Warnings
5959
601. There is no safety net with this feature. Once you click "import" all the sentences will be processed and there is no easy way to cancel or revert.
60**(1)** There is no safety net with this feature. Once you click "import" all the sentences will be processed and there is no easy way to cancel or revert.
6161
6262 * Make sure the encoding is UTF-8.
6363 * Make sure you entered the correct user ID.
6464 * Make sure you selected the correct language(s).
6565 * Don't import files from just anyone who asks you. Only import from trusted contributors or you could be importing a ton of bad quality sentences.
6666 * Luckily you don't have to worry anymore about duplicates because Tatoeba handles that.
6767 * It is recommended to make a test on the [dev website](https://dev.tatoeba.org) first, and only upload your file on the main website if it went well on the dev website.
6868
692. The logs of the sentences will indicate *you* (the admin) as the person who created the sentences. If you are importing sentences for another user, make sure they are aware of it and that they don't care about it. The user will still be assigned as the owner, but the logs will state that you are contributor and the "Contributions" stats will increase on *your profile* and not on the owner's profile.
69**(2)** The logs of the sentences will indicate *you* (the admin) as the person who created the sentences.
7070
713. When importing sentences and translations, it is impossible to know from technically which sentence is actually the original and which one is the translation. Both sentences and translations will be considered as original.
71 * If you are importing sentences for another user, make sure they are aware of it and that they don't care about it.
72 * The user will still be assigned as the owner, but the logs will state that you are contributor and the "Contributions" stats will increase on *your profile* and not on the owner's profile.
73
74**(3)** When importing sentences and translations, it is impossible to know from technically which sentence is actually the original and which one is the translation. Both sentences and translations will be considered as original.
7275
7376 * The mass import feature functions as if someone was adding the sentence, then adding the translation separately as a new sentence (instead of clicking on the "Translate" button), and then using the linking feature to connect the sentence and the translation together.
7477 * As a result It doesn't matter if the file you are importing has the translations in the first column and the original sentences in the second column.
7578 * But as a result as well, this means that the mass import feature does not reflect accurately the reality. It loses information that is not completely critical, but could be useful in some cases (for instance for handling the permission to change the license of a sentence).
diff view generated by jsdifflib

Version at: 10/11/2018, 20:01

# How to mass import sentences

Tatoeba has a feature to mass import sentences but it is restricted to admins. It is not very user-friendly to use and can be potentially harmful in the wrong hands.

The page to mass import sentences is: [https://tatoeba.org/eng/sentences/import](https://tatoeba.org/eng/sentences/import)

There are two sections on this page:

- Single sentences
- Sentences and translations

## Single sentences

This section allows you to import a list of sentences in a **same language**. You cannot import a list of sentences in a mix of different languages.

The format of the file is simple: each line is one sentence. 

Example: [https://gist.github.com/trang/50497ce80f494d2be801879ed3193129](https://gist.github.com/trang/50497ce80f494d2be801879ed3193129)


The form to import sentences has 3 fields:

- **Language of the sentences**: this is a dropdown for you to choose the language of the sentences.
- **File**: this is a button which will open a popup for you to select the file to import.
- **Numeric user id**: this is a text input for you to enter the ID of the user.

## Sentences and translations

This section allows you to import sentences and their translations. Again, it is not possible to mix languages.

The format of the file is: sentence [tab] translation. The separator between the sentence and translation must be a tab and strictly a tab. It cannot be not a sequence of several spaces.

Example: [https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816](https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816)

The form has 4 fields:

- **Language of the sentences**: this is a dropdown for you to choose the language of the sentences.
- **Language of the translations**: this is a dropdown for you to choose the language of the translations.
- **File**: this is a button which will open a popup for you to select the file to import.
- **Numeric user id**: this is a text input for you to enter the ID of the user.

## How to find the user ID

* Go to the [List of all members](https://tatoeba.org/eng/users/all).
* Search for the user in the sidebar.
* This will lead you to the latest activity page of the user.
* This page can also be accessed from the user profile, by clicking on the link "show latest activity" in the "Stats" section in the sidebar.
* Check the URL of the page. It looks like `https://tatoeba.org/eng/users/show/{number}`. The `{number}` is the user ID.

## Encoding requirement

The file that you upload must be encoded in **UTF-8**. If your happen to import a file that is not encoded in UTF-8 and the sentences contain some special characters, the text imported will be messed up.

You can [check the encoding of your file online](https://nlp.fi.muni.cz/projects/chared/). Your file is encoded in UTF-8 if the result says `Detected by Chared: utf_8`.

There are also ways you can check offline, depending on which OS you are using. Best is to google it (for instance: [detect file encoding](http://google.com/search?q=detect+file+encoding)).

## Warnings

1. There is no safety net with this feature. Once you click "import" all the sentences will be processed and there is no easy way to cancel or revert.

  * Make sure the encoding is UTF-8.
  * Make sure you entered the correct user ID.
  * Make sure you selected the correct language(s).
  * Don't import files from just anyone who asks you. Only import from trusted contributors or you could be importing a ton of bad quality sentences.
  * Luckily you don't have to worry anymore about duplicates because Tatoeba handles that.
  * It is recommended to make a test on the [dev website](https://dev.tatoeba.org) first, and only upload your file on the main website if it went well on the dev website.

2. The logs of the sentences will indicate *you* (the admin) as the person who created the sentences. If you are importing sentences for another user, make sure they are aware of it and that they don't care about it. The user will still be assigned as the owner, but the logs will state that you are contributor and the "Contributions" stats will increase on *your profile* and not on the owner's profile.

3. When importing sentences and translations, it is impossible to know from technically which sentence is actually the original and which one is the translation. Both sentences and translations will be considered as original. 

  * The mass import feature functions as if someone was adding the sentence, then adding the translation separately as a new sentence (instead of clicking on the "Translate" button), and then using the linking feature to connect the sentence and the translation together.
  * As a result It doesn't matter if the file you are importing has the translations in the first column and the original sentences in the second column. 
  * But as a result as well, this means that the mass import feature does not reflect accurately the reality. It loses information that is not completely critical, but could be useful in some cases (for instance for handling the permission to change the license of a sentence).

version at: 10/11/2018, 20:02

# How to mass import sentences

Tatoeba has a feature to mass import sentences but it is restricted to admins. It is not very user-friendly to use and can be potentially harmful in the wrong hands.

The page to mass import sentences is: [https://tatoeba.org/eng/sentences/import](https://tatoeba.org/eng/sentences/import)

There are two sections on this page:

- Single sentences
- Sentences and translations

## Single sentences

This section allows you to import a list of sentences in a **same language**. You cannot import a list of sentences in a mix of different languages.

The format of the file is simple: each line is one sentence. 

Example: [https://gist.github.com/trang/50497ce80f494d2be801879ed3193129](https://gist.github.com/trang/50497ce80f494d2be801879ed3193129)


The form to import sentences has 3 fields:

- **Language of the sentences**: this is a dropdown for you to choose the language of the sentences.
- **File**: this is a button which will open a popup for you to select the file to import.
- **Numeric user id**: this is a text input for you to enter the ID of the user.

## Sentences and translations

This section allows you to import sentences and their translations. Again, it is not possible to mix languages.

The format of the file is: sentence [tab] translation. The separator between the sentence and translation must be a tab and strictly a tab. It cannot be not a sequence of several spaces.

Example: [https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816](https://gist.github.com/trang/149ac3faddee6f3bdfde5ae97a303816)

The form has 4 fields:

- **Language of the sentences**: this is a dropdown for you to choose the language of the sentences.
- **Language of the translations**: this is a dropdown for you to choose the language of the translations.
- **File**: this is a button which will open a popup for you to select the file to import.
- **Numeric user id**: this is a text input for you to enter the ID of the user.

## How to find the user ID

* Go to the [List of all members](https://tatoeba.org/eng/users/all).
* Search for the user in the sidebar.
* This will lead you to the latest activity page of the user.
* This page can also be accessed from the user profile, by clicking on the link "show latest activity" in the "Stats" section in the sidebar.
* Check the URL of the page. It looks like `https://tatoeba.org/eng/users/show/{number}`. The `{number}` is the user ID.

## Encoding requirement

The file that you upload must be encoded in **UTF-8**. If your happen to import a file that is not encoded in UTF-8 and the sentences contain some special characters, the text imported will be messed up.

You can [check the encoding of your file online](https://nlp.fi.muni.cz/projects/chared/). Your file is encoded in UTF-8 if the result says `Detected by Chared: utf_8`.

There are also ways you can check offline, depending on which OS you are using. Best is to google it (for instance: [detect file encoding](http://google.com/search?q=detect+file+encoding)).

## Warnings

**(1)** There is no safety net with this feature. Once you click "import" all the sentences will be processed and there is no easy way to cancel or revert.

  * Make sure the encoding is UTF-8.
  * Make sure you entered the correct user ID.
  * Make sure you selected the correct language(s).
  * Don't import files from just anyone who asks you. Only import from trusted contributors or you could be importing a ton of bad quality sentences.
  * Luckily you don't have to worry anymore about duplicates because Tatoeba handles that.
  * It is recommended to make a test on the [dev website](https://dev.tatoeba.org) first, and only upload your file on the main website if it went well on the dev website.

**(2)** The logs of the sentences will indicate *you* (the admin) as the person who created the sentences.

  * If you are importing sentences for another user, make sure they are aware of it and that they don't care about it. 
  * The user will still be assigned as the owner, but the logs will state that you are contributor and the "Contributions" stats will increase on *your profile* and not on the owner's profile.

**(3)** When importing sentences and translations, it is impossible to know from technically which sentence is actually the original and which one is the translation. Both sentences and translations will be considered as original. 

  * The mass import feature functions as if someone was adding the sentence, then adding the translation separately as a new sentence (instead of clicking on the "Translate" button), and then using the linking feature to connect the sentence and the translation together.
  * As a result It doesn't matter if the file you are importing has the translations in the first column and the original sentences in the second column. 
  * But as a result as well, this means that the mass import feature does not reflect accurately the reality. It loses information that is not completely critical, but could be useful in some cases (for instance for handling the permission to change the license of a sentence).

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.