Version at: 03/05/2014, 08:30 vs. version at: 21/09/2014, 12:53
11# Adding a New Language to the Corpus (for Developers)
22
33##Introduction
44
55These are the instructions for adding a language to the Tatoeba corpus. Instructions for adding a language in which the Tatoeba UI will be displayed are found elsewhere.
66
77These instructions were modified from [Assembla](https://www.assembla.com/spaces/tatoeba2/wiki/Adding_a_language_in_Tatoeba).
88
99##FAQ for users
1010The FAQ for users who want to add a new language: [How to request a new language](http://tatoeba.org/eng/faq#new-language).
1111
1212##Language icon
13131. Create the icon for the language. The icon should be a PNG file of dimension 30x20. On each icon there is (in theory) a 1px line of color #dcdcdc on the border bottom and right. Most of the icons also have gone through a luminosity change, so that they are a bit more pale than the original image. Anyway, most of it doesn't matter right now. The most important is that the icon is a PNG file of dimension 30x20.
1414
15152. Commit the image to the repository. The icons are stored in the app/webroot/img/flags folder. Ask one of the people with repository access if you don't have it yourself and don't want to obtain it.
1616
17173. Update the app/webroot/img/flags directory on the server, to retrieve the new images for the new languages.
1818
1919##Source code
2020
2121There is a [script](https://github.com/Tatoeba/tatoeba2/blob/master/docs/database/scripts/add_lang.sh) to add the new language code to the appropriate files. It takes the following parameters:
2222
2323* three-letter ISO 639-code (e.g., "epo" for Esperanto")
2424* the English name of the language (e.g., "Esperanto")
2525* the ID of a list containing at least 5 sentences in the given language (to find the ID of the language, search the [list of lists](http://tatoeba.org/eng/sentences_lists/index) for the appropriate list, then roll over it with your mouse to see the URL, which contains the list ID)
2626* the string "dev" (on a development machine) or "prod" (on the server)
2727* the username for the database
2828* the password for the database
2929* the database name
3030
3131It should be executed from the directory in which the script is located.
3232
3333Once this script is executed, the new languages will be available. You should execute it locally and verify that it works correctly. (See the "Tests" section below.) The sentences that were in the list with id <list_id> should have their language set to the new language (instead of being set to language unknown). This script edits the following files:
3434
3535* **app/model/sentence.php**
3636Adds the language ISO code to the $validate array. Languages that are not part of this array are not allowed.
3737
3838* **app/views/helpers/languages.php**
3939Adds the language ISO code and the name to the languagesArray() method.
4040
4141* **docs/generate\_sphinx\_conf.php**
4242Adds the language ISO code and name to the $languages array. Also adds the ISO code to the $cjkLanguages array if the language uses Chinese, Japanese or Korean characters.
4343
4444After you make your changes and test them according to the next section, commit your code to the repository, or have someone do it for you. See [Repositories](repositories). Refer to the appropriate issue ticket(s) in your comment and indicate the languages that were added.
4545
46##Configure Sphinx
47Sphinx have a list of characters that are considered as used in words. Every character not in this list is ignored (they are used as word separators). So you need to make sure that the characters used by the language are in this list. If you’re sure they are, you can skip reading this section.
48
49You need to have a list of the unicode code points used by the language and make sure they are defined into the [charset_table](http://sphinxsearch.com/docs/current.html#conf-charset-table) option that is defined in generate\_sphinx\_conf.php. Include case folding rules if appropriate. (gillux can help with that.)
50
51##Refresh the Sphinx indexes
52Add some real sentences in the language, so that you can check if searching works after reindexing. Then, regenerate the Sphinx config using the **generate\_sphinx\_conf.php** script, and refresh the Sphinx index for that new language (indexer index_XXX --rotate).
53
4654##Tests
4755* Test that the language detection works (or can work) by adding a sentence with 'auto-detect'. There should be on Tatoeba a list of sentences in the language in question (named after the language in question).
4856* Test that you can change the language of a sentence into the language in question.
4957* Check that the count displays properly in the languages stats.
58* Check that search is working by looking up a word used in sentences that has been added before reindexing.
59* Check if the random sentence feature is working.
5060
5161##On the server
5262
5363* If everything is fine so far, go to the server.
5464* Execute "htop" to verify that the load is below 2.
5565* Update the repository if necessary
5666* Connect to the mysql database of the prod version.
5767* CALL add\_new\_language(iso_code, list_id, tag_name);
5868* exit
5969* Check that the sentences that were in the list and tags have now the appropriate icon.
6070* Check that the language appears in the languages stats.
6171* cp /usr/local/etc/sphinx.conf /usr/local/etc/sphinx.conf.old
6272* php generate\_sphinx\_conf.php > /usr/local/etc/sphinx.conf
6373* Change the necessary things in the new config file (user, password, database and port). Look at the old conf file for reference.
6474* indexer --all --rotate & disown
6575
6676##Historical information only
6777We used to follow this procedure for adding a new language (example: French):
6878
6979* Create folder : /app/locale/fre/LC_MESSAGES
7080* Copy-paste default.pot into this folder
7181* Change it into default.po
7282* Open default.po with PoEdit (http://www.poedit.net/) and translate.
7383* Save. It will generate a *.mo file, which is used when replacing strings at runtime.
7484
7585and when new strings were added:
7686
7787* Follow the cake i18n instructions to generate the up-to-date POT file.
7888* Open the PO file (PO, not POT).
7989* In the menu : Catalog > Update from POT file…
8090* Choose the POT file that was newly generated
8191
8292The language of the page is set through the URL.
8393Example: http://localhost/tatoeba2/fre/sentences/index
8494
8595###Language detection
8696In the past, we used to edit this:
8797
8898* **app/controllers/components/google\_language\_api.php**
8999Adds the corresponding case to the google2TatoebaCode() method, if Google supports the detection for the language. See the Language enum.
90100
91101but now tatodetect takes care of language detection.
92102
93103Resources:
94104
95105* http://blog.jaysalvat.com/articles/choix-des-langues-par-url-dans-cakephp.php
96106* http://www.formation-cakephp.com/41/multilingue-18n-l10n
97107
diff view generated by jsdifflib

Version at: 03/05/2014, 08:30

# Adding a New Language to the Corpus (for Developers)

##Introduction

These are the instructions for adding a language to the Tatoeba corpus. Instructions for adding a language in which the Tatoeba UI will be displayed are found elsewhere.

These instructions were modified from [Assembla](https://www.assembla.com/spaces/tatoeba2/wiki/Adding_a_language_in_Tatoeba).

##FAQ for users
The FAQ for users who want to add a new language: [How to request a new language](http://tatoeba.org/eng/faq#new-language).

##Language icon
1. Create the icon for the language. The icon should be a PNG file of dimension 30x20. On each icon there is (in theory) a 1px line of color #dcdcdc on the border bottom and right. Most of the icons also have gone through a luminosity change, so that they are a bit more pale than the original image. Anyway, most of it doesn't matter right now. The most important is that the icon is a PNG file of dimension 30x20.

2. Commit the image to the repository. The icons are stored in the app/webroot/img/flags folder. Ask one of the people with repository access if you don't have it yourself and don't want to obtain it.

3. Update the app/webroot/img/flags directory on the server, to retrieve the new images for the new languages.

##Source code

There is a [script](https://github.com/Tatoeba/tatoeba2/blob/master/docs/database/scripts/add_lang.sh) to add the new language code to the appropriate files. It takes the following parameters:

* three-letter ISO 639-code (e.g., "epo" for Esperanto")
* the English name of the language (e.g., "Esperanto")
* the ID of a list containing at least 5 sentences in the given language (to find the ID of the language, search the [list of lists](http://tatoeba.org/eng/sentences_lists/index) for the appropriate list, then roll over it with your mouse to see the URL, which contains the list ID)
* the string "dev" (on a development machine) or "prod" (on the server)
* the username for the database
* the password for the database
* the database name

It should be executed from the directory in which the script is located.

Once this script is executed, the new languages will be available. You should execute it locally and verify that it works correctly. (See the "Tests" section below.) The sentences that were in the list with id <list_id> should have their language set to the new language (instead of being set to language unknown). This script edits the following files:

* **app/model/sentence.php**
Adds the language ISO code to the $validate array. Languages that are not part of this array are not allowed.

* **app/views/helpers/languages.php**
Adds the language ISO code and the name to the languagesArray() method.

* **docs/generate\_sphinx\_conf.php**
Adds the language ISO code and name to the $languages array. Also adds the ISO code to the $cjkLanguages array if the language uses Chinese, Japanese or Korean characters.

After you make your changes and test them according to the next section, commit your code to the repository, or have someone do it for you. See [Repositories](repositories). Refer to the appropriate issue ticket(s) in your comment and indicate the languages that were added.

##Tests
* Test that the language detection works (or can work) by adding a sentence with 'auto-detect'. There should be on Tatoeba a list of sentences in the language in question (named after the language in question).
* Test that you can change the language of a sentence into the language in question.
* Check that the count displays properly in the languages stats.
 
##On the server

* If everything is fine so far, go to the server.
* Execute "htop" to verify that the load is below 2.
* Update the repository if necessary
* Connect to the mysql database of the prod version.
* CALL add\_new\_language(iso_code, list_id, tag_name);
* exit
* Check that the sentences that were in the list and tags have now the appropriate icon.
* Check that the language appears in the languages stats.
* cp /usr/local/etc/sphinx.conf /usr/local/etc/sphinx.conf.old
* php generate\_sphinx\_conf.php > /usr/local/etc/sphinx.conf
* Change the necessary things in the new config file (user, password, database and port). Look at the old conf file for reference.
* indexer --all --rotate & disown

##Historical information only
We used to follow this procedure for adding a new language (example: French):

* Create folder : /app/locale/fre/LC_MESSAGES
* Copy-paste default.pot into this folder
* Change it into default.po
* Open default.po with PoEdit (http://www.poedit.net/) and translate.
* Save. It will generate a *.mo file, which is used when replacing strings at runtime.

and when new strings were added:

* Follow the cake i18n instructions to generate the up-to-date POT file.
* Open the PO file (PO, not POT).
* In the menu : Catalog > Update from POT file…
* Choose the POT file that was newly generated

The language of the page is set through the URL.
Example: http://localhost/tatoeba2/fre/sentences/index

###Language detection
In the past, we used to edit this:

* **app/controllers/components/google\_language\_api.php**
Adds the corresponding case to the google2TatoebaCode() method, if Google supports the detection for the language. See the Language enum.

but now tatodetect takes care of language detection.

Resources:

* http://blog.jaysalvat.com/articles/choix-des-langues-par-url-dans-cakephp.php
* http://www.formation-cakephp.com/41/multilingue-18n-l10n

version at: 21/09/2014, 12:53

# Adding a New Language to the Corpus (for Developers)

##Introduction

These are the instructions for adding a language to the Tatoeba corpus. Instructions for adding a language in which the Tatoeba UI will be displayed are found elsewhere.

These instructions were modified from [Assembla](https://www.assembla.com/spaces/tatoeba2/wiki/Adding_a_language_in_Tatoeba).

##FAQ for users
The FAQ for users who want to add a new language: [How to request a new language](http://tatoeba.org/eng/faq#new-language).

##Language icon
1. Create the icon for the language. The icon should be a PNG file of dimension 30x20. On each icon there is (in theory) a 1px line of color #dcdcdc on the border bottom and right. Most of the icons also have gone through a luminosity change, so that they are a bit more pale than the original image. Anyway, most of it doesn't matter right now. The most important is that the icon is a PNG file of dimension 30x20.

2. Commit the image to the repository. The icons are stored in the app/webroot/img/flags folder. Ask one of the people with repository access if you don't have it yourself and don't want to obtain it.

3. Update the app/webroot/img/flags directory on the server, to retrieve the new images for the new languages.

##Source code

There is a [script](https://github.com/Tatoeba/tatoeba2/blob/master/docs/database/scripts/add_lang.sh) to add the new language code to the appropriate files. It takes the following parameters:

* three-letter ISO 639-code (e.g., "epo" for Esperanto")
* the English name of the language (e.g., "Esperanto")
* the ID of a list containing at least 5 sentences in the given language (to find the ID of the language, search the [list of lists](http://tatoeba.org/eng/sentences_lists/index) for the appropriate list, then roll over it with your mouse to see the URL, which contains the list ID)
* the string "dev" (on a development machine) or "prod" (on the server)
* the username for the database
* the password for the database
* the database name

It should be executed from the directory in which the script is located.

Once this script is executed, the new languages will be available. You should execute it locally and verify that it works correctly. (See the "Tests" section below.) The sentences that were in the list with id <list_id> should have their language set to the new language (instead of being set to language unknown). This script edits the following files:

* **app/model/sentence.php**
Adds the language ISO code to the $validate array. Languages that are not part of this array are not allowed.

* **app/views/helpers/languages.php**
Adds the language ISO code and the name to the languagesArray() method.

* **docs/generate\_sphinx\_conf.php**
Adds the language ISO code and name to the $languages array. Also adds the ISO code to the $cjkLanguages array if the language uses Chinese, Japanese or Korean characters.

After you make your changes and test them according to the next section, commit your code to the repository, or have someone do it for you. See [Repositories](repositories). Refer to the appropriate issue ticket(s) in your comment and indicate the languages that were added.

##Configure Sphinx
Sphinx have a list of characters that are considered as used in words. Every character not in this list is ignored (they are used as word separators). So you need to make sure that the characters used by the language are in this list. If you’re sure they are, you can skip reading this section.

You need to have a list of the unicode code points used by the language and make sure they are defined into the [charset_table](http://sphinxsearch.com/docs/current.html#conf-charset-table) option that is defined in generate\_sphinx\_conf.php. Include case folding rules if appropriate. (gillux can help with that.)

##Refresh the Sphinx indexes
Add some real sentences in the language, so that you can check if searching works after reindexing. Then, regenerate the Sphinx config using the **generate\_sphinx\_conf.php** script, and refresh the Sphinx index for that new language (indexer index_XXX --rotate).

##Tests
* Test that the language detection works (or can work) by adding a sentence with 'auto-detect'. There should be on Tatoeba a list of sentences in the language in question (named after the language in question).
* Test that you can change the language of a sentence into the language in question.
* Check that the count displays properly in the languages stats.
* Check that search is working by looking up a word used in sentences that has been added before reindexing.
* Check if the random sentence feature is working.
 
##On the server

* If everything is fine so far, go to the server.
* Execute "htop" to verify that the load is below 2.
* Update the repository if necessary
* Connect to the mysql database of the prod version.
* CALL add\_new\_language(iso_code, list_id, tag_name);
* exit
* Check that the sentences that were in the list and tags have now the appropriate icon.
* Check that the language appears in the languages stats.
* cp /usr/local/etc/sphinx.conf /usr/local/etc/sphinx.conf.old
* php generate\_sphinx\_conf.php > /usr/local/etc/sphinx.conf
* Change the necessary things in the new config file (user, password, database and port). Look at the old conf file for reference.
* indexer --all --rotate & disown

##Historical information only
We used to follow this procedure for adding a new language (example: French):

* Create folder : /app/locale/fre/LC_MESSAGES
* Copy-paste default.pot into this folder
* Change it into default.po
* Open default.po with PoEdit (http://www.poedit.net/) and translate.
* Save. It will generate a *.mo file, which is used when replacing strings at runtime.

and when new strings were added:

* Follow the cake i18n instructions to generate the up-to-date POT file.
* Open the PO file (PO, not POT).
* In the menu : Catalog > Update from POT file…
* Choose the POT file that was newly generated

The language of the page is set through the URL.
Example: http://localhost/tatoeba2/fre/sentences/index

###Language detection
In the past, we used to edit this:

* **app/controllers/components/google\_language\_api.php**
Adds the corresponding case to the google2TatoebaCode() method, if Google supports the detection for the language. See the Language enum.

but now tatodetect takes care of language detection.

Resources:

* http://blog.jaysalvat.com/articles/choix-des-langues-par-url-dans-cakephp.php
* http://www.formation-cakephp.com/41/multilingue-18n-l10n

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.