Notice

This page show a previous version of the article

Adding a New Language to the Corpus (for Developers)

Introduction

These are the instructions for adding a language to the Tatoeba corpus. Instructions for adding a language in which the Tatoeba UI will be displayed are found elsewhere.

These instructions were modified from Assembla.

FAQ for users

The FAQ for users who want to add a new language: How to request a new language.

Language icon

  1. Create the icon for the language. The icon should be a PNG file of dimension 30x20. On each icon there should be a 1px line of color #dcdcdc on the border bottom and right. Most of the icons also have gone through a luminosity change, so that they are a bit more pale than the original image, but that is not necessary.

  2. Commit the image to the repository. The icons are stored in the app/webroot/img/flags folder. Ask one of the people with repository access if you don't have it yourself and don't want to obtain it.

  3. Update the app/webroot/img/flags directory on the server, to retrieve the new images for the new languages.

Source code

There is a script to add the new language code to the appropriate files. It takes the following parameters:

  • three-letter ISO 639-code (e.g., "epo" for Esperanto")
  • the English name of the language (e.g., "Esperanto")
  • the ID of a list containing at least 5 sentences in the given language (to find the ID of the language, search the list of lists for the appropriate list, then roll over it with your mouse to see the URL, which contains the list ID)
  • the string "local" (on a local virtual machine), "dev" (on a development machine), or "prod" (on the server)
  • the username for the database
  • the password for the database
  • the database name
  • the string "update" (to update files), "run" (to run the updated SQL procedure), or "update_and_run" (to do both)

It should be executed from the directory in which the script is located.

The script also issues the following reminders:

(1) If the language has a two-letter code (not all languages do), update $iso639_3_to_iso639_1 in app/views/helpers/languages.php.

(2) If the new language is written right-to-left, update ../../../app/views/helpers/languages.php.

(3) If the new language is stemmed, or is written with CJK characters, or contains characters not previously used in Tatoeba, or has no word boundaries, update generate_sphinx_conf.php.

In general, a developer will run the script with the "update" string, then commit the changed files. Later, they will be pulled onto the repository, and the script can be run with the string "run". (Alternatively, the final step in the script can be executed directly by running the call_add_language procedure.)

Once this procedure is executed, the new languages will be available. You should execute it locally and verify that it works correctly. (See the "Tests" section below.) The sentences that were in the list with id <list_id> should have their language set to the new language (instead of being set to language unknown). This script edits the following files:

  • app/model/sentence.php Adds the language ISO code to the $validate array. Languages that are not part of this array are not allowed.

  • app/views/helpers/languages.php Adds the language ISO code and the name to the languagesArray() method.

  • docs/generate_sphinx_conf.php Adds the language ISO code and name to the $languages array. Also adds the ISO code to the $cjkLanguages array if the language uses Chinese, Japanese or Korean characters.

After you make your changes and test them according to the next section, commit your code to the repository, or have someone do it for you. See Repositories. Refer to the appropriate issue ticket(s) in your comment and indicate the languages that were added.

Configure Sphinx

Sphinx has a list of characters that are considered as belonging to words. Any character not in this list is ignored. (It will be treated as a word separator.) So you need to make sure that the characters used by the new language are in this list. If you’re sure they are, you can skip reading this section.

You need to have a list of the Unicode code points used by the language. Make sure they are defined in the charset_table option that is defined in generate_sphinx_conf.php. Include case folding rules if appropriate. (gillux can help with that.)

Refresh the Sphinx indexes

Add some real sentences in the language, so you can verify that searching works after reindexing. Then, regenerate the Sphinx config using the generate_sphinx_conf.php script, and refresh the Sphinx index for that new language (indexer index_XXX --rotate).

Tests

  • Test that the language detection works (or can work) by adding a sentence with 'auto-detect'. Try examples from the list of sentences on Tatoeba (named after the language in question) that needed to exist when the language was requested.
  • Test that you can change the language of a sentence into the language in question.
  • Check that the count displays properly in the language stats.
  • Check that search is working by looking up a word used in sentences that were added before reindexing.
  • Check that the random sentence feature is working.

On the server

  • If everything is fine so far, go to the server.
  • Execute "htop" to verify that the load is below 2.
  • Update the repository if necessary
  • Connect to the mysql database of the prod version.
  • CALL add_new_language(iso_code, list_id, tag_name);
  • exit
  • Check that the sentences that were in the list and tags now have the appropriate icon.
  • Check that the language appears in the language stats.
  • cp /usr/local/etc/sphinx.conf /usr/local/etc/sphinx.conf.old
  • php generate_sphinx_conf.php > /usr/local/etc/sphinx.conf
  • Change the necessary things in the new config file (user, password, database and port). Look at the old conf file for reference.
  • indexer --all --rotate & disown

Historical information only

We used to follow this procedure for adding a new language (example: French):

  • Create folder : /app/locale/fre/LC_MESSAGES
  • Copy-paste default.pot into this folder
  • Change it into default.po
  • Open default.po with PoEdit (http://www.poedit.net/) and translate.
  • Save. It will generate a *.mo file, which is used when replacing strings at runtime.

and when new strings were added:

  • Follow the cake i18n instructions to generate the up-to-date POT file.
  • Open the PO file (PO, not POT).
  • In the menu : Catalog > Update from POT file…
  • Choose the POT file that was newly generated

The language of the page is set through the URL. Example: http://localhost/tatoeba2/fre/sentences/index

Language detection

In the past, we used to edit this:

  • app/controllers/components/google_language_api.php Adds the corresponding case to the google2TatoebaCode() method, if Google supports the detection for the language. See the Language enum.

but now tatodetect takes care of language detection.

Resources:

  • http://blog.jaysalvat.com/articles/choix-des-langues-par-url-dans-cakephp.php
  • http://www.formation-cakephp.com/41/multilingue-18n-l10n