Notice

This page show a previous version of the article

How to add a new transcription, transliteration or alternative script

This article explains how to add a new transcription, transliteration or alternative script on Tatoeba. We will further refer to these three terms as “transcriptions” in this article because they are technically handled the same way on Tatoeba. The goal of transcriptions is to allow people to read sentences using a different writing system.

Warning

This article is subject to change.

Requirements

Members may request the addition of new transcriptions by posting a message on the Wall. There are a few requirements.

  • The writing system must have an identified ISO 15924 code. It’s a 4-letter code that identify scripts.

  • If there are no existing transcriptions for the language, the ISO 15924 code of the script used in existing sentences must be identified as well.

  • A link to a page (like a Wikipedia article) that explains the transcription system and shows that it is used in real-life. You may as well comment on the transcription to argue how useful would it be for Tatoeba to have it.

  • A list of a substantial amount of transcription pairs. A transcription pair is a sentence and its expected transcription. A “substantial amount” means it shows how the transcription system works as fully as possible, and that the more, the better. The list will be used by developers to ensure the transcription algorithm works and maintain it without having to be experts in the transcription themselves. It should ideally include both real-life examples of transcriptions as well as “edge-case transcriptions” that show particular cases the algorithm should handle. Try to think hard about all the possibilities. A few hints: how to handle proper nouns, punctuation?

Autogenerated transcriptions

We provide autogenerated transcriptions for each language we allow transcriptions in. It means a piece of software reads original sentences and converts them into the target script as soon as they are modified or added. You can see whether a transcription has been autogenerated or further edited by contributors by mouse hovering it.

Depending on the quality, accuracy and reliableness of the software used, autogenerated transcriptions may be further editable by contributors or displayed with a warning icon. Editable transcriptions are marked with a pen icon on the left (which may appear disabled depending on your particular edition rights on a particular transcription). No pen means the transcription is not editable at all (or that you're not logged in).

Developers decide whether a given type of transcription is made editable or not after consulting contributors and checking accuracy against the provided list of transcription pairs. Editable transcriptions usually meet the following requirements:

  • The format of the transcription (syntax, etc.) is clearly defined. If it's not, it's more desirable not to allow edition until a format has been agreed upon, in order to avoid inconsistencies.

  • The autogeneration software is mostly reliable but produces errors from times to times. If it's rather not reliable, edition may be prevented because it would require too much human work from contributors to fix all the transcriptions. If it's near-100% perfect, edition may be prevented as well unless it produces substantial errors.