How to add a new transcription, transliteration or alternative script

This article explains how to add a new transcription, transliteration or alternative script on Tatoeba. We will further refer to these three terms as “transcriptions” in this article because they are technically handled the same way on Tatoeba. The goal of transcriptions is to allow people to read sentences using a different writing system.

Requirements

Members may request the addition of new transcriptions by posting a message on the Wall. There are a few requirements:

  • The writing system must have an identified ISO 15924 code. This is a 4-letter code that identifies scripts.

  • If there are no existing transcriptions for the language, the ISO 15924 code of the script used in existing sentences must be identified as well.

  • A link to a page (like a Wikipedia article) that explains the transcription system and shows that it is used in real life. Add a comment to explain why it would be useful for Tatoeba.

  • A list of a substantial number of transcription pairs. A transcription pair is a sentence and its expected transcription. A “substantial number” is sufficient to show as fully as possible how the transcription system works; the more, the better. The list will be used by developers to ensure the transcription algorithm works and maintain it without having to be experts in the transcription themselves. It should ideally include both real-life examples of transcriptions as well as “edge-case transcriptions” that show particular cases the algorithm should handle. Try to think hard about all the possibilities. A few hints: how should proper nouns and punctuation be handled?

Autogenerated transcriptions

We provide autogenerated transcriptions for each language for which we allow transcriptions. This means that a piece of software reads the original sentences and converts them into the target script as soon as they are modified or added. You can see whether a transcription has been autogenerated or further edited by contributors by hovering a mouse over it.

Depending on the quality, accuracy and reliableness of the software used, autogenerated transcriptions may be further editable by contributors or displayed with a warning icon. Editable transcriptions are marked with a pen icon on the left (which may appear disabled depending on your particular editing rights on a particular transcription). No pen means the transcription is not editable at all (or that you're not logged in).

Developers decide whether a given type of transcription is made editable or not after consulting contributors and checking accuracy against the provided list of transcription pairs. Editable transcriptions usually meet the following requirements:

  • The format of the transcription (syntax, etc.) is clearly defined. If it's not, it's more desirable not to allow editing until a format has been agreed upon, in order to avoid inconsistencies.

  • The autogeneration software is mostly reliable but produces errors from time to time. If it's not very reliable, editing may be prevented because it would require too much human work from contributors to fix all the transcriptions. If it's near-100% perfect, editing may be prevented as well.

Contents

Article available in: