Notice

This page show a previous version of the article

Language exceptions

Language classification is a complex task and it is not part of Tatoeba's mission. We rely on the ISO 639-3 standard to define what is a valid language. In other word, if a language is not defined in this standard, we do not support it in Tatoeba.

There are however some exceptions. This article lists these exceptions and explains the reasons behind it.

How our language request requirements evolved

To understand why we have exceptions, it is important to understand how Tatoeba's requirement evolved.

At the very beginning of Tatoeba (that is in 2006), languages were associated to a 2-letter code. The codes were not explicitly following any standards, they were just chosen based on common sense.

Our decision to follow the ISO 639-3 standard started in December 2009. This was motivated by the desire to support Shanghainese in Tatoeba. We realized we needed a linguistic framework to categorize languages and we chose to rely on the ISO 639-3 standard.

Our decision to follow more strictly this standard was made official on February 2011. We started to notice users adding sentences in unknown/new constructed languages and realized that we could not support all the constructed languages that people may come up with. We added the following warning in our instructions for language requests:

IMPORTANT: We cannot add your language if it does not have an ISO 639-3 code. At this point we already have a lot of languages to deal with, and it's a bit too complicated to deal with languages that are not "officially" recognized.

On more recent events, we decided to restrict language requests to individual languages. This was made official in October 2018. We realized that Tatoeba does not properly take into account macrolanguages and until we come up with a technical solution for this issue, we will avoid adding more macrolanguages.

Why do we have exceptions?

We do not consider the ISO 639-3 standard to be the holy grail of language categorization. It is only a tool that helps us make decisions on how we organize the data collected through Tatoeba, but Tatoeba has its own history and we take this history into account more than anything else.

Whenever we introduce changes in our rules, the new rules applies to new language requests, not to existing languages.

We will not make any change on already supported languages unless there has been a discussion and an agreement with the contributors on what to do with their language.

If we are not facing any major issue to keep supporting a language even though it does not fit the new policies, then we will not bother removing or changing it.

Will we make more exceptions?

No.

If we receive a language request that cannot fit into our current rules, we can review the rules so that we can add the language without making it an exception.

But if we cannot agree on new rules that would support the requested language, then the language will not be supported.

Our current exceptions

This list may not be exhaustive. If you notice other exceptions, please contact Trang.

Arabic (ara)

Added in April 2009.

Arabic is defined in the ISO 639-3 standard but as macrolanguage. Our current requirements state that we only accept individual languages.

CycL (cycl)

Added in August 2010.

CycL is an exception because it is not defined in the ISO 639-3 standard.

Toki Pona (toki)

Added in November 2010

Toki Pona is an exception because it is not defined in the ISO 639-3 standard.

Berber (ber)

Added in June 2010

Berber is an exception because it is not defined in the ISO 639-3 standard. It is however defined as a collection of language in the ISO 636-2/5 standards.

Albanian (sqi)

Added in August 2018.

Albanian is an exception for similar reasons as Arabic. It is a macrolanguage, but we only accept individual languages.

How to request a new language