Version at: 26/04/2013, 19:36 vs. version at: 26/04/2013, 22:30
11##Introduction
22
3This article by Trang is a must-read for anyone who is serious about contributing in Tatoeba.
3This article by Trang is a must-read for anyone who is serious about contributing to Tatoeba.
44
55Here is an outline of how to be a good contributor:
66
771. [Understand the context of the project](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule1)
882. [Understand how the corpus is structured](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2)
993. [Focus on the main sentence, not the other translations](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule3)
10104. [Translate the sentence as a whole rather than as a collection of individual words](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule4)
11115. [Do not edit a sentence if, by itself, it is correct](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule5)
12126. [Do not change the language in which a sentence is written](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule6)
13137. [Make sure you are adding comments to the right sentence](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule7)
14148. [Do not add sentences from copyrighted content](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule8)
15159. [Do not annotate sentences](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule9)
161610. [Give us feedback](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule10)
171711. [Do not wait for us to code it if you can code it](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule11)
181812. [Indicate your languages in your profile](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule12)
191913. [Encourage and educate new (or old) contributors](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule13)
202014. [Spread the love](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule14)
2121
2222[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)
2323
2424##1. Understand the context of the project
2525
2626* I started this project in 2006. The initiative was driven by my passion for language learning and frustration about not finding an adequate online dictionary. The project is focused on **sentences** and I insist on **sentences**. Sample sentences were (and still are) a very scarce resource. Please only add **complete sentences** if you are going to contribute.
2727* I was alone on this project for some time. It was only three years later, in 2009, that other people (all computer science students) started to help me out by coding more features.
2828* Tatoeba is NOT a commercial project. We're not a company, and we're not paid for doing any of this. It is is something that we're working on in our **free time**.
2929* To be honest, we don't exclude the possibility of starting a company someday, but that is if and only if we have an innovative, coherent and ethical business model (yeah, good luck). Having ads everywhere and driving a lot of traffic, or forcing people to pay to access the data, is out of the question.
3030
3131[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)
3232
3333##2. Understand how the corpus is structured
3434
3535The corpus is structured not as a **table** but as a **graph** (in the computer science sense of the word). What does that mean? Well, imagine you had to extract part of the corpus and write it on paper. You might do something like this:
3636
3737<table>
3838<tr> <th>**English**</th> <th>**French**</th> <th>**Spanish**</th> </tr>
3939<tr> <td>My name is Trang.</td> <td>Je m'appelle Trang.</td> <td>Me llamo Trang.</td> </tr>
4040<tr> <td>How are you?</td> <td>Comment vas-tu?</td> <td>¿Cómo estás?</td> </tr>
4141<tr> <td>...</td> <td>...</td> <td>...</td> </tr>
4242</table>
4343
4444That's a **table** structure. There are **rows** and **columns**: a row contains sentences with the same meaning, and a column contains sentences with the same language. That's the first approach anyone might take, but that's NOT how the corpus is constructed.
4545
4646Our corpus is set up like this: *[note: diagram needs to be imported]*
4747
4848That's a **graph** structure. There are **nodes** and **edges**: each node represents a sentence, and each edge represent the link between two sentences. When two sentences are linked, they have the same meaning.
4949
5050The structure of the representation has a big effect on the way you can contribute to the corpus. One important implication of the graph structure is that you can add **multiple translations in the same language** for a specific sentence. You think there are two ways to translate a sentence and you really can't decide which would be the best? Well, just add both!
5151
5252Some other implications are pointed out below.
5353
5454
5555##3. Focus on the main sentence, not the other translations
5656
5757When you translate a sentence, you are in fact adding a **sentence** (node) and a **link** (edge) between the original sentence and your translation. The only thing you need to care about is that you are adding a proper translation to the "main sentence" (the one at the top, in larger type).
5858
5959Assume you wanted to add a Spanish translation to the English sentence "How are you?", which happens to have an existing French translation "Comment vas-tu ?" You would see the following:
6060
6161_How are you?_
6262
6363_=> Comment vas-tu ?_
6464
6565You could add _"¿Cómo estás?"_ (informal) or _"¿Cómo está usted?"_ (formal). Even better, you could add both as separate translations of the same original sentence. You should not restrict yourself to choosing an informal translation simply because one was chosen by the person who was working in French. You should only care that your translation is a proper translation of the **English sentence**: if someone had to translate your contribution back to English, _"How are you?"_ would be a possibility.
6666
6767
6868##4. Translate the sentence as a whole rather than as a collection of individual words
6969
7070We are not interested in having sentences that sound like they were written by a primitive robot that translated the individual words without regard for the sentence as a whole. We want sentences that a native speaker would really say. Translating is a very difficult task, we know. But if you are translating into your native language, you should always, always reread your translation in isolation from the original, and ask yourself if it is actually something people would say. If, for some reason, you want to add a word-by-word translation that is not what a native speaker would say, you should use the comments.
7171
7272You are allowed to translate into a language other than your own, in which case you are forgiven for not writing native-like sentences. But in this case, please add a comment or tag to request that a native speaker check your sentences and correct any mistakes.
7373
7474Tatoeba is not only about providing translations, it's also about gathering data on a language. Even if none of the sentences in a given language has yet been translated, those sentences are informative in themselves. However, they must be representative of their language.
7575
7676To put it another way, the sentences are the basic layer in the Tatoeba corpus. The links between the sentences form another layer. But the corpus should have value even without that second layer.
7777
7878
7979
8080##5. Do not edit a sentence if, by itself, it is correct
8181
8282As I mentioned in the previous section, sentences in Tatoeba have value apart from their translations. Consequently, before you modify a sentence, look at it without paying attention to its translations, and ask yourself _"Does this sentence have any spelling or grammar mistake? Does it sound weird?"_. If the answer is "No", then do NOT edit it, **leave it alone**! If one or more of the translations is incorrect, you should break the links between them (or ask someone else to do it for you), but the **content** of sentences that are valid in themselves should remain unchanged.
8383
8484You may be tempted to edit a sentence so that its meaning matches all the other sentences. Or perhaps you want to turn a sentence into a word-for-word translation. But this is not a good idea. Not only does a word-for-word translation run the risk of sounding unnatural (cf. rule #4), but you're replacing data when you could be adding new data and keeping the old.
8585
86Sometimes, a sentence may not match the original AT ALL. For instance:
86Sometimes, as the result of a mistake, a sentence may not match the original AT ALL. For instance:
8787
8888
8989
9090_My name is Trang._
9191
9292_=> Je m'appelle Trang._
9393
9494_=> Vamos a la playa._
9595
9696You notice that the Spanish sentence (which says "Let's go to the beach") has nothing to do with the English sentence.
9797
9898Perhaps you don't speak Spanish very well so you're not confident in modifying the Spanish sentence and decide to change the English sentence instead. However, then you create another problem: the French sentence won't fit the English sentence anymore...
9999
100100Perhaps you are a native Spanish speaker and are tempted to change the Spanish sentence. In this particular case, it would still be acceptable because the Spanish sentence is not linked to any other sentence. But if someone had translated that Spanish sentence into Italian, "correcting" the Spanish sentence would cause a conflict with the Italian translation.
101101
102102Then there is a problem you may have not thought of: when changing the meaning of a sentence, you are potentially erasing unique vocabulary. What if the Spanish sentence was currently the only one with "playa" in it?
103103
104104The best way to proceed in this kind of situation is to add a new Spanish translation (_Me llamo Trang_) and "unlink" the current Spanish translation. NOTE: Only "trusted users" (users with a level of advanced contributor or higher) can unlink. However, anyone can post a comment to request that a sentence be unlinked.
105105
106106
107107##6. Do not change the language in which a sentence is written
108108
109109While you should correct the language flag for a sentence if it is wrong (for instance, it is flagged as Chinese when it is in fact Japanese), you shouldn't replace, say, a Japanese sentence by a Chinese sentence with the same meaning.
110110
111111The problem is that a sentence can be associated with data, such as comments, that depend on its language. People can post comments on sentences, and the comments may be valid only because the sentence was in a certain language.
112112
113113At the moment, this is mostly an issue for Japanese sentences, which are associated with special annotations. These annotations are not displayed because they are not useful for normal users. If you change a Japanese sentence into an English sentence, then the annotations that were associated with it won't make sense anymore.
114114
115115##7. Make sure you are adding comments to the right sentence
116116
117117When you post a comment, the comment is only associated with the main sentence, so make sure that your comment is related to that particular sentence. Typically, imagine that you want to point out a spelling mistake, as in the following:
118118
119119_My name is Trang._
120120
121121_=> Je m'appel Trang._
122122
123123_=> Me llamo Trang._
124124
125125You can see that the French sentence is wrong. It should be "appelle" and not "appel". If you posted your comment here, it would be associated with the English sentence (which is the main sentence, displayed at the top). This is not what you want. The right thing to do is to click on the French sentence first. It will change the display into:
126126
127127_Je m'appel Trang._
128128
129129_=> My name is Trang._
130130
131131_=> Me llamo Trang._
132132
133133And then you can post your comment.
134134
135135Now there is the case where you want to point out that a translation is wrong. Your comment will be related to two sentences, so where should you post it? Well, ideally, for this type of situation, there should be the possibility of commenting on a **link** between two sentences. But we don't have that, so we can only comment on a **sentence**. You are free to decide where you want to post your comment. Just remember that your comment must be related to the main sentence.
136136
137137
138138##8. Do not add sentences from copyrighted content
139139
140140We are distributing the corpus under the [Creative Commons Attribution](http://creativecommons.org/licenses/by/2.0/fr/) (or CC-BY) license. This makes it possible for anyone to re-use this data in any way they want as long as they mention Tatoeba in their work.
141141
142142As a contributor, you have agreed with the terms of use, and therefore you are providing your contributions under the CC-BY license as well. This means we can reuse your data in any way we want as long as we mention you, which we do via the logs and statistics.
143143
144144But providing your work under CC-BY means you also have some responsibilities for what you provide. And you have to know that you **cannot** legally redistribute data if it was copied from a source that doesn't clearly state that you can. Typically, you cannot (legally) copy all the sentences from a textbook and add them into in Tatoeba.
145145
146146Don't worry, you (and we) won't land in jail and be in debt for life if you've added a couple of sentences from a textbook. But the law forbids us from taking a significant amount of someone's work and reusing it without their consent. Producing sentences and translations is work, so be careful where you get the sentences from. Preferably, come up with your own sentences or take them from books that are in the public domain.
147147
148148If you have added or seen sentences that were copied from a copyrighted source, change a few words so that it won't be exactly the same sentence. Or go negotiate with the authors and convince them to release their work under the CC-BY license so we can re-use it.
149149
150150Please follow these guidelines so that we don't get sued.
151151
152152
153153##9. Do not annotate sentences
154154
155155We want sentences to remain as natural as possible, so do not add annotations. For example we do NOT want sentences like these:
156156
157157
1581581. I (female) am happy.
1591592. It's raining cats and dogs. (idiom)
1601603. I like her/him.
161161
162162
163163Regarding sentences 1 and 2, if you need to indicate that a sentence is a proverb or female speech or whatsoever, then post a comment about it (or tag it, if you are a trusted user), but please do NOT add this information directly in the sentence.
164164
165165Regarding sentence 3, instead of having only one sentence, split it into two sentences. Remember, you have the right to add multiple translations in a same language. So the following is encouraged:
166166
167167> Je l'aime bien.
168168>
169169> => I like her.
170170>
171171> => I like him.
172172
173173
174174
175175
176176
177177There are various reasons why we don't want annotations.
178178
179179
1801801. They can be a problem for people who are using our data in order to improve a natural language processing system, for instance.
1811812. Your translation can be retranslated into another language, and it's less easy for people to translate sentences that contain alternatives (like "him/her"), since they may result in changes to other parts of the sentences, making the result unwieldy.
1821823. If we want to record audio for the sentence, we will need to choose what exactly to record, and annotations don't help.
183183
184184
185185
186186
187187
188188
189189
190190
191191##10. Give us feedback##
192192
193193We know that Tatoeba is not perfect, so don't hesitate to [tell us](http://tatoeba.org/pages/contact) what you think is missing (though it is a good idea to see whether the subject has already been discussed on the [Wall](http://tatoeba.org/wall) already). Also tell us if you see any spelling mistake, feel that some explanations are not clear, or encounter bugs.
194194
195195We also know that Tatoeba is a cool project so feel free to tell us you like it too :P
196196
197197
198198##11. Do not wait for us to code it if you can code it
199199
200200As much as we welcome feedback, we welcome even more **INITIATIVE**. There are just sooo many things we could do. We can't take care of everything.
201201
202202For instance we are distributing the _**entire **_corpus, but many people probably don't need **_all_** the sentences in _**all**_ the languages. You may just want the English-Spanish sentences. Well instead of asking and waiting for us to provide a file with only English-Spanish sentences, you can code a tool (and please, tell us if you do) that will extract only what you want from our files.
203203
204204That's just one example but if you are a programmer, there could be many things you could do yourself instead of waiting for us to do it. But of course, tell us so we don't duplicate your effort.
205205
206206You also have to know that we are actually open source (under AGPL license) but we are not really "promoting" this aspect because:
207207
2082081. The code hasn't met my standards of elegance yet... Still too many parts that make me cringe when I look at them.
2092092. We still don't have a sound methodology and organization in our way of working and I really don't have time to manage more people.
210210However if you love the project and are really motivated to join the development team, then feel free to contact us =)
211211
212212
213213
214214##12. Indicate your languages in your profile
215215
216216For people who didn't know, you can edit your profile by clicking on your username (at the top, in the menu bar).
217217
218218Since Tatoeba involves languages, it can be very useful for other users to know which languages you can speak and how well you can speak them. We don't have a specific "languages" field, so write about it in your profile description (in the section "Something about you"). And tell other users to indicate their languages as well (if they haven't already), especially if they have already contributed.
219219
220220
221221
222222##13. Encourage and educate new (or old) contributors
223223
224224Community is very important in a project like Tatoeba. We just can't fulfill our ambitions without a strong community. But how do you build one? Well, one thing is NOT to make new users feel lost and isolated.
225225
226226Part of this depends on the system. It has to be designed in a way that not only enables but also encourages users to interact with each other. Tatoeba offers a Wall, private messages, and comments.
227227
228228And, of course, the other part depends on the community itself. It must make an effort to build its strength. If someone is asking a question to which you can answer, don't hesitate to help out. If you notice someone is going something wrong, don't hesitate to (politely!) tell them the right way to do it. If you notice someone or some people have been contributing significantly, don't hesitate to drop a line (in a private message or on the Wall) to say "congratulations" or "thank you" for their work.
229229
230230More generally speaking, if you have any idea on how to make Tatoeba a more socially pleasant place to be, then go ahead!
231231
232232
233233##14. Spread the love
234234
235235Last but not least: you love the project, we love the project, we all want this project to become the greatest language tool of all time, so bring more people into this adventure!
236236
237237In the end, anyone who knows how to read and how to write can participate. There's no need to be a polyglot. Even if you can "just" hunt for mistakes and correct them or point them out, this will be already extremely helpful. If you have programming skills, you can be helpful in working with our software. The more people we have, the more mistakes we can fix, and the more data we can produce that people can rely on. And everyone can live happily ever after.
diff view generated by jsdifflib

Version at: 26/04/2013, 19:36

##Introduction

This article by Trang is a must-read for anyone who is serious about contributing in Tatoeba. 

Here is an outline of how to be a good contributor:

1.  [Understand the context of the project](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule1)
2.  [Understand how the corpus is structured](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2)
3.  [Focus on the main sentence, not the other translations](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule3)
4.  [Translate the sentence as a whole rather than as a collection of individual words](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule4)
5.  [Do not edit a sentence if, by itself, it is correct](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule5)
6.  [Do not change the language in which a sentence is written](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule6)
7.  [Make sure you are adding comments to the right sentence](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule7)
8.  [Do not add sentences from copyrighted content](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule8)
9.  [Do not annotate sentences](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule9)
10.  [Give us feedback](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule10)
11.  [Do not wait for us to code it if you can code it](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule11)
12.  [Indicate your languages in your profile](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule12)
13.  [Encourage and educate new (or old) contributors](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule13)
14.  [Spread the love](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule14)

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)

##1. Understand the context of the project

*   I started this project in 2006. The initiative was driven by my passion for language learning and frustration about not finding an adequate online dictionary.   The project is focused on **sentences** and I insist on **sentences**. Sample sentences were (and still are) a very scarce resource. Please only add **complete sentences** if you are going to contribute.
*   I was alone on this project for some time. It was only three years later, in 2009, that other people (all computer science students) started to help me out by coding more features.
*   Tatoeba is NOT a commercial project. We're not a company, and we're not paid for doing any of this. It is is something that we're working on in our **free time**.
*   To be honest, we don't exclude the possibility of starting a company someday, but that is if and only if we have an innovative, coherent and ethical business model (yeah, good luck). Having ads everywhere and driving a lot of traffic, or forcing people to pay to access the data, is out of the question.

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)

##2. Understand how the corpus is structured

The corpus is structured not as a **table** but as a **graph** (in the computer science sense of the word). What does that mean? Well, imagine you had to extract part of the corpus and write it on paper. You might do something like this:

<table>
<tr>     <th>**English**</th>     <th>**French**</th>     <th>**Spanish**</th> </tr>
<tr>         <td>My name is Trang.</td>     <td>Je m'appelle Trang.</td>     <td>Me llamo Trang.</td> </tr>
<tr>     <td>How are you?</td>     <td>Comment vas-tu?</td>     <td>¿Cómo estás?</td> </tr>
<tr>     <td>...</td>     <td>...</td>     <td>...</td> </tr>
</table>

That's a **table** structure. There are **rows** and **columns**: a row contains sentences with the same meaning, and a column contains sentences with the same language. That's the first approach anyone might take, but that's NOT how the corpus is constructed.

Our corpus is set up like this: *[note: diagram needs to be imported]*

That's a **graph** structure. There are **nodes** and **edges**: each node represents a sentence, and each edge represent the link between two sentences. When two sentences are linked, they have the same meaning.

The structure of the representation has a big effect on the way you can contribute to the corpus. One important implication of the graph structure is that you can add **multiple translations in the same language** for a specific sentence. You think there are two ways to translate a sentence and you really can't decide which would be the best? Well, just add both!

Some other implications are pointed out below.


##3. Focus on the main sentence, not the other translations

When you translate a sentence, you are in fact adding a **sentence** (node) and a **link** (edge) between the original sentence and your translation. The only thing you need to care about is that you are adding a proper translation to the "main sentence" (the one at the top, in larger type).

Assume you wanted to add a Spanish translation to the English sentence "How are you?", which happens to have an existing French translation "Comment vas-tu ?" You would see the following:

_How are you?_

_=&gt; Comment vas-tu ?_

You could add _"¿Cómo estás?"_ (informal) or  _"¿Cómo está usted?"_ (formal). Even better, you could add both as separate translations of the same original sentence. You should not restrict yourself to choosing an informal translation simply because one was chosen by the person who was working in French. You should only care that your translation is a proper translation of the **English sentence**: if someone had to translate your contribution back to English, _"How are you?"_ would be a possibility.


##4. Translate the sentence as a whole rather than as a collection of individual words

We are not interested in having sentences that sound like they were written by a primitive robot that translated the individual words without regard for the sentence as a whole. We want sentences that a native speaker would really say. Translating is a very difficult task, we know. But if you are translating into your native language, you should always, always reread your translation in isolation from the original, and ask yourself if it is actually something people would say. If, for some reason, you want to add a word-by-word translation that is not what a native speaker would say, you should use the comments.

You are allowed to translate into a language other than your own, in which case you are forgiven for not writing native-like sentences. But in this case, please add a comment or tag to request that a native speaker check your sentences and correct any mistakes.

Tatoeba is not only about providing translations, it's also about gathering data on a language. Even if none of the sentences in a given language has yet been translated, those sentences are informative in themselves. However, they must be representative of their language.

To put it another way, the sentences are the basic layer in the Tatoeba corpus. The links between the sentences form another layer. But the corpus should have value even without that second layer.



##5. Do not edit a sentence if, by itself, it is correct

As I mentioned in the previous section, sentences in Tatoeba have value apart from their translations. Consequently, before you modify a sentence, look at it without paying attention to its translations, and ask yourself _"Does this sentence have any spelling or grammar mistake? Does it sound weird?"_. If the answer is "No", then do NOT edit it, **leave it alone**! If one or more of the translations is incorrect, you should break the links between them (or ask someone else to do it for you), but the **content** of sentences that are valid in themselves should remain unchanged. 

You may be tempted to edit a sentence so that its meaning matches all the other sentences. Or perhaps you want to turn a sentence into a word-for-word translation. But this is not a good idea. Not only does a word-for-word translation run the risk of sounding unnatural (cf. rule #4), but you're replacing data when you could be adding new data and keeping the old.

Sometimes, a sentence may not match the original AT ALL. For instance:



_My name is Trang._

_=&gt; Je m'appelle Trang._

_=&gt; Vamos a la playa._

You notice that the Spanish sentence (which says "Let's go to the beach") has nothing to do with the English sentence.

Perhaps you don't speak Spanish very well so you're not confident in modifying the Spanish sentence and decide to change the English sentence instead. However, then you create another problem: the French sentence won't fit the English sentence anymore...

Perhaps you are a native Spanish speaker and are tempted to change the Spanish sentence. In this particular case, it would still be acceptable because the Spanish sentence is not linked to any other sentence. But if someone had translated that Spanish sentence into Italian, "correcting" the Spanish sentence would cause a conflict with the Italian translation.

Then there is a problem you may have not thought of: when changing the meaning of a sentence, you are potentially erasing unique vocabulary. What if the Spanish sentence was currently the only one with "playa" in it?

The best way to proceed in this kind of situation is to add a new Spanish translation (_Me llamo Trang_) and "unlink" the current Spanish translation. NOTE: Only "trusted users" (users with a level of advanced contributor or higher) can unlink. However, anyone can post a comment to request that a sentence be unlinked.


##6. Do not change the language in which a sentence is written

While you should correct the language flag for a sentence if it is wrong (for instance, it is flagged as Chinese when it is in fact Japanese), you shouldn't replace, say, a Japanese sentence by a Chinese sentence with the same meaning. 

The problem is that a sentence can be associated with data, such as comments, that depend on its language. People can post comments on sentences, and the comments may be valid only because the sentence was in a certain language.

At the moment, this is mostly an issue for Japanese sentences, which are associated with special annotations. These annotations are not displayed because they are not useful for normal users. If you change a Japanese sentence into an English sentence, then the annotations that were associated with it won't make sense anymore.

##7. Make sure you are adding comments to the right sentence

When you post a comment, the comment is only associated with the main sentence, so make sure that your comment is related to that particular sentence. Typically, imagine that you want to point out a spelling mistake, as in the following:

_My name is Trang._

_=&gt; Je m'appel Trang._

_=&gt; Me llamo Trang._

You can see that the French sentence is wrong. It should be "appelle" and not "appel". If you posted your comment here, it would be associated with the English sentence (which is the main sentence, displayed at the top). This is not what you want. The right thing to do is to click on the French sentence first. It will change the display into:

_Je m'appel Trang._

_=&gt; My name is Trang._

_=&gt; Me llamo Trang._

And then you can post your comment.

Now there is the case where you want to point out that a translation is wrong. Your comment will be related to two sentences, so where should you post it? Well, ideally, for this type of situation, there should be the possibility of commenting on a **link** between two sentences. But we don't have that, so we can only comment on a **sentence**. You are free to decide where you want to post your comment. Just remember that your comment must be related to the main sentence.


##8. Do not add sentences from copyrighted content

We are distributing the corpus under the [Creative Commons Attribution](http://creativecommons.org/licenses/by/2.0/fr/) (or CC-BY) license. This makes it possible for anyone to re-use this data in any way they want as long as they mention Tatoeba in their work.

As a contributor, you have agreed with the terms of use, and therefore you are providing your contributions under the CC-BY license as well. This means we can reuse your data in any way we want as long as we mention you, which we do via the logs and statistics.

But providing your work under CC-BY means you also have some responsibilities for what you provide. And you have to know that you **cannot** legally redistribute data if it was copied from a source that doesn't clearly state that you can. Typically, you cannot (legally) copy all the sentences from a textbook and add them into in Tatoeba.

Don't worry, you (and we) won't land in jail and be in debt for life if you've added a couple of sentences from a textbook. But the law forbids us from taking a significant amount of someone's work and reusing it without their consent. Producing sentences and translations is work, so be careful where you get the sentences from. Preferably, come up with your own sentences or take them from books that are in the public domain.

If you have added or seen sentences that were copied from a copyrighted source, change a few words so that it won't be exactly the same sentence. Or go negotiate with the authors and convince them to release their work under the CC-BY license so we can re-use it.

Please follow these guidelines so that we don't get sued.


##9. Do not annotate sentences

We want sentences to remain as natural as possible, so do not add annotations. For example we do NOT want sentences like these:


1.  I (female) am happy.
2.  It's raining cats and dogs. (idiom)
3.  I like her/him.


Regarding sentences 1 and 2, if you need to indicate that a sentence is a proverb or female speech or whatsoever, then post a comment about it (or tag it, if you are a trusted user), but please do NOT add this information directly in the sentence.

Regarding sentence 3, instead of having only one sentence, split it into two sentences. Remember, you have the right to add multiple translations in a same language. So the following is encouraged:

> Je l'aime bien.
> 
> =&gt; I like her.
> 
> =&gt; I like him.





There are various reasons why we don't want annotations.


1.  They can be a problem for people who are using our data in order to improve a natural language processing system, for instance.
2.  Your translation can be retranslated into another language, and it's less easy for people to translate sentences that contain alternatives (like "him/her"), since they may result in changes to other parts of the sentences, making the result unwieldy.
3.  If we want to record audio for the sentence, we will need to choose what exactly to record, and annotations don't help.








##10. Give us feedback##

We know that Tatoeba is not perfect, so don't hesitate to [tell us](http://tatoeba.org/pages/contact) what you think is missing (though it is a good idea to see whether the subject has already been discussed on the [Wall](http://tatoeba.org/wall) already). Also tell us if you see any spelling mistake, feel that some explanations are not clear, or encounter bugs.

We also know that Tatoeba is a cool project so feel free to tell us you like it too :P


##11. Do not wait for us to code it if you can code it

As much as we welcome feedback, we welcome even more **INITIATIVE**. There are just sooo many things we could do. We can't take care of everything.

For instance we are distributing the _**entire **_corpus, but many people probably don't need **_all_** the sentences in _**all**_ the languages. You may just want the English-Spanish sentences. Well instead of asking and waiting for us to provide a file with only English-Spanish sentences, you can code a tool (and please, tell us if you do) that will extract only what you want from our files.

That's just one example but if you are a programmer, there could be many things you could do yourself instead of waiting for us to do it. But of course, tell us so we don't duplicate your effort.

You also have to know that we are actually open source (under AGPL license) but we are not really "promoting" this aspect because:

1.  The code hasn't met my standards of elegance yet... Still too many parts that make me cringe when I look at them.
2.  We still don't have a sound methodology and organization in our way of working and I really don't have time to manage more people.
However if you love the project and are really motivated to join the development team, then feel free to contact us =)



##12. Indicate your languages in your profile

For people who didn't know, you can edit your profile by clicking on your username (at the top, in the menu bar).

Since Tatoeba involves languages, it can be very useful for other users to know which languages you can speak and how well you can speak them. We don't have a specific "languages" field, so write about it in your profile description (in the section "Something about you"). And tell other users to indicate their languages as well (if they haven't already), especially if they have already contributed.



##13. Encourage and educate new (or old) contributors

Community is very important in a project like Tatoeba. We just can't fulfill our ambitions without a strong community. But how do you build one? Well, one thing is NOT to make new users feel lost and isolated.

Part of this depends on the system. It has to be designed in a way that not only enables but also encourages users to interact with each other. Tatoeba offers a Wall, private messages, and comments.

And, of course, the other part depends on the community itself. It must make an effort to build its strength. If someone is asking a question to which you can answer, don't hesitate to help out. If you notice someone is going something wrong, don't hesitate to (politely!) tell them the right way to do it. If you notice someone or some people have been contributing significantly, don't hesitate to drop a line (in a private message or on the Wall) to say "congratulations" or "thank you" for their work.

More generally speaking, if you have any idea on how to make Tatoeba a more socially pleasant place to be, then go ahead!


##14. Spread the love

Last but not least: you love the project, we love the project, we all want this project to become the greatest language tool of all time, so bring more people into this adventure!

In the end, anyone who knows how to read and how to write can participate. There's no need to be a polyglot. Even if you can "just" hunt for mistakes and correct them or point them out, this will be already extremely helpful. If you have programming skills, you can be helpful in working with our software. The more people we have, the more mistakes we can fix, and the more data we can produce that people can rely on. And everyone can live happily ever after.

version at: 26/04/2013, 22:30

##Introduction

This article by Trang is a must-read for anyone who is serious about contributing to Tatoeba. 

Here is an outline of how to be a good contributor:

1.  [Understand the context of the project](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule1)
2.  [Understand how the corpus is structured](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2)
3.  [Focus on the main sentence, not the other translations](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule3)
4.  [Translate the sentence as a whole rather than as a collection of individual words](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule4)
5.  [Do not edit a sentence if, by itself, it is correct](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule5)
6.  [Do not change the language in which a sentence is written](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule6)
7.  [Make sure you are adding comments to the right sentence](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule7)
8.  [Do not add sentences from copyrighted content](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule8)
9.  [Do not annotate sentences](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule9)
10.  [Give us feedback](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule10)
11.  [Do not wait for us to code it if you can code it](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule11)
12.  [Indicate your languages in your profile](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule12)
13.  [Encourage and educate new (or old) contributors](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule13)
14.  [Spread the love](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule14)

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&amp;postID=5954885196540160002)

##1. Understand the context of the project

*   I started this project in 2006. The initiative was driven by my passion for language learning and frustration about not finding an adequate online dictionary.   The project is focused on **sentences** and I insist on **sentences**. Sample sentences were (and still are) a very scarce resource. Please only add **complete sentences** if you are going to contribute.
*   I was alone on this project for some time. It was only three years later, in 2009, that other people (all computer science students) started to help me out by coding more features.
*   Tatoeba is NOT a commercial project. We're not a company, and we're not paid for doing any of this. It is is something that we're working on in our **free time**.
*   To be honest, we don't exclude the possibility of starting a company someday, but that is if and only if we have an innovative, coherent and ethical business model (yeah, good luck). Having ads everywhere and driving a lot of traffic, or forcing people to pay to access the data, is out of the question.

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&amp;postID=5954885196540160002)

##2. Understand how the corpus is structured

The corpus is structured not as a **table** but as a **graph** (in the computer science sense of the word). What does that mean? Well, imagine you had to extract part of the corpus and write it on paper. You might do something like this:

<table>
<tr>     <th>**English**</th>     <th>**French**</th>     <th>**Spanish**</th> </tr>
<tr>         <td>My name is Trang.</td>     <td>Je m'appelle Trang.</td>     <td>Me llamo Trang.</td> </tr>
<tr>     <td>How are you?</td>     <td>Comment vas-tu?</td>     <td>¿Cómo estás?</td> </tr>
<tr>     <td>...</td>     <td>...</td>     <td>...</td> </tr>
</table>

That's a **table** structure. There are **rows** and **columns**: a row contains sentences with the same meaning, and a column contains sentences with the same language. That's the first approach anyone might take, but that's NOT how the corpus is constructed.

Our corpus is set up like this: *[note: diagram needs to be imported]*

That's a **graph** structure. There are **nodes** and **edges**: each node represents a sentence, and each edge represent the link between two sentences. When two sentences are linked, they have the same meaning.

The structure of the representation has a big effect on the way you can contribute to the corpus. One important implication of the graph structure is that you can add **multiple translations in the same language** for a specific sentence. You think there are two ways to translate a sentence and you really can't decide which would be the best? Well, just add both!

Some other implications are pointed out below.


##3. Focus on the main sentence, not the other translations

When you translate a sentence, you are in fact adding a **sentence** (node) and a **link** (edge) between the original sentence and your translation. The only thing you need to care about is that you are adding a proper translation to the "main sentence" (the one at the top, in larger type).

Assume you wanted to add a Spanish translation to the English sentence "How are you?", which happens to have an existing French translation "Comment vas-tu ?" You would see the following:

_How are you?_

_=&gt; Comment vas-tu ?_

You could add _"¿Cómo estás?"_ (informal) or  _"¿Cómo está usted?"_ (formal). Even better, you could add both as separate translations of the same original sentence. You should not restrict yourself to choosing an informal translation simply because one was chosen by the person who was working in French. You should only care that your translation is a proper translation of the **English sentence**: if someone had to translate your contribution back to English, _"How are you?"_ would be a possibility.


##4. Translate the sentence as a whole rather than as a collection of individual words

We are not interested in having sentences that sound like they were written by a primitive robot that translated the individual words without regard for the sentence as a whole. We want sentences that a native speaker would really say. Translating is a very difficult task, we know. But if you are translating into your native language, you should always, always reread your translation in isolation from the original, and ask yourself if it is actually something people would say. If, for some reason, you want to add a word-by-word translation that is not what a native speaker would say, you should use the comments.

You are allowed to translate into a language other than your own, in which case you are forgiven for not writing native-like sentences. But in this case, please add a comment or tag to request that a native speaker check your sentences and correct any mistakes.

Tatoeba is not only about providing translations, it's also about gathering data on a language. Even if none of the sentences in a given language has yet been translated, those sentences are informative in themselves. However, they must be representative of their language.

To put it another way, the sentences are the basic layer in the Tatoeba corpus. The links between the sentences form another layer. But the corpus should have value even without that second layer.



##5. Do not edit a sentence if, by itself, it is correct

As I mentioned in the previous section, sentences in Tatoeba have value apart from their translations. Consequently, before you modify a sentence, look at it without paying attention to its translations, and ask yourself _"Does this sentence have any spelling or grammar mistake? Does it sound weird?"_. If the answer is "No", then do NOT edit it, **leave it alone**! If one or more of the translations is incorrect, you should break the links between them (or ask someone else to do it for you), but the **content** of sentences that are valid in themselves should remain unchanged. 

You may be tempted to edit a sentence so that its meaning matches all the other sentences. Or perhaps you want to turn a sentence into a word-for-word translation. But this is not a good idea. Not only does a word-for-word translation run the risk of sounding unnatural (cf. rule #4), but you're replacing data when you could be adding new data and keeping the old.

Sometimes, as the result of a mistake, a sentence may not match the original AT ALL. For instance:



_My name is Trang._

_=&gt; Je m'appelle Trang._

_=&gt; Vamos a la playa._

You notice that the Spanish sentence (which says "Let's go to the beach") has nothing to do with the English sentence.

Perhaps you don't speak Spanish very well so you're not confident in modifying the Spanish sentence and decide to change the English sentence instead. However, then you create another problem: the French sentence won't fit the English sentence anymore...

Perhaps you are a native Spanish speaker and are tempted to change the Spanish sentence. In this particular case, it would still be acceptable because the Spanish sentence is not linked to any other sentence. But if someone had translated that Spanish sentence into Italian, "correcting" the Spanish sentence would cause a conflict with the Italian translation.

Then there is a problem you may have not thought of: when changing the meaning of a sentence, you are potentially erasing unique vocabulary. What if the Spanish sentence was currently the only one with "playa" in it?

The best way to proceed in this kind of situation is to add a new Spanish translation (_Me llamo Trang_) and "unlink" the current Spanish translation. NOTE: Only "trusted users" (users with a level of advanced contributor or higher) can unlink. However, anyone can post a comment to request that a sentence be unlinked.


##6. Do not change the language in which a sentence is written

While you should correct the language flag for a sentence if it is wrong (for instance, it is flagged as Chinese when it is in fact Japanese), you shouldn't replace, say, a Japanese sentence by a Chinese sentence with the same meaning. 

The problem is that a sentence can be associated with data, such as comments, that depend on its language. People can post comments on sentences, and the comments may be valid only because the sentence was in a certain language.

At the moment, this is mostly an issue for Japanese sentences, which are associated with special annotations. These annotations are not displayed because they are not useful for normal users. If you change a Japanese sentence into an English sentence, then the annotations that were associated with it won't make sense anymore.

##7. Make sure you are adding comments to the right sentence

When you post a comment, the comment is only associated with the main sentence, so make sure that your comment is related to that particular sentence. Typically, imagine that you want to point out a spelling mistake, as in the following:

_My name is Trang._

_=&gt; Je m'appel Trang._

_=&gt; Me llamo Trang._

You can see that the French sentence is wrong. It should be "appelle" and not "appel". If you posted your comment here, it would be associated with the English sentence (which is the main sentence, displayed at the top). This is not what you want. The right thing to do is to click on the French sentence first. It will change the display into:

_Je m'appel Trang._

_=&gt; My name is Trang._

_=&gt; Me llamo Trang._

And then you can post your comment.

Now there is the case where you want to point out that a translation is wrong. Your comment will be related to two sentences, so where should you post it? Well, ideally, for this type of situation, there should be the possibility of commenting on a **link** between two sentences. But we don't have that, so we can only comment on a **sentence**. You are free to decide where you want to post your comment. Just remember that your comment must be related to the main sentence.


##8. Do not add sentences from copyrighted content

We are distributing the corpus under the [Creative Commons Attribution](http://creativecommons.org/licenses/by/2.0/fr/) (or CC-BY) license. This makes it possible for anyone to re-use this data in any way they want as long as they mention Tatoeba in their work.

As a contributor, you have agreed with the terms of use, and therefore you are providing your contributions under the CC-BY license as well. This means we can reuse your data in any way we want as long as we mention you, which we do via the logs and statistics.

But providing your work under CC-BY means you also have some responsibilities for what you provide. And you have to know that you **cannot** legally redistribute data if it was copied from a source that doesn't clearly state that you can. Typically, you cannot (legally) copy all the sentences from a textbook and add them into in Tatoeba.

Don't worry, you (and we) won't land in jail and be in debt for life if you've added a couple of sentences from a textbook. But the law forbids us from taking a significant amount of someone's work and reusing it without their consent. Producing sentences and translations is work, so be careful where you get the sentences from. Preferably, come up with your own sentences or take them from books that are in the public domain.

If you have added or seen sentences that were copied from a copyrighted source, change a few words so that it won't be exactly the same sentence. Or go negotiate with the authors and convince them to release their work under the CC-BY license so we can re-use it.

Please follow these guidelines so that we don't get sued.


##9. Do not annotate sentences

We want sentences to remain as natural as possible, so do not add annotations. For example we do NOT want sentences like these:


1.  I (female) am happy.
2.  It's raining cats and dogs. (idiom)
3.  I like her/him.


Regarding sentences 1 and 2, if you need to indicate that a sentence is a proverb or female speech or whatsoever, then post a comment about it (or tag it, if you are a trusted user), but please do NOT add this information directly in the sentence.

Regarding sentence 3, instead of having only one sentence, split it into two sentences. Remember, you have the right to add multiple translations in a same language. So the following is encouraged:

> Je l'aime bien.
> 
> =&gt; I like her.
> 
> =&gt; I like him.





There are various reasons why we don't want annotations.


1.  They can be a problem for people who are using our data in order to improve a natural language processing system, for instance.
2.  Your translation can be retranslated into another language, and it's less easy for people to translate sentences that contain alternatives (like "him/her"), since they may result in changes to other parts of the sentences, making the result unwieldy.
3.  If we want to record audio for the sentence, we will need to choose what exactly to record, and annotations don't help.








##10. Give us feedback##

We know that Tatoeba is not perfect, so don't hesitate to [tell us](http://tatoeba.org/pages/contact) what you think is missing (though it is a good idea to see whether the subject has already been discussed on the [Wall](http://tatoeba.org/wall) already). Also tell us if you see any spelling mistake, feel that some explanations are not clear, or encounter bugs.

We also know that Tatoeba is a cool project so feel free to tell us you like it too :P


##11. Do not wait for us to code it if you can code it

As much as we welcome feedback, we welcome even more **INITIATIVE**. There are just sooo many things we could do. We can't take care of everything.

For instance we are distributing the _**entire **_corpus, but many people probably don't need **_all_** the sentences in _**all**_ the languages. You may just want the English-Spanish sentences. Well instead of asking and waiting for us to provide a file with only English-Spanish sentences, you can code a tool (and please, tell us if you do) that will extract only what you want from our files.

That's just one example but if you are a programmer, there could be many things you could do yourself instead of waiting for us to do it. But of course, tell us so we don't duplicate your effort.

You also have to know that we are actually open source (under AGPL license) but we are not really "promoting" this aspect because:

1.  The code hasn't met my standards of elegance yet... Still too many parts that make me cringe when I look at them.
2.  We still don't have a sound methodology and organization in our way of working and I really don't have time to manage more people.
However if you love the project and are really motivated to join the development team, then feel free to contact us =)



##12. Indicate your languages in your profile

For people who didn't know, you can edit your profile by clicking on your username (at the top, in the menu bar).

Since Tatoeba involves languages, it can be very useful for other users to know which languages you can speak and how well you can speak them. We don't have a specific "languages" field, so write about it in your profile description (in the section "Something about you"). And tell other users to indicate their languages as well (if they haven't already), especially if they have already contributed.



##13. Encourage and educate new (or old) contributors

Community is very important in a project like Tatoeba. We just can't fulfill our ambitions without a strong community. But how do you build one? Well, one thing is NOT to make new users feel lost and isolated.

Part of this depends on the system. It has to be designed in a way that not only enables but also encourages users to interact with each other. Tatoeba offers a Wall, private messages, and comments.

And, of course, the other part depends on the community itself. It must make an effort to build its strength. If someone is asking a question to which you can answer, don't hesitate to help out. If you notice someone is going something wrong, don't hesitate to (politely!) tell them the right way to do it. If you notice someone or some people have been contributing significantly, don't hesitate to drop a line (in a private message or on the Wall) to say "congratulations" or "thank you" for their work.

More generally speaking, if you have any idea on how to make Tatoeba a more socially pleasant place to be, then go ahead!


##14. Spread the love

Last but not least: you love the project, we love the project, we all want this project to become the greatest language tool of all time, so bring more people into this adventure!

In the end, anyone who knows how to read and how to write can participate. There's no need to be a polyglot. Even if you can "just" hunt for mistakes and correct them or point them out, this will be already extremely helpful. If you have programming skills, you can be helpful in working with our software. The more people we have, the more mistakes we can fix, and the more data we can produce that people can rely on. And everyone can live happily ever after.

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.