Version at: 26/04/2013, 16:53 vs. version at: 26/04/2013, 16:54
11##Introduction
22
3This article by Trang is a must-read for anyone who is serious about contributing in Tatoeba. Here is an outline of how to be a good contributor:
3This article by Trang is a must-read for anyone who is serious about contributing in Tatoeba.
4
5Here is an outline of how to be a good contributor:
46
571. [Understand the context of the project](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule1)
682. [Understand how the corpus is structured](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2)
793. [Do not pay attention to the other translations](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule3)
8104. [Do not translate word for word](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule4)
9115. [Do not edit a sentence if, by itself, it is correct](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule5)
10126. [Do not change the language in which a sentence is written](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule6)
11137. [Make sure you are adding comments to the right sentence](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule7)
12148. [Do not add sentences from copyrighted content](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule8)
13159. [Do not annotate sentences](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule9)
141610. [Give us feedback](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule10)
151711. [Do not wait for us to code it if you can code it](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule11)
161812. [Indicate your languages in your profile](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule12)
171913. [Encourage and educate new (or even not so new) contributors](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule13)
182014. [Spread the love](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule14)
1921
2022[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)
2123
2224##1. Understand the context of the project
2325
2426I will (someday) write a more detailed history, but here are the basic facts you should be aware of.
2527
2628* I started this project in 2006. The initiative was driven by my passion for language learning and frustration about not finding an adequate online dictionary. The project is focused on **sentences** and I insist on **sentences**. Sample sentences were (and still are) a very scarce resource. Please only add **complete sentences** if you are going to contribute.
2729* I was alone on this project for some time. It was only three years later, in 2009, that other people (all computer science students) started to help me out by coding more features.
2830* Tatoeba is NOT a commercial project. We're not a company, we're not paid for doing any of this. It is is something that we're working on in our **free time**.
2931* To be honest, we don't exclude the possibility of starting a company someday, but that is if and only if we have an innovative, coherent and ethical business model (yeah, good luck). Having ads everywhere and driving a lot of traffic, or forcing people to pay to access the data, is out of the question.
3032
3133[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)
3234
3335##2. Understand how the corpus is structured
3436
3537The corpus is structured not as a **table** but as a **graph** (in the computer science sense of the word). What does that mean? Well, imagine you had to extract part of the corpus and write it on paper. You might do something like this:
3638
3739<table><tbody>
3840<tr> <th style="text-align: left;">English</th> <th style="text-align: left;">French</th> <th style="text-align: left;">Spanish</th> </tr>
3941<tr> <td>My name is Trang.</td> <td>Je m'appelle Trang.</td> <td>Me llamo Trang.</td> </tr>
4042<tr> <td>How are you?</td> <td>Comment vas-tu?</td> <td>¿Cómo estás?</td> </tr>
4143<tr> <td>...</td> <td>...</td> <td>...</td> </tr>
4244</tbody></table>
4345
4446That's a **table** structure. There are **rows** and **columns**: a row contains sentences with the same meaning, and a column contains sentences with the same language. That's the first approach anyone might take, but that's NOT how the corpus is constructed.
4547
4648Our corpus is set up like this: *[note: diagram needs to be imported]*
4749
4850That's a **graph** structure. There are **nodes** and **edges**: each node represents a sentence, and each edge represent the link between two sentences. When two sentences are linked, they have the same meaning.
4951
5052The structure of the representation has a big effect on the way you can contribute to the corpus. One important implication of the graph structure is that you can add **multiple translations in the same language** for a specific sentence. You think there are two ways to translate a sentence and you really can't decide which would be the best? Well, just add both!
5153
5254Some other implications are pointed out below.
5355
5456
5557##3. Do not pay attention to the other translations**
5658
5759When you translate a sentence, you are in fact **adding a sentence** (a node) and **adding a link** (an edge) between the "original" sentence and your translation. So the only thing you need to care about is that you are adding a proper translation to "main sentence" (the one at the top, written in bigger size).
5860
5961More concretely, if you were in this situation and wanted to add a Spanish translation to the English sentence:
6062
6163_How are you?_
6264
6365_=> Comment vas-tu?_
6466
6567You could add _"¿Cómo estás?"_ (casual) as much as you could add _"¿Cómo está usted?"_ (formal). Or you could add both (because you can add multiple translations in a same language).
6668
6769If you understand French, it **doesn't matter** if the French sentence is the casual form, you only have to worry about the fact that your translation is a proper translation of the **English sentence**. A proper translation means that if someone had to translate your contribution back to English, _"How are you?"_ would be a possibility.
6870
6971
7072##4. Do not translate word for word**
7173
7274We are not interested in having sentences that sound like they were written by a robot. We want sentences that really are what a native speaker would say. Translating is a very difficult task, we know it. But if you are translating into your native language, you should always, always re-read your translation as if it was a single sentence, and ask yourself if it is actually something people would say. You can use the comments to indicate a literal translation.
7375
7476If you are not translating into your native language (which you can), you are forgiven for not writing native-like sentences. But in this case, please make sure you find a native speaker to check your sentences so that your possible mistakes get corrected more quickly.
7577
7678The point is to understand that Tatoeba is not only about providing translations, it's also about gathering data about a language. Tatoeba could simply be limited to adding sentences without translating them at all. If we were to extract only the sentences in Italian, we would like that each of them are representative of the Italian language.
7779
7880The sentences are the basic layer. The links between the sentences is another layer. But the corpus should make sense without those links.
7981
8082
8183
8284##5. Do not edit a sentence if, by itself, it is correct
8385
8486As I mentioned just above, Tatoeba could simply be limited to adding sentences without translating them at all. Consequently, before you modify a sentence, look at it without paying attention to its translations, and ask yourself _"Does this sentence have any spelling or grammar mistake? Does it sound weird?"_. If the answer is "No", then do NOT edit it, **leave it alone**!
8587
8688
8789
8890
8991
9092
9193I am explaining this because you may be tempted to edit a sentence so that its meaning matches all the other sentences.
9294
9395
9496
9597It could be because you want to turn a sentence into a more "literal" translation. But this is not a good idea. Obviously, if we don't want you to translate word for word (cf. rule #4), we also don't want you to change a sentence into a word for word translation.
9698
9799
98100
99101
100102It could also be because the sentence doesn't match AT ALL. For instance:
101103
102104
103105
104106_My name is Trang._
105107
106108_=> Je m'appelle Trang._
107109
108110_=> Vamos a la playa._
109111
110112You notice that the Spanish sentence (which says "Let's go to the beach") has nothing to do with the English sentence.
111113
112114Perhaps you don't speak Spanish very well so you're not confident in modifying the Spanish sentence and decide to change the English sentence. Problem: what about the French sentence? It won't fit the English sentence anymore...
113115
114116Perhaps you are a native Spanish speaker and decide to change the Spanish sentence. In this particular case, it would still be acceptable because the Spanish sentence is not linked to any other sentence. But if someone had translated that Spanish sentence into Italian, "correcting" the Spanish sentence would cause a conflict with the Italian translation.
115117
116118Then there is a problem you may have not thought of: when changing the meaning of a sentence, you are potentially erasing unique vocabulary. What if the Spanish sentence was currently the only one with "playa" in it?
117119
118120So the best way to proceed in this kind of situation is to add a new Spanish translation (_Me llamo Trang_) and "unlink" the current Spanish translation. NOTE: Not everyone can unlink. Only "trusted users" can. You can post a comment to request a sentence to be unlinked.
119121
120122
121123##6. Do not change the language in which a sentence is written
122124
123125If the language flag of a sentence is wrong (for instance it was flagged as Chinese when it is in fact Japanese), then of course, you can change it. That's not what I mean by "Do not change the language in which a sentence is written".
124126
125127What I mean is that you shouldn't replace a Japanese sentence by a Chinese sentence with the same meaning (and that applies to any language of course). It shouldn't often happen, but if you're in a situation where you want to do that, then don't.
126128
127129The problem is that a sentence can be associated to data that is dependent on its language. For instance comments. People can post comments on sentences, and the comments may be valid only because the sentence was in a certain language.
128130
129131At the moment it is more an issue for Japanese sentences, which are associated to some sort of annotations. These annotations are not displayed because they are not useful for normal users. If you change a Japanese sentence into an English sentence, then the annotations that were associated to it won't make sense anymore.
130132
131133##7. Make sure you are adding comments to the right sentence
132134
133135When you post a comment, the comment is only associated to the main sentence, so make sure that your comment is related to that particular sentence. Typically, if you want to point out a spelling mistake, like here:
134136
135137_My name is Trang._
136138
137139_=> Je m'appel Trang._
138140
139141_=> Me llamo Trang._
140142
141143You can see that the French sentence is wrong. It should be "appelle" and not "appel". If you post your comment here, it would be associated to the English sentence (because it's at the top, so it's the main sentence). This is not what you want. The right thing to do is to click on the French sentence first. It will change the configuration into:
142144
143145_Je m'appel Trang._
144146
145147_=> My name is Trang._
146148
147149_=> Me llamo Trang. _
148150
149151And then you can post your comment.
150152
151153Now there is the case where you want to point out that a translation is wrong. Your comment will be related to two sentences, so where should you post it? Well, ideally, for this type of situation, there should be the possibility to comment a **link** between two sentences. But we don't have that, we can only comment a **sentence**. So you are free to decide where you want to post your comment. Just remember that it's good as long as your comment is related to the main sentence.
152154
153155
154156##8. Do not add sentences from copyrighted content
155157
156158We are distributing the corpus under the [Creative Commons Attribution](http://creativecommons.org/licenses/by/2.0/fr/) (or CC-BY) license. It makes it possible for anyone to re-use this data in any way they want as long as they mention Tatoeba in their work.
157159
158160As a contributor, you have agreed with the terms of use (which of course you haven't read), and therefore you are providing your contributions under the CC-BY license as well. Which means we can re-use your data in any way we want as long as we mention you. So we are re-using your work in Tatoeba, and we mention you through the logs and the stats.
159161
160162But providing your work under CC-BY means you also have some responsibilities on what you provide. And you have to know that you **cannot** legally redistribute data if it was copied from a source that doesn't clearly state that you can do it. Typically, you cannot (legally) copy all the sentences from a textbook and add them into in Tatoeba.
161163
162164Don't worry, you (and we) won't get in jail and be in debt for life if you've added a couple of sentences from a textbook (hopefully...). But the law forbids us to take the work of someone and re-use it without their consent. Producing sentences and translations is work, so be careful where you get the sentences from. Preferably, come up with your own sentences or take them from books that are in the public domain.
163165
164166If you have added or have seen sentences that were copied from a copyrighted material, change a few words so that it won't be exactly the same sentence. Or, go negotiate with the authors and convince them to release their work under the CC-BY license so we can re-use it.
165167
166168I'm not going to argue on whether all of this makes sense or not (obviously I don't believe it does), but it will help us a lot if everyone did the necessary so we don't get sued.
167169
168170
169171##9. Do not annotate sentences
170172
171173We want sentences to remain as "raw" as possible so do not add annotations. For example we do NOT want sentences like this:
172174
173175
1741761. I (female) am happy.
1751772. It's raining cats and dogs. (idiom)
1761783. I like her/him.
177179
178180
179181Regarding sentences 1 and 2, if you need to indicate that a sentence is a proverb or female speech or whatsoever, then post a comment about it (or tag it, if you are a trusted user), but please do NOT add this information directly in the sentence.
180182
181183
182184
183185
184186Regarding sentence 3, instead of having only one sentence, split it into two sentences. Remember, you have the right to add multiple translations in a same language. So it's okay to have this:
185187
186188> Je l'aime bien.
187189>
188190> => I like her.
189191>
190192> => I like him.
191193
192194
193195
194196
195197
196198There are various reasons why we don't want annotations.
197199
198200
1992011. They can be a problem for people who are using our data in order to improve a natural language processing system, for instance.
2002022. Your translation can be retranslated into another language, and it's less easy for people to translate sentences that contain alternatives (like "him/her").
2012033. If we want to record audio for the sentence, we will need to choose what exactly to record, and annotations don't help.
202204
203205
204206
205207
206208
207209
208210
209211
210212##10. Give us feedback##
211213
212214We know that Tatoeba is not perfect so don't hesitate to [tell us](http://tatoeba.org/pages/contact) what you think is missing (just make sure no one has talked about it on the [Wall](http://tatoeba.org/wall) already). Also tell us if you see any spelling mistake, feel that some explanations are not clear, or encounter bugs.
213215
214216We also know that Tatoeba is a cool project so feel free to tell us you like it too :P
215217
216218
217219##11. Do not wait for us to code it if you can code it
218220
219221As much as we welcome feedback, we welcome even more **INITIATIVE**. There are just sooo many things we could do. We can't take care of everything.
220222
221223For instance we are distributing the _**entire **_corpus, but many people probably don't need **_all_** the sentences in _**all**_ the languages. You may just want the English-Spanish sentences. Well instead of asking and waiting for us to provide a file with only English-Spanish sentences, you can code a tool (and please, tell us if you do) that will extract only what you want from the our files.
222224
223225That's just one example but if you are a programmer, there could be many things you could do yourself instead of waiting for us to do it. But of course, tell us so we don't start working on something you plan to work on.
224226
225227You also have to know that we are actually open source (under AGPL license) but we are not really "promoting" this aspect because:
226228
2272291. The code hasn't met my standards of elegance yet... Still too many parts that make me cringe when I look at them.
2282302. We still don't have a sound methodology and organization in our way of working and I really don't have time to manage more people.
229231However if you love the project and are really motivated to join the development team, then feel free to contact us =)
230232
231233
232234
233235##12. Indicate your languages in your profile
234236
235237For people who didn't know, you can edit your profile by clicking on your username (at the top, in the menu bar).
236238
237239Since Tatoeba involves languages, it can be very useful for other users to know which languages you can speak and how well you can speak them. We don't have a specific "languages" field so you will have to write about it in your profile description (in the section "Something about you").
238240
239241And tell other users to indicate their languages as well (if they haven't already), especially if they have contributed.
240242
241243
242244
243245##13. Encourage and educate new (or even not so new) contributors
244246
245247The community is very important in a project like Tatoeba, we just can't achieve the ambition without a strong community. But how do you build a strong community? Well, one thing is NOT to make new users feel lost and isolated.
246248
247249Part of this depends on the system. It has to be designed in a way that not only enables but also encourages users to interact with each other. Tatoeba is not great at that, but you have the minimum (private messages, wall, comments).
248250
249251And the other part depends of course on the community itself. There must be an effort from the community to build a stronger community. So if someone is asking a question to which you can answer, don't hesitate to help out. If you notice someone is going something wrong, don't hesitate to tell them the right way to do it. If you notice someone or some people have been contributing significantly, don't hesitate to drop a line (in a private message or on the Wall) to say "congratulations" or "thank you" for their work.
250252
251253More generally speaking, if you have any idea on how to make Tatoeba a more socially pleasant place to be, then go ahead!
252254
253255
254256##14. Spread the love
255257
256258Last but not least: you love the project, we love the project, we all want this project to become the greatest language tool of all time, so bring more people into this adventure!
257259
258260In the end, anyone who knows how to read and how to write can participate. There's no need to be a polyglot. If you can "just" hunt for mistakes and correct them or point them out, it will be already extremely helpful. The more people, the more mistakes we can take down, the more data we can produce that people can rely on. And everyone can live happily ever after.
diff view generated by jsdifflib

Version at: 26/04/2013, 16:53

##Introduction

This article by Trang is a must-read for anyone who is serious about contributing in Tatoeba. Here is an outline of how to be a good contributor:

1.  [Understand the context of the project](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule1)
2.  [Understand how the corpus is structured](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2)
3.  [Do not pay attention to the other translations](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule3)
4.  [Do not translate word for word](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule4)
5.  [Do not edit a sentence if, by itself, it is correct](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule5)
6.  [Do not change the language in which a sentence is written](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule6)
7.  [Make sure you are adding comments to the right sentence](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule7)
8.  [Do not add sentences from copyrighted content](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule8)
9.  [Do not annotate sentences](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule9)
10.  [Give us feedback](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule10)
11.  [Do not wait for us to code it if you can code it](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule11)
12.  [Indicate your languages in your profile](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule12)
13.  [Encourage and educate new (or even not so new) contributors](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule13)
14.  [Spread the love](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule14)

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)

##1. Understand the context of the project

I will (someday) write a more detailed history, but here are the basic facts you should be aware of.

*   I started this project in 2006. The initiative was driven by my passion for language learning and frustration about not finding an adequate online dictionary.   The project is focused on **sentences** and I insist on **sentences**. Sample sentences were (and still are) a very scarce resource. Please only add **complete sentences** if you are going to contribute.
*   I was alone on this project for some time. It was only three years later, in 2009, that other people (all computer science students) started to help me out by coding more features.
*   Tatoeba is NOT a commercial project. We're not a company, we're not paid for doing any of this. It is is something that we're working on in our **free time**.
*   To be honest, we don't exclude the possibility of starting a company someday, but that is if and only if we have an innovative, coherent and ethical business model (yeah, good luck). Having ads everywhere and driving a lot of traffic, or forcing people to pay to access the data, is out of the question.

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&postID=5954885196540160002)

##2. Understand how the corpus is structured

The corpus is structured not as a **table** but as a **graph** (in the computer science sense of the word). What does that mean? Well, imagine you had to extract part of the corpus and write it on paper. You might do something like this:

<table><tbody>
<tr>     <th style="text-align: left;">English</th>     <th style="text-align: left;">French</th>     <th style="text-align: left;">Spanish</th> </tr>
<tr>         <td>My name is Trang.</td>     <td>Je m'appelle Trang.</td>     <td>Me llamo Trang.</td> </tr>
<tr>     <td>How are you?</td>     <td>Comment vas-tu?</td>     <td>&#191;Cómo estás?</td> </tr>
<tr>     <td>...</td>     <td>...</td>     <td>...</td> </tr>
</tbody></table>

That's a **table** structure. There are **rows** and **columns**: a row contains sentences with the same meaning, and a column contains sentences with the same language. That's the first approach anyone might take, but that's NOT how the corpus is constructed.

Our corpus is set up like this: *[note: diagram needs to be imported]*

That's a **graph** structure. There are **nodes** and **edges**: each node represents a sentence, and each edge represent the link between two sentences. When two sentences are linked, they have the same meaning.

The structure of the representation has a big effect on the way you can contribute to the corpus. One important implication of the graph structure is that you can add **multiple translations in the same language** for a specific sentence. You think there are two ways to translate a sentence and you really can't decide which would be the best? Well, just add both!

Some other implications are pointed out below.


##3. Do not pay attention to the other translations**

When you translate a sentence, you are in fact **adding a sentence** (a node) and **adding a link** (an edge) between the "original" sentence and your translation. So the only thing you need to care about is that you are adding a proper translation to "main sentence" (the one at the top, written in bigger size).

More concretely, if you were in this situation and wanted to add a Spanish translation to the English sentence:

_How are you?_

_=&gt; Comment vas-tu?_

You could add _"&#191;Cómo estás?"_ (casual) as much as you could add _"&#191;Cómo está usted?"_ (formal). Or you could add both (because you can add multiple translations in a same language).

If you understand French, it **doesn't matter** if the French sentence is the casual form, you only have to worry about the fact that your translation is a proper translation of the **English sentence**. A proper translation means that if someone had to translate your contribution back to English, _"How are you?"_ would be a possibility.


##4. Do not translate word for word**

We are not interested in having sentences that sound like they were written by a robot. We want sentences that really are what a native speaker would say. Translating is a very difficult task, we know it. But if you are translating into your native language, you should always, always re-read your translation as if it was a single sentence, and ask yourself if it is actually something people would say. You can use the comments to indicate a literal translation.

If you are not translating into your native language (which you can), you are forgiven for not writing native-like sentences. But in this case, please make sure you find a native speaker to check your sentences so that your possible mistakes get corrected more quickly.

The point is to understand that Tatoeba is not only about providing translations, it's also about gathering data about a language. Tatoeba could simply be limited to adding sentences without translating them at all. If we were to extract only the sentences in Italian, we would like that each of them are representative of the Italian language.

The sentences are the basic layer. The links between the sentences is another layer. But the corpus should make sense without those links.



##5. Do not edit a sentence if, by itself, it is correct

As I mentioned just above, Tatoeba could simply be limited to adding sentences without translating them at all. Consequently, before you modify a sentence, look at it without paying attention to its translations, and ask yourself _"Does this sentence have any spelling or grammar mistake? Does it sound weird?"_. If the answer is "No", then do NOT edit it, **leave it alone**!






I am explaining this because you may be tempted to edit a sentence so that its meaning matches all the other sentences. 



It could be because you want to turn a sentence into a more "literal" translation. But this is not a good idea. Obviously, if we don't want you to translate word for word (cf. rule #4), we also don't want you to change a sentence into a word for word translation.




It could also be because the sentence doesn't match AT ALL. For instance:



_My name is Trang._

_=&gt; Je m'appelle Trang._

_=&gt; Vamos a la playa._

You notice that the Spanish sentence (which says "Let's go to the beach") has nothing to do with the English sentence.

Perhaps you don't speak Spanish very well so you're not confident in modifying the Spanish sentence and decide to change the English sentence. Problem: what about the French sentence? It won't fit the English sentence anymore...

Perhaps you are a native Spanish speaker and decide to change the Spanish sentence. In this particular case, it would still be acceptable because the Spanish sentence is not linked to any other sentence. But if someone had translated that Spanish sentence into Italian, "correcting" the Spanish sentence would cause a conflict with the Italian translation.

Then there is a problem you may have not thought of: when changing the meaning of a sentence, you are potentially erasing unique vocabulary. What if the Spanish sentence was currently the only one with "playa" in it?

So the best way to proceed in this kind of situation is to add a new Spanish translation (_Me llamo Trang_) and "unlink" the current Spanish translation. NOTE: Not everyone can unlink. Only "trusted users" can. You can post a comment to request a sentence to be unlinked.


##6. Do not change the language in which a sentence is written

If the language flag of a sentence is wrong (for instance it was flagged as Chinese when it is in fact Japanese), then of course, you can change it. That's not what I mean by "Do not change the language in which a sentence is written".

What I mean is that you shouldn't replace a Japanese sentence by a Chinese sentence with the same meaning (and that applies to any language of course). It shouldn't often happen, but if you're in a situation where you want to do that, then don't.

The problem is that a sentence can be associated to data that is dependent on its language. For instance comments. People can post comments on sentences, and the comments may be valid only because the sentence was in a certain language.

At the moment it is more an issue for Japanese sentences, which are associated to some sort of annotations. These annotations are not displayed because they are not useful for normal users. If you change a Japanese sentence into an English sentence, then the annotations that were associated to it won't make sense anymore.

##7. Make sure you are adding comments to the right sentence

When you post a comment, the comment is only associated to the main sentence, so make sure that your comment is related to that particular sentence. Typically, if you want to point out a spelling mistake, like here:

_My name is Trang._

_=&gt; Je m'appel Trang._

_=&gt; Me llamo Trang._

You can see that the French sentence is wrong. It should be "appelle" and not "appel". If you post your comment here, it would be associated to the English sentence (because it's at the top, so it's the main sentence). This is not what you want. The right thing to do is to click on the French sentence first. It will change the configuration into:

_Je m'appel Trang._

_=&gt; My name is Trang._

_=&gt; Me llamo Trang.   _

And then you can post your comment.

Now there is the case where you want to point out that a translation is wrong. Your comment will be related to two sentences, so where should you post it? Well, ideally, for this type of situation, there should be the possibility to comment a **link** between two sentences. But we don't have that, we can only comment a **sentence**. So you are free to decide where you want to post your comment. Just remember that it's good as long as your comment is related to the main sentence.


##8. Do not add sentences from copyrighted content

We are distributing the corpus under the [Creative Commons Attribution](http://creativecommons.org/licenses/by/2.0/fr/) (or CC-BY) license. It makes it possible for anyone to re-use this data in any way they want as long as they mention Tatoeba in their work.

As a contributor, you have agreed with the terms of use (which of course you haven't read), and therefore you are providing your contributions under the CC-BY license as well. Which means we can re-use your data in any way we want as long as we mention you. So we are re-using your work in Tatoeba, and we mention you through the logs and the stats.

But providing your work under CC-BY means you also have some responsibilities on what you provide. And you have to know that you **cannot** legally redistribute data if it was copied from a source that doesn't clearly state that you can do it. Typically, you cannot (legally) copy all the sentences from a textbook and add them into in Tatoeba.

Don't worry, you (and we) won't get in jail and be in debt for life if you've added a couple of sentences from a textbook (hopefully...). But the law forbids us to take the work of someone and re-use it without their consent. Producing sentences and translations is work, so be careful where you get the sentences from. Preferably, come up with your own sentences or take them from books that are in the public domain.

If you have added or have seen sentences that were copied from a copyrighted material, change a few words so that it won't be exactly the same sentence. Or, go negotiate with the authors and convince them to release their work under the CC-BY license so we can re-use it.

I'm not going to argue on whether all of this makes sense or not (obviously I don't believe it does), but it will help us a lot if everyone did the necessary so we don't get sued.


##9. Do not annotate sentences

We want sentences to remain as "raw" as possible so do not add annotations. For example we do NOT want sentences like this:


1.  I (female) am happy.
2.  It's raining cats and dogs. (idiom)
3.  I like her/him.


Regarding sentences 1 and 2, if you need to indicate that a sentence is a proverb or female speech or whatsoever, then post a comment about it (or tag it, if you are a trusted user), but please do NOT add this information directly in the sentence.




Regarding sentence 3, instead of having only one sentence, split it into two sentences. Remember, you have the right to add multiple translations in a same language. So it's okay to have this:

> Je l'aime bien.
> 
> =&gt; I like her.
> 
> =&gt; I like him.





There are various reasons why we don't want annotations.


1.  They can be a problem for people who are using our data in order to improve a natural language processing system, for instance.
2.  Your translation can be retranslated into another language, and it's less easy for people to translate sentences that contain alternatives (like "him/her").
3.  If we want to record audio for the sentence, we will need to choose what exactly to record, and annotations don't help.








##10. Give us feedback##

We know that Tatoeba is not perfect so don't hesitate to [tell us](http://tatoeba.org/pages/contact) what you think is missing (just make sure no one has talked about it on the [Wall](http://tatoeba.org/wall) already). Also tell us if you see any spelling mistake, feel that some explanations are not clear, or encounter bugs.

We also know that Tatoeba is a cool project so feel free to tell us you like it too :P


##11. Do not wait for us to code it if you can code it

As much as we welcome feedback, we welcome even more **INITIATIVE**. There are just sooo many things we could do. We can't take care of everything.

For instance we are distributing the _**entire **_corpus, but many people probably don't need **_all_** the sentences in _**all**_ the languages. You may just want the English-Spanish sentences. Well instead of asking and waiting for us to provide a file with only English-Spanish sentences, you can code a tool (and please, tell us if you do) that will extract only what you want from the our files.

That's just one example but if you are a programmer, there could be many things you could do yourself instead of waiting for us to do it. But of course, tell us so we don't start working on something you plan to work on.

You also have to know that we are actually open source (under AGPL license) but we are not really "promoting" this aspect because:

1.  The code hasn't met my standards of elegance yet... Still too many parts that make me cringe when I look at them.
2.  We still don't have a sound methodology and organization in our way of working and I really don't have time to manage more people.
However if you love the project and are really motivated to join the development team, then feel free to contact us =)



##12. Indicate your languages in your profile

For people who didn't know, you can edit your profile by clicking on your username (at the top, in the menu bar).

Since Tatoeba involves languages, it can be very useful for other users to know which languages you can speak and how well you can speak them. We don't have a specific "languages" field so you will have to write about it in your profile description (in the section "Something about you").

And tell other users to indicate their languages as well (if they haven't already), especially if they have contributed.



##13. Encourage and educate new (or even not so new) contributors

The community is very important in a project like Tatoeba, we just can't achieve the ambition without a strong community. But how do you build a strong community? Well, one thing is NOT to make new users feel lost and isolated.

Part of this depends on the system. It has to be designed in a way that not only enables but also encourages users to interact with each other. Tatoeba is not great at that, but you have the minimum (private messages, wall, comments).

And the other part depends of course on the community itself. There must be an effort from the community to build a stronger community. So if someone is asking a question to which you can answer, don't hesitate to help out. If you notice someone is going something wrong, don't hesitate to tell them the right way to do it. If you notice someone or some people have been contributing significantly, don't hesitate to drop a line (in a private message or on the Wall) to say "congratulations" or "thank you" for their work.

More generally speaking, if you have any idea on how to make Tatoeba a more socially pleasant place to be, then go ahead!


##14. Spread the love

Last but not least: you love the project, we love the project, we all want this project to become the greatest language tool of all time, so bring more people into this adventure!

In the end, anyone who knows how to read and how to write can participate. There's no need to be a polyglot. If you can "just" hunt for mistakes and correct them or point them out, it will be already extremely helpful. The more people, the more mistakes we can take down, the more data we can produce that people can rely on. And everyone can live happily ever after.

version at: 26/04/2013, 16:54

##Introduction

This article by Trang is a must-read for anyone who is serious about contributing in Tatoeba. 

Here is an outline of how to be a good contributor:

1.  [Understand the context of the project](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule1)
2.  [Understand how the corpus is structured](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2)
3.  [Do not pay attention to the other translations](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule3)
4.  [Do not translate word for word](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule4)
5.  [Do not edit a sentence if, by itself, it is correct](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule5)
6.  [Do not change the language in which a sentence is written](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule6)
7.  [Make sure you are adding comments to the right sentence](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule7)
8.  [Do not add sentences from copyrighted content](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule8)
9.  [Do not annotate sentences](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule9)
10.  [Give us feedback](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule10)
11.  [Do not wait for us to code it if you can code it](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule11)
12.  [Indicate your languages in your profile](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule12)
13.  [Encourage and educate new (or even not so new) contributors](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule13)
14.  [Spread the love](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule14)

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&amp;postID=5954885196540160002)

##1. Understand the context of the project

I will (someday) write a more detailed history, but here are the basic facts you should be aware of.

*   I started this project in 2006. The initiative was driven by my passion for language learning and frustration about not finding an adequate online dictionary.   The project is focused on **sentences** and I insist on **sentences**. Sample sentences were (and still are) a very scarce resource. Please only add **complete sentences** if you are going to contribute.
*   I was alone on this project for some time. It was only three years later, in 2009, that other people (all computer science students) started to help me out by coding more features.
*   Tatoeba is NOT a commercial project. We're not a company, we're not paid for doing any of this. It is is something that we're working on in our **free time**.
*   To be honest, we don't exclude the possibility of starting a company someday, but that is if and only if we have an innovative, coherent and ethical business model (yeah, good luck). Having ads everywhere and driving a lot of traffic, or forcing people to pay to access the data, is out of the question.

[](http://www.blogger.com/post-edit.g?blogID=2196533844101218567&amp;postID=5954885196540160002)

##2. Understand how the corpus is structured

The corpus is structured not as a **table** but as a **graph** (in the computer science sense of the word). What does that mean? Well, imagine you had to extract part of the corpus and write it on paper. You might do something like this:

<table><tbody>
<tr>     <th style="text-align: left;">English</th>     <th style="text-align: left;">French</th>     <th style="text-align: left;">Spanish</th> </tr>
<tr>         <td>My name is Trang.</td>     <td>Je m'appelle Trang.</td>     <td>Me llamo Trang.</td> </tr>
<tr>     <td>How are you?</td>     <td>Comment vas-tu?</td>     <td>&#191;Cómo estás?</td> </tr>
<tr>     <td>...</td>     <td>...</td>     <td>...</td> </tr>
</tbody></table>

That's a **table** structure. There are **rows** and **columns**: a row contains sentences with the same meaning, and a column contains sentences with the same language. That's the first approach anyone might take, but that's NOT how the corpus is constructed.

Our corpus is set up like this: *[note: diagram needs to be imported]*

That's a **graph** structure. There are **nodes** and **edges**: each node represents a sentence, and each edge represent the link between two sentences. When two sentences are linked, they have the same meaning.

The structure of the representation has a big effect on the way you can contribute to the corpus. One important implication of the graph structure is that you can add **multiple translations in the same language** for a specific sentence. You think there are two ways to translate a sentence and you really can't decide which would be the best? Well, just add both!

Some other implications are pointed out below.


##3. Do not pay attention to the other translations**

When you translate a sentence, you are in fact **adding a sentence** (a node) and **adding a link** (an edge) between the "original" sentence and your translation. So the only thing you need to care about is that you are adding a proper translation to "main sentence" (the one at the top, written in bigger size).

More concretely, if you were in this situation and wanted to add a Spanish translation to the English sentence:

_How are you?_

_=&gt; Comment vas-tu?_

You could add _"&#191;Cómo estás?"_ (casual) as much as you could add _"&#191;Cómo está usted?"_ (formal). Or you could add both (because you can add multiple translations in a same language).

If you understand French, it **doesn't matter** if the French sentence is the casual form, you only have to worry about the fact that your translation is a proper translation of the **English sentence**. A proper translation means that if someone had to translate your contribution back to English, _"How are you?"_ would be a possibility.


##4. Do not translate word for word**

We are not interested in having sentences that sound like they were written by a robot. We want sentences that really are what a native speaker would say. Translating is a very difficult task, we know it. But if you are translating into your native language, you should always, always re-read your translation as if it was a single sentence, and ask yourself if it is actually something people would say. You can use the comments to indicate a literal translation.

If you are not translating into your native language (which you can), you are forgiven for not writing native-like sentences. But in this case, please make sure you find a native speaker to check your sentences so that your possible mistakes get corrected more quickly.

The point is to understand that Tatoeba is not only about providing translations, it's also about gathering data about a language. Tatoeba could simply be limited to adding sentences without translating them at all. If we were to extract only the sentences in Italian, we would like that each of them are representative of the Italian language.

The sentences are the basic layer. The links between the sentences is another layer. But the corpus should make sense without those links.



##5. Do not edit a sentence if, by itself, it is correct

As I mentioned just above, Tatoeba could simply be limited to adding sentences without translating them at all. Consequently, before you modify a sentence, look at it without paying attention to its translations, and ask yourself _"Does this sentence have any spelling or grammar mistake? Does it sound weird?"_. If the answer is "No", then do NOT edit it, **leave it alone**!






I am explaining this because you may be tempted to edit a sentence so that its meaning matches all the other sentences. 



It could be because you want to turn a sentence into a more "literal" translation. But this is not a good idea. Obviously, if we don't want you to translate word for word (cf. rule #4), we also don't want you to change a sentence into a word for word translation.




It could also be because the sentence doesn't match AT ALL. For instance:



_My name is Trang._

_=&gt; Je m'appelle Trang._

_=&gt; Vamos a la playa._

You notice that the Spanish sentence (which says "Let's go to the beach") has nothing to do with the English sentence.

Perhaps you don't speak Spanish very well so you're not confident in modifying the Spanish sentence and decide to change the English sentence. Problem: what about the French sentence? It won't fit the English sentence anymore...

Perhaps you are a native Spanish speaker and decide to change the Spanish sentence. In this particular case, it would still be acceptable because the Spanish sentence is not linked to any other sentence. But if someone had translated that Spanish sentence into Italian, "correcting" the Spanish sentence would cause a conflict with the Italian translation.

Then there is a problem you may have not thought of: when changing the meaning of a sentence, you are potentially erasing unique vocabulary. What if the Spanish sentence was currently the only one with "playa" in it?

So the best way to proceed in this kind of situation is to add a new Spanish translation (_Me llamo Trang_) and "unlink" the current Spanish translation. NOTE: Not everyone can unlink. Only "trusted users" can. You can post a comment to request a sentence to be unlinked.


##6. Do not change the language in which a sentence is written

If the language flag of a sentence is wrong (for instance it was flagged as Chinese when it is in fact Japanese), then of course, you can change it. That's not what I mean by "Do not change the language in which a sentence is written".

What I mean is that you shouldn't replace a Japanese sentence by a Chinese sentence with the same meaning (and that applies to any language of course). It shouldn't often happen, but if you're in a situation where you want to do that, then don't.

The problem is that a sentence can be associated to data that is dependent on its language. For instance comments. People can post comments on sentences, and the comments may be valid only because the sentence was in a certain language.

At the moment it is more an issue for Japanese sentences, which are associated to some sort of annotations. These annotations are not displayed because they are not useful for normal users. If you change a Japanese sentence into an English sentence, then the annotations that were associated to it won't make sense anymore.

##7. Make sure you are adding comments to the right sentence

When you post a comment, the comment is only associated to the main sentence, so make sure that your comment is related to that particular sentence. Typically, if you want to point out a spelling mistake, like here:

_My name is Trang._

_=&gt; Je m'appel Trang._

_=&gt; Me llamo Trang._

You can see that the French sentence is wrong. It should be "appelle" and not "appel". If you post your comment here, it would be associated to the English sentence (because it's at the top, so it's the main sentence). This is not what you want. The right thing to do is to click on the French sentence first. It will change the configuration into:

_Je m'appel Trang._

_=&gt; My name is Trang._

_=&gt; Me llamo Trang.   _

And then you can post your comment.

Now there is the case where you want to point out that a translation is wrong. Your comment will be related to two sentences, so where should you post it? Well, ideally, for this type of situation, there should be the possibility to comment a **link** between two sentences. But we don't have that, we can only comment a **sentence**. So you are free to decide where you want to post your comment. Just remember that it's good as long as your comment is related to the main sentence.


##8. Do not add sentences from copyrighted content

We are distributing the corpus under the [Creative Commons Attribution](http://creativecommons.org/licenses/by/2.0/fr/) (or CC-BY) license. It makes it possible for anyone to re-use this data in any way they want as long as they mention Tatoeba in their work.

As a contributor, you have agreed with the terms of use (which of course you haven't read), and therefore you are providing your contributions under the CC-BY license as well. Which means we can re-use your data in any way we want as long as we mention you. So we are re-using your work in Tatoeba, and we mention you through the logs and the stats.

But providing your work under CC-BY means you also have some responsibilities on what you provide. And you have to know that you **cannot** legally redistribute data if it was copied from a source that doesn't clearly state that you can do it. Typically, you cannot (legally) copy all the sentences from a textbook and add them into in Tatoeba.

Don't worry, you (and we) won't get in jail and be in debt for life if you've added a couple of sentences from a textbook (hopefully...). But the law forbids us to take the work of someone and re-use it without their consent. Producing sentences and translations is work, so be careful where you get the sentences from. Preferably, come up with your own sentences or take them from books that are in the public domain.

If you have added or have seen sentences that were copied from a copyrighted material, change a few words so that it won't be exactly the same sentence. Or, go negotiate with the authors and convince them to release their work under the CC-BY license so we can re-use it.

I'm not going to argue on whether all of this makes sense or not (obviously I don't believe it does), but it will help us a lot if everyone did the necessary so we don't get sued.


##9. Do not annotate sentences

We want sentences to remain as "raw" as possible so do not add annotations. For example we do NOT want sentences like this:


1.  I (female) am happy.
2.  It's raining cats and dogs. (idiom)
3.  I like her/him.


Regarding sentences 1 and 2, if you need to indicate that a sentence is a proverb or female speech or whatsoever, then post a comment about it (or tag it, if you are a trusted user), but please do NOT add this information directly in the sentence.




Regarding sentence 3, instead of having only one sentence, split it into two sentences. Remember, you have the right to add multiple translations in a same language. So it's okay to have this:

> Je l'aime bien.
> 
> =&gt; I like her.
> 
> =&gt; I like him.





There are various reasons why we don't want annotations.


1.  They can be a problem for people who are using our data in order to improve a natural language processing system, for instance.
2.  Your translation can be retranslated into another language, and it's less easy for people to translate sentences that contain alternatives (like "him/her").
3.  If we want to record audio for the sentence, we will need to choose what exactly to record, and annotations don't help.








##10. Give us feedback##

We know that Tatoeba is not perfect so don't hesitate to [tell us](http://tatoeba.org/pages/contact) what you think is missing (just make sure no one has talked about it on the [Wall](http://tatoeba.org/wall) already). Also tell us if you see any spelling mistake, feel that some explanations are not clear, or encounter bugs.

We also know that Tatoeba is a cool project so feel free to tell us you like it too :P


##11. Do not wait for us to code it if you can code it

As much as we welcome feedback, we welcome even more **INITIATIVE**. There are just sooo many things we could do. We can't take care of everything.

For instance we are distributing the _**entire **_corpus, but many people probably don't need **_all_** the sentences in _**all**_ the languages. You may just want the English-Spanish sentences. Well instead of asking and waiting for us to provide a file with only English-Spanish sentences, you can code a tool (and please, tell us if you do) that will extract only what you want from the our files.

That's just one example but if you are a programmer, there could be many things you could do yourself instead of waiting for us to do it. But of course, tell us so we don't start working on something you plan to work on.

You also have to know that we are actually open source (under AGPL license) but we are not really "promoting" this aspect because:

1.  The code hasn't met my standards of elegance yet... Still too many parts that make me cringe when I look at them.
2.  We still don't have a sound methodology and organization in our way of working and I really don't have time to manage more people.
However if you love the project and are really motivated to join the development team, then feel free to contact us =)



##12. Indicate your languages in your profile

For people who didn't know, you can edit your profile by clicking on your username (at the top, in the menu bar).

Since Tatoeba involves languages, it can be very useful for other users to know which languages you can speak and how well you can speak them. We don't have a specific "languages" field so you will have to write about it in your profile description (in the section "Something about you").

And tell other users to indicate their languages as well (if they haven't already), especially if they have contributed.



##13. Encourage and educate new (or even not so new) contributors

The community is very important in a project like Tatoeba, we just can't achieve the ambition without a strong community. But how do you build a strong community? Well, one thing is NOT to make new users feel lost and isolated.

Part of this depends on the system. It has to be designed in a way that not only enables but also encourages users to interact with each other. Tatoeba is not great at that, but you have the minimum (private messages, wall, comments).

And the other part depends of course on the community itself. There must be an effort from the community to build a stronger community. So if someone is asking a question to which you can answer, don't hesitate to help out. If you notice someone is going something wrong, don't hesitate to tell them the right way to do it. If you notice someone or some people have been contributing significantly, don't hesitate to drop a line (in a private message or on the Wall) to say "congratulations" or "thank you" for their work.

More generally speaking, if you have any idea on how to make Tatoeba a more socially pleasant place to be, then go ahead!


##14. Spread the love

Last but not least: you love the project, we love the project, we all want this project to become the greatest language tool of all time, so bring more people into this adventure!

In the end, anyone who knows how to read and how to write can participate. There's no need to be a polyglot. If you can "just" hunt for mistakes and correct them or point them out, it will be already extremely helpful. The more people, the more mistakes we can take down, the more data we can produce that people can rely on. And everyone can live happily ever after.

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.