Version at: 23/02/2014, 00:48 vs. version at: 23/02/2014, 03:23
11GSoC ideas for student projects
22===============================
33
4This page lists example ideas for students who would like to take part in Google Summer of Code and be mentored by Tatoeba. To quote [GSoC FAQ](http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2013/help_page#3._What_is_an_Ideas_list):
4This page lists example ideas for students who would like to take part in [Google Summer of Code](http://www.google-melange.com/) and be mentored by Tatoeba. To quote [GSoC FAQ](http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2013/help_page#3._What_is_an_Ideas_list):
55
66<blockquote>
77<p>An Ideas list should be a list of suggested student projects. This list is meant to introduce contributors to your project's needs and to provide inspiration to would-be student applicants. It is useful to classify each idea as specifically as possible, e.g. "must know Python" or "easier project; good for a student with more limited experience with C++." If your organization plans to provide a proposal template for the students, it would be good to include it on your Ideas list.</p>
88
99<p>Keep in mind that your ideas list should be a starting point for student proposals; we've heard from past mentoring organization participants that some of their best student projects are those that greatly expanded on a proposed idea or were blue-sky proposals not mentioned on the ideas list at all. A link to a bug tracker for your open source organization is NOT an ideas list.</p>
1010
1111<p>You can check out the <a href="http://community.kde.org/GSoC/2011/Ideas">Ideas list for KDE</a> for Google Summer of Code in 2011 to get an idea of what we’re looking for in an ideas list. </p>
1212</blockquote>
1313
1414If you're a student, you're invited to discuss any of these ideas, as well as propose your own. To contact us, use one of these:
1515
1616* [Tatoeba Wall page](http://tatoeba.org/wall/index)
1717* Email/Google groups: [Tatoeba GSoC mailing list](https://groups.google.com/forum/?fromgroups=#!forum/tatoeba-gsoc)
1818* IRC: [Tatoeba on #Freenode](irc://irc.freenode.net/tatoeba), [Webchat](http://webchat.freenode.net?channels=tatoeba)
1919* XMPP: [Tatoeba conference room on chat.tatoeba.org](xmpp:tatoeba@chat.tatoeba.org?join)
2020
2121
2222Current site
2323------------
2424
25Extending current PHP site: programming in PHP with the CakePHP framework, shell tools for maintenance (e.g.: better export scripts?), JavaScript?
25Extending current PHP site:
26
27* programming in PHP with the CakePHP framework
28* shell tools for maintenance (e.g.: better export scripts)
29* JavaScript
2630
2731### Better export scripts
2832
29Currently CSV dumps are done weekly. They require the database to be switched into a read-only mode, take 5~10 minutes and do not contain some important information, like tag creator, comments etc. CSV dumps are important for people who cooperate with Tatoeba by creating additional tools, so their quality is vital for a healthy collaboration.
33Currently CSV dumps are done weekly. They require the database to be switched into a read-only mode, take 5~10 minutes, and do not contain some important information, such as tag creator and comments. CSV dumps are important for people who cooperate with Tatoeba by creating additional tools, so their quality is vital for healthy collaboration.
3034
3135**Deliverables**: An database export mechanism that:
3236
33 * Dumps all interesting information (everything that's currently in the data dumps plus modification history, sentence comments, the wall etc.)
37 * Dumps all interesting information (everything that's currently in the data dumps plus modification history, sentence comments, the wall, etc.)
3438 * Can create an incremental dump (faster dumps will allow making them more often)
3539 * Provides an interface for collaborators to get notifications about new dumps and allows automatic access.
3640 * (Advanced) Provides a stream of updates in form of web sockets or a similar mechanism.
3741
3842**Prerequisite knowledge**: a scripting language (Python preferred), PHP, MySQL.
3943
4044### Administrative Scripts
4145
42The current site has experienced a couple of crashes and instabilities. We also would like to grow a number of users who are fully capable of administering it with ease and reliability. In order to ease administration and quickly recover from disaster, a number of scripts covering common administrative tasks are needed.
46The current site has experienced a couple of crashes and instabilities. We would like to grow a number of users who are fully capable of administering it with ease and reliability. In order to ease administration and quickly recover from disaster, a number of scripts covering common administrative tasks are needed.
4347
4448**Deliverables**: Shell scripts that cover:
4549
4650 * backup
4751 * adding new languages
4852 * preserving existing translations when typos in the source UI strings are fixed
4953 * getting external services up and running
5054 * updating the production site from the repository
5155 * deployment on a real server from scratch
5256 * monitoring the server and logging load and activity
5357 * other necessary tasks
5458
5559**Prerequisite knowledge**: a scripting language (bash, Python, Perl, etc..), possibly familiarity with a build system (ansible, vagrant), and possibly familiarity with setting up and maintaining a monitoring system (newrelic, nagios, cacti, munin)
5660
5761### API
5862
5963The Tatoeba database is used either through the main website interface or through data dumps. Having a real API that can be called through AJAX and return machine-readable results would provide real-time access for external applications.
6064
6165**Deliverables**: A web application that provides a set of API calls for data stored in the current database. The API should cover all data available through the current web interface, including sentence comments, wall comments, recently-added sentences and top recent contributors.
6266
6367**Prerequisite knowledge**: a web application language (Python or PHP preferred), MySQL, and a data exchange format such as JSON or XML.
6468
6569### Improvements in user interface for end users
6670
6771Tatoeba now handles several kinds of queries, but more are desired. The translation interface, in particular, needs improvement. Examples of desired types of queries:
6872
6973* Get all sentences in a given language by a given username that have not yet been translated into a given language.
7074
7175 * For example: Show me all English sentences by CK not yet translated into Japanese.
7276
7377* Same as above, but limited to sentences with audio.
7478
7579 * For example: Show me all English sentences by CK with audio not yet translated into Japanese.
7680
7781* Get all sentences by native speakers of a given language not yet translated into my own native language.
7882
7983 * For example: Show me all English sentences by native speakers not yet translated into Japanese.
8084
8185* Get all sentences in a given language with a certain tag not yet translated into a given language.
8286
8387 * For example: Show me all English sentences with the tag "restaurant".
8488
8589* Same as above, but limited to sentences by native speakers not yet translated into a given language.
8690
8791 * For example: Show me all English sentences by native speakers with the tag "weather" not yet translated into Japanese.
8892
8993* Get all sentences in a given language under a certain length not yet translated into a given language.
9094
9195 * For example: Show me all Japanese sentences less than 50 characters that aren't yet translated into English.
9296
9397* Same as above, but limited to native speaker sentences.
9498
9599* Same as above, but limited to sentences by a given username.
96100
97101* Get all sentences by native speakers of a given language that match a given search keyword that aren't yet translated into a given language.
98102
99103 * For example: Show all English sentences with the word "mountain" that aren't yet translated into Japanese.
100104
101105* Same as above, but limited to native speaker sentences.
102106
103107* Same as above, but limited to sentences by a given username.
104108
105109**Deliverables:** Implementation of some (all?) of the above. Project might include additional queries. It would be highly desired to provide a generic way of adding new types of queries.
106110
107111**Prerequisite knowledge**: PHP, CakePHP.
108112
109113### Allow users to follow each other
110114
111115This feature would allow users to keep track of newly created sentences of other users, just like Twitter does. One could be able to get notified of new sentences of users he or she's following, and to browse them. Public and private visibility of who’s following who should be discussed prior to development.
112116
113117**Deliverables:** a mean to follow one or more users ; a page that displays the sentences of the followed users and allows to browse and search through ; configurable notifications about new sentences of the followed users ; displaying of who’s following who.
114118
115119**Prerequisite knowledge**: PHP, CakePHP.
116120
117121### Word requests
118122
119123This feature would allow users to request example sentences that show the correct usage of a given word or phrase. Contributors could browse lists of 'requested words' and add sentences that include them.
120124
121125People interested in this idea should consider and discuss possible implementations details prior to development. Typical questions include:
122126
123127* What should be the scope of the lists (per-language, per-user…)?
124128* How should the lists be maintained?
125129* Can we indicate that a 'requested word' now has enough example sentences? If yes, how?
126130* What’s the lifecycle of a typical requested word?
127131* What if users want to express additional information in their requests, such as the context or sense for the requested word?
128132* What about synonyms, inflections… ?
129133
130134**Deliverables:** a mean to express the need of example sentences of a given word ; a way to easily contribute new sentences that shows example of wanted words
131135
132136**Prerequisite knowledge**: PHP, CakePHP.
133137
134138### Show pronunciation in IPA for sentences
135139
136140IPA stands for "International Phonetic Alphabet" and is used to describe pronunciation of human languages in an unambiguous way. As such, it helps learning languages whose pronunciation rules are complex (e.g., English). Tatoeba could display IPA pronunciation for each sentence in basically the same way it currently displays pronunciation for Japanese using kana. One possible way of performing the task is to use an external library or application to prepare IPA annotations. For example, [eSpeak](http://espeak.sourceforge.net/) seems to be able to handle several popular languages and has an IPA converter.
137141
138142**Deliverables:** A mechanism that shows IPA pronunciation for some languages (chosen by the student). This can be done server-side (as a standalone service or part of existing code) or client-side (using JavaScript). Mechanism should allow pre-generating pronunciation descriptions and should provide means to manually edit pronunciation later. Mechanism can rely on 3rd party tool to generate pronunciation descriptions.
139143
140144**Prerequisite knowledge**: web technology, some web application stack (PHP, Python or CppCMS preferred).
141145
142146New site & CppCMS
143147-----------------
144148
145149Helping Sysko with tatowiki, tatodb. Extending [CppCMS](http://cppcms.com/wikipp/en/page/main). As the new site is still mostly being planned, there are no specific project ideas for the moment. Please ask on the IRC channel for more information. Note that many projects in this category will have an experimental nature, and their scope highly depends on your skills.
146150
147151Standalone user tools
148152---------------------
149153
150154Work on [shtooka recorder](http://a4esl.com/temporary/tatoeba/shtooka/) (or swac-record), [tatoparser](https://github.com/qdii/tatoeba_parser), [katoeba](https://github.com/sadhen/katoeba) and similar tools. Create new tools for advanced contributors and common users, like apps for smartphones.
151155
152156### Android/iPhone application
153157
154158iPhone users are about 12%, and Android users about 7% of the site visitors. It might help them immensely if they could use a dedicated application
155159
156160**Deliverables**: A smartphone application for easy access to Tatoeba. Examples of features:
157161
158162* Querying the online Tatoeba site
159163* Adding sentences
160164* Translating
161165* Performing typical corpus maintenance tasks (linking/unlinking sentences, changing sentence language, tagging etc.)
162166* Access to wall and sentence comments
163167* Recording voice
164168* Offline database access (more difficult!)
165169
166170Note: It is not expected to implement all of these features during a single GSoC event. Depending on your skills, you might prepare a proposal for a basic set of features (if you don't have much experience in mobile development yet) or a more complex or targeted application (if you do have experience and want to prepare something more feature-complete).
167171
168172**Prerequisite knowledge**: Java and Android development or iPhone and iOS developement; using web services.
169173
170174### Streamlined linking of multiple sentences
171175
172176Where multiple sentences in a source language have the same translation in the target language, make it easy to link those source sentences to the same target translation. Collecting the sentences that are likely to have the same translation could be as simple as presenting sentences in order of creation, since variants of a sentence that vary only in, e.g., the number of the pronoun (where the singular and plural forms of the second person map to the same word in English) are likely to be entered consecutively.
173177
174178**Prerequisite knowledge**: JavaScript, possibly Java, possibly SQL
175179
176180### Help bots
177181
178182Produce bots like [those in Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Bots), to help with maintenance and repetitive tasks such as fixing common mistakes, wrong flag, etc. As on Wikipedia, users that are actually bots should be identified somehow on the website side. Ideally, create a library that could be used as a base to create bots that interact with the Tatoeba website.
179183
180184Note: this idea would highly benefit from having a real API, which is another project listed here.
181185
182186**Prerequisite knowledge**: web
183187
184188Other ideas
185189-----------
186190
187191External services, like [CK's Temporary Tatoeba site](http://a4esl.com/temporary/tatoeba/). Other ideas.
188192
189193### XMPP Integration for Tatoeba
190194
191195Use the XMPP communications protocol to integrate the manipulation of sentences, comments, and wall posts, as well as live feeds of latest comments, sentence additions, and wall posts, with services such as pubsub.
192196
193197**Deliverables**:
194198
195199* An XEP that outlines the protocol tatoeba would use for all of those operations over XMPP ready to be submitted to the XSF
196200* An implementation of this XEP in XMPP clients as plugins, poezio and gajim are top priorities.
197201* An implementation of this XEP server side as a module, prosody is a top priority.
198202
199203**Prerequisite Knowledge**: XMPP, PubSub, Python, Lua, Familiarity with prosody/gajim/poezio codebases and plugin architecture
200204
201205
202206### SRS deck generator
203207
204208Spaced Repetition Systems such as Anki and Mnemosyne are popular tools for learning languages. However, preparing a good SRS deck is a time-consuming task. Therefore, an automated way to generate a deck from a list of sentences (e.g., sentences on a Tatoeba list, sentences tagged by some specific tag, etc.) would help language learners.
205209
206210**Deliverables**: an application (preferably a web-based one) that would use Tatoeba database (for example in the form of a weekly CSV data dump) to create SRS decks for major flash card applications. Examples of features:
207211
208212* Generate a simple deck from a Tatoeba list, tag, or search query.
209213* Generate an N+1-style deck based on user's list of known words and Tatoeba database. (User gives a list of N words that s/he already knows. System chooses a new sentence where exactly one word is unknown, and the rest belong to the already known set.)
210214* Generated decks have proper internal structure (as for Anki decks: proper field scheme is used to store knowledge, so editing is easy).
211215
212216**Prerequisite knowledge**: any web stack, however Django or CppCMS are prefered; Python or C++; knowledge about SRS.
213217
214218### Browsable graph of sentence links
215219
216220Given a sentence, display it as a [graph](https://en.wikipedia.org/wiki/Graph_%28data_structure%29) [like this](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2) the linked sentences up to a given depth. The main purpose of such a graph is to show users how Tatoeba is structured at a glance. The current interface doesn’t provide such a view, but it’s important that users understand the actual structure of Tatoeba. This idea could be freely extended to a complete interface allowing linking and unlinking with a click, filter by language, edit sentences, or whatever you can think of.
217221
218222**Deliverables**: a web application or a client-side JavaScript program that provides a graph view of a group of sentences, and allows manipulating them. Code can either operate on database directly or use existing or planned APIs.
219223
220224Note that this idea can be implemented as part of the current code base in PHP or as an experimental service for a new CppCMS site (preferred).
221225
222226**Prerequisite knowledge**: PHP, Python or CppCMS.
diff view generated by jsdifflib

Version at: 23/02/2014, 00:48

GSoC ideas for student projects
===============================

This page lists example ideas for students who would like to take part in Google Summer of Code and be mentored by Tatoeba. To quote [GSoC FAQ](http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2013/help_page#3._What_is_an_Ideas_list):

<blockquote>
<p>An Ideas list should be a list of suggested student projects. This list is meant to introduce contributors to your project's needs and to provide inspiration to would-be student applicants. It is useful to classify each idea as specifically as possible, e.g. "must know Python" or "easier project; good for a student with more limited experience with C++." If your organization plans to provide a proposal template for the students, it would be good to include it on your Ideas list.</p>

<p>Keep in mind that your ideas list should be a starting point for student proposals; we've heard from past mentoring organization participants that some of their best student projects are those that greatly expanded on a proposed idea or were blue-sky proposals not mentioned on the ideas list at all. A link to a bug tracker for your open source organization is NOT an ideas list.</p>

<p>You can check out the <a href="http://community.kde.org/GSoC/2011/Ideas">Ideas list for KDE</a> for Google Summer of Code in 2011 to get an idea of what we’re looking for in an ideas list. </p>
</blockquote>

If you're a student, you're invited to discuss any of these ideas, as well as propose your own. To contact us, use one of these:

* [Tatoeba Wall page](http://tatoeba.org/wall/index)
* Email/Google groups: [Tatoeba GSoC mailing list](https://groups.google.com/forum/?fromgroups=#!forum/tatoeba-gsoc)
* IRC: [Tatoeba on #Freenode](irc://irc.freenode.net/tatoeba), [Webchat](http://webchat.freenode.net?channels=tatoeba)
* XMPP: [Tatoeba conference room on chat.tatoeba.org](xmpp:tatoeba@chat.tatoeba.org?join)


Current site
------------

Extending current PHP site: programming in PHP with the CakePHP framework, shell tools for maintenance (e.g.: better export scripts?), JavaScript?

### Better export scripts

Currently CSV dumps are done weekly. They require the database to be switched into a read-only mode, take 5~10 minutes and do not contain some important information, like tag creator, comments etc. CSV dumps are important for people who cooperate with Tatoeba by creating additional tools, so their quality is vital for a healthy collaboration.

**Deliverables**: An database export mechanism that:

  * Dumps all interesting information (everything that's currently in the data dumps plus modification history, sentence comments, the wall etc.)
  * Can create an incremental dump (faster dumps will allow making them more often)
  * Provides an interface for collaborators to get notifications about new dumps and allows automatic access.
  * (Advanced) Provides a stream of updates in form of web sockets or a similar mechanism.

**Prerequisite knowledge**: a scripting language (Python preferred), PHP, MySQL.

### Administrative Scripts

The current site has experienced a couple of crashes and instabilities. We also would like to grow a number of users who are fully capable of administering it with ease and reliability. In order to ease administration and quickly recover from disaster, a number of scripts covering common administrative tasks are needed.

**Deliverables**: Shell scripts that cover:

 * backup
 * adding new languages
 * preserving existing translations when typos in the source UI strings are fixed 
 * getting external services up and running
 * updating the production site from the repository
 * deployment on a real server from scratch
 * monitoring the server and logging load and activity
 * other necessary tasks

**Prerequisite knowledge**: a scripting language (bash, Python, Perl, etc..), possibly familiarity with a build system (ansible, vagrant), and possibly familiarity with setting up and maintaining a monitoring system (newrelic, nagios, cacti, munin)

### API

The Tatoeba database is used either through the main website interface or through data dumps. Having a real API that can be called through AJAX and return machine-readable results would provide real-time access for external applications.

**Deliverables**: A web application that provides a set of API calls for data stored in the current database. The API should cover all data available through the current web interface, including sentence comments, wall comments, recently-added sentences and top recent contributors.

**Prerequisite knowledge**: a web application language (Python or PHP preferred), MySQL, and a data exchange format such as JSON or XML.

### Improvements in user interface for end users

Tatoeba now handles several kinds of queries, but more are desired. The translation interface, in particular, needs improvement. Examples of desired types of queries:

* Get all sentences in a given language by a given username that have not yet been translated into a given language.

    * For example: Show me all English sentences by CK not yet translated into Japanese.

* Same as above, but limited to sentences with audio.

    * For example: Show me all English sentences by CK with audio not yet translated into Japanese.

* Get all sentences by native speakers of a given language not yet translated into my own native language.

    * For example: Show me all English sentences by native speakers not yet translated into Japanese.

* Get all sentences in a given language with a certain tag not yet translated into a given language.

    * For example: Show me all English sentences with the tag "restaurant".

* Same as above, but limited to sentences by native speakers not yet translated into a given language.

    * For example: Show me all English sentences by native speakers with the tag "weather" not yet translated into Japanese.

* Get all sentences in a given language under a certain length not yet translated into a given language.

    * For example: Show me all Japanese sentences less than 50 characters that aren't yet translated into English.

* Same as above, but limited to native speaker sentences.

* Same as above, but limited to sentences by a given username.

* Get all sentences by native speakers of a given language that match a given search keyword that aren't yet translated into a given language.

    * For example: Show all English sentences with the word "mountain" that aren't yet translated into Japanese.

* Same as above, but limited to native speaker sentences.

* Same as above, but limited to sentences by a given username.

**Deliverables:** Implementation of some (all?) of the above. Project might include additional queries. It would be highly desired to provide a generic way of adding new types of queries.

**Prerequisite knowledge**: PHP, CakePHP.

### Allow users to follow each other

This feature would allow users to keep track of newly created sentences of other users, just like Twitter does. One could be able to get notified of new sentences of users he or she's following, and to browse them. Public and private visibility of who’s following who should be discussed prior to development.

**Deliverables:** a mean to follow one or more users ; a page that displays the sentences of the followed users and allows to browse and search through ; configurable notifications about new sentences of the followed users ; displaying of who’s following who.

**Prerequisite knowledge**: PHP, CakePHP.

### Word requests

This feature would allow users to request example sentences that show the correct usage of a given word or phrase. Contributors could browse lists of 'requested words' and add sentences that include them.

People interested in this idea should consider and discuss possible implementations details prior to development. Typical questions include: 

* What should be the scope of the lists (per-language, per-user…)? 
* How should the lists be maintained? 
* Can we indicate that a 'requested word' now has enough example sentences? If yes, how? 
* What’s the lifecycle of a typical requested word? 
* What if users want to express additional information in their requests, such as the context or sense for the requested word? 
* What about synonyms, inflections… ?

**Deliverables:** a mean to express the need of example sentences of a given word ; a way to easily contribute new sentences that shows example of wanted words

**Prerequisite knowledge**: PHP, CakePHP.

### Show pronunciation in IPA for sentences

IPA stands for "International Phonetic Alphabet" and is used to describe pronunciation of human languages in an unambiguous way. As such, it helps learning languages whose pronunciation rules are complex (e.g., English). Tatoeba could display IPA pronunciation for each sentence in basically the same way it currently displays pronunciation for Japanese using kana. One possible way of performing the task is to use an external library or application to prepare IPA annotations. For example, [eSpeak](http://espeak.sourceforge.net/) seems to be able to handle several popular languages and has an IPA converter.

**Deliverables:** A mechanism that shows IPA pronunciation for some languages (chosen by the student). This can be done server-side (as a standalone service or part of existing code) or client-side (using JavaScript). Mechanism should allow pre-generating pronunciation descriptions and should provide means to manually edit pronunciation later. Mechanism can rely on 3rd party tool to generate pronunciation descriptions.

**Prerequisite knowledge**: web technology, some web application stack (PHP, Python or CppCMS preferred).

New site & CppCMS
-----------------

Helping Sysko with tatowiki, tatodb. Extending [CppCMS](http://cppcms.com/wikipp/en/page/main). As the new site is still mostly being planned, there are no specific project ideas for the moment. Please ask on the IRC channel for more information. Note that many projects in this category will have an experimental nature, and their scope highly depends on your skills.

Standalone user tools
---------------------

Work on [shtooka recorder](http://a4esl.com/temporary/tatoeba/shtooka/) (or swac-record), [tatoparser](https://github.com/qdii/tatoeba_parser), [katoeba](https://github.com/sadhen/katoeba) and similar tools. Create new tools for advanced contributors and common users, like apps for smartphones.

### Android/iPhone application

iPhone users are about 12%, and Android users about 7% of the site visitors. It might help them immensely if they could use a dedicated application

**Deliverables**: A smartphone application for easy access to Tatoeba. Examples of features:

* Querying the online Tatoeba site
* Adding sentences
* Translating
* Performing typical corpus maintenance tasks (linking/unlinking sentences, changing sentence language, tagging etc.)
* Access to wall and sentence comments
* Recording voice
* Offline database access (more difficult!)

Note: It is not expected to implement all of these features during a single GSoC event. Depending on your skills, you might prepare a proposal for a basic set of features (if you don't have much experience in mobile development yet) or a more complex or targeted application (if you do have experience and want to prepare something more feature-complete).

**Prerequisite knowledge**: Java and Android development or iPhone and iOS developement; using web services.

### Streamlined linking of multiple sentences

Where multiple sentences in a source language have the same translation in the target language, make it easy to link those source sentences to the same target translation. Collecting the sentences that are likely to have the same translation could be as simple as presenting sentences in order of creation, since variants of a sentence that vary only in, e.g., the number of the pronoun (where the singular and plural forms of the second person map to the same word in English) are likely to be entered consecutively.

**Prerequisite knowledge**: JavaScript, possibly Java, possibly SQL

### Help bots

Produce bots like [those in Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Bots), to help with maintenance and repetitive tasks such as fixing common mistakes, wrong flag, etc. As on Wikipedia, users that are actually bots should be identified somehow on the website side. Ideally, create a library that could be used as a base to create bots that interact with the Tatoeba website.

Note: this idea would highly benefit from having a real API, which is another project listed here.

**Prerequisite knowledge**: web 

Other ideas
-----------

External services, like [CK's Temporary Tatoeba site](http://a4esl.com/temporary/tatoeba/). Other ideas.

### XMPP Integration for Tatoeba

Use the XMPP communications protocol to integrate the manipulation of sentences, comments, and wall posts, as well as live feeds of latest comments, sentence additions, and wall posts, with services such as pubsub.

**Deliverables**:

* An XEP that outlines the protocol tatoeba would use for all of those operations over XMPP ready to be submitted to the XSF
* An implementation of this XEP in XMPP clients as plugins, poezio and gajim are top priorities.
* An implementation of this XEP server side as a module, prosody is a top priority.

**Prerequisite Knowledge**: XMPP, PubSub, Python, Lua, Familiarity with prosody/gajim/poezio codebases and plugin architecture


### SRS deck generator

Spaced Repetition Systems such as Anki and Mnemosyne are popular tools for learning languages. However, preparing a good SRS deck is a time-consuming task. Therefore, an automated way to generate a deck from a list of sentences (e.g., sentences on a Tatoeba list, sentences tagged by some specific tag, etc.) would help language learners.

**Deliverables**: an application (preferably a web-based one) that would use Tatoeba database (for example in the form of a weekly CSV data dump) to create SRS decks for major flash card applications. Examples of features:

* Generate a simple deck from a Tatoeba list, tag, or search query.
* Generate an N+1-style deck based on user's list of known words and Tatoeba database. (User gives a list of N words that s/he already knows. System chooses a new sentence where exactly one word is unknown, and the rest belong to the already known set.)
* Generated decks have proper internal structure (as for Anki decks: proper field scheme is used to store knowledge, so editing is easy).

**Prerequisite knowledge**: any web stack, however Django or CppCMS are prefered; Python or C++; knowledge about SRS.

### Browsable graph of sentence links

Given a sentence, display it as a [graph](https://en.wikipedia.org/wiki/Graph_%28data_structure%29) [like this](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2) the linked sentences up to a given depth. The main purpose of such a graph is to show users how Tatoeba is structured at a glance. The current interface doesn’t provide such a view, but it’s important that users understand the actual structure of Tatoeba. This idea could be freely extended to a complete interface allowing linking and unlinking with a click, filter by language, edit sentences, or whatever you can think of.

**Deliverables**: a web application or a client-side JavaScript program that provides a graph view of a group of sentences, and allows manipulating them. Code can either operate on database directly or use existing or planned APIs.

Note that this idea can be implemented as part of the current code base in PHP or as an experimental service for a new CppCMS site (preferred).

**Prerequisite knowledge**: PHP, Python or CppCMS.

version at: 23/02/2014, 03:23

GSoC ideas for student projects
===============================

This page lists example ideas for students who would like to take part in [Google Summer of Code](http://www.google-melange.com/) and be mentored by Tatoeba. To quote [GSoC FAQ](http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2013/help_page#3._What_is_an_Ideas_list):

<blockquote>
<p>An Ideas list should be a list of suggested student projects. This list is meant to introduce contributors to your project's needs and to provide inspiration to would-be student applicants. It is useful to classify each idea as specifically as possible, e.g. "must know Python" or "easier project; good for a student with more limited experience with C++." If your organization plans to provide a proposal template for the students, it would be good to include it on your Ideas list.</p>

<p>Keep in mind that your ideas list should be a starting point for student proposals; we've heard from past mentoring organization participants that some of their best student projects are those that greatly expanded on a proposed idea or were blue-sky proposals not mentioned on the ideas list at all. A link to a bug tracker for your open source organization is NOT an ideas list.</p>

<p>You can check out the <a href="http://community.kde.org/GSoC/2011/Ideas">Ideas list for KDE</a> for Google Summer of Code in 2011 to get an idea of what we’re looking for in an ideas list. </p>
</blockquote>

If you're a student, you're invited to discuss any of these ideas, as well as propose your own. To contact us, use one of these:

* [Tatoeba Wall page](http://tatoeba.org/wall/index)
* Email/Google groups: [Tatoeba GSoC mailing list](https://groups.google.com/forum/?fromgroups=#!forum/tatoeba-gsoc)
* IRC: [Tatoeba on #Freenode](irc://irc.freenode.net/tatoeba), [Webchat](http://webchat.freenode.net?channels=tatoeba)
* XMPP: [Tatoeba conference room on chat.tatoeba.org](xmpp:tatoeba@chat.tatoeba.org?join)


Current site
------------

Extending current PHP site: 

* programming in PHP with the CakePHP framework
* shell tools for maintenance (e.g.: better export scripts)
* JavaScript

### Better export scripts

Currently CSV dumps are done weekly. They require the database to be switched into a read-only mode, take 5~10 minutes, and do not contain some important information, such as tag creator and comments. CSV dumps are important for people who cooperate with Tatoeba by creating additional tools, so their quality is vital for healthy collaboration.

**Deliverables**: An database export mechanism that:

  * Dumps all interesting information (everything that's currently in the data dumps plus modification history, sentence comments, the wall, etc.)
  * Can create an incremental dump (faster dumps will allow making them more often)
  * Provides an interface for collaborators to get notifications about new dumps and allows automatic access.
  * (Advanced) Provides a stream of updates in form of web sockets or a similar mechanism.

**Prerequisite knowledge**: a scripting language (Python preferred), PHP, MySQL.

### Administrative Scripts

The current site has experienced a couple of crashes and instabilities. We would like to grow a number of users who are fully capable of administering it with ease and reliability. In order to ease administration and quickly recover from disaster, a number of scripts covering common administrative tasks are needed.

**Deliverables**: Shell scripts that cover:

 * backup
 * adding new languages
 * preserving existing translations when typos in the source UI strings are fixed 
 * getting external services up and running
 * updating the production site from the repository
 * deployment on a real server from scratch
 * monitoring the server and logging load and activity
 * other necessary tasks

**Prerequisite knowledge**: a scripting language (bash, Python, Perl, etc..), possibly familiarity with a build system (ansible, vagrant), and possibly familiarity with setting up and maintaining a monitoring system (newrelic, nagios, cacti, munin)

### API

The Tatoeba database is used either through the main website interface or through data dumps. Having a real API that can be called through AJAX and return machine-readable results would provide real-time access for external applications.

**Deliverables**: A web application that provides a set of API calls for data stored in the current database. The API should cover all data available through the current web interface, including sentence comments, wall comments, recently-added sentences and top recent contributors.

**Prerequisite knowledge**: a web application language (Python or PHP preferred), MySQL, and a data exchange format such as JSON or XML.

### Improvements in user interface for end users

Tatoeba now handles several kinds of queries, but more are desired. The translation interface, in particular, needs improvement. Examples of desired types of queries:

* Get all sentences in a given language by a given username that have not yet been translated into a given language.

    * For example: Show me all English sentences by CK not yet translated into Japanese.

* Same as above, but limited to sentences with audio.

    * For example: Show me all English sentences by CK with audio not yet translated into Japanese.

* Get all sentences by native speakers of a given language not yet translated into my own native language.

    * For example: Show me all English sentences by native speakers not yet translated into Japanese.

* Get all sentences in a given language with a certain tag not yet translated into a given language.

    * For example: Show me all English sentences with the tag "restaurant".

* Same as above, but limited to sentences by native speakers not yet translated into a given language.

    * For example: Show me all English sentences by native speakers with the tag "weather" not yet translated into Japanese.

* Get all sentences in a given language under a certain length not yet translated into a given language.

    * For example: Show me all Japanese sentences less than 50 characters that aren't yet translated into English.

* Same as above, but limited to native speaker sentences.

* Same as above, but limited to sentences by a given username.

* Get all sentences by native speakers of a given language that match a given search keyword that aren't yet translated into a given language.

    * For example: Show all English sentences with the word "mountain" that aren't yet translated into Japanese.

* Same as above, but limited to native speaker sentences.

* Same as above, but limited to sentences by a given username.

**Deliverables:** Implementation of some (all?) of the above. Project might include additional queries. It would be highly desired to provide a generic way of adding new types of queries.

**Prerequisite knowledge**: PHP, CakePHP.

### Allow users to follow each other

This feature would allow users to keep track of newly created sentences of other users, just like Twitter does. One could be able to get notified of new sentences of users he or she's following, and to browse them. Public and private visibility of who’s following who should be discussed prior to development.

**Deliverables:** a mean to follow one or more users ; a page that displays the sentences of the followed users and allows to browse and search through ; configurable notifications about new sentences of the followed users ; displaying of who’s following who.

**Prerequisite knowledge**: PHP, CakePHP.

### Word requests

This feature would allow users to request example sentences that show the correct usage of a given word or phrase. Contributors could browse lists of 'requested words' and add sentences that include them.

People interested in this idea should consider and discuss possible implementations details prior to development. Typical questions include: 

* What should be the scope of the lists (per-language, per-user…)? 
* How should the lists be maintained? 
* Can we indicate that a 'requested word' now has enough example sentences? If yes, how? 
* What’s the lifecycle of a typical requested word? 
* What if users want to express additional information in their requests, such as the context or sense for the requested word? 
* What about synonyms, inflections… ?

**Deliverables:** a mean to express the need of example sentences of a given word ; a way to easily contribute new sentences that shows example of wanted words

**Prerequisite knowledge**: PHP, CakePHP.

### Show pronunciation in IPA for sentences

IPA stands for "International Phonetic Alphabet" and is used to describe pronunciation of human languages in an unambiguous way. As such, it helps learning languages whose pronunciation rules are complex (e.g., English). Tatoeba could display IPA pronunciation for each sentence in basically the same way it currently displays pronunciation for Japanese using kana. One possible way of performing the task is to use an external library or application to prepare IPA annotations. For example, [eSpeak](http://espeak.sourceforge.net/) seems to be able to handle several popular languages and has an IPA converter.

**Deliverables:** A mechanism that shows IPA pronunciation for some languages (chosen by the student). This can be done server-side (as a standalone service or part of existing code) or client-side (using JavaScript). Mechanism should allow pre-generating pronunciation descriptions and should provide means to manually edit pronunciation later. Mechanism can rely on 3rd party tool to generate pronunciation descriptions.

**Prerequisite knowledge**: web technology, some web application stack (PHP, Python or CppCMS preferred).

New site & CppCMS
-----------------

Helping Sysko with tatowiki, tatodb. Extending [CppCMS](http://cppcms.com/wikipp/en/page/main). As the new site is still mostly being planned, there are no specific project ideas for the moment. Please ask on the IRC channel for more information. Note that many projects in this category will have an experimental nature, and their scope highly depends on your skills.

Standalone user tools
---------------------

Work on [shtooka recorder](http://a4esl.com/temporary/tatoeba/shtooka/) (or swac-record), [tatoparser](https://github.com/qdii/tatoeba_parser), [katoeba](https://github.com/sadhen/katoeba) and similar tools. Create new tools for advanced contributors and common users, like apps for smartphones.

### Android/iPhone application

iPhone users are about 12%, and Android users about 7% of the site visitors. It might help them immensely if they could use a dedicated application

**Deliverables**: A smartphone application for easy access to Tatoeba. Examples of features:

* Querying the online Tatoeba site
* Adding sentences
* Translating
* Performing typical corpus maintenance tasks (linking/unlinking sentences, changing sentence language, tagging etc.)
* Access to wall and sentence comments
* Recording voice
* Offline database access (more difficult!)

Note: It is not expected to implement all of these features during a single GSoC event. Depending on your skills, you might prepare a proposal for a basic set of features (if you don't have much experience in mobile development yet) or a more complex or targeted application (if you do have experience and want to prepare something more feature-complete).

**Prerequisite knowledge**: Java and Android development or iPhone and iOS developement; using web services.

### Streamlined linking of multiple sentences

Where multiple sentences in a source language have the same translation in the target language, make it easy to link those source sentences to the same target translation. Collecting the sentences that are likely to have the same translation could be as simple as presenting sentences in order of creation, since variants of a sentence that vary only in, e.g., the number of the pronoun (where the singular and plural forms of the second person map to the same word in English) are likely to be entered consecutively.

**Prerequisite knowledge**: JavaScript, possibly Java, possibly SQL

### Help bots

Produce bots like [those in Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Bots), to help with maintenance and repetitive tasks such as fixing common mistakes, wrong flag, etc. As on Wikipedia, users that are actually bots should be identified somehow on the website side. Ideally, create a library that could be used as a base to create bots that interact with the Tatoeba website.

Note: this idea would highly benefit from having a real API, which is another project listed here.

**Prerequisite knowledge**: web 

Other ideas
-----------

External services, like [CK's Temporary Tatoeba site](http://a4esl.com/temporary/tatoeba/). Other ideas.

### XMPP Integration for Tatoeba

Use the XMPP communications protocol to integrate the manipulation of sentences, comments, and wall posts, as well as live feeds of latest comments, sentence additions, and wall posts, with services such as pubsub.

**Deliverables**:

* An XEP that outlines the protocol tatoeba would use for all of those operations over XMPP ready to be submitted to the XSF
* An implementation of this XEP in XMPP clients as plugins, poezio and gajim are top priorities.
* An implementation of this XEP server side as a module, prosody is a top priority.

**Prerequisite Knowledge**: XMPP, PubSub, Python, Lua, Familiarity with prosody/gajim/poezio codebases and plugin architecture


### SRS deck generator

Spaced Repetition Systems such as Anki and Mnemosyne are popular tools for learning languages. However, preparing a good SRS deck is a time-consuming task. Therefore, an automated way to generate a deck from a list of sentences (e.g., sentences on a Tatoeba list, sentences tagged by some specific tag, etc.) would help language learners.

**Deliverables**: an application (preferably a web-based one) that would use Tatoeba database (for example in the form of a weekly CSV data dump) to create SRS decks for major flash card applications. Examples of features:

* Generate a simple deck from a Tatoeba list, tag, or search query.
* Generate an N+1-style deck based on user's list of known words and Tatoeba database. (User gives a list of N words that s/he already knows. System chooses a new sentence where exactly one word is unknown, and the rest belong to the already known set.)
* Generated decks have proper internal structure (as for Anki decks: proper field scheme is used to store knowledge, so editing is easy).

**Prerequisite knowledge**: any web stack, however Django or CppCMS are prefered; Python or C++; knowledge about SRS.

### Browsable graph of sentence links

Given a sentence, display it as a [graph](https://en.wikipedia.org/wiki/Graph_%28data_structure%29) [like this](http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html#rule2) the linked sentences up to a given depth. The main purpose of such a graph is to show users how Tatoeba is structured at a glance. The current interface doesn’t provide such a view, but it’s important that users understand the actual structure of Tatoeba. This idea could be freely extended to a complete interface allowing linking and unlinking with a click, filter by language, edit sentences, or whatever you can think of.

**Deliverables**: a web application or a client-side JavaScript program that provides a graph view of a group of sentences, and allows manipulating them. Code can either operate on database directly or use existing or planned APIs.

Note that this idea can be implemented as part of the current code base in PHP or as an experimental service for a new CppCMS site (preferred).

**Prerequisite knowledge**: PHP, Python or CppCMS.

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.