Notice
This page show a previous version of the articleGSoC 2016 Project ideas
This page lists project ideas for students who would like to take part in Google Summer of Code 2016 and be mentored by Tatoeba.
Warning
This page is still a work in progress.
About Tatoeba
Tatoeba is a platform that aims to build a large database of sentences translated into as many languages as possible. The initial idea was to have a tool in which you could search certain words, and it would return example sentences containing these words with their translations in the desired languages. The name Tatoeba resulted from this concept, because tatoeba means for example in Japanese.
You can browse the blog or the wiki for more information about the project.
A note for students
If you are a student and are interested to work on one of the projects listed below, note that at this stage Google has not yet chosen which organizations will participate to GSoC 2016. The list of accepted mentoring organizations will be published on February 29. Until that date, Tatoeba is not officially part of GSoC 2016.
Of course this should not stop you from getting started on a project ahead of time. If you choose to do so, below are our recommendations and requirements.
Make sure that you have read the GSoC FAQ and that you understand how the program works. Please check the calendar for the various deadlines.
Spend time using Tatoeba. You need to have a good understanding of the current functionalities. Note that we have a dev website where you can test anything you want without being afraid of polluting the prod website.
If your project involves implementing something for the current version of Tatoeba, we expect you to show us that you understand our development process and our tools. The main way to do so is to actually try to contribute some code. We have a guide for new developers. You can simply follow it.
If your project will not affect Tatoeba's code itself, do still read the guide, but stop at the Get in touch with us part, and let us know your project idea.
Start preparing your GSoC proposal. You won't be implementing anything (at least not anything related to a GSoC project) until you are officially a GSoC student for Tatoeba. Be aware that we have certain requirements regarding GSoC proposals.
Last but not least, remember that the ideas listed on this page are only ideas. They are here to give you inspiration on what projects you could do with us but you are in no way limited to these ideas.
Ideas
Mobile friendly user interface
Around 40% of the visitors of Tatoeba are browsing the website from a mobile device, but the usuability of the current website on mobile devices is very poor. The idea of this project is to redesign the UI to improve the user experience for visitors who are using a mobile.
Word requests
Imagine that you are learning a language, and you are reading some article in this foreign language. You come across new words, and would like to have more example sentences that illustrates the usage of this word. You could go to Tatoeba and search for this word. But what if you don't find any sentence?
Tatoeba currently doesn't have any feature to support this situation. Our users would like to be able to easily create word requests, where they can submit a word in a certain language to request that other contributors create sentences around this word.
Achievement system
The idea behind the achievement system is to give users specific tasks to do and reward them with a badge/medal when they complete the tasks.
This system can be useful to guide new contributors into learning about the features of Tatoeba progressively, or just to know what to do next after they register. Indeed, at the moment, after a user registers on Tatoeba, they are kind of left to themselves to figure out what to do next.
This system can also make contributing more engaging for the more advanced contributors.
Improvement of communication tools
The Wall is the main place for members to communicate with each other publicly. There are however no categories like in a regular forum. All the topics are mixed together. As a result, one cannot easily find all the posts where people introduce each other, or all the posts where people submit suggestions, or all the posts that are announcements from the admins.
The private messages are very old style. There is no notion of a discussion thread, and therefore each message is displayed alone, even if it was a reply of a previous message. This makes it rather unpractical to have a conversation with private messages.
The goal of this project is:
- to improve the Wall, or possibly replace it with a forum, or implement a forum in addition to the Wall.
- change the private messages system to display all the messages from a same discussion in a same thread, rather than separated into several private messages.
Permissions management
The permissions of a user are based mostly on the user's status: depending on whether you are a contributor, advanced contributor, corpus maintainer or admin, you will have access to more or less features. For instance advanced contributors an add tags to a sentence, while regular contributors cannot. Corpus maintainers can delete sentences while other contributors cannot.
The goal of this project is to design and implement a more refined permission system, with an interface to manage these permissions.
Here are example of things that we cannot do at the moment, and that could be part of the project:
- Disallow a user to add new sentences, but still allow them to translate sentences.
- Restrict the languages in which a user can contribute.
- Disallow a user from posting comments only on the Wall, but not on sentences.
Audio
Tatoeba provides audio for some sentences. These audio are recorded by volunteers, and the process of contributing audio is a bit complicated. This is due to the fact that audio was not at the core of the project.
Audio is still a great addition to the project and Tatoeba has received more and more audio contributions over the years. But the audio content lacks the structure that the sentences in the textual corpus benefit of.
- There is no easy way to know (from the website) who is the author of an audio file, not when it was contributed (cf. Github issue #547).
- It is not possible either to attach several audio to a same sentence (to illustrate different accents of a same language for instance).
- It is a bit tedious to update and maintain the audio. Contributors have to follow a certain procedure, then their audio has to be uploaded to the server, then one of the server admins have to run some script to update the database. Surely we can make this simpler.
- It would also be nice if users could record audio directly through the web page (see this proof of concept)
The goal of this project would be to implement the necessary features for a better management the audio content in Tatoeba.
Better export
Tatoeba shares its data via CSV files that can be downloaded from the Downloads page of the website. CSVs are generated on a weekly basis. Third parties can reuse this data in their projects. However, it's not easy to do so because this approach has many limits:
- Third parties must download the whole corpora. There is no way to download a part of it, for instance only sentences in a given set of languages.
- We don’t provide diff between versions. Even if a relatively small part of the corpora changed, third parties must download the whole corpora at each new version.
- The format of the data is documented, yet subject to change at any time. There is no way to notify third parties about this.
- Third parties must wait a week to get new data.
- Third parties must do some preliminary work to restructure the data the way they need it.
- Probably other things.
We would love to see more projects reusing our data, but all this is definitely an entry barrier for many of them. So what can we do to make our export files easier to use?
Mentors
All the students will be mentored by BOTH gillux and Trang.
gillux
Trang
Contact
Google group
Our Google group is called tatoebaproject. This is your main entry point to get in touch with us in the scope of Google Summer of Code.
IRC / XMPP
To interact with us in real time, you are welcome to join our IRC channel: #tatoeba on freenode. Note that we are more likely to be online during the weekend than during weekdays.
If you don't want to install a IRC client, you can use the Webchat.
In case IRC is not your type of protocol, you can instead join our Jabber room on tatoeba@chat.tatoeba.org.
Tatoeba Wall
The Wall is the place where Tatoeba's community discuss things, ask questions, and exchange ideas. We usually read all the messages on the Wall, so you could also get in touch with us from there.
It could happen however that your message goes unnoticed because it got buried behind some passionate discussion, therefore we recommend that you use the Google group at first.