GSoC 2017 Project ideas
Note that at this stage Google has not yet chosen which organizations will participate to GSoC 2017. The list of accepted mentoring organizations will be published on February 27. Until that date, Tatoeba is not officially part of GSoC 2017.
Tatoeba is a large database of sentences and translations. Its content is ever-growing and results from the voluntary contributions of thousands of members.
Tatoeba provides a tool for you to see examples of how words are used in the context of a sentence. You specify words that interest you, and it returns sentences containing these words with their translations in the desired languages. The name Tatoeba (for example in Japanese) captures this concept.
How to get started
You are a student and are interested in participating to GSoC with Tatoeba mentoring you? Here's how you can get started.
Spend time using Tatoeba. You need to have a good understanding of the current functionalities. Note that we have a dev website where you can test anything you want without being afraid of polluting the prod website.
We'll expect you to show us that you understand our development process and our tools. The best way to do this is to actually contribute some code, by fixing a small bug or implement a small enhancement. For this, follow our guide for new developers.
Read our requirements regarding GSoC proposals. Start thinking about what you would write in your proposal, and if you need any information, ask us!
Remember that the ideas listed on this page are only ideas. They are here to give you inspiration on what projects you could do with us but you are in no way limited to these ideas.
Imagine that you are learning a language, and you are reading some article in this foreign language. You come across new words, and would like to have more example sentences that illustrates the usage of this word. You could go to Tatoeba and search for this word. But what if you don't find any sentence?
To address this, we made it possible for users to create vocabulary lists. When they add a vocabulary item for which no sentence exists, this item is listed on a page for "Sentences wanted". From this page, contributors can browse vocabylary items with less than 10 sentences, and create sentences for these vocabulary items.
This feature still needs a lot of improvement. For instance:
- There is no way to filter out or remove "spam" vocabulary items.
- There is no system to bump up more demanded vocabulary items.
- The sentences linked to the vocabulary items contain only an exact match of the vocabulary.
The idea behind the achievement system is to give users specific tasks to do and reward them with a badge/medal when they complete the tasks.
This system can be useful to guide new contributors into learning about the features of Tatoeba progressively, or just to know what to do next after they register. Indeed, at the moment, after a user registers on Tatoeba, they are kind of left to themselves to figure out what to do next.
This system can also make contributing more engaging for the more advanced contributors.
Improvement of communication tools
The Wall is the main place for members to communicate with each other publicly. There are however no categories like in a regular forum. All the topics are mixed together. As a result, one cannot easily find all the posts where people introduce each other, or all the posts where people submit suggestions, or all the posts that are announcements from the admins.
The private messages are very old style. There is no notion of a discussion thread, and therefore each message is displayed alone, even if it was a reply of a previous message. This makes it rather unpractical to have a conversation with private messages.
The goal of this project is:
- to improve the Wall, or possibly replace it with a forum, or implement a forum in addition to the Wall.
- change the private messages system to display all the messages from a same discussion in a same thread, rather than separated into several private messages.
The permissions of a user are based mostly on the user's status: depending on whether you are a contributor, advanced contributor, corpus maintainer or admin, you will have access to more or less features. For instance advanced contributors an add tags to a sentence, while regular contributors cannot. Corpus maintainers can delete others’ sentences while other contributors cannot.
The goal of this project is to design and implement a more refined permission system, with an interface to manage these permissions.
Here are example of things that we cannot do at the moment, and that could be part of the project:
- Disallow a user to add new sentences, but still allow them to translate sentences.
- Restrict the languages in which a user can contribute.
- Disallow a user from posting comments only on the Wall, but not on sentences.
Tatoeba provides audio for some sentences. These audio are recorded by volunteers, but due to the fact that audio was initially not at the core of the project, the process of contributing audio is a bit complicated.
Audio was still a great addition and Tatoeba has received more and more audio contributions over the years. However the audio content lacks many features.
- It is not possible either to attach several audio to a same sentence (to illustrate different accents of a same language for instance).
- Contributors cannot record audio directly through the web page (see this proof of concept)
The goal of this project would be to implement the necessary features for a better management the audio content in Tatoeba.
Tatoeba shares its data via CSV files that can be downloaded from the Downloads page of the website. CSVs are generated on a weekly basis. Third parties can reuse this data in their projects. However, it's not easy to do so because this approach has many limits:
- Third parties must download the whole corpora. There is no way to download a part of it, for instance only sentences in a given set of languages.
- We don’t provide diff between versions. Even if a relatively small part of the corpora changed, third parties must download the whole corpora at each new version.
- The format of the data is documented, yet subject to change at any time. There is no way to notify third parties about this.
- Third parties must wait a week to get new data.
- Third parties must do some preliminary work to restructure the data the way they need it.
- Probably other things.
We would love to see more projects reusing our data, but all this is definitely an entry barrier for many of them. So what can we do to make our export files easier to use?
App using Tatoeba's data
As mentioned in the "Better exports" idea above, Tatoeba shares its data and we are always happy to see projects reusing our data. Do you have a nice idea of an app that you could build from it? This can be a GSoC project as well.
Just one thing: make sure you check this list of projects that uses our corpus. Maybe someone else already had the idea before you. So try to find the gaps. Make something innovative!
Note that this project idea is very tied to the "Better exports" idea, except it tackles the problem from a more concrete angle. Since you will be reusing our data, you will experience real situations where you can see how we can improve the way we share our data. You will be in a better position to find out, or helps us find out, what we could do to make it easier for you (and other people like you) to get started with their projects.
As a collaborative project that is open for anyone to join, one of the challenges that Tatoeba faces constantly is to provide data of good quality. Not all Tatoeba contributors are highly skilled in the language(s) they contribute in, and therefore contributions are not always good: they may contain spelling mistakes or grammatical mistakes, they may not sound natural, the translations may be inaccurate on just plain wrong.
Although Tatoeba has some mechanisms to manage quality, these mechanisms are not optimal. Users still need to make extra efforts to figure out when they can really rely on a sentence or translation.
What can we improve in our current system, to provide sentences and translations of higher quality? How can we assess the quality of a sentence or of a translation, so that language learners or third party tools can easily filter out sentences of bad quality, or of uncertain quality?
Our Google group is called tatoebaproject. This is your main entry point to get in touch with us in the scope of Google Summer of Code.
To interact with us in real time, you are welcome to join our Gitter chatroom. We may not be online when you drop by, but feel free to leave a message nonetheless.
The Wall is the place where Tatoeba's community discuss things, ask questions, and exchange ideas. We usually read all the messages on the Wall, so you could also get in touch with us from there.
It could happen however that your message goes unnoticed because it got buried behind some passionate discussion, therefore we recommend that you use the Google group at first.