Offline translations for the Hvalur project

Jirka contacted Tatoeba when facing issues trying to bulk-upload translations previously made for the Hvalur project, using Gemini to automate the upload process. Hvalur is an Icelandic-Czech dictionary project that reuses Icelandic sentences from Tatoeba. Some of these Icelandic sentences have also been translated into Czech, and Jirka was trying to upload back this translation work to Tatoeba.

Below is a copy of Jirka's email explaining this translation work.

Translation workflow

I proceed in three passes over CSV files.

First pass: purely mechanical machine translation, whole file at once. Either Google Translate API (until the end of 2025) or DeepL (since 2026). DeepL officially supports Icelandic since Jan 7, 2026, so I just switched, although their "beta" was superior to Google even before that.

Second pass: manual translation. (When still having Google Translate in the 1st phase, essentially every sentence needed re-translating. With DeepL, I can even keep some sentences the same as DeepL phrased them. But there are still many very basic mistakes) In this phase, I spend only the minority of my time on translation itself. I'm excerpting new meanings for hvalur.org, reporting any issues with the original to tatoeba, and, most importantly, using each sentence to improve my own language skills.

Third pass: check of output quality and translation accuracy by a chatbot (Gemini meeting my needs there very well). I correct any issues found manually, this is often an iterative process, because some sentences are complicated to translate without any defined context. But I'm able to feed long blocks of input to the chatbot at a time, so this proceeds much faster than the second pass.

Formats

The input to the first pass is whatever CSV the hvalur.org project gave me. I chose to work with a much simpler CSV whose individual lines look roughly like this: "123456";"Icelandic sentence";"Czech sentence", for every subsequent phase. I use a plain text editor, editing the CSV directly. Phase 3 checks for any file formatting irregularities as well.

Icelandic and Czech normally don't use the ASCII double quote character for quoting. If any quoting appears in an existing item, it would use „this form“ and not disrupt the CSV format's use of "English quoting". However, separator choice or character escaping may need attention if you are thinking of establishing language-independent tooling.

I believe that simplicity of the format is paramount, to allow people performing translation and troubleshooting without distractions, as well as easy/reliable integration to the tooling of their choosing (often LLM-assisted, I expect). There is no need for a GUI tool (other than the tatoeba.org website); all you'd need to provide would be an interchange format and people can stick to whatever GUI tool they already have (if they require one).

Whether the interchange format is CSV, JSON, or XML, is immaterial as any of these can be converted into any other with ease. File level validation always helps (reject any file that's not exactly confirming to your specification); providing item-by-item processing results in a similar machine readable format may be helpful for some users, too.

Upload

I'm not saying that an API that could only process one item at a time would be a bad move, either; in my case, it would be harder to use than a bulk import facility, but way easier than playing a web browser like I did last night. (Whatever Gemini coded for my present upload batch, is throwaway. I don't even store my own prompts that I gave to Gemini. Your website's going to evolve, so in absence of anything better, my next batch would have to come with a freshly coded uploader.)

Playing a web browser is, most of all, fragile, so if I don't have dedicated hardware for the task, and dedicated attention and electricity budgets for it, it could be painful to try to maintain it running for, say 21 days straight (8400 items divided by 400 items allowed per day). I had this current upload batch running overnight so as not to accidentally minimize its browser window or something. I'm quite happy that ~1775 items out of 2000 apparently made it through as that makes any followup corrections managable size for me. (4 out of 2000 were items already deleted on the tatoeba side, leaving only 1996 items my target size for this batch).

Gemini chose "Go" as the programming language and a library called chromedp for the browser component if that matters.

Contents

Article available in: