Requirements list for duplicate merging process

This is an edited and expanded version of the requirements list regarding the duplicate merging script rewrite. The original list was composed by alanf, gillux, Link Mauve, liori and "two unnamed authors".


Requirements for the deduplication code (script, runtime code, or combination of the two):

  • Either:
    • does not block the database, and gracefully handles sentences that are added, deleted, or modified while the script is running (possibly by ignoring them), or
    • does block the database, but runs so quickly that any inconvenience is minimal
  • Does not mess up the database if it is aborted prematurely (for instance, if the server goes down).
  • Either:
    • reindexes the search query database (perhaps by calling the Sphinx reindexing command), or
    • prompts the caller of the script to run the reindexing command when finished
  • Performs incremental runs (e.g. "consider only sentences that have changed according to the contributions table between date X and date Y", not just sentences that have been added after some date/ID).
  • Integration with the PHP code? (i.e. run at sentence upload time), can be a way to link sentences for normal users.
  • Does not use O(n²) algorithm.
  • Intelligently handles cycles (sentences that link to themselves directly or indirectly) rather than getting stuck in infinite loops or trying to add the same comment or link to a sentence multiple times.
  • Maybe provides warnings before deletion? Like: first run only adds a comment saying "this sentence is a duplicate of that sentence", and second run actually removes them?
  • Who owns the merged sentence?
  • Merging comments, links, logs.
  • Merge tags except for "@duplicate", which should be dropped.
  • Update users’ favorite sentences.
  • Add a new type of contribution to the contributions table: ' sentence'. Any time a sentence is deleted due to the deduplication process, a new entry in the contributions table is added to log the deletion.
  • Can run on dev machine as well as on server. Username, password, etc. are read from command line or config files. (We can just do it as a Cake console script which is run from the command line.) (We should basically go over the database schema and look at all the places where sentence ID shows up, and decide what to do in those places.)
  • Ignore whitespace when comparing sentences. Equivalently, but more efficiently, have the site collapse runs of whitespace in sentences when they're added. (We'd also have to run a script once to do the same for existing sentences.)
  • Prioritization
    • Sentence with low correctness value (e.g., -1) always loses to a sentence with higher correctness. (A sentence may be assigned low correctness and thus be displayed in red because its contributor has, for instance, consistently added copyrighted sentences. We would want versions of sentences from trustworthy users instead, even if the content was equivalent.)
    • Sentence with audio wins over sentence without.
      • If multiple sentences have audio, and we're using the "dry run" technique, the first run could check for this situation so we could listen for ourselves in case the audio for one is better. Or we could allow multiple audio files per sentence, which would require a change to the database schema.
    • Older sentence wins over newer.
    • Prioritize contributions by (self-identified) native speakers over those who do not self-identify as native. (This was suggested by CK, who points out that some users and "re-users" of the corpus place more trust in sentences by self-identified natives.)
    • Prioritize owned sentences over orphan sentences. (According to CK, the old script did this.)
  • [2014-10-07] When sentences are merged, the merged sentence should replace the others in the lists where they were present.

Nice to have

  • Some way of handling the comment threads on the existing sentences so that they can still be understood when the sentences are merged.
  • Can be executed as a cron job (in other words, reliable enough that it doesn't require preparation before it works, observation while it works, or cleanup after it works).


  • Normalize punctuation (’ vs. ', ! vs. !, directional vs. straight quotation marks, etc.) — this is a task for a separate script.

Additional notes

  • Another approach is to prevent addition of duplicates in the first place. (see point 3 of requirements) The question is whether that can be done efficiently. Actually you need to lock the database, check for existing sentence, add it and unlock (not a big deal giving the rate sentences are added).
    • Adding this might require quite a lot of changes to the UI and such so that it won't be confusing to the user. For example, when user accidentally makes a sentence that is a duplicate, even if he meant to edit the sentence later.
    • For adding a translation: we could just link the existing sentence with the one he/she intended to. But this could be used as a way to bypass normal user rights and link two sentences randomly. Isn’t that ultimately wanted, and was disabled only because no good UI was found?
    • For adding a new sentence: just show an error.
    • For editing sentence to some already existing one: ???
  • Can a sentence be written as the same text while having different meanings in different languages, thus actually requiring separate sentences?


Article available in: