Version at: 31/03/2014, 02:44 vs. version at: 29/04/2014, 18:39
11Notes on duplicate merging process
22==================================
33
44This is an (very slightly) edited copy of the requirements list regarding the duplicate merging script rewrite. The [original list](http://piratepad.net/Pc2jyM71Ut) was composed by alanf, gillux, Link Mauve, liori and "two unnamed authors".
55
66Requirements
77------------
88
99Requirements for the deduplication code (script, runtime code, or combination of the two):
1010
1111 * Does not block the database.
1212 * Does not mess up the database if it is aborted prematurely (for instance, if the server goes down).
1313 * Performs incremental runs (e.g. "consider only sentences that have *changed* according to the contributions table between date X and date Y", not just sentences that have been added after some date/ID).
1414 * Integration with the PHP code? (i.e. run at sentence upload time), can be a way to link sentences for normal users.
1515 * Does not use O(n²) algorithm.
1616 * Handles cycles intelligently (sentences that link to themselves directly or indirectly) without getting stuck in infinite loops or trying to add the same comment or link to a sentence multiple times.
1717 * Maybe provides warnings before deletion? Like: first run only adds a comment saying "this sentence is a duplicate of that sentence", and second run actually removes them?
1818 * Who owns the merged sentence?
1919 * Sentences with audio prioritized as the target sentence. (Need to decide what to do if multiple duplicate sentences have audio: which one wins? If we use the "dry run" technique, the first run could check for this situation so we could listen for ourselves in case the audio for one is better. Or we could just use a rule (newest wins, or oldest wins). Why not allow multiple audio per sentences? Multiple locuters will add value to the audio anyway. Good idea but I don’t think it’s currently possible. Of course. :-)
2020 * Merging comments, tags, links, logs.
2121 * Update users’ favorite sentences.
2222 * Add a new type of contribution to the contributions table: ' sentence'. Any time a sentence is deleted due to the deduplication process, a new entry in the contributions table is added to log the deletion.
2323 * Can run on dev machine as well as on server. Username, password, etc. are read from command line or another file. (We can just do it as a Cake console script which is run from the command line.)
2424
2525(We should basically go over the database schema and look at all the places where sentence ID shows up, and decide what to do in those places.)
2626
2727Nice to have
2828------------
2929
3030 * Some way of handling the comment threads on the existing sentences so that they can still be understood when the sentences are merged.
3131 * Can be executed as a cron job (in other words, reliable enough that it doesn't require preparation before it works, observation while it works, or cleanup after it works).
32
32 * Prioritize contributions by (self-identified) native speakers over those who do not self-identify as native. (suggested by CK, who points out that some users and "re-users" of the corpus place more trust in sentences by self-identified natives)
33
3334Nonrequirements
3435---------------
3536
3637 * Does not handle sentences that come in while the script is running(?)
3738 * Normalize punctuations (’ vs. ', ! vs. ! etc.) — this is a task for a separate script.
3839
3940Additional notes
4041----------------
4142
4243 * Another approach is to prevent addition of duplicates in the first place. (see point 3 of requirements) The question is whether that can be done efficiently. Actually you need to lock the database, check for existing sentence, add it and unlock (not a big deal giving the rate sentences are added).
4344 * Adding this might require quite a lot of changes to the UI and such so that it won't be confusing to the user. For example, when user accidentally makes a sentence that is a duplicate, even if he meant to edit the sentence later.
4445 * For adding a translation: we could just link the existing sentence with the one he/she intended to. But this could be used as a way to bypass normal user rights and link two sentences randomly. Isn’t that ultimately wanted, and was disabled only because no good UI was found?
4546 * For adding a new sentence: just show an error.
4647 * For editing sentence to some already existing one: ???
4748 * Can a sentence be written as the same text while having different meanings in different languages, thus actually requiering separate sentences?
4849
diff view generated by jsdifflib

Version at: 31/03/2014, 02:44

Notes on duplicate merging process
==================================

This is an (very slightly) edited copy of the requirements list regarding the duplicate merging script rewrite. The [original list](http://piratepad.net/Pc2jyM71Ut) was composed by alanf, gillux, Link Mauve, liori and "two unnamed authors".

Requirements
------------

Requirements for the deduplication code (script, runtime code, or combination of the two):
 
  * Does not block the database.
  * Does not mess up the database if it is aborted prematurely (for instance, if the server goes down).
  * Performs incremental runs (e.g. "consider only sentences that have *changed* according to the contributions table between date X and date Y", not just sentences that have been added after some date/ID).
  * Integration with the PHP code? (i.e. run at sentence upload time), can be a way to link sentences for normal users.
  * Does not use O(n²) algorithm.
  * Handles cycles intelligently (sentences that link to themselves directly or indirectly) without getting stuck in infinite loops or trying to add the same comment or link to a sentence multiple times.
  * Maybe provides warnings before deletion? Like: first run only adds a comment saying "this sentence is a duplicate of that sentence", and second run actually removes them?
  * Who owns the merged sentence?
  * Sentences with audio prioritized as the target sentence. (Need to decide what to do if multiple duplicate sentences have audio: which one wins? If we use the "dry run" technique, the first run could check for this situation so we could listen for ourselves in case the audio for one is better. Or we could just use a rule (newest wins, or oldest wins). Why not allow multiple audio per sentences? Multiple locuters will add value to the audio anyway. Good idea but I don’t think it’s currently possible. Of course. :-)
  * Merging comments, tags, links, logs.
  * Update users’ favorite sentences.
  * Add a new type of contribution to the contributions table: '  sentence'. Any time a sentence is deleted due to the deduplication process, a new entry in the contributions table is added to log the deletion.
  * Can run on dev machine as well as on server. Username, password, etc. are read from command line or another file. (We can just do it as a Cake console script which is run from the command line.)
 
(We should basically go over the database schema and look at all the places where sentence ID shows up, and decide what to do in those places.)
 
Nice to have
------------

  * Some way of handling the comment threads on the existing sentences so that they can still be understood when the sentences are merged.
  * Can be executed as a cron job (in other words, reliable enough that it doesn't require preparation before it works, observation while it works, or cleanup after it works).
 
Nonrequirements
---------------
 
  * Does not handle sentences that come in while the script is running(?)
  * Normalize punctuations (’ vs. ', ! vs. ! etc.) — this is a task for a separate script.
 
Additional notes
----------------

  * Another approach is to prevent addition of duplicates in the first place. (see point 3 of requirements) The question is whether that can be done efficiently. Actually you need to lock the database, check for existing sentence, add it and unlock (not a big deal giving the rate sentences are added).
      * Adding this might require quite a lot of changes to the UI and such so that it won't be confusing to the user. For example, when user accidentally makes a sentence that is a duplicate, even if he meant to edit the sentence later.
      * For adding a translation: we could just link the existing sentence with the one he/she intended to. But this could be used as a way to bypass normal user rights and link two sentences randomly. Isn’t that ultimately wanted, and was disabled only because no good UI was found?
      * For adding a new sentence: just show an error.
      * For editing sentence to some already existing one: ???
  * Can a sentence be written as the same text while having different meanings in different languages, thus actually requiering separate sentences?

version at: 29/04/2014, 18:39

Notes on duplicate merging process
==================================

This is an (very slightly) edited copy of the requirements list regarding the duplicate merging script rewrite. The [original list](http://piratepad.net/Pc2jyM71Ut) was composed by alanf, gillux, Link Mauve, liori and "two unnamed authors".

Requirements
------------

Requirements for the deduplication code (script, runtime code, or combination of the two):
 
  * Does not block the database.
  * Does not mess up the database if it is aborted prematurely (for instance, if the server goes down).
  * Performs incremental runs (e.g. "consider only sentences that have *changed* according to the contributions table between date X and date Y", not just sentences that have been added after some date/ID).
  * Integration with the PHP code? (i.e. run at sentence upload time), can be a way to link sentences for normal users.
  * Does not use O(n²) algorithm.
  * Handles cycles intelligently (sentences that link to themselves directly or indirectly) without getting stuck in infinite loops or trying to add the same comment or link to a sentence multiple times.
  * Maybe provides warnings before deletion? Like: first run only adds a comment saying "this sentence is a duplicate of that sentence", and second run actually removes them?
  * Who owns the merged sentence?
  * Sentences with audio prioritized as the target sentence. (Need to decide what to do if multiple duplicate sentences have audio: which one wins? If we use the "dry run" technique, the first run could check for this situation so we could listen for ourselves in case the audio for one is better. Or we could just use a rule (newest wins, or oldest wins). Why not allow multiple audio per sentences? Multiple locuters will add value to the audio anyway. Good idea but I don’t think it’s currently possible. Of course. :-)
  * Merging comments, tags, links, logs.
  * Update users’ favorite sentences.
  * Add a new type of contribution to the contributions table: '  sentence'. Any time a sentence is deleted due to the deduplication process, a new entry in the contributions table is added to log the deletion.
  * Can run on dev machine as well as on server. Username, password, etc. are read from command line or another file. (We can just do it as a Cake console script which is run from the command line.)
 
(We should basically go over the database schema and look at all the places where sentence ID shows up, and decide what to do in those places.)
 
Nice to have
------------

  * Some way of handling the comment threads on the existing sentences so that they can still be understood when the sentences are merged.
  * Can be executed as a cron job (in other words, reliable enough that it doesn't require preparation before it works, observation while it works, or cleanup after it works).
  * Prioritize contributions by (self-identified) native speakers over those who do not self-identify as native. (suggested by CK, who points out that some users and "re-users" of the corpus place more trust in sentences by self-identified natives)

Nonrequirements
---------------
 
  * Does not handle sentences that come in while the script is running(?)
  * Normalize punctuations (’ vs. ', ! vs. ! etc.) — this is a task for a separate script.
 
Additional notes
----------------

  * Another approach is to prevent addition of duplicates in the first place. (see point 3 of requirements) The question is whether that can be done efficiently. Actually you need to lock the database, check for existing sentence, add it and unlock (not a big deal giving the rate sentences are added).
      * Adding this might require quite a lot of changes to the UI and such so that it won't be confusing to the user. For example, when user accidentally makes a sentence that is a duplicate, even if he meant to edit the sentence later.
      * For adding a translation: we could just link the existing sentence with the one he/she intended to. But this could be used as a way to bypass normal user rights and link two sentences randomly. Isn’t that ultimately wanted, and was disabled only because no good UI was found?
      * For adding a new sentence: just show an error.
      * For editing sentence to some already existing one: ???
  * Can a sentence be written as the same text while having different meanings in different languages, thus actually requiering separate sentences?

Note

The lines in green are the lines that have been added in the new version. The lines in red are those that have been removed.