English > Support

[BUG!] Database optimization

<< < (2/3) > >>

nostra:
If you have just a small amount of duplicates it is really not worth the time spent for optimization or for the routine I described above. Removing all XXX thousand invisible movie also take a very long time (in your case I would say about 8 hours to delete 245.000 invisible movies).


--- Quote ---I guess this might be the result of updating different people at different times, and in the interim the capitalization of the entry has been changed at IMDb. PVD then considers the item to be new because the titles are different, even though the URLs are the same.
--- End quote ---

Hm, actually PVd should give URL a hiher priority. I will take a look if there is a bug or smth.


--- Quote ---Is the optimize routine removing duplicates even though I typically abort it before it's finished?
--- End quote ---

Yes, it does

rick.ca:

--- Quote ---If you have just a small amount of duplicates it is really not worth the time spent for optimization or for the routine I described above.
--- End quote ---

Yes, that was my conclusion. I also wanted to point out that a database that includes filmography data will appear to include many records that appear to be duplicates, but are not.

I suspect I did my experiment too soon after optimizing to get much information about duplicates. Next time, I'll do an export before and after—so I can determine the nature of the duplicates that are created (and who knows, the number may be very small) and which are removed by the routine. I believe you've mentioned it simply deletes the second record it encounters. If there's no merging of data, then there's likely some data being lost. For example, it a duplicate is created because the title of a movie has been changed since it was first downloaded, then any unique data added to that record will be lost. If that was filmography data, then updating the affected people again will add the movie—again using the different title and creating a duplicate.

Maybe what the routine should be doing if the URL is the same is merge the data, and change the name to that of the most recently added record. :-\

mgpw4me@yahoo.com:
I've run the optimize a couple of times and won't use the remove duplicates option for people anymore.  The process only looks at the name of the person and removes duplicates regardless of filmography, url, date of birth, etc.

Case in point:

Pamela Anderson (the famous one)
Pamela Anderson (from "showgirls")

Duplicate removal deleted the famous P.A. or maybe it kept the better actress  ;D

rick.ca:
How about that?! I didn't believe you because I always assumed it must use the URL to determine whether duplicate names are duplicate records. (It's difficult to tell what's happening when there are 33,000 people in the database.) So I exported a list of people and determined there were 53 duplicate names (mainly ones like "David Brown"). Spot checking these found no duplicate records—they're all unique people. The routine removed all of them! Whether design flaw or bug, it needs to be fixed.

mgpw4me@yahoo.com:
I suppose anything that could be done to resolve the bug would be better implemented to ensure the duplicates aren't added in the first place, then the duplicate removal feature would not be necessary.  I can see how this would work if a person's name and imdb id were indexed by the database as a single unique key (either value could be the same as another record, but not both).  This would tie PVD to IMDB (because of the IMDB ID) so tightly, I don't think that's desirable.

On the topic of movie duplicate removal, I assume it's using the same sort of processing.  It's much more complex since a title could have different versions with the same IMDB ID...with or without having the same name.  The file path "might" work for some people, but I have more than half my collection on DVDs (maybe with different VOLUME IDs, or not, don't know).  

Perhaps a confirmation dialog (similar to the file scan results) is needed.

If I'd known the first time I ran duplicate removal against the people in my database that I'd loose 700+ entries (1%), I'd probably have canceled the process.  Or not.  I can be stubborn that way.


Oh look the pussycat's going to scratch me...  Go ahead pussycat SCRATCH ME.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version