Author Topic: [BUG!] Database optimization  (Read 9536 times)

0 Members and 1 Guest are viewing this topic.

Offline Minous

  • Member
  • *
  • Posts: 13
    • View Profile
[BUG!] Database optimization
« on: October 19, 2010, 07:14:36 pm »
Ive got a database with about 350 movies and a dozen TV shows (plus all of there actors). I sometimes run across movies that are duplicates (in the actors information tab), So I decided to "Optimize" the database, everything except "delete movie duplicates" works just fine, however when I try to use that option the program runs for 4+ hours and seems to  be hung. Is this a bug, something that needs fixed, or just normal behavior?
« Last Edit: November 14, 2010, 11:48:52 pm by rick.ca »

Offline nostra

  • Administrator
  • *****
  • Posts: 2852
    • View Profile
    • Personal Video Database
Re: Database optimization
« Reply #1 on: October 19, 2010, 07:22:16 pm »
It could take pretty much time to remove duplicates if you have many peoole in your database with filled filmography. I would just try running the process during the night. The other question is why did you get those duplicates in the first place? What set of plugins do you use to get the info?
Gentlemen, you can’t fight in here! This is the War Room!

Offline Minous

  • Member
  • *
  • Posts: 13
    • View Profile
Re: Database optimization
« Reply #2 on: October 19, 2010, 07:29:02 pm »
this database is fairly old, (Ive been updating it for about two years) Ive used three plugins IMDB, allmovies and TVDB plugins. Is there a simple method for dumping movie titles and then re-importing all the data en-mass?

Offline nostra

  • Administrator
  • *****
  • Posts: 2852
    • View Profile
    • Personal Video Database
Re: Database optimization
« Reply #3 on: October 19, 2010, 07:32:13 pm »
If you do not have duplicates in your movie list, I would do the following:
1. Select Filters -> Advanced filters -> Not visible
2. Select all of the movie in the list
3. Delete all records
4. Select Filters -> Advanced filters -> Not Visible
5. Switch to people
6. Reimport information for all people (select all and run the plugin(s))
Gentlemen, you can’t fight in here! This is the War Room!

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: Database optimization
« Reply #4 on: October 20, 2010, 04:01:43 am »
My database has about 2,200 movies, 33,000 people (all with filmographies) and 245,000 invisible movies (i.e., data for the filmographies). I can't remember the last time I've let the optimize routine to run long enough to remove duplicates. I usually give up after 12-15 hours.

Maybe I need to delete and re-download the people data as suggested, but I don't want to do that unless I'm sure it will make a difference. With these kinds of numbers, it's very difficult to tell how extensive the problem is. Many apparent duplicates are separate records for different years of a series. Those are necessary so the series will appear with the year a particular actor appeared in it. And however uncommon, it is possible for two different videos to have the same title and year.

So this is what I did: I exported my invisible movies to Excel, and used that to determine duplications in URL. There are 20 of them. In each case the titles are identical, except for differences in their capitalization. I guess this might be the result of updating different people at different times, and in the interim the capitalization of the entry has been changed at IMDb. PVD then considers the item to be new because the titles are different, even though the URLs are the same. In any case, it really doesn't matter. There are 20 duplications in 145,000 records (0.014%). It's of no possible consequence.

I suppose one question remains. Is the optimize routine removing duplicates even though I typically abort it before it's finished? I think I let it run overnight less than a week ago. Maybe that's why there are so few duplicates. :-\

And now for the funny part. I decided I should do the same thing for my visible movies. Sometimes I notice a duplicate that's happened because I've added something using a different title than an already existing "wish list" item. There are 20 of those as well! Not hard to lose among 2,200 movies, but still...

Offline nostra

  • Administrator
  • *****
  • Posts: 2852
    • View Profile
    • Personal Video Database
Re: Database optimization
« Reply #5 on: October 20, 2010, 11:45:31 am »
If you have just a small amount of duplicates it is really not worth the time spent for optimization or for the routine I described above. Removing all XXX thousand invisible movie also take a very long time (in your case I would say about 8 hours to delete 245.000 invisible movies).

Quote
I guess this might be the result of updating different people at different times, and in the interim the capitalization of the entry has been changed at IMDb. PVD then considers the item to be new because the titles are different, even though the URLs are the same.

Hm, actually PVd should give URL a hiher priority. I will take a look if there is a bug or smth.

Quote
Is the optimize routine removing duplicates even though I typically abort it before it's finished?

Yes, it does
Gentlemen, you can’t fight in here! This is the War Room!

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: Database optimization
« Reply #6 on: October 20, 2010, 09:44:53 pm »
Quote
If you have just a small amount of duplicates it is really not worth the time spent for optimization or for the routine I described above.

Yes, that was my conclusion. I also wanted to point out that a database that includes filmography data will appear to include many records that appear to be duplicates, but are not.

I suspect I did my experiment too soon after optimizing to get much information about duplicates. Next time, I'll do an export before and after—so I can determine the nature of the duplicates that are created (and who knows, the number may be very small) and which are removed by the routine. I believe you've mentioned it simply deletes the second record it encounters. If there's no merging of data, then there's likely some data being lost. For example, it a duplicate is created because the title of a movie has been changed since it was first downloaded, then any unique data added to that record will be lost. If that was filmography data, then updating the affected people again will add the movie—again using the different title and creating a duplicate.

Maybe what the routine should be doing if the URL is the same is merge the data, and change the name to that of the most recently added record. :-\

mgpw4me@yahoo.com

  • Guest
Re: [SOLVED] Database optimization
« Reply #7 on: November 14, 2010, 08:55:03 pm »
I've run the optimize a couple of times and won't use the remove duplicates option for people anymore.  The process only looks at the name of the person and removes duplicates regardless of filmography, url, date of birth, etc.

Case in point:

Pamela Anderson (the famous one)
Pamela Anderson (from "showgirls")

Duplicate removal deleted the famous P.A. or maybe it kept the better actress  ;D

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: [SOLVED] Database optimization
« Reply #8 on: November 14, 2010, 11:47:30 pm »
How about that?! I didn't believe you because I always assumed it must use the URL to determine whether duplicate names are duplicate records. (It's difficult to tell what's happening when there are 33,000 people in the database.) So I exported a list of people and determined there were 53 duplicate names (mainly ones like "David Brown"). Spot checking these found no duplicate records—they're all unique people. The routine removed all of them! Whether design flaw or bug, it needs to be fixed.

mgpw4me@yahoo.com

  • Guest
Re: [BUG!] Database optimization
« Reply #9 on: November 15, 2010, 07:31:17 pm »
I suppose anything that could be done to resolve the bug would be better implemented to ensure the duplicates aren't added in the first place, then the duplicate removal feature would not be necessary.  I can see how this would work if a person's name and imdb id were indexed by the database as a single unique key (either value could be the same as another record, but not both).  This would tie PVD to IMDB (because of the IMDB ID) so tightly, I don't think that's desirable.

On the topic of movie duplicate removal, I assume it's using the same sort of processing.  It's much more complex since a title could have different versions with the same IMDB ID...with or without having the same name.  The file path "might" work for some people, but I have more than half my collection on DVDs (maybe with different VOLUME IDs, or not, don't know).  

Perhaps a confirmation dialog (similar to the file scan results) is needed.

If I'd known the first time I ran duplicate removal against the people in my database that I'd loose 700+ entries (1%), I'd probably have canceled the process.  Or not.  I can be stubborn that way.



Oh look the pussycat's going to scratch me...  Go ahead pussycat SCRATCH ME.
« Last Edit: November 15, 2010, 07:35:42 pm by mgpw4me@yahoo.com »

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: [BUG!] Database optimization
« Reply #10 on: November 16, 2010, 02:13:12 am »
Whether people or movies, any two records which have different IMDb (or some other source) numbers are not duplicates. Multiple movie records for different versions of the same movie is a user database design issue—they're still duplicates, just duplicates that the user wouldn't want removed. The problem is with other possibilities—like the person added from another source, but who already exists in the database.

Maybe the answer is to provide tools that would assist the user in reviewing and fixing these issues. Like a command that bookmarks all duplicates, and another to merge two records (hopefully in some "intelligent" manner).

mgpw4me@yahoo.com

  • Guest
Re: [BUG!] Database optimization
« Reply #11 on: November 16, 2010, 04:11:08 am »
The problem is with other possibilities—like the person added from another source, but who already exists in the database.

That is my exact experience. 

If I add a movie that can't be found in IMDB.  I know who was in the movie, but don't know what other movies (if any) they were in.  I dig around (say at HKMDB.COM) to find a filmography I can match up with an IMDB person.  With luck, my movie has an AKA that isn't listed in IMDB and I can fix it by simply adding the url / running the plug-in.

And sometimes, no. 

I have about 24 people that I cannot match up, and have no url for.  I know only from the credits (or from the source of the original movie file) who is in it.

My point is that the url can't be counted on to be unique and definitive.  Having an option to include / exclude null urls would reduce the problems.

Automated detection and manual correction sounds like a reasonable way to deal with this.  In my case that is where I'd bow out...700 manual corrections is just too much...unless I could mark records as having been processing and not have to deal with them when I run another duplicate check (unless there was "more information" to the contrary provided by the scanning process).

I'd rather see the duplicate removal option removed entirely and dealt with in PVD 1.1 or later.  There's so many better things that could be done with the time.

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: [BUG!] Database optimization
« Reply #12 on: November 16, 2010, 04:51:49 am »
Quote
I have about 24 people that I cannot match up, and have no url for.

You could add any available URL (e.g., for the person's page at HKMDB.com). If that's too much trouble or no such page exists, use a dummy URL. It could be anything unique (in proper URL form, I suppose), but maybe something that would tell you where the data came from.

Quote
In my case that is where I'd bow out...700 manual corrections is just too much...unless I could mark records as having been processing and not have to deal with them when I run another duplicate check (unless there was "more information" to the contrary provided by the scanning process).

If providing URL's won't do the trick (and I can imagine you might not want to tackle the 700 record backlog), yes, I would mark them somehow. Searching or grouping could be used to segregate them—that wouldn't have to be built into the tool.

mgpw4me@yahoo.com

  • Guest
Re: [BUG!] Database optimization
« Reply #13 on: November 16, 2010, 05:02:19 am »
I suppose adding a fake url would work, but...well...you know...it sucks.

I could easily export / import to make the change or write a short script.  I'd just hate to have something marked as definitely unique when at some future time I may stumble on a proper information source.  Such things happen...I'm still hunting for 50 or 60 posters.