Author Topic: File scanner regular expressions  (Read 5471 times)

0 Members and 1 Guest are viewing this topic.

Offline AimHere

  • Power User
  • ****
  • Posts: 209
    • View Profile
File scanner regular expressions
« on: July 31, 2011, 05:41:14 pm »
Hi,

I need some help with the configuration for the File Scanner, specifically, regular expressions.

I like how, when importing video files, PVD strips things like "CD1", "CD2", and so on from the filenames when generating a movie title. So, for example, "This Movie CD1 (2011, SomeStudio)" becomes "This Movie" in the title field.

Thing is, I have more and more movies that came from multiple DVD sets, where each DVD is broken into CD-sized files as well. They have filenames like "That Movie D1CD1", "That Movie D1CD2", "That Movie D2CD1", etc. ... OR, "Another Movie D1A", "Another Movie D1B", "Another Movie D2A", etc.

When I point PVD's file scanner to movies like these, I wind up with titles like "That Movie D1CD" or "Another Movie D" in the Scan Results window. Note how it retains part of the disc/CD identification tag. I have to edit the title and strip them out by hand.

Also, with the "DxCDy" naming convention, PVD doesn't group all the parts together under the same title, instead I get a separate item for each DVD ("That Movie D1CD" AND "That Movie D2CD"). So, I have to select both items and right-click to choose "Same Movie".

Now, I don't really know much about regular expressions, so I'm not sure how to go about fixing this in the preferences for the File Scanner. I'd like to retain the "D1CD1" or "D1A" tagging for the filenames, but keep it from carrying over to the titles in the File Scanner.

Any ideas?

Aimhere

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: File scanner regular expressions
« Reply #1 on: July 31, 2011, 07:16:11 pm »
It's ability to recognize patterns like "That Movie D1CD2" is exactly why regex is used. The expressions in your configuration are evaluated until one matches the filename in question. You need to add one in the correct position that matches this particular pattern. For example...

(?i)^.*\\(?P<title>.*) D[0-9]CD[0-9].*

...would match D:\Video\Movies\That Movie D1CD2 plus whatever.mkv. A more elaborate expression could match variations in the " DxCDy " pattern, like ".DxCDyy." or "- Dx CDy" (spaces are significant!).

Better practice would be to rename all movie files to include the year. That helps resolve ambiguity in the title (when searching a data source) and allows a simple expression to match both Title and Year without fail...

(?i)^.*\\(?P<title>.*) \((?P<year>(19|2\d)\d{2})\).*

...matches anything like D:\Video\Movies\That Movie (2011) plus whatever.mkv.

But if you don't want to bother renaming these files (or any other files matching some other patterns) to include the year, there's nothing wrong with having multiple expressions to match various patterns.

Use the Utility to test regular expressions for extracting movie data from file names (link on Download page) to determine the expression needed in any circumstance.

Offline AimHere

  • Power User
  • ****
  • Posts: 209
    • View Profile
Re: File scanner regular expressions
« Reply #2 on: August 01, 2011, 09:45:31 pm »
Hi rick,

Sorry, I should have been more clear. The movie filenames already do include the year, e.g. "This Movie D1A (2011, Some Studio).avi", "That Movie D2CD1 (2010, AnotherStudio).avi", etc. I just want to strip out the "D1A" or "D2CD1" parts completely to create the titles in PVD's file scanner, rather than leaving the fragments that I'm getting (and have to edit out) as shown in my original post. And I want to do it in a way which is compatible with the existing (default) regular expressions already in the File Scanner configuration (i will accept modifying existing regex's to do what needs to be done).

Offline AimHere

  • Power User
  • ****
  • Posts: 209
    • View Profile
Re: File scanner regular expressions
« Reply #3 on: August 02, 2011, 12:13:26 am »
Okay, I managed to figure out the strings I needed to add to the "find and replace" section in the File Scanner preferences to strip out the strings I wanted to remove:

(?i)\bD\d{1,2}CD\d{1,2}\b
(?i)\bD\d{1,2}[a-z]\b
(?i)\bD\d{1,2}\b


I just added these after the existing "(?i)\bCD\d{1,2}\b" line, and tested on a folder full of files named as we've been discussing. This seems to remove all of the "D1CD1" and "D1A" style strings from the filenames. Maybe there's a more elegant way to do the same thing with fewer RegEx's, but I'll take what works.

I've noticed another problem, though: the first line in the default set of "find and replace" expressions, "(?i).?((?<!\b).){0,5}Rip", is clearly intended to remove strings like "DVDRip" and "BDRip". But, I'm finding it ALSO removes any unrelated string that happens to end with the substring "rip". In other words, it removes words like "Trip", "Strip", "Grip", and so on. (And words like "Gripped" get mangled to "ped".) After some wrangling with it, I managed to come up with a RegEx that removes things like "DVDRip" while leaving words like "ripper", "tripping", etc. alone:

(?i)\b(CD|DVD|BD|Blu-Ray|BluRay|HD-DVD|HDDVD|VHS|Vinyl|Cassette|Tape|.{0})Rip(\s|\b)


Yeah, it's not pretty, but I figure it covers all the "rip" bases I'll ever encounter. :D (I had to add the fiddly bit at the end to strip out spaces that were creeping into the modified string.)

Aimhere

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: File scanner regular expressions
« Reply #4 on: August 02, 2011, 02:18:59 am »
You might be getting carried away with trying to remove things. The purpose is to recognize Title and, if possible, Year. If you can do that, you're done, regardless of what other crap is in the filename. Hopefully, the Title is always at the beginning. If you can recognize the disk numbering stuff, and it always comes after the Title, then you've got the Title. If the Year is always four numbers between a "(" and a "," or a ")", then you've got that too. There's nothing to be removed.

If, once you have a handle on regex basics, you still find it too complicated, you should probably be considering renaming the files so they can be recognized. Once you get existing files cleaned-up, you can adopt a workflow that renames files in consistent/recognizable patterns as soon as they hit your HDD. Think about it. It has to be done at some point. The earlier you rename files to something sensible, the less trouble they'll be.

I'm not advocating work for the sake of "neatness." This includes doing things like configuring torrent and ripping software so they do the renaming automatically. For example, my TV episodes are found, downloaded and renamed automatically. Properly renamed, they're automatically recognized and processed by PVD, and then the meta data fed automatically to my media manager. Aside from "supervising" the process, my job is to sit on the couch and enjoy my media using a remote. 8)

Offline AimHere

  • Power User
  • ****
  • Posts: 209
    • View Profile
Re: File scanner regular expressions
« Reply #5 on: August 06, 2011, 05:53:46 pm »
See, the default set of RegExes in PVD's file scanner configuration already removes things like "CD1" or "CD01" when reading the filenames, using the pattern "(?i)\bCD\d{1,2}\b" in the "find and replace" section. I'm just trying to get it to do the same for "D1CD1" and the like. Without the additions I've made to the RegExes, the file scanner would give me titles like "Some Movie D1CD". (This is why the file scanner configuration HAS a "find and replace" section... it strips out a lot of garbage that would otherwise get included in the extracted titles, BEFORE even attempting the extraction.)

A lot of the movies in my collection were released as multi-disc collections, with each disc in multiple 700MB AVI chunks. The filenames HAVE to include strings like "D1CD1" so each file has a unique filename. I could use something like VirtualDub to combine the chunks into a single AVI, but this is an awful lot of work given the number of files involved (with me acquiring more all the time), not to mention presenting the danger of exceeding file-size limits that may exist when I want to back up these files on DVD-R. Also, the second disc for each title often contains ONLY "extras", not any part of the main movie, so I wouldn't want to combine that with the first disc (usually I only want to watch the main movie, but I still want to keep the extras around in case I ever feel the urge to watch them). In any event, I'm still going to have multiple files for any particular movie.

I'm already renaming the files as I acquire them... believe me, the original filenames as posted on Usenet are practically worthless.  (They don't include readable movie titles or anything useful.)  :P

Edit: I should mention, all of my movies have filenames of the general form "Title Discnumber (Year, Studio).avi", where "Discnumber" is the "CD1" or "D1A" or whatever. I suppose I could move the "Discnumber" part to the very end, maybe that would keep it from being included in the title without all of the fiddly RegEx stuff. But then I'd want to rename all of my existing AVI files (several thousand, burned onto DVD-R) to the new format for the sake of consistency, and besides, the RegExes I came up with seem to work fine...

Aimhere
« Last Edit: August 06, 2011, 06:09:55 pm by AimHere »

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: File scanner regular expressions
« Reply #6 on: August 08, 2011, 10:52:09 pm »
I'm not questioning the need for some kind of disc number indicator in the filename. I'm just pointing out removing them as a means to isolate the title is pointless if their pattern is not consistent enough. Besides, you could remove all of them and still be left with junk not part of the title. On the other hand, if there's any pattern that marks the end of the title, that's all you need. If that doesn't exist, you're probably better off changing the file names.

Quote
I should mention, all of my movies have filenames of the general form "Title Discnumber (Year, Studio).avi", where "Discnumber" is the "CD1" or "D1A" or whatever. I suppose I could move the "Discnumber" part to the very end, maybe that would keep it from being included in the title without all of the fiddly RegEx stuff.

Exactly. It's unfortunate the form is not "Title (Year) Discnumber Studio.avi." Then [Title] and [Year] would be recognized without fail and would, in turn, dramatically improve the accuracy of online searches.

If that pattern is consistent, you should be getting the Year (i.e., from the " (Year, " pattern). That not only gives you [Year], but establishes the end of whatever the disc number indicator is. Your problem is then reduced to identifying a pattern indicating the beginning of that (CD#|C#|D# etc.). But...

Quote
...the RegExes I came up with seem to work fine.

It seems you're done anyway.

Offline AimHere

  • Power User
  • ****
  • Posts: 209
    • View Profile
Re: File scanner regular expressions
« Reply #7 on: August 26, 2011, 04:57:26 pm »
Sorry if it seems like I'm being obstinate about all this. I do appreciate your attempts to point out alternative strategies. :)

Maybe it would be easier to come up with a more elegant RegEx to import data into PVD if I used a different file-naming scheme. But, whatever RegEx[es] I come up with still have to be able to process the existing file name format anyway, in case I ever have to re-scan my collection or move it onto different media in the future. Given my limited knowledge of RegExes, I did the best I could. it just seemed easier to strip out the discnumber strings than to come up with a RegEx that could ignore them.

Be that as it may... in the future, what if I enclose the "Discnumber" in something like brackets [], so that I had file names like "Title (Year, Studio)[Discnumber].avi"? What kind of RegEx would be needed to parse that?

Offline rick.ca

  • Global Moderator
  • *****
  • Posts: 3241
  • "I'm willing to shoot you!"
    • View Profile
Re: File scanner regular expressions
« Reply #8 on: August 26, 2011, 06:51:46 pm »
Quote
Be that as it may... in the future, what if I enclose the "Discnumber" in something like brackets [], so that I had file names like "Title (Year, Studio)[Discnumber].avi"? What kind of RegEx would be needed to parse that?

Studio and Discnumber are not among the variables that can be saved. So, for the purpose of the scanner, the relevant pattern is "Title (Year, blah, blah, blah.avi"—from which [Title] and [Year] will be captured with absolute certainty.

 

anything