English > PVD Python Scripts

New IMDb People v3 (Selenium) script comments

<< < (2/3) > >>

afrocuban:
Great thanks! It's good to clean the code and to fix additional things in order to revive them!



I have some dilemmas about Alternative Names, though:


Is there any specific reason you changed ItemValue1 to AltNames1?


I also don't get it what var AltNames is used for and where?

One thing I am not sure about too: how this work in the line you posted?

--- Quote --- If AltNames1 <> '' then AddFieldValueXML('AltNames', ItemValue1);
--- End quote ---

Ivek23:

--- Quote from: afrocuban on January 10, 2025, 02:14:32 am ---Great thanks! It's good to clean the code and to fix additional things in order to revive them!



I have some dilemmas about Alternative Names, though:


Is there any specific reason you changed ItemValue1 to AltNames1?


I also don't get it what var AltNames is used for and where?

One thing I am not sure about too: how this work in the line you posted?

--- Quote --- If AltNames1 <> '' then AddFieldValueXML('AltNames', ItemValue1);
--- End quote ---

--- End quote ---

I apologize, I forgot to make the change, it's correct like this.


--- Quote --- If AltNames1 <> '' then AddFieldValueXML('AltNames', AltNames1);
--- End quote ---

AltNames or AltNames1 is named like this because I have more code for Alternative Names in different parts of the script in Function ParsePage_IMDBPersonBASE for better transfer of information to the altnames field and comment field.

You can also use ItemValue1 in this code, as you wish.

afrocuban:
Thanks!


If someone (like me) wants bio field to look only like this (without Biography link due to redudancy in comment field):



--- Quote ---

 In 2015, he played a journalist in <link url="http://www.imdb.com/title/tt1895587/">Под лупом (2015)</link>, which, like Birdman, won the Academy Award for Best Picture. In 2016, he starred as <link url="http://www.imdb.com/name/nm4669566/">Ray Kroc</link>, the developer of McDonald's, in the drama <link url="http://www.imdb.com/title/tt4276820/">Osnivač (2016)</link>.
 
 He is a visiting scholar at Carnegie Mellon University.
--------------------------------------------------------------------------
- IMDb Mini Biography By: firehouse44 and Pedro Borges
--------------------------------------------------------------------------
BirthName:  Michael John Douglas
--------------------------------------------------------------------------
--- End quote ---


Then it is needed to set Script options like this:



--- Quote ---  BIO_INFO_IN_BIO                        = False ;   //Use the PVD field ~bio~ for not storing the person Biography Info Url link for Biography Pages.
   //BIO_INFO_IN_BIO                     = True ;   //Use the PVD field ~bio~ for storing the person Biography Info Url link for Biography Pages.
  BIO_URL_IN_BIO                        = True ;   //Use the PVD field ~bio~ for storing the person Url's for Biography Info (Mini Bio) for Biography Pages.
   //BIO_URL_IN_BIO                          = False ;   //Use the PVD field ~bio~ for not storing the person Url's for Biography Info (Mini Bio) for Biography Pages.
   //IMDB_MINI_IN_BIO                               = True ;   //Use the PVD field ~bio~ for storing the person IMDb Mini Biography letters for Biography Info (Mini Bio) for Biography Pages.
  IMDB_MINI_IN_BIO                            = False ;   //Use the PVD field ~bio~ for not storing the person IMDb Mini Biography letters for Biography Info (Mini Bio) for Biography Pages.


--- End quote ---


and function has to be



--- Quote ---
Function ParsePage_IMDBPeopleBIO(HTML:String):Cardinal; //BlockOpen
//Returns:
//     Result:=prFinished; Script has finished gathering data
//     Result:=prError; If any big problem with exit;
//Retrieve: ~bio~ Biography from "Mini Bio" IMDB section
Var
curPos,endPos,debug_pos1:Integer;
ItemValue:String;
PersonID,ItemValue0,ItemValue10,ItemValue1,ItemValue11:String;
ItemList,ItemList00,ItemList0,ItemList1,ItemList11,ItemList12:String;
FinalValue: String;
ItemList2,ItemList10,ItemList20,ItemValue3:String;
BirhNameValue: String;
Begin
LogMessage('ParsePage_IMDBPeopleBIO: Starting processing.');
LogMessage('HTML length: ' + IntToStr(Length(HTML)));
LogMessage('Function ParsePage_IMDBPeopleBIO BEGIN=====================||');
Result:=prFinished;  //It will change to prError if any big problem with exit;

LogMessage('Result set to prFinished');  //Log the initial result setting

//(*
//Get "Biography" info
curPos:=Pos('<h1 class="ipc-title__text">Biography</h1>',HTML);      //Strings start which opens the block content data. WEB_SPECIFIC
if (curPos=0) then Exit;   
//*)
//(*
ItemList2:='';
ItemList11:='';
//*)
(*
ItemList2:='';
ItemList11:='';
//*)
//(*   
//Get PersonID
//LogMessage('Attempting to find PersonID');
PersonID := TextBetWeenFirst(HTML, '<link rel="canonical" href="https://', '/">');  //WEB_SPECIFIC   
if (Length(PersonID) > 2) then begin
ItemList2 := '--------------------------------------------------------------------------'+#13+'<link url="http://' + PersonID + '/#overview">Biography Info</link>';
//ItemList2 := '--------------------------------------------------------------------------'+#13+'<link url="http://www.imdb.com/name/' + PersonID + '/bio/#overview">Biography Info</link>';
LogMessage('Get result PersonID: ' + PersonID + '||');
end else begin
LogMessage('Error: PersonID not found');
Result := prError;      //Set the result to error if PersonID is not found
end;   
//*)   
//(*
//Get "Biography" info
LogMessage('Attempting to find Biography section');
curPos := Pos('<div data-testid="sub-section-mini_bio"', HTML);         //Updated to reflect new layout
if (curPos = 0) then Begin
LogMessage('Error: Biography section not found');
Result := prError;      //Set the result to error if the section is not found
Exit;
End;
endPos := Pos('</ul>', Copy(HTML, curPos, Length(HTML) - curPos + 1)) + curPos - 1;
if endPos = curPos - 1 then Begin
LogMessage('Error: End of Biography section not found');
Result := prError;      //Set the result to error if the section is not found
Exit;
End;
ItemList0 := Copy(HTML, curPos, endPos - curPos + Length('</ul>'));    //Include </ul> in the end position
LogMessage('Biography section found');

//Extract "Mini bio" Biography text
LogMessage('Extracting Mini Bio text:');
curPos := Pos('<div class="ipc-html-content-inner-div" role="presentation">', ItemList0);      //Updated to reflect new layout
LogMessage('curPos for Mini Bio set to: ' + IntToStr(curPos));
if curPos > 0 then Begin
endPos := Pos('</ul>', Copy(ItemList0, curPos, Length(ItemList0) - curPos + 1)) + curPos - 1;      //Update to match exact structure
LogMessage('endPos for Mini Bio set to: ' + IntToStr(endPos));
if endPos > curPos Then Begin
ItemValue := Trim(Copy(ItemList0, curPos, endPos - curPos + Length('</ul>')));
               
//Normalize whitespace but keep empty lines
ItemValue := StringReplace(ItemValue, #13#10, #10, True, True, False);      //Normalize line endings
ItemValue := StringReplace(ItemValue, #13, #10, True, True, False);
ItemValue := StringReplace(ItemValue, #10#10, #13#10#13#10, True, True, False);      //Preserve empty lines
ItemValue := StringReplace(ItemValue, #10, ' ', True, True, False);
ItemValue := StringReplace(ItemValue, #13#10#13#10, #10#10, True, True, False);      //Revert empty line placeholders
While Pos('  ', ItemValue) > 0 Do
ItemValue := StringReplace(ItemValue, '  ', ' ', True, True, False);

//Transform links
ItemValue := StringReplace(ItemValue, '<a class="ipc-md-link ipc-md-link--entity" href="', '<link url="http://www.imdb.com', True, True, False);
ItemValue := StringReplace(ItemValue, '/?ref_=nmbio_mbio">', '/">', True, True, False);
ItemValue := StringReplace(ItemValue, '</a>', '</link>', True, True, False);

//Remove unwanted tags
ItemValue := StringReplace(ItemValue, '<div class="ipc-html-content-inner-div" role="presentation">', '', True, True, False);
ItemValue := StringReplace(ItemValue, '<div class="ipc-html-content ipc-html-content--base ipc-metadata-list-item-html-item" role="presentation">', '', True, True, False);
ItemValue := StringReplace(ItemValue, '<>', '', True, True, False);
ItemValue := StringReplace(ItemValue, '</ul>', '', True, True, False);

If Not(BIO_URL_IN_BIO) then ItemValue:=RemoveTagsEx00(ItemValue);   
If Not(BIO_URL_IN_BIO) then ItemValue:=StringReplace(ItemValue,'</link>','',True,True,False);

If ItemValue <> '' then ItemList := ItemValue;

//LogMessage('      Get result bio (from Mini bio)002:'+ItemList+'||');           
If ItemList <> '' then ItemList11:=ItemList11+ItemList;

End Else LogMessage('Error: End position not found for Mini Bio');
End Else LogMessage('Error: Start position not found for Mini Bio');
//(*
// Extract the final "IMDb Mini Biography By: ..." value and clean tags
If Pos('- IMDb Mini Biography By:', ItemList0) > 0 Then Begin
curPos := Pos('- IMDb Mini Biography By:', ItemList0);
endPos := Pos('<>', Copy(ItemList0, curPos, Length(ItemList0) - curPos + 1)) + curPos - 1;
FinalValue := Copy(ItemList0, curPos, endPos - curPos + Length('<>'));

// Clean surrounding tags without using RemoveTags
FinalValue := StringReplace(FinalValue, '<div class="ipc-html-content-inner-div" role="presentation">', '', True, True, False);
FinalValue := StringReplace(FinalValue, '<div class="ipc-html-content ipc-html-content--base ipc-metadata-list-item-html-item" role="presentation">', '', True, True, False);
FinalValue := StringReplace(FinalValue, '<>', '', True, True, False);
FinalValue := StringReplace(FinalValue, '</ul>', '', True, True, False);

// Remove existing occurrence of FinalValue from ItemList
ItemList := StringReplace(ItemList, FinalValue, '', True, True, False);

// Log the cleaned ItemList
//LogMessage('   *   Cleaned ItemList without FinalValue:' + ItemList + '||');

// Append the final value to ItemList only if it's not already present
If Pos(FinalValue, ItemList) = 0 Then Begin
If Length(ItemList) > 0 Then
ItemList := ItemList + #13#10 + '--------------------------------------------------------------------------'+ #13#10 + FinalValue
Else
ItemList := FinalValue;
End;

// Log the updated ItemList
//LogMessage('   *   Get result bio (from Mini bio)002:' + ItemList + '||');

If Not(IMDB_MINI_IN_BIO) then   
curPos:=Pos('- IMDb Mini',ItemList);
if curPos >0 then ItemList := Copy(ItemList,0,curPos-1);
//LogMessage('      Get result bio (from Mini bio) a:'+ItemList+'||');
//LogMessage('      Get result bio (from Mini bio):'+ItemList+'||');
If ItemList <> '' then ItemList11:=ItemList11+ItemList;
End;
//*)

//AddFieldValueXML('bio', ItemList);
        //LogMessage('Added ItemList to XML: ' + ItemList);

If (ItemList11 <> '') AND (ItemList2 <> '') Then
//ItemList12:=ItemList11;
ItemList12:=ItemList11+#13+ItemList2;

//Get "Birth name" Biography text
ItemList00:='';
//ItemList10:=TextBetWeenFirst(HTML,'" data-testid="title"><hgroup><h1 class="ipc-title__text"','<h3 class="ipc-title__text"><span>Contribute to this page</span></h3>');
curPos := PosFrom('<h3 class="ipc-title__text"><span id="overview">Overview', HTML,curPos);
EndPos:=PosFrom('<></section>',HTML,curPos);
ItemList00:=Copy(HTML,curPos,endPos-curPos);
//LogMessage('  ** Parse Biography '+#13+ItemList00+' **');
//(*
If (Length(ItemList00)>0) Then Begin
ItemValue10:=TextBetWeenFirst(ItemList00,'<li role="presentation" class="ipc-metadata-list__item" id="name" data-testid="list-item"><span class="ipc-metadata-list-item__label" aria-disabled="false">Birth name</span>','<><><></li>');
//if BIRTH_NAME_IN_TRANSNAME then
//if ItemValue10 <> '' then
//AddFieldValueXML('transname',ItemValue10);
If ItemValue10 <> '' then
//LogMessage('      Get result from Birth Name02:'+ItemValue10+'||');
If ItemValue10 <> '' then ItemValue10:='BirthName:  '+ItemValue10;
If ItemValue10 <> '' then ItemList12:=ItemList12+#13+'--------------------------------------------------------------------------'+#13+ItemValue10;
If ItemValue10 <> '' then BirhNameValue:='--------------------------------------------------------------------------' + #13#10 + ItemValue10 + #13#10 + '--------------------------------------------------------------------------'
End;   
//*)         
If BIO_INFO_IN_BIO then
AddFieldValueXML('bio', ItemList12)
Else If Not(BIO_INFO_IN_BIO) and BIO_URL_IN_BIO and (IMDB_MINI_IN_BIO) then
AddFieldValueXML('bio', ItemList + #13#10 + BirhNameValue)
Else
AddFieldValueXML('bio', ItemList11);


Result := prFinished;

LogMessage('Function ParsePage_IMDBPeopleBIO END=====================||');
End; //BlockClose
//*)

--- End quote ---


In v4 I'll probably in the comments fields leave only links, people id and PID and move the rest to bio where, to me, naturally belongs.

Ivek23:
No, that's not true, not everything belongs in the bio field, but only what was there now, because there is simply no other information on the bio pages and you can't import it anywhere else, only into the comment field.

For example, the proper name (for example Andrew Keegan (I) ), jobtitle, Filmography - Career or alternative names will not be found on the bio pages and the transfer of this information goes to the comment field.

afrocuban:
Oh, I see! Thanks! I thought to update like AddFieldValueXML('bio', ItemList + #13#10 + CommentValuesForBio) for example and to pull them out from Base, or to merge functions... It was just an idea not thought through in details at all...

Or this concept:



to add additional argument to the Bio function and in ParsePage to sort that out when calling Bio function...



--- Quote ---Function ParsePage(...);
Var
    HTML, CommentValuesForBio, BioResult: String;
Begin
    // Parse Base Page
    HTML := 'Base page HTML content';
    CommentValuesForBio := ParsePage_IMDBPersonBASE(HTML);

    // Parse Bio Page and combine with the 10th result from base page
    HTML := 'Bio page HTML content';
    BioResult := ParsePage_IMDBPeopleBIO(HTML, CommentValuesForBio);

    // Output or use the combined result
    LogMessage('Final Bio Result: ' + BioResult);
End;

--- End quote ---


and in ParseBio then:


--- Quote ---Function ParsePage_IMDBPeopleBIO(HTML: String; CommentValuesForBio: String): Cardinal;
Var
    .....
Begin
    // Extract and process bio information
    BioList:= 'Bio result';

    // Combine with the 10th result from the base page
    BioItemListFinal := BioList+ #13#10 + CommentValuesForBio;

    // Return the combined result
    AddFieldValueXML('bio', BioItemListFinal)
End;


--- End quote ---
I don't know...

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version