Integrating Selenium to PVD

English > Development

<< < (2/4) > >>

Ivek23:

--- Quote from: afrocuban on December 15, 2024, 09:26:45 pm ---
--- Quote from: Ivek23 on December 15, 2024, 12:29:31 pm ---In the Python attachment, there are fetch_imdbs_titles.py and imdb_aka_fetcher.py scripts to help and see the specific paths of geckodriver for firefox when writing Python scripts.

I would like to ask you to make a Python Script (selenium_script.py) and Pascal Script (Invoke Selenium Script) path to IMDB_[EN][HTTPS]_TEST_2c 2c script for me and add all this in the attachment here.

Where do I need to add geckodriver for firefox.

--- End quote ---

It isn't clear who are you asking to do this, but in case you are asking me, I am still at the very beginning of even comprehending the concept, not to say to code. Interacting with AI can be and is extremely frustrating, and whatever I tried, I needed to try it live, otherwise I had to started over each time. Meaning, asking me to provide it for you isn't productive way, unless you too want to get crazy like I did while upgrading FA script, hahahah. To get there to be able to parse FA trailers page (meaning to be able to download and parse dynamic content of HTML on FA), I think I'll need a month at least, but I'm not surrendering.

Meanwhile, I started to fix and upgrade IMDb people script. I already fixed "bio" field, but I need and want to further tweak, update and upgrade it before meaningfully post it.
--- End quote ---

Ok, I just asked if there is such a possibility. I would like to ask that the Python Script (selenium_script.py) be published, which would be good for other users as well, maybe I could find someone else who could help with this.

afrocuban:
I installed: selenium, beautifulsoap4, node.js and puppeteer. No solution downloaded dynamic content locally.

We have no choice for now, it looks....

I tried it for People credits only...

Ivek23:

--- Quote from: afrocuban on December 17, 2024, 04:38:10 am ---I installed: selenium, beautifulsoap4, node.js and puppeteer. No solution downloaded dynamic content locally.

We have no choice for now, it looks....

I tried it for People credits only...
--- End quote ---

Here is the url for Aaron Spelling credits

https://www.imdb.com/name/nm0005455/?showAllCredits=true

I already found a solution for AKA titles. How to download them all and they work in test form, but there are still some details missing that need to be tested as well as selenium_script.py .

IMDB_[EN][HTTPS]_TEST_2c 2c script I had to change some parts of the code so that now Function ParsePage_IMDBMovieAKA is the only one that is used.

afrocuban:

--- Quote from: Ivek23 on December 17, 2024, 07:08:14 pm ---
--- Quote from: afrocuban on December 17, 2024, 04:38:10 am ---I installed: selenium, beautifulsoap4, node.js and puppeteer. No solution downloaded dynamic content locally.

We have no choice for now, it looks....

I tried it for People credits only...
--- End quote ---

Here is the url for Aaron Spelling credits

https://www.imdb.com/name/nm0005455/?showAllCredits=true

I already found a solution for AKA titles. How to download them all and they work in test form, but there are still some details missing that need to be tested as well as selenium_script.py .

IMDB_[EN][HTTPS]_TEST_2c 2c script I had to change some parts of the code so that now Function ParsePage_IMDBMovieAKA is the only one that is used.

--- End quote ---

Great to hear. I succeeded to eventually download full content of https://www.imdb.com/name/nm0005455/?showAllCredits=true to local file. The trick was to download and save it as mhtml. Now I'm looking how to parse that page and later how to invoke .py from within .psf, that is to pass the url to .py...

Ivek23:
Here is a script to help

python

--- Quote ---import sys
import os
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
import time

# Preverite, ali je IMDb URL podan kot parameter
if len(sys.argv) < 2:
print("IMDb URL ni bil posredovan kot parameter.")
sys.exit(1)

imdb_url = sys.argv[1] # IMDb URL iz ukazne vrstice

# Pot do geckodriver.exe
gecko_path = "C:/Projects/geckodriver.exe" # Prilagodite pot glede na lokacijo gonilnika

# Pridobite trenutno pot aplikacije
app_path = os.path.dirname(os.path.abspath(__file__)) # Pot do trenutne Python skripte

# Preverite, ali je vaša "PVD_0.9.9.21_MOD-Simple AllMovies" mapa na D: disku ali drugje
pvd_path = "D:\MyPVD\PVD_0.9.9.21_MOD-Simple AllMovies" # Nastavite to pot enkrat, da se ne spreminja

# Če želite univerzalno pot, uporabite app_path za združitev
output_path = os.path.join(pvd_path, "Scripts", "Tmp", "downpage-UTF8_NO_BOM.htm")

# Preverite, ali mapa obstaja, če ne, jo ustvarite
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Ustvarite objekt za brskalnik
service = Service(gecko_path)
driver = webdriver.Firefox(service=service)

try:
# Odprite IMDb stran
driver.get(imdb_url)
print(f"Stran {imdb_url} je naložena.")

# Počakajte, da se stran naloži
time.sleep(5)

# Pridobite celoten izvorni HTML strani
html_source = driver.page_source

# Shranite HTML v datoteko
with open(output_path, 'w', encoding='utf-8') as file:
file.write(html_source)
print(f"HTML je shranjen v datoteko: {output_path}")

finally:
# Zaprite brskalnik
driver.quit()
--- End quote ---

I apologize for some parts of the text being in Slovenian, because I used ChatGPT - GPT Chat Free Online AI and asked it questions in my own language.

You need to change some items in the script, including the path to your pvd database or is this the universal path to the pvd folder of the program.

You also need to change certain paths in the script, such as these parts of the code
--- Quote ---Function GetDownloadURL:AnsiString; //BlockOpen
Var
curPos:Integer;
ScriptPath,MovieID:String;
Begin
LogMessage('Testna inicializacija log sistema.');
LogMessage('Testno sporočilo: Log deluje.');
LogMessage('Function GetDownloadURL BEGIN======================|');
LogMessage('Global Var-Mode|'+IntToStr(Mode)+'|');
LogMessage('Global Var-DownloadURL|'+DownloadURL+'|');
//Comprobation of needed external files.
ScriptPath:=GetAppPath+'Scripts\';
If Not(FileExists(ScriptPath+'PVdBDownPage.exe')) Then Begin
ShowMessage ('This script needs the external file for work.'+Chr(13)+'• PVdBDownPage.exe'+Chr(13)+'Read script text for futher information',SCRIPT_NAME);
Mode:=smFinished;
Result:=''; //If error returns empty string
exit;
End;
If (Mode=smSearch) Then Begin
//Get stored URL if exist.
StoredURL:=GetFieldValueXML('url');
LogMessage('Stored URL is:'+StoredURL+'||');
//Standarize the URL
StoredURL:=LowerCase(StoredURL);
StoredURL:=StringReplace(StoredURL,'https','http',True,True,False);
StoredURL:=StringReplace(StoredURL,'http://imdb.com/', 'http://www.imdb.com/', True,True,False);
StoredURL:=StringReplace(StoredURL,'http://httpbin.org/response-headers?key=','',True,False,False);
StoredURL:=StringReplace(StoredURL,' ',BASE_URL_SUF,True,True,False)+BASE_URL_SUF; //Asure that the URLs always finish BASE_URL_SUF (even in the last position
     LogMessage('* Stored URL is:'+StoredURL+'||');
//Get IMDB ID if exist.
curPos:=Pos(BASE_URL_PRE,StoredURL);
If 0<curPos Then Begin //Get IMDB_ID for search
LogMessage(' IMDB URL.');
MovieID:=TextBetWeen(StoredURL,BASE_URL_PRE,BASE_URL_SUF,false,curPos); //WEB_SPECIFIC
DownloadURL:=BASE_URL_PRE_TRUE+ MovieID +BASE_URL_SUF; //WEB_SPECIFIC
LogMessage(' Parse stored information DownloadURL:'+DownloadURL+' ||');
Mode:=smNormal; //->Go to function ParsePage for parse the film information
Result:=GetAppPath+DUMMY_HTML_FILE; //Any existing little file for cheating PVdB automatic download (little).).
LogMessage('Function GetDownloadURL END====================== with Mode='+IntToStr(Mode)+' Result='+Result+'|');
exit;
End Else Begin //The movie URL not exist, search mode needed. Download the search page.
//ShowMessage('No IMDB URL.',SCRIPT_NAME);
LogMessage(' No IMDB URL.');
Mode:=smSearch; //->Go to function ParsePage for search the URL (in this funtion you can't not use user funtions)ntions)
DownloadURL:=''; //Has not movie URL.
Result:=GetAppPath+DUMMY_HTML_FILE; //Any existing little file for cheating PVdB automatic download (little).).
LogMessage('Function GetDownloadURL END====================== with Mode='+IntToStr(Mode)+' Result='+Result+'|');
exit; //Go to the
End;
End;
//Not other modes working needs in this function.
//smNormal = 1; //This scripts download with external program (not with GetDownloadURL) so it only make one pass to ParsePage for retrieve all info, credits, poster, etc. other field modes aren't necesarye pass to ParsePage for retrieve all info, credits, poster, etc. other field modes aren't necesary
//smSearchList = 8; //Used in ParsePage for demands download the https link returned by user in the window of (AddSearchResult)(AddSearchResult)
Result:=GetAppPath+DUMMY_HTML_FILE; //Any existing little file for cheating PVdB automatic download (little).
LogMessage('Function GetDownloadURL END====================== with Mode='+IntToStr(Mode)+' Result='+Result+'|');
exit;
End; //BlockClose
.
.
.
Function DownloadPage(URL:AnsiString):String; //BlockOpen
//Returns the URL page text. If error returns empty string
Var
i:Integer;
ScriptPath,WebText:String;
Begin
LogMessage(Chr(9)+Chr(9)+'Function DownloadPage BEGIN======================|');
LogMessage(Chr(9)+Chr(9)+'Global Var-DownloadURL|'+DownloadURL+' |');
LogMessage(Chr(9)+Chr(9)+' Local Var-URL|'+URL+' |');
ScriptPath:=GetAppPath+'Scripts\';
//Delete the ancient downloaded page file.
While FileExists(ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM) Do Begin
LogMessage(Chr(9)+Chr(9)+'Deleting existing file: ' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM);
       FileExecute('cmd.exe', '/C del "'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM+'"');
LogMessage(Chr(9)+Chr(9)+' Waiting 1s for delete:'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
wait (1000);
End;

// Download the URL page.
//LogMessage(Chr(9)+Chr(9)+' Download with PVdBDownPage in file:|'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM+' the information of:|'+URL+' ||');
//FileExecute(ScriptPath+'PVdBDownPage.exe', '"'+URL+'" "'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM+'"');

LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:| ' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM + ' the information of:|' + URL + '||');
   LogMessage(Chr(9)+Chr(9)+'Executing Python script to download URL content.');
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM + '"');

// Wait download finish and exist the downloaded page.
i:=0; // INTERNET_TEST_ITERATIONS
While Not(FileExists(ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM)) Do Begin
LogMessage(Chr(9)+Chr(9)+' Waiting 2s for exists of:'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
wait (5000);
i:=i+1;
If i=INTERNET_TEST_ITERATIONS Then Begin
if 2=MessageBox('Too many faulty attempts to internet connection.'+Chr(13)+ 'Retry or Cancel?',SCRIPT_NAME,5) then begin
LogMessage(Chr(9)+Chr(9)+'Function DownloadPage END with NOT INTERNET connection ===============|');
Result:='';
Exit;
End;
i:=0;
End;
End;

LogMessage(Chr(9)+Chr(9)+' Now present complete page file: '+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
WebText:=FileToString(ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
LogMessage(Chr(9)+Chr(9)+'File content length: ' + IntToStr(Length(WebText)));
LogMessage(Chr(9)+Chr(9)+'File content (first 100 chars): ' + Copy(WebText, 1, 100));
WebText:=ConvertEncoding(WebText, 65001);
Result:=WebText;

// Some download data validations.
if (Pos('404 Not Found',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL is not in use (404 Not Found).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' 404 Not Found|');
Result:='';
End;

if (Pos('404 Error - IMDb',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL is not in use (404 Error - IMDb).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' 404 Error - IMDb|');
Result:='';
End;

if (Pos('Page not found',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL is not in use (Page not found).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' Page not found|');
Result:='';
End;

if (Pos('405 Method not allowed',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL has HTTP method problems (405 Method not allowed).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' 405 Method not allowed|');
Result:='';
End;
if (Pos('Too many request',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The provider has banned your IP (Too many request).'+Chr(13)+'Go to the provider web and resolve the captcha in order to prove you are not a robot',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' Banned IP|');
Result:='';
End;

   LogMessage('Value BASE_DOWNLOAD_FILE_NO_BOM: ' + BASE_DOWNLOAD_FILE_NO_BOM);
LogMessage(Chr(9)+Chr(9)+'Function DownloadPage END======================|');
exit;
End; //BlockClose

Function DownloadImage(URL:AnsiString;OutPutFile:AnsiString):Integer; //BlockOpen
//Returns 1 or 0 if the downloaded image file exists in Exit.
//Var
//i:Integer;
//ScriptPath:String;
Begin
   (*
LogMessage(Chr(9)+Chr(9)+'Function DownloadImage BEGIN======================|');
LogMessage(Chr(9)+Chr(9)+'Global Var-DownloadURL|'+DownloadURL+' |');
LogMessage(Chr(9)+Chr(9)+' Local Var-URL|'+URL+' |');
LogMessage(Chr(9)+Chr(9)+' Local Var-OutPutFile|'+OutPutFile+'|');
ScriptPath:=GetAppPath+'Scripts\';
//Delete the ancient dowloaded page file. Needed for wait to curl download included in PowerShell command.
While FileExists(OutPutFile) Do Begin
FileExecute('cmd.exe', '/C del "'+OutPutFile+'"');
LogMessage(Chr(9)+Chr(9)+' Waiting 1s for delete:'+OutPutFile);
wait (1000);
End;
//Download the URL page.
LogMessage(Chr(9)+Chr(9)+' Download with PVdBDownPage in file:|'+OutPutFile+' the information of:|'+URL+' ||');
FileExecute(ScriptPath+'PVdBDownPage.exe', '"'+URL+'" "'+OutPutFile+'"');
//Wait download finish and exist the downloaded page.
i:=0; // INTERNET_TEST_ITERATIONS
While Not(FileExists(OutPutFile)) Do Begin
LogMessage(Chr(9)+Chr(9)+' Waiting 2s for exists of:'+OutPutFile);
wait (2000);
i:=i+1;
If i=INTERNET_TEST_ITERATIONS Then Begin //In the images download the scritp can not ask to the user for internet conexion because perhaps the file doesn't exist.
LogMessage(Chr(9)+Chr(9)+'Function DownloadImage END with NOT file downloaded ===============|');
Result:=0;
exit;
End;
End;
LogMessage(Chr(9)+Chr(9)+' Now present complete page file: '+OutPutFile);
Result:=1;
LogMessage(Chr(9)+Chr(9)+'Function DownloadImage END======================|');
exit;
*)
End; //BlockClose
.
.
.
Function ParsePage(HTML:String;URL:AnsiString):Cardinal; //BlockOpen
Var
MovieID,titleValue,yearValue:String;
ResultTmp:Cardinal;
Date:String;
Fullinfo,Movie_URL,IMDB_URL:String;
DateParts:TWideArray;
   Fullinfo1,MovieID1:String;
Begin
.
.
.
//Parse Also Known As provider page = BASE_URL_AKA-------------------------------------------------------------------
If (GET_FULL_AKA and Not(USE_SAVED_PVDCONFIG and (Copy(PVDConfigOptions,opAKA,1)='0'))) Then Begin
//If (GET_FULL_AKA and (MediaType='Movie') and Not(USE_SAVED_PVDCONFIG and (Copy(PVDConfigOptions,opAKA,1)='0'))) Then Begin
//If (GET_FULL_AKA and Not(USE_SAVED_PVDCONFIG and (Copy(PVDConfigOptions,opAKA,1)='0'))) Then Begin
DownloadURL:=StringReplace(BASE_URL_AKA,'%IMDB_ID',MovieID,True,True,False);
HTML:=DownloadPage(DownloadURL); //True page for parsing
         //HTML := DownloadPage(DownloadURL, 'Tmp\downpage-UTF8_NO_BOM_AKA.htm'); // True page for parsing
         //BASE_DOWNLOAD_FILE_NO_BOM_AKA = 'Tmp\downpage-UTF8_NO_BOM_AKA.htm';
HTML:=HTMLToText(HTML);
ResultTmp:=ParsePage_IMDBMovieAKA(HTML);
If Not(ResultTmp=prFinished) then Result:=ResultTmp;
End;

--- End quote ---

But leave the ParsePage function only one as seen in my example and it should work

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version