Integrating Selenium to PVD

English > Development

<< < (3/4) > >>

afrocuban:
THanks. I'll keep this in mind.

I am now too deep in the People script, to leave it this point, and I thought movie script would be your focus, and me to help with surrounding scripts ad selenium itself. I thought I'd need more time to get into selenium, but I was lucky to adapt quickly. At the moment, I am at the point that I adapted script to download multiple pages locally to easier track what is actually scraped, I got selenium script to downlad those pages, and now in the middle of how to get dynamic content actually downloaded. I got it for People Credits page:

--- Quote ---
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os

# Paths
CHROME_DRIVER_PATH = r"C:\PersonalVideoDB\Scripts\Tmp\chromedriver.exe"
CHROME_BINARY_PATH = r"C:\GoogleChromePortable64\App\Chrome-bin\chrome.exe"
SAVE_PATH = r"C:\PersonalVideoDB\Scripts\Tmp\downpage-UTF8_NO_BOM-Credit.mhtml"

# IMDb URL
IMDB_URL = "https://www.imdb.com/name/nm0000040/?showAllCredits=true"

# Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = CHROME_BINARY_PATH # Specify the Chrome binary location
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu") # Disable GPU for headless mode stability
chrome_options.add_argument("--headless") # Running Chrome in headless mode
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# Service object for ChromeDriver
service = Service(executable_path=CHROME_DRIVER_PATH)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Add a custom cookie
driver.get("https://www.imdb.com") # Open the base URL to set the cookie
cookie = {'name': 'example_cookie', 'value': 'example_value', 'domain': 'imdb.com'}
driver.add_cookie(cookie)

# Navigate to the IMDb page
driver.get(IMDB_URL)

# Wait for the page to fully load and specific element to ensure all content is loaded
try:
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "span.ipc-title__text"))
)
except Exception as e:
print(f"Error waiting for the page to load: {e}")

# Get page source
page_source = driver.page_source

# Constructing the MHTML content manually
mhtml_content = f"""MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01D4E1C0.CE6AA5F0"

This document is a Single File Web Page, also known as a Web Archive file.

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0
Content-Location: {IMDB_URL}
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

{page_source}

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0--
"""

# Write the MHTML data to the specified file path
with open(SAVE_PATH, "w", encoding="utf-8") as file:
file.write(mhtml_content)

# Wait to ensure the file is saved
time.sleep(5) # Adjust the sleep time if necessary

# Confirm file creation
if os.path.exists(SAVE_PATH):
print(f"Page saved successfully to {SAVE_PATH}")
else:
print(f"Failed to save the page to {SAVE_PATH}")

# Close the browser
driver.quit()

--- End quote ---

and just added Awards function to script, and modified DownloadPage and ParsePage functions to split downpage-UTF8_NO_BOM.htm into downloading different file for each function: Principal, Bio, Credit, Awards and Genre.

Now, when I have all these pages, it'll be easier to track and parse, at least for me not knowing to code.

Awards page is crucial for all other scripts, because there I have to make it mimic clicking on a "More" /"All" and similar buttons, and to search and recognize them all actually, to wait for them. On filmaffinity.co was easier because their values are hidden behind the button, but on imdb that's not the case, thus more challenging.

After that comes the challenge to pass the url to selenium script, then to again readapt script not them to try to download pages (I need them now to see what selenium needs to downlad, but properly), and after that, as I see it as a long term goal, those psf files to serve only to call selenium script to pass it Title and year, and actually selenium would do the whole job of finding the movie including offering us too choose after which goes parsing, extracting and formatting data, and to pass them back to .psf to get it to PVD database. The concept is feasible:

afrocuban:

Python scripts using Selenium can definitely parse data from an IMDb page. Let's extend our existing script to extract and parse relevant information such as movie titles, roles, and other credits from the IMDb page.

Example Script to Parse Data from IMDb Page:
python

--- Quote ---from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import csv

# Paths
CHROME_DRIVER_PATH = r"Q:\\ChromeDriver-win64\\chromedriver.exe"
CHROME_BINARY_PATH = r"Q:\\GoogleChromePath\\chrome.exe"

# IMDb URL
IMDB_URL = "https://www.imdb.com/name/nm0000040/?showAllCredits=true"

# Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = CHROME_BINARY_PATH
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# Service object for ChromeDriver
service = Service(executable_path=CHROME_DRIVER_PATH)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Navigate to the specific IMDb page
driver.get(IMDB_URL)

# Wait for the credits section to load
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".filmo-category-section")))

# Scroll to the bottom to ensure all content is loaded
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)
driver.execute_script("window.scrollTo(0, 0);")
time.sleep(5)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)

# Extract movie titles and roles
credits = driver.find_elements(By.CSS_SELECTOR, ".filmo-category-section .filmo-row")

# Prepare the data for CSV
data = []
for credit in credits:
title_element = credit.find_element(By.CSS_SELECTOR, "b a")
title = title_element.text if title_element else "N/A"
year = credit.find_element(By.CSS_SELECTOR, ".year_column").text.strip()
role_elements = credit.find_elements(By.CSS_SELECTOR, "a[href*='?ref_=nmbio_']") # Adjust the selector to match your needs
roles = [role.text for role in role_elements] if role_elements else ["N/A"]
data.append([title, year, ", ".join(roles)])

# Save the data to a CSV file
csv_path = r"C:\\PersonalVideoDB\\Scripts\\Tmp\\credits.csv"
with open(csv_path, 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(["Title", "Year", "Roles"])
writer.writerows(data)

# Confirm file creation
if os.path.exists(csv_path):
print(f"Data saved successfully to {csv_path}")
else:
print(f"Failed to save the data to {csv_path}")

# Close the browser
driver.quit()
--- End quote ---
Key Points:
Wait for the Page to Fully Load: Ensure dynamic content is loaded by waiting for specific elements.

Scroll to Load All Content: Scroll up and down to trigger lazy loading.

Extract Relevant Data: Parse movie titles, years, and roles from the IMDb page.

Save Data: Write the parsed data to a CSV file for easy import into your database.

afrocuban:
The concept is feasible and can be efficiently implemented. The idea of using a Pascal/Delphi script to call a Python script with Selenium is quite practical. Here’s a step-by-step outline on how to achieve this efficiently:

1. Pascal/Delphi Script (.psf)
Your Delphi/Pascal application calls a Python script.

It passes the movie title and year to the Python script.

2. Python Script with Selenium
The Python script searches IMDb for the movie.

It offers titles if there are multiple matches and lets you choose.

It parses the relevant data, formats it, and passes it back to the Pascal script.

3. Pascal/Delphi Script Receives Data
The Pascal script receives the data and integrates it into your database.

Detailed Steps
Step 1: Pascal/Delphi Script to Call Python Script
Here’s an example of how to call a Python script from Pascal/Delphi:

pascal

--- Quote ---program CallPython;

uses
ShellAPI, SysUtils, Windows;

var
Title, Year: string;
PythonExe, ScriptPath, Parameters: string;
ReturnCode: Integer;
ResultFile: TextFile;
Line: string;
begin
Title := 'MovieTitle'; // These would be input parameters in your actual app
Year := '2021';

PythonExe := 'C:\Path\To\Python\python.exe';
ScriptPath := 'C:\Path\To\Script\imdb_script.py';
Parameters := Format('"%s" "%s" "%s"', [ScriptPath, Title, Year]);

ReturnCode := ShellExecute(0, 'open', PChar(PythonExe), PChar(Parameters), nil, SW_HIDE);
if ReturnCode > 32 then
Writeln('Python script executed successfully')
else
Writeln('Failed to execute Python script');

// Assuming the Python script writes to a result file
AssignFile(ResultFile, 'C:\Path\To\Script\results.txt');
Reset(ResultFile);
while not Eof(ResultFile) do
begin
ReadLn(ResultFile, Line);
Writeln(Line); // Process each line (this would be your actual data processing)
end;
CloseFile(ResultFile);
end.
--- End quote ---
Step 2: Python Script to Search and Parse IMDb
Here’s an example Python script that receives parameters, searches IMDb, and writes results to a file:

python

--- Quote ---import sys
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time

CHROME_DRIVER_PATH = "path/to/chromedriver"
IMDB_URL_TEMPLATE = "https://www.imdb.com/find?q={query}&s=tt"

def main():
if len(sys.argv) != 3:
print("Usage: imdb_script.py <Title> <Year>")
return

title = sys.argv[1]
year = sys.argv[2]
query = f"{title} {year}"

service = Service(executable_path=CHROME_DRIVER_PATH)
options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(service=service, options=options)
driver.get(IMDB_URL_TEMPLATE.format(query=query))

time.sleep(3) # Wait for page to load

# Assuming first result is the desired movie
first_result = driver.find_element_by_css_selector(".result_text a")
first_result.click()

time.sleep(3) # Wait for movie page to load

movie_title = driver.find_element_by_css_selector("h1").text
movie_year = driver.find_element_by_css_selector("#titleYear a").text

# Collect other relevant data...
director = driver.find_element_by_css_selector("div.credit_summary_item a").text

# Write results to file
with open("path/to/results.txt", "w") as file:
file.write(f"Title: {movie_title}\n")
file.write(f"Year: {movie_year}\n")
file.write(f"Director: {director}\n")

driver.quit()

if __name__ == "__main__":
main()
--- End quote ---
Step 3: Process Results in Pascal/Delphi
Back in your Pascal/Delphi script, read and process the results written by the Python script.

Efficiency Considerations
Parallel Execution: If you’re processing multiple movies, consider parallel execution where possible.

Caching: Implement caching for already fetched and processed data to avoid redundant processing.

Error Handling: Implement robust error handling and logging to manage issues with web scraping or data extraction.

Summary
This approach ensures the seamless integration of Python and Pascal/Delphi scripts, leveraging the strengths of each for your specific needs. This should streamline the process and reduce manual intervention, improving overall efficiency.

afrocuban:
At the very end, to set up environment is too complex for an average end user, so I'll try exploring options at the beginning how to check if user has python, selenium, chrome driver, and if not to offer to the user to download an set all of that for him...

A long road ahead for inevitable transition to almost full selenium-like tools, but once set up it'll be way easier because we at least won't have to download pages thus overriding "HTTPS issue" once and for all.

afrocuban:
Regarding your script in Slovenian, I just asked Copilot to translate it to English and here it is:

--- Quote ---import sys
import os
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
import time

# Check if IMDb URL is provided as a parameter
if len(sys.argv) < 2:
print("IMDb URL was not provided as a parameter.")
sys.exit(1)

imdb_url = sys.argv[1] # IMDb URL from the command line

# Path to geckodriver.exe
gecko_path = "C:/Projects/geckodriver.exe" # Adjust the path according to the driver location

# Get the current application path
app_path = os.path.dirname(os.path.abspath(__file__)) # Path to the current Python script

# Check if your "PVD_0.9.9.21_MOD-Simple AllMovies" folder is on the D: drive or elsewhere
pvd_path = "D:\\MyPVD\\PVD_0.9.9.21_MOD-Simple AllMovies" # Set this path once, so it does not change

# If you want a universal path, use app_path to combine
output_path = os.path.join(pvd_path, "Scripts", "Tmp", "downpage-UTF8_NO_BOM.htm")

# Check if the folder exists, if not, create it
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Create a browser object
service = Service(gecko_path)
driver = webdriver.Firefox(service=service)

try:
# Open the IMDb page
driver.get(imdb_url)
print(f"The page {imdb_url} is loaded.")

# Wait for the page to load
time.sleep(5)

# Get the entire source HTML of the page
html_source = driver.page_source

# Save the HTML to a file
with open(output_path, 'w', encoding='utf-8') as file:
file.write(html_source)
print(f"HTML is saved to file: {output_path}")

finally:
# Close the browser
driver.quit()
--- End quote ---

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version