Integrating Selenium to PVD

English > Development

(1/4) > >>

afrocuban:

Sorry that I'm in the conversation uninvited and not knowing programming. But I learned a bit along the way trying to locally download FA page with trailers. I succeeded to download it as downpage_trailer-UTF8_NO_BOM_FA.htm beside downpage-UTF8_NO_BOM.htm in order to try to parse them both, but no good news. I mean, I can parse them both, but dynamic content isn't downloaded,

I don't think we can do it with PVdBDownPage.exe:

--- Quote ---// Download the URL page.
LogMessage(Chr(9) + Chr(9) + ' Download with PVdBDownPage in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');
FileExecute(ScriptPath + 'PVdBDownPage.exe', '"' + URL + '" "' + ScriptPath + OutFile + '"');

--- End quote ---

because it cannot download dynamic content since it
--- Quote ---is likely due to the JavaScript that loads the video elements dynamically on the original page. When you download the page using DownloadPage, the JavaScript doesn’t execute, so the dynamic content (like embedded trailers) isn’t included in the downloaded HTML file.
To work around this, we have a couple of options:

Use a Web Browser Automation Tool: Tools like Selenium can control a web browser to visit the page and let all JavaScript execute before saving the fully rendered page’s HTML. This way, the dynamic content will be included.

Find the Video URL Manually: Check the original HTML of the page in your browser (right-click and “View Page Source” or inspect elements with Developer Tools). Look for the actual URLs of the video files. You can then use those URLs directly in your script.

Using a web browser automation tool like Selenium is a bit more involved and requires additional setup, but it is a more robust solution for handling dynamic content. If you prefer to go this route, I can guide you through the setup process.

Alternatively, manually inspecting the page’s source for video URLs is quicker but less convenient for repeated tasks.
--- End quote ---

--- Quote ---Here's the rest of the conversation:
--- End quote ---
Invoking Selenium directly from within a PVD script (written in Pascal) is not straightforward because Selenium typically operates via Python, Java, or another high-level language. However, you can use a workaround by invoking an external Python script from your PVD Pascal script.

Here’s how you can integrate Selenium into your workflow:

Create a Python Script Using Selenium:

This script will navigate to the URL, wait for the content to load, and save the fully rendered HTML.

python

--- Quote ---from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get("https://www.filmaffinity.com/en/evideos.php?movie_id=xxxx")

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open('downpage_trailer-UTF8_NO_BOM_FA.htm', 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get("https://www.filmaffinity.com/en/evideos.php?movie_id=xxxx")

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open('downpage_trailer-UTF8_NO_BOM_FA.htm', 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser
driver.quit()
--- End quote ---

Invoke the Python Script from Your PVD Script:

Use the FileExecute method to call the Python script from within your Pascal script.

pascal

--- Quote ---// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');

// Construct the command to run the Python script
FileExecute('python.exe', ScriptPath + 'selenium_script.py');
--- End quote ---
In this setup:

The Python script handles the browsing and downloading of the fully rendered HTML using Selenium.

Your PVD script uses the FileExecute method to call the Python script, passing any necessary parameters if needed.

After the Python script finishes and the HTML is saved, your PVD script can continue processing the saved HTML as usual.

Example Complete Workflow
1. Python Script (selenium_script.py)

python

--- Quote ---from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys

# Get the URL from the command line arguments
url = sys.argv[1]
output_file = sys.argv[2]

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get(url)

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open(output_file, 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser
driver.quit()
--- End quote ---
2. Pascal Script (Invoke Selenium Script)

pascal

--- Quote ---// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');

// Construct the command to run the Python script
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + OutFile + '"');
--- End quote ---
By following this workflow, you can effectively integrate Selenium into your PVD script to handle pages with dynamic content. This way, you can ensure that all necessary content, like embedded trailers, is included in the HTML that your PVD script processes.

How would selenium know which url to open?
Great question. You can pass the URL to Selenium through command line arguments when you invoke the Python script. This way, your PVD script can dynamically supply the URL to Selenium. Here's how you can do it:

Step-by-Step Implementation
Modify Your Python Script to Accept Command Line Arguments:

You’ll adjust the Python script to accept the URL as an argument. This ensures that Selenium knows which URL to open.

Example Python Script (selenium_script.py):

python

--- Quote ---from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys

# Get the URL and output file from the command line arguments
url = sys.argv[1]
output_file = sys.argv[2]

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get(url)

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open(output_file, 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser
driver.quit()
--- End quote ---
Invoke the Python Script from Your PVD Script:

Use the FileExecute method to call the Python script, passing the URL and the output file path as arguments.

Example Pascal Script:

pascal

--- Quote ---// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');

// Construct the command to run the Python script
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + OutFile + '"');
--- End quote ---
How It Works
Python Script: The Python script (selenium_script.py) receives the URL and the output file path as command line arguments. It then uses Selenium to open the URL, waits for the trailers to load, and saves the fully rendered HTML to the specified output file.

Pascal Script: Your PVD script constructs the command to run the Python script with the necessary arguments (the URL and the output file path) and then executes it. This ensures that the correct URL is passed to Selenium dynamically.

This approach allows you to integrate Selenium seamlessly into your PVD script, enabling it to handle pages with dynamic content effectively. If you need further assistance with any part of the setup, feel free to ask!

How slow is this process?
The process of using Selenium to render and download the HTML can be slower than simply downloading a static page because Selenium has to:

Open a Web Browser: Launching a browser instance (e.g., Chrome or Firefox) takes a few seconds.

Navigate to the URL: Just like a human user, Selenium will navigate to the specified URL, which involves network latency.

Wait for JavaScript Execution: Selenium waits for the page to fully load, including the execution of any JavaScript that dynamically loads content. This can vary in time depending on the complexity of the page and the speed of your internet connection.

Save the HTML: Once the page is fully loaded, Selenium will extract and save the HTML content, which is relatively quick.

Estimated Time
Browser Launch: 1-3 seconds

Page Navigation: 1-5 seconds (depends on network speed)

JavaScript Execution & Content Load: 3-10 seconds (depends on page complexity)

Saving HTML: <1 second

In total, it can take anywhere from 5 to 20 seconds or more per page, depending on various factors.

Performance Considerations
Network Speed: Faster internet connections will reduce the navigation and loading time.

Page Complexity: Pages with heavy JavaScript and multiple dynamic elements will take longer to fully load.

Browser & System Performance: Using a faster browser and a system with more resources (CPU, RAM) will speed up the process.

While this is slower compared to a direct HTML download using PVdBDownPage.exe, it ensures that all dynamic content is fully loaded and accessible. This trade-off can be worthwhile if you need to scrape content that is loaded dynamically by JavaScript.

If speed becomes a critical issue, you might look into optimizing the Selenium script further, such as by reducing the wait times where possible or running multiple instances in parallel if applicable.

afrocuban:
ON AKAS

Yes, you’re correct. When dealing with elements that dynamically load additional content upon interaction, such as the "12 more" button on the IMDb page, Selenium is an excellent tool to use.How Selenium Can Help
* Automate Browser Actions: Selenium can open the browser and navigate to the desired URL.
* Simulate User Interaction: Selenium can simulate clicking on the "12 more" button to load the additional AKAs.
* Wait for Dynamic Content: Selenium can wait for the dynamically loaded content to appear before capturing the page’s HTML.
* Extract Fully Rendered HTML: Once all content is loaded, Selenium can extract and save the fully rendered HTML, including the dynamically loaded AKAs.Example Selenium WorkflowHere’s how you might modify your Selenium script to handle this case:
1. Python Script Using Selenium (selenium_imdb.py):

python

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import sys # Get the URL and output file from the command line arguments url = sys.argv[1] output_file = sys.argv[2] # Setup the WebDriver driver = webdriver.Chrome(executable_path='path_to_chromedriver') # Navigate to the URL driver.get(url) # Wait for the "12 more" button to be clickable and click it more_button = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH, "//span[contains(text(), '12 more')]")) ) more_button.click() # Wait for the new content to load WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-testid='sub-section-akas']")) ) # Get the fully rendered HTML html = driver.page_source # Save the HTML to a file with open(output_file, 'w', encoding='utf-8') as file: file.write(html) # Close the browser driver.quit() 2. Invoke the Python Script from Your PVD Script:

pascal

// Log the action LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||'); // Construct the command to run the Python script FileExecute('python.exe', '"' + ScriptPath + 'selenium_imdb.py" "' + URL + '" "' + ScriptPath + OutFile + '"'); Summary
* Automate Browser Actions: Open the browser and navigate to the URL.
* Simulate User Click: Click the "12 more" button to load additional AKAs.
* Wait for Content: Wait for the new content to load.
* Save Fully Rendered HTML: Save the fully rendered HTML to a file.
* Integrate with PVD Script: Use FileExecute in your PVD script to call the Selenium script and save the HTML.By using Selenium, you can ensure that all dynamic content, such as the additional AKAs on the IMDb page, is loaded and saved for processing by your PVD script. This makes Selenium a powerful tool for dealing with dynamically loaded web content.

afrocuban:
And to be sure it's not about PVD:

So, regardless of PVD handles https or not, no application could access dynamically loaded content actually, without tools like Selenium?

That's right. Dynamically loaded content, which relies on JavaScript to fetch and display data after the initial HTML page load, generally can't be accessed using tools that only handle static content, such as curl or traditional HTTP request libraries. These tools can fetch the initial HTML, but they can't execute JavaScript to render additional content.Why Tools Like Selenium Are Necessary
* JavaScript Execution: Selenium can control a web browser to execute JavaScript, allowing it to load dynamic content.
* User Interaction: Selenium can simulate user actions such as clicking buttons, filling out forms, and scrolling, which are often required to trigger JavaScript that loads additional content.
* Full Page Rendering: Selenium waits until the page is fully rendered, ensuring all dynamic elements are loaded before extracting the HTML.Use Cases for Selenium
* Expanding Sections: Like the "12 more" button on IMDb, where content only appears after clicking.
* Infinite Scrolling: Pages that load more content as you scroll down.
* Interactive Elements: Forms, drop-downs, and other UI elements that trigger content changes.Without tools like Selenium, accessing this kind of dynamically loaded content programmatically is very difficult, as these elements require interaction and JavaScript execution that static tools can't handle.

Ivek23:
In the Python attachment, there are fetch_imdbs_titles.py and imdb_aka_fetcher.py scripts to help and see the specific paths of geckodriver for firefox when writing Python scripts.

I would like to ask you to make a Python Script (selenium_script.py) and Pascal Script (Invoke Selenium Script) path to IMDB_[EN][HTTPS]_TEST_2c 2c script for me and add all this in the attachment here.

Where do I need to add geckodriver for firefox.

afrocuban:

--- Quote from: Ivek23 on December 15, 2024, 12:29:31 pm ---In the Python attachment, there are fetch_imdbs_titles.py and imdb_aka_fetcher.py scripts to help and see the specific paths of geckodriver for firefox when writing Python scripts.

I would like to ask you to make a Python Script (selenium_script.py) and Pascal Script (Invoke Selenium Script) path to IMDB_[EN][HTTPS]_TEST_2c 2c script for me and add all this in the attachment here.

Where do I need to add geckodriver for firefox.

--- End quote ---

It isn't clear who are you asking to do this, but in case you are asking me, I am still at the very beginning of even comprehending the concept, not to say to code. Interacting with AI can be and is extremely frustrating, and whatever I tried, I needed to try it live, otherwise I had to started over each time. Meaning, asking me to provide it for you isn't productive way, unless you too want to get crazy like I did while upgrading FA script, hahahah. To get there to be able to parse FA trailers page (meaning to be able to download and parse dynamic content of HTML on FA), I think I'll need a month at least, but I'm not surrendering.

Meanwhile, I started to fix and upgrade IMDb people script. I already fixed "bio" field, but I need and want to further tweak, update and upgrade it before meaningfully post it.

Navigation

[0] Message Index

[#] Next page

Go to full version