English > PVD Python Scripts

Python (+Selenium) Chrome script for IMDb People script

(1/3) > >>

afrocuban:
This is Selenium script specific for the People script here:


Script Description: This script automates the download of various IMDb pages using Selenium and ChromeDriver, including handling cookies and popups, and saving the resulting HTML to local files.
It automatically finds your localization by using service, the http://ipinfo.io API to get the country code and the dictionary to map to language acording to obtaining country code. If you don't want this, comment out first part of the script and uncomment the one at the end of this script. Open the script in text editor and read about this.
For this to work ensure that:


--- Quote ---A. You installed python
B. You installed selenium and requests by



--- Quote ---pip install selenium requests
--- End quote ---


C. You have your Chrome bin on a PATH
D. You have Python folder on your PATH
E. pythonw.exe is not missing, or it's containing folder is on the PATH
--- End quote ---


This script:


--- Quote ---1. Uses Chrome browser instead Firefox
2. Uses chromedriver.exe instead geckodriver
3. Starts chromedriver.exe silently
4. Silently invokes browser in a headless mode (no pop-up windows of browser)
5. Scrapes .htm page of a given url
6. No path is needed to set manually inside the script - it is set to be relative to the path of selenium script!
--- End quote ---

For using relative path, ensure:


--- Quote ---6A. You put this script into "Scripts" folder of your PVD instance.
6B. You put appropirate chromedriver.exe to the "Script" folder, too.
--- End quote ---

To silently invoke selenium script itself by PVD's .psf script (no pop-up windows of selenium script's cmd window), be sure to use pythonw.exe instead of python.exe, like this for example:

--- Quote ---FileExecute('pythonw.exe', '"' + ScriptPath + 'selenium_script-Chrome_People.py" "' + URL + '" "' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM + '"');
--- End quote ---

Now, the last one will probably be ensured by those who maintain corresponding scripts if interested in, and for now, those are Ivek and me, but be sure to check if it's there anyway.

You may want first to test the script manually, from cmd, for example like this:


--- Quote ---C:\Users\user\selenium_script-Chrome_People.py "https://www.imdb.com/name/nm0000017"
--- End quote ---

From this point on, everything is automated and headless.

afrocuban:
Here's optimized selenium script, that should reduce time wait significantly.

afrocuban:
New scripts. Delete earlier and put these to the Scripts folder.

Read more here:


--- Quote ---http://www.videodb.info/forum_en/index.php/topic,4367.msg22727.html#msg22727
--- End quote ---

Ivek23:
selenium_script-People_4_pages_v3.2 script does not transfer all awaeds data because it does not open all more buttons for you, at least it was the case for me.

Here is my updated part of the code to open more more buttons and it works, so you will have to adapt it for the chrome version.


--- Quote ---# Define URLs and save paths
URLS_AND_PATHS = {
    f"{base_url}/awards/": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Awards.htm"),
    f"{base_url}/bio/": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Bio.htm"),
    f"https://www.imdb.com/search/title/?explore=genres&role={base_url.split('/')[-1]}": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Genres.htm"),
    f"{base_url}/?showAllCredits=true": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Credit.htm")
}

# Improved function to click all "More" buttons with scrolling
def click_all_more_buttons(driver):
    """
    Scrolls down the page and clicks all the "More" buttons that are visible.
    """
    body = driver.find_element(By.TAG_NAME, 'body')
    while True:
        try:
            # Find visible "More" buttons
            more_buttons = driver.find_elements(By.XPATH, '//span[contains(@class, "ipc-see-more__text")]/..')

            # If no buttons are found, break the loop
            if not more_buttons:
                logging.info("No more 'More' buttons found.")
                break

            # Iterate through and click all visible "More" buttons
            for button in more_buttons:
                try:
                    # Scroll into view before clicking
                    driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", button)
                    time.sleep(0.5)  # Allow page to stabilize
                    button.click()
                    logging.info("Clicked a 'More' button.")
                    time.sleep(1)  # Allow time for new content to load
                except Exception as e:
                    logging.warning(f"Error clicking a 'More' button: {e}")
                    continue

            # Scroll the page down to load more buttons
            body.send_keys(Keys.PAGE_DOWN)
            time.sleep(1)  # Wait for page to load more buttons
        except Exception as e:
            logging.info("No additional 'More' buttons to click.")
            break

# Function to download a page
def download_page(imdb_url, output_path, retries=3):
    for attempt in range(retries):
        try:
            # Initialize FirefoxDriver
            service = Service(gecko_path)
            driver = webdriver.Firefox(service=service, options=firefox_options)
            logging.info(f"Started FirefoxDriver for: {imdb_url}")

            driver.get(imdb_url)
            logging.info(f"Loaded URL: {imdb_url}")

            # Handle "Select Your Preferences" popup
            try:
                popup = WebDriverWait(driver, 5).until(
                    EC.presence_of_element_located((By.XPATH, "//div[contains(@class, 'sc-kDvujY')]"))
                )
                accept_button = WebDriverWait(driver, 5).until(
                    EC.element_to_be_clickable((By.XPATH, "//button[@data-testid='accept-button']"))
                )
                accept_button.click()
                logging.info("Accepted preferences popup.")
            except TimeoutException:
                logging.info("No preferences popup appeared.")

            # Click all "More" buttons on the page
            click_all_more_buttons(driver)

            # Save the HTML after clicking all "More" buttons
            html_source = driver.page_source
            with open(output_path, 'w', encoding='utf-8') as file:
                file.write(html_source)
            logging.info(f"Saved HTML to: {output_path}")
            break
        except Exception as e:
            logging.error(f"Error in attempt {attempt + 1}: {e}")
        finally:
 driver.quit()

# Download pages in parallel
threads = []
for url, path in URLS_AND_PATHS.items():
    thread = threading.Thread(target=download_page, args=(url, path))
    threads.append(thread)
    thread.start()
--- End quote ---

afrocuban:
Strange.


I deleted record for Alfonso Cuaron (https://www.imdb.com/name/nm0190859/), that has 262 wons and 207 nominations and imported it from the scratch, and counted manually up to 262+207=469 and all are there?

Can you provide me the link you were stuck with, so I could try to reproduce?

Navigation

[0] Message Index

[#] Next page

Go to full version