
IMDb & TMDb Movie Scraper
This project is a sophisticated data scraping tool built in Python to systematically gather comprehensive information about movies and TV series from IMDb. To create a richer and more complete dataset, the scraper intelligently enriches the initial IMDb data by fetching detailed metadata from The Movie Database (TMDb) API. The final, cleaned data, containing everything from titles and ratings to budgets, cast, and crew, is then exported into structured JSON
and CSV
files for easy analysis and use.
The core of the project is built upon the powerful and flexible Scrapy framework, which manages the scraping workflow. To handle IMDb's dynamic, JavaScript-heavy web pages and ensure all content is loaded correctly, the scraper utilizes Selenium to automate a web browser. Once the initial movie data is collected, the script makes concurrent API calls to TMDb using the Requests library, fetching a wealth of additional information. This dual-source approach ensures the final dataset is far more detailed than what could be obtained from a single source.
The project features two distinct spiders to demonstrate both foundational and advanced scraping techniques. The basic_scrapper
provides a straightforward implementation for smaller-scale tasks. The advance_scrapper
is engineered for high-performance, large-scale data collection. It employs Python's concurrent.futures.ThreadPoolExecutor
to run multiple browser instances in parallel, dramatically increasing scraping speed. This advanced version also includes a robust task management system using a queue to process scraping jobs by year and month, along with resilient error handling and automatic retries, making it a powerful tool for building large datasets efficiently and reliably.