Scraping and cleaning a Disney movies database in Python

Goal

In order to analyze a Disney movies database, the data must be cleaned initially.

The data set includes 12 months' worth of sales data, which contains hundreds of thousands of electronics store orders broken down by product type, quantity ordered, price, order date, and purchase address.

Data

The data is scraped from the following Wikipedia webpage: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films

Libraries used

  • Beautiful Soup (to scrape the data)
  • Requests (to get the Wikipedia webpage)
  • JSON (to parse, save, and load the movies list as a JSON file)
  • Datetime (to convert string of dates to datetime objects)
  • Pickle (to save and load list that includes datetime objects)

Final output

The final output can be accessed here.

Issues fixed:

  1. Removed movies from the movies list that do not have a linked Wikipedia page
  2. Fixed error due to table headers not being used in some table row elements
  3. Fixed JSON file save error due to datetime object (used Pickle library)

Data cleaning tasks performed:

  1. Removed references "[1]", "[2]", etc. from the data
  2. Split up remaining strings of names into lists (nested lists)
  3. Investigated and fixed "'NoneType' object has no attribute" errors for some movies
  4. Running time: converted string to integer
  5. Release date: Converted dates to datetime objects

See the full output on Github →