Scraping and cleaning a Disney movies database in Python

Goal

In order to analyze a Disney movies database, the data must be cleaned initially.

The data set includes 12 months' worth of sales data, which contains hundreds of thousands of electronics store orders broken down by product type, quantity ordered, price, order date, and purchase address.

‍

Data

The data is scraped from the following Wikipedia webpage: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films

‍

Libraries used

Beautiful Soup (to scrape the data)
Requests (to get the Wikipedia webpage)
JSON (to parse, save, and load the movies list as a JSON file)
Datetime (to convert string of dates to datetime objects)
Pickle (to save and load list that includes datetime objects)

‍

Final output

The final output can be accessed here.

‍

Issues fixed:

Removed movies from the movies list that do not have a linked Wikipedia page
Fixed error due to table headers not being used in some table row elements
Fixed JSON file save error due to datetime object (used Pickle library)

‍

Data cleaning tasks performed:

Removed references "[1]", "[2]", etc. from the data
Split up remaining strings of names into lists (nested lists)
Investigated and fixed "'NoneType' object has no attribute" errors for some movies
Running time: converted string to integer
Release date: Converted dates to datetime objects

‍

See the full output on Github →