What is the purpose of the project, and who are we building our analysis for?
Scenario:
Our stakeholder needs data-driven insights to determine what type of app to build for the Apple App Store
Questions to be answered:
What app categories are the most popular?
What price should the app be?
How do I maximize user ratings?
My Responsibilities
Based on the situation above, I have used a data set that includes data for 7197 iOS apps from the Apple App Store from July 2017. The data set has been obtained from Kaggle.
From this data set, I used two tables in the SQL code in order to analyze the data and make meaningful conclusions:
Before diving into advanced analysis using SQL, I conduct EDA on the data set to understand the characteristics and structure of the data, and any issues that need to be addressed beforehand, such as missing or inconsistent data or data types, and any errors or outliers
I also query the data for the min, max, and average ratings
-- **EXPLORATORY DATA ANALYSIS**
-- Checking the number of unique apps in both tables
select count(distinct id) as unique_app_ids
from AppleStore
select count(distinct id) as unique_app_ids
from appleStore_description_combined
-- Outputs for both SQL queries above show the same result: 7197
/*
Check for any missing values in key fields (app name, user rating, app genre, description)
from both tables
*/
select count(*) as missing_values_app
from AppleStore
where track_name is null or user_rating is null or prime_genre is null
select count(*) as missing_values_desc
from appleStore_description_combined
where app_desc is null
-- Output shows 0 missing values for both tables
-- Finding out the number of apps per genre
select
prime_genre,
count(*) as number_of_apps
from AppleStore
group by prime_genre
order by number_of_apps desc
-- Get an overview of the apps' ratings
select
min(user_rating) as min_rating,
max(user_rating) as max_rating,
avg(user_rating) as avg_rating
from AppleStore
Now that we've established that there is no missing data, and we have a basic statistical summary of the data (min, max, average), we can move on to the actual data analysis.
Checking whether paid apps have higher ratings than free apps:
** DATA ANALYSIS **
-- Determing whether paid apps have higher ratings than free apps
select
case
when price > 0 then 'Paid'
else 'Free'
end as app_type,
avg(user_rating) as avg_rating
from AppleStore
group by app_type
-- On average, the rating of Paid apps is slightly (10.19%) higher than Free apps
Output:
On average, the rating of Paid apps is slightly (10.19%) higher than Free apps
Checking if apps with more supported languages have higher ratings:
select
case
when lang_num < 10 then '<10 Languages Supported'
when lang_num between 10 and 30 then '10-30 Languages Supported'
when lang_num > 30 then '>30 Languages Supported'
END as language_bucket,
avg(user_rating) as avg_rating
from AppleStore
group by language_bucket
order by avg_rating desc
Output:
Checking what genres have the lowest ratings:
select
prime_genre,
avg(user_rating) as avg_rating
from AppleStore
group by prime_genre
order by avg_rating
limit 10
Output:
The above result may mean users are not satisfied with apps in these genres, so there may be a pain point to solve. However, this may also mean these app types (such as Finance) are hard to build and therefore harder to get a good rating from users
Checking if correlation exists between length of app description and user rating:
select
case
when length(d.app_desc) < 500 then 'Short'
when length(d.app_desc) between 500 and 1000 then 'Medium'
else 'Long'
END as desc_length_bucket,
avg(a.user_rating) as avg_rating
from AppleStore a
join AppleStore_description_combined d
on a.id = d.id
group by desc_length_bucket
order by avg_rating desc
Output:
Checking the top-rated apps for each genre:
select
prime_genre,
track_name,
user_rating
from (
select
prime_genre,
track_name,
user_rating,
rank() over(partition by prime_genre order by user_rating desc, rating_count_tot desc) as rank
from AppleStore
) a
where a.rank = 1
Output:
Results & Recommendations
The SQL output shows that paid apps generally have better ratings than free apps—this doesn't necessarily mean that paid apps are "better" than free apps. It may mean that users who pay for an app have higher engagement with the app and thus have a higher perceived value for it. The recommendation to the stakeholder is to charge a price for their app if they believe it solves a critical pain point in a high quality manner
The output also shows that apps that support 10-30 languages have a higher rating than apps that support less than 10 or more than 30 languages. This does not necessarily mean that apps that support between 10-30 languages are of higher quality, but that apps with a good amount of language support focus on the right audience, and satisfy user a higher number of users since they cater to a larger audience. An app with more than 30 languages may receive lower ratings since it may lack a balance between useful functionality and language support
App genres such as "Finance" and "Book" received one of the lowest ratings, suggesting that there is room for improvement for apps in these industries. However, this may also mean these apps in these industries are harder to build and maintain and therefore harder to get a good rating from users
Apps with a longer description have better ratings—it is likely that users appreciate apps that have a clear and transparent description of the features and limitations of the app before they download it, which sets appropriate expectations, leading to higher ratings
During the EDA phase, the SQL query to find the number of apps per genre provides an output where the Games and Entertainment app genres have the highest number of apps. This suggests that these are highly competitive app categories and it may be difficult for a new app in these spaces to gain meaningful traction