
Answering Business Questions Using SQL
In this project walkthrough, we’ll explore how to use SQL for data analysis from a digital music store and answer critical business questions. By working with the Chinook database—a sample database that represents a digital media store similar to iTunes—we’ll demonstrate how SQL can drive data-informed decision-making in a business context.
The Chinook database contains information about artists, albums, tracks, customers, and sales data. Through strategic SQL queries, we’ll help the business understand its market trends, evaluate employee performance, and identify growth opportunities. This project showcases real-world SQL applications that data analysts encounter daily.
We’ll take you through writing increasingly complex SQL queries, from basic exploratory analysis to advanced queries using Common Table Expressions (CTEs) and subqueries.
What You’ll Learn
By the end of this tutorial, you’ll know how to:
- Navigate complex relational database schemas with multiple tables
- Write SQL queries using joins to connect data across multiple tables
- Use Common Table Expressions (CTEs) to organize complex queries
- Apply subqueries to calculate percentages and comparative metrics
- Analyze business data to provide actionable insights
- Connect SQL queries to Python in Jupyter
Before You Start: Pre-Instruction
To make the most of this project walkthrough, follow these preparatory steps:
- Review the Project
- Prepare Your Environment
- If you’re using the Dataquest platform, everything is already set up for you
- If you’re working locally, you’ll need:
- Get Comfortable with SQL Fundamentals
- You should be familiar with basic SQL keywords:
SELECT
,FROM
,GROUP BY
, andJOIN
- Some experience with CTEs and subqueries will be helpful, but not required
- New to Markdown? We recommend learning the basics: Markdown Guide
- You should be familiar with basic SQL keywords:
Setting Up Your Environment
Before we get into our analysis, let’s set up our Jupyter environment to work with SQL. We’ll use some SQL magic commands that allow us to write SQL directly in Jupyter cells.
%%capture
%load_ext sql
%sql sqlite:///chinook.db
Learning Insight: The
%%capture
magic command suppresses any output messages from the cell, keeping our notebook clean. The%load_ext sql
command loads the SQL extension, and%sql sqlite:///chinook.db
connects us to our database.
Now let’s verify our connection and explore what tables are available in our database:
%%sql
SELECT name
FROM sqlite_master
WHERE type='table';
This special SQLite query shows us all the table names in our database. The Chinook database contains 11 tables representing different aspects of a digital music store:
album
: Album detailsartist
: Artist informationcustomer
: Customer information with assigned support representativesemployee
: Store employees, including sales support agentsgenre
: Music genresinvoice
: Sales transactionsinvoice_line
: Individual items within each invoicemedia_type
: Format types (MP3, AAC, etc.)playlist
: Curated playlistsplaylist_track
: Tracks within each playlisttrack
: Song information
Understanding the Database Schema
Working with relational databases means understanding how tables connect to each other. The Chinook database uses primary and foreign keys to establish these relationships. Here’s a simplified view of the key relationships between the tables we’ll be working with:
customer
is linked toemployee
throughsupport_rep_id
invoice
is linked tocustomer
throughcustomer_id
invoice_line
is linked toinvoice
throughinvoice_id
track
is linked toalbum
,invoice_line
, andgenre
throughalbum_id
,track_id
, andgenre_id
, respectively
Let’s preview some of our key tables to understand the data we’re working with:
%%sql
SELECT *
FROM track
LIMIT 5;
track_id | name | album_id | media_type_id | genre_id | composer | milliseconds | bytes | unit_price |
---|---|---|---|---|---|---|---|---|
1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 | Angus Young, Malcolm Young, Brian Johnson | 343719 | 11170334 | 0.99 |
2 | Balls to the Wall | 2 | 2 | 1 | None | 342562 | 5510424 | 0.99 |
3 | Fast As a Shark | 3 | 2 | 1 | F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman | 230619 | 3990994 | 0.99 |
4 | Restless and Wild | 3 | 2 | 1 | F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman | 252051 | 4331779 | 0.99 |
5 | Princess of the Dawn | 3 | 2 | 1 | Deaffy & R.A. Smith-Diesel | 375418 | 6290521 | 0.99 |
%%sql
SELECT *
FROM invoice_line
LIMIT 5;
invoice_line_id | invoice_id | track_id | unit_price | quantity |
---|---|---|---|---|
1 | 1 | 1158 | 0.99 | 1 |
2 | 1 | 1159 | 0.99 | 1 |
3 | 1 | 1160 | 0.99 | 1 |
4 | 1 | 1161 | 0.99 | 1 |
5 | 1 | 1162 | 0.99 | 1 |
Learning Insight: When working with a new database, always preview your tables with LIMIT to understand the data structure before writing complex queries. This helps you identify column names, data types, and potential relationships without flooding your output with hundreds of rows.
Business Question 1: Which Music Genres Should We Focus on in the USA?
The Chinook store wants to understand which music genres are most popular in the United States market. This information will help them decide which new albums to add to their catalog. Let’s build a query to analyze genre popularity by sales.
Building Our Analysis with a CTE
We’ll use a Common Table Expression (CTE) to create a temporary result set that combines data from multiple tables:
%%sql
WITH genre_usa_tracks AS (
SELECT
il.invoice_line_id,
g.name AS genre,
t.track_id,
i.billing_country AS country
FROM track t
JOIN genre g ON t.genre_id = g.genre_id
JOIN invoice_line il ON t.track_id = il.track_id
JOIN invoice i ON il.invoice_id = i.invoice_id
WHERE i.billing_country = 'USA'
)
SELECT
genre,
COUNT(*) AS tracks_sold,
COUNT(*) * 100.0 / (SELECT COUNT(*) FROM genre_usa_tracks) AS percentage
FROM genre_usa_tracks
GROUP BY genre
ORDER BY tracks_sold DESC;
genre | tracks_sold | percentage |
---|---|---|
Rock | 561 | 53.37773549000951 |
Alternative & Punk | 130 | 12.369172216936251 |
Metal | 124 | 11.798287345385347 |
R&B/Soul | 53 | 5.042816365366318 |
Blues | 36 | 3.4253092293054235 |
Alternative | 35 | 3.330161750713606 |
Latin | 22 | 2.093244529019981 |
Pop | 22 | 2.093244529019981 |
Hip Hop/Rap | 20 | 1.9029495718363463 |
Jazz | 14 | 1.3320647002854424 |
Easy Listening | 13 | 1.236917221693625 |
Reggae | 6 | 0.570884871550904 |
Electronica/Dance | 5 | 0.47573739295908657 |
Classical | 4 | 0.38058991436726924 |
Heavy Metal | 3 | 0.285442435775452 |
Soundtrack | 2 | 0.19029495718363462 |
TV Shows | 1 | 0.09514747859181731 |
Learning Insight: CTEs make complex queries more readable by breaking them into logical steps. Here, we first create a filtered dataset of USA purchases, then analyze it. The 100.0 in our percentage calculation ensures we get decimal results instead of integer division.
Our results show that Rock music dominates the USA market with over 50% of sales, followed by Latin, Metal, and Alternative & Punk. This suggests the store should prioritize these genres when selecting new inventory.
Key Insights from Genre Analysis
- Rock dominates: With 561 tracks sold (53.4%), Rock is by far the most popular genre
- Latin music surprise: The second most popular genre is Latin (10.3%), indicating a significant market segment
- Long tail effect: Many genres have very small percentages, suggesting niche markets
Business Question 2: Analyzing Employee Sales Performance
The company wants to evaluate its sales support agents’ performance to identify top performers and areas for improvement. Let’s analyze which employees generate the most revenue.
%%sql
SELECT
e.first_name || ' ' || e.last_name AS employee_name,
e.hire_date,
COUNT(DISTINCT c.customer_id) AS customer_count,
SUM(i.total) AS total_sales_dollars,
SUM(i.total) / COUNT(DISTINCT c.customer_id) AS avg_dollars_per_customer
FROM customer c
JOIN invoice i ON c.customer_id = i.customer_id
JOIN employee e ON c.support_rep_id = e.employee_id
GROUP BY e.employee_id, e.hire_date
ORDER BY total_sales_dollars DESC;
employee_name | hire_date | customer_count | total_sales_dollars | avg_dollars_per_customer |
---|---|---|---|---|
Jane Peacock | 2017-04-01 00:00:00 | 21 | 1731.5100000000039 | 82.45285714285733 |
Margaret Park | 2017-05-03 00:00:00 | 20 | 1584.0000000000034 | 79.20000000000017 |
Steve Johnson | 2017-10-17 00:00:00 | 18 | 1393.920000000002 | 77.44000000000011 |
Learning Insight: When using
GROUP BY
with aggregate functions, remember to include all non-aggregated columns in yourGROUP BY
clause. This is required in most SQL flavors (though SQLite is more forgiving). The||
operator concatenates strings in SQLite.
Performance Analysis Results
Our analysis reveals interesting patterns:
- Jane Peacock leads with the highest average dollars per customer, despite not having the most customers
- Margaret Park’s performance is solid, with metrics close to Jane’s, suggesting a consistent level of customer value delivery
- Steve Johnson, the newest employee, shows promising performance with metrics similar to more experienced staff
Business Question 3: Combining SQL with Python for Visualization
While SQL excels at data retrieval and transformation, combining it with Python enables powerful visualizations. Let’s demonstrate how to pass SQL query results to Python:
import pandas as pd
# Store our query as a string
query = """
SELECT
genre,
COUNT(*) AS tracks_sold
FROM genre_usa_tracks
GROUP BY genre
ORDER BY tracks_sold DESC
LIMIT 10;
"""
# Execute the query and store results
result = %sql \$query
# Convert to pandas DataFrame
df = result.DataFrame()
Learning Insight: The
%sql
inline magic (single percent sign) allows us to execute SQL and capture the results in Python. The dollar sign syntax (\$query
) lets us reference Python variables within SQL magic commands.
Challenges and Considerations
During our analysis, we encountered several important SQL concepts worth highlighting:
1. Integer Division Pitfall
When calculating percentages, SQL performs integer division by default:
-- This returns 0 for all percentages
SELECT COUNT(*) / (SELECT COUNT(*) FROM table) AS percentage
-- This returns proper decimals
SELECT COUNT(*) * 100.0 / (SELECT COUNT(*) FROM table) AS percentage
2. JOIN Selection Matters
We used INNER JOIN
throughout because we only wanted records that exist in all related tables. If we needed to include customers without invoices, we would use LEFT JOIN
instead.
3. Subquery Performance
Our percentage calculation uses a subquery that executes for each row. For larger datasets, consider using window functions or pre-calculating totals in a CTE.
Sharing Your Work with GitHub Gists
GitHub Gists provide an excellent way to share your SQL projects without the complexity of full repositories. Here’s how to share your work:
- Navigate to gist.github.com
- Create a new gist
- Name your file with the
.ipynb
extension for Jupyter notebooks or.sql
for SQL scripts - Paste your code and create either a public or secret gist
Gists automatically render Jupyter notebooks with all outputs preserved, making them perfect for sharing analysis results with stakeholders or including in your portfolio of projects.
Summary of Analysis
In this project, we’ve demonstrated how SQL can answer critical business questions for a digital music store:
- Genre Analysis: We identified Rock as the dominant genre in the USA market (53.4%), with Latin music as a surprising second place
- Employee Performance: We evaluated sales representatives, discovering that Jane Peacock leads in average revenue per customer
- Technical Skills: We applied CTEs, subqueries, multiple joins, and aggregate functions to solve real business problems
These insights enable data-driven decisions about inventory management, employee training, and market strategy.
Next Steps
To extend this analysis and deepen your SQL skills, consider these challenges:
- Time-based Analysis: How do sales trends change over time? Add date filtering to identify seasonal patterns
- Customer Segmentation: Which customers are the most valuable? Create customer segments based on purchase behavior
- Product Recommendations: Which tracks are commonly purchased together? Use self-joins to find associations
- International Markets: Expand the genre analysis to compare preferences across different countries
If you’re new to SQL and found this project challenging, start with our SQL Fundamentals skill path to build the foundational skills needed for complex analysis. The course covers essential topics like joins, aggregations, and subqueries that we’ve used throughout this project.
Happy querying!
Source link