Customer Behavior and Sales Analysis for Chinook Digital Store.
- Oluwakemi Oyefeso
- Jan 13
- 10 min read
Updated: Jan 29

Problem Statement:
Customer Behavior and Sales Analysis for Chinook Digital Store from 2009 to 2013 to identify growth opportunities, enhance its marketing strategies, and improve customer retention.
Goals
Understand Current Trends:
Gain insights into sales performance and customer retention patterns.
Identify Growth Opportunities:
Highlight underperforming areas and potential revenue streams.
Develop Data-Driven Strategies:
Create targeted marketing campaigns and personalized recommendations.
Key Stakeholders
Executive Management: To guide strategic decisions on market expansion and resource allocation.
Marketing Team: To craft data-driven campaigns and optimize ROI.
Sales Team: To improve strategies for high-value customer retention and sales growth.
Customer Service Team: To enhance customer experience through data insights.
Data:
Source: SQLite database provided by AltSchool Africa.
Techniques Applied:
Left Join: merging columns from different tables.
Customer Segmentation: RFM (Recency, Frequency, Monetary) Analysis.
Market Basket Analysis: Calculating metrics like Support, Confidence, and Lift.
I have also provided YouTube video links to explain the queries in detail.
Data Limitation:
No gender-specific or user feedback data.
Scope
Part 1: Data extraction, transformation, and cleaning.
Key Steps:
Step 1: Data Collection and Preparation:
Phases in Step 1:
Data Collection:
The original Chinook dataset was an SQLite file containing 11 tables.
I uploaded this file into SQLite and wrote queries to extract the required tables and columns.
Additionally, I created new tables for CLV, Customer Segmentation, and Market Basket Analysis.

Creating a Comprehensive Tracks Table:
I extracted Genre Name and MediaType from their respective tables and merged them into a new tracks2 table using a LEFT JOIN. This table served as the foundation for further analysis.
CREATE TABLE tracks2 AS
SELECT
t.TrackId,
t.Name,
t.AlbumId,
t.MediaTypeId,
t.GenreId,
t.Composer,
t.UnitPrice,
g.Name AS Genre,
m.Name AS MediaType
FROM tracks AS t
LEFT JOIN genres AS g
ON t.GenreId = g.GenreId
LEFT JOIN media_types AS m
ON t.MediaTypeId = m.MediaTypeId


Building the Fact Table:
Using the tracks2 table, I created the artist_album_sales fact table, consolidating data from multiple tables.
(Link to the YOUTUBE video presentation on building the artist_album_sales table.)
CREATE TABLE artist_album_sales AS
SELECT
i.InvoiceId,
i.InvoiceDate,
i.CustomerId,
iv.TrackId,
t.Name AS TrackName,
a.ArtistId,
at.Name AS ArtistName,
t.AlbumId,
a.Title AS AlbumTitle,
t.GenreId,
t.Genre,
t.MediaTypeId,
t.MediaType,
t.Composer,
iv.UnitPrice,
SUM(iv.Quantity) AS Quantity,
ROUND(SUM(iv.UnitPrice * iv.Quantity), 2) AS Amount
FROM invoices AS i
LEFT JOIN invoice_items AS iv
ON i.InvoiceId = iv.InvoiceId
LEFT JOIN tracks2 AS t
ON iv.TrackId = t.TrackId
LEFT JOIN albums AS a
ON t.AlbumId = a.AlbumId
LEFT JOIN artists AS at
ON a.ArtistId = at.ArtistId
GROUP BY
i.InvoiceId,
i.InvoiceDate,
i.CustomerId,
iv.TrackId,
t.Name,
a.ArtistId,
at.Name,
t.AlbumId,
a.Title,
t.GenreId,
t.Genre,
t.MediaTypeId,
t.MediaType,
t.Composer,
iv.UnitPriceOutput:
The artist_album_sales, containing 17 columns, was downloaded as a CSV file for further analysis after checking for duplicates.
Checking for duplicates:
I checked the artist_album_sales table for duplicates, and there were none.
Duplicates Query:
SELECT
InvoiceId,
InvoiceDate,
CustomerId,
TrackId,
AlbumId,
ArtistId,
UnitPrice,
SUM(Quantity) AS Quantity,
ROUND(SUM(UnitPrice * Quantity), 2) AS Amount,
COUNT(*) AS DuplicateCount
FROM artist_album_sales
GROUP BY
InvoiceId,
InvoiceDate,
CustomerId,
TrackId,
AlbumId,
ArtistId,
UnitPrice
HAVING COUNT(*) > 1;Step 2: Calculating Customer Lifetime Value (CLV)
Phases in Step 2:
Metrics Needed to calculate CLV:
Purchase Frequency: How often customers made purchases.
Average Purchase Value: The average revenue generated per transaction.
Customer Lifespan: The time between the first and last purchase.
What is CLV and why is it important to Chinook?
CLV is a metric that provides insight into how long a customer has been actively purchasing over time.
Benefits of Calculating CLV for Chinook
Identify and Retain High-Value Customers:
Helps pinpoint the most valuable customers and develop strategies to nurture and retain them.
Guides decisions on how much to invest in acquiring new customers based on expected revenue from existing ones.
Optimize Marketing Spend:
Enables efficient allocation of marketing resources by focusing on high-CLV customers who generate the most revenue.
Reduces acquisition costs by prioritizing retention efforts over constantly acquiring new customers.
Enhance Customer Experience and Personalization:
Provides insights into the preferences and behaviors of high-value customers.
Allows for tailored product recommendations, exclusive offers, and improved customer service, increasing engagement and loyalty.
Improve Business Strategy and Revenue Forecasting:
Helps project future revenue, assess growth potential, and make informed decisions on product development, expansion, and investments.
Supports strategic planning by predicting long-term profitability.
Measure Customer Relationship Effectiveness:
Acts as an indicator of customer satisfaction, loyalty, and repeat purchase behavior.
A high CLV signals strong customer relationships, while a declining CLV may highlight areas for improvement.
By leveraging CLV analysis, Chinook can maximize profitability, enhance customer retention, and make data-driven decisions for long-term success.
(Link to the YOUTUBE video presentation on CLV.)
CLV Calculation Query
WITH PurchaseData AS (
SELECT
ar.CustomerId,
c.FirstName || ' ' || c.LastName AS CustomerName,
COUNT(ar.InvoiceId) AS PurchaseFrequency,
ROUND(AVG(ar.Amount), 2) AS AveragePurchaseValue
FROM
artist_album_sales AS ar
LEFT JOIN customers as c
ON ar.CustomerId = c.CustomerId
GROUP BY
ar.CustomerId
),
CustomerLifespan AS (
SELECT
ar.CustomerId,
i.BillingAddress,
i.BillingCity,
i.BillingState,
i.BillingCountry,
i.BillingPostalCode,
(JULIANDAY(MAX(ar.InvoiceDate)) - JULIANDAY(MIN(ar.InvoiceDate))) / 365 AS CustomerLifespan
FROM
artist_album_sales AS ar
LEFT JOIN invoices AS i
GROUP BY
ar. CustomerId,i.BillingAddress,
i.BillingCity,
i.BillingState,
i.BillingCountry,
i.BillingPostalCode
)
SELECT
pd.CustomerId,
pd.CustomerName,
pd.AveragePurchaseValue,
pd.PurchaseFrequency,
cl.CustomerLifespan,
ROUND(pd.AveragePurchaseValue * pd.PurchaseFrequency * cl.CustomerLifespan, 2) AS CustomerLifetimeValue,
cl.BillingAddress,
cl.BillingCity,
cl.BillingState,
cl.BillingCountry,
cl.BillingPostalCode
FROM
PurchaseData AS pd
JOIN
CustomerLifespan AS cl ON pd.CustomerId = cl.CustomerIdOutput:
Shows the CLV per customer and this table was downloaded as a CSV file for further analysis and visualization in Power BI.
NOTES:
JULIANDAY(MAX(ar.InvoiceDate)): Calculates the Julian day number for the latest invoice date in the dataset for each customer. The MAX function ensures you're getting the most recent InvoiceDate.
JULIANDAY(MIN (ar.InvoiceDate)): Calculates the Julian day number for the earliest invoice date in the dataset for each customer. The MIN function ensures you're getting the earliest InvoiceDate.
Subtracting the Two Julian Day Numbers gives you the number of days between the earliest and latest invoice dates for each customer. This represents the duration or "lifespan" of the customer's activity with the store.
Dividing by 365 converts the number of days into years. This gives you the CustomerLifespan in years.
Step 3: Customer Segmentation (RFM Analysis)
RFM analysis enables customer segmentation by categorizing them based on three key factors: Recency (how recently they made a purchase), Frequency (how often they buy), and Monetary Value (how much they spend).
Phases in Step 3:
Frequency Calculation:
Frequency refers to how often a customer purchases within a specific period. A high frequency means the customer is loyal.
Query:
CREATE TABLE FREQUENCY AS
WITH PurchaseFrequency AS (
SELECT
ar.CustomerId,
COUNT (DISTINCT ar.InvoiceId) AS Frequency
FROM
artist_album_sales AS ar
GROUP BY
ar.CustomerId
)
SELECT
c.CustomerId,
c.FirstName || ' ' || c.LastName AS CustomerName,
pf.Frequency
FROM
customers AS c
JOIN
PurchaseFrequency AS pf ON c.CustomerId = pf.CustomerId
ORDER BY
Frequency DESC;Output:

Monetary Calculation:
Monetary measures the total spending of a customer. It looks at the cumulative revenue generated by each customer. A high monetary value means the customer is more valuable to the business.
Query:
CREATE TABLE MonetaryCal AS
WITH CustomerMonetary AS (
SELECT
ar.CustomerId,
ROUND(SUM(ar.Amount),2) AS TotalMonetary
FROM
artist_album_sales AS ar
GROUP BY
ar.CustomerId
)
SELECT
c.CustomerId,
c.FirstName || ' ' || c.LastName AS CustomerName,
COALESCE(cm.TotalMonetary, 0) AS Monetary
FROM
customers AS c
LEFT JOIN
CustomerMonetary AS cm ON c.CustomerId = cm.CustomerId
GROUP BY
c.CustomerId, c.FirstName,c.LastName
ORDER BY
Monetary DESC;Output:

Recency Calculation:
Recency refers to how recently a customer has made a purchase. Customers who purchased recently are often more engaged, responsive, and likely to buy more.
Query:
CREATE TABLE RecencyCal AS
WITH LastPurchase AS (
SELECT
ar.CustomerId,
MAX(ar.InvoiceDate) AS LastPurchaseDate
FROM
artist_album_sales AS ar
GROUP BY
ar.CustomerId
)
SELECT
c.CustomerId,
c.FirstName || ' ' || c.LastName AS CustomerName,
ROUND((julianday('now') - julianday(lp.LastPurchaseDate)), 2) AS Recency -- days since last purchase rounded to 2 decimal places
FROM
customers AS c
LEFT JOIN -- Use LEFT JOIN to include customers even if they have no purchases
LastPurchase AS lp ON c.CustomerId = lp.CustomerId
GROUP BY
c.CustomerId, c.FirstName, c.LastName, lp.LastPurchaseDate -- Group by all selected fields
ORDER BY
Recency ASC; -- Most recent customers firstOutput:

Combining RFM Values:
After calculating the Recency, Frequency, and Monetary values, I consolidated them into a new table ~ RFM Table, for easier segmentation.
CREATE TABLE RFMTable AS
SELECT DISTINCT
rc.CustomerId,
rc.CustomerName,
rc.Recency,
fc.Frequency,
mc.Monetary
FROM
RecencyCal AS rc
LEFT JOIN
Frequency AS fc ON rc.CustomerId = fc.CustomerId
LEFT JOIN
MonetaryCal AS mc ON mc.CustomerId = fc.CustomerId
ORDER BY
rc.Recency ASC, fc.Frequency DESC, mc.Monetary DESC;
Output:

Calculating the RFM Scores:
Each RFM factor is usually scored or ranked, often on a scale of 1 to 5.
1 represents the best score (e.g., most recent purchase, highest frequency, highest spending), and 5 represents the lowest score.
Query:
CREATE TABLE RFM_Scores AS
WITH RankedRFM AS (
-- Rank customers based on Recency, Frequency, and Monetary
SELECT
CustomerId,
CustomerName,
Recency,
Frequency,
Monetary,
NTILE(5) OVER (ORDER BY Recency ASC) AS RecencyScore,
NTILE(5) OVER (ORDER BY Frequency DESC) AS FrequencyScore,
NTILE(5) OVER (ORDER BY Monetary DESC) AS MonetaryScore
FROM RFMTable
)
SELECT
CustomerId,
CustomerName,
RecencyScore,
FrequencyScore,
MonetaryScore
FROM RankedRFMOutput:

Segmenting Customers:
Customers were segmented based on their RFM scores into 5 different categories as follows:
Best Customers 💎: RFM = 1, indicating recent, frequent, and high-value purchases.
Loyal Customers 🏅 (Default): RFM ≤ 3, indicating recent, frequent, and high-value purchases, but with a score less than 1.
New Customers 🆕: R ≤ 2, FM ≥ 4, indicating recent purchases but with low frequency and low monetary value.
At-Risk Customers ⚠️: R ≥ 4, FM ≤ 2, indicating purchases are not recent, but high frequency and have high monetary value.
Hibernating Customers 💤: RFM score ≥ 4, representing customers who have not made recent purchases, buy infrequently, and have low purchase value.
Query:
CREATE TABLE customer_seg AS
SELECT
CustomerId,
CustomerName,
RecencyScore,
FrequencyScore,
MonetaryScore,
CASE
WHEN RecencyScore = 1 AND FrequencyScore = 1 AND MonetaryScore = 1 THEN 'Best Customer'
WHEN RecencyScore <= 3 AND FrequencyScore <= 3 AND MonetaryScore <= 3 THEN 'Loyal Customers'
WHEN RecencyScore <= 2 AND FrequencyScore >= 4 AND MonetaryScore >= 4 THEN 'New Customers'
WHEN RecencyScore >= 4 AND FrequencyScore <= 2 AND MonetaryScore <= 2 THEN 'At Risk'
WHEN RecencyScore >= 4 AND FrequencyScore >= 4 AND MonetaryScore >= 4 THEN 'Hibernating'
ELSE 'Loyal Customers'
END AS Segment
FROM RFM_Scores;Output:
The customer_seg table was downloaded as a CSV file for further analysis and visualization.
NOTES:
The NTILE function splits each RFM metric into 5 bins (i.e. from 1 to 5):
Recency:
Bin 1 = recent purchasers
Bin 5 = less engaged.
Frequency and Monetary:
Bin 1 = high-frequency/high-value.
Bin 5 = lower values.
I used the CASE statement to categorize customers into five segments based on their RFM scores
(Link to the YOUTUBE video presentation on Customer Segmentation.)
Step 4: Market Basket Analysis (MBA):
Market Basket Analysis (MBA) is a data analysis technique used to uncover customers' purchasing patterns by identifying products that are frequently bought together. It provides valuable insights that help retailers make strategic decisions related to product placement, promotions, and bundling, all of which contribute to increasing sales.
Phases in Step 4:
Steps to Perform Market Basket Analysis:
Data Source: The analysis requires transaction data, typically obtained from POS transactions, customer receipts, or transaction logs.
Creating Association Rules: The core of MBA is the discovery of relationships between products. This is achieved by calculating three key metrics:
Support: The frequency with which a product or set of products appears in transactions.
Confidence: The probability that if product A is bought, product B will also be bought.
Lift: The likelihood that product B will be purchased when product A is bought, compared to the probability of the purchase occurring by random chance.
I analyzed sales transactions for tracks and albums at the Chinook digital store.
Link to the YOUTUBE video presentation on the Market Basket Analysis.
MBA Query for Tracks:
WITH MarketB AS (
SELECT
a.TrackName AS InitialBasket,
b.TrackName AS SecondBasket,
COUNT(1) AS TotalFrequency,
(SELECT COUNT(InvoiceId) FROM artist_album_sales)AS TotalTransactions,
(SELECT COUNT(InvoiceId) FROM artist_album_sales AS ar WHERE a.TrackName = ar.TrackName) AS Frequency_InitialBasket,
(SELECT COUNT(InvoiceId) FROM artist_album_sales AS ar WHERE b.TrackName = ar.TrackName) AS Frequency_SecondBasket
FROM artist_album_sales AS a
LEFT JOIN artist_album_sales AS b
ON a.InvoiceId=b.InvoiceId
WHERE a.TrackName>b.TrackName
GROUP BY a.TrackName,b.TrackName
)
SELECT
InitialBasket,
SecondBasket,
TotalFrequency,
ROUND(CAST(TotalFrequency AS FLOAT) * 100.0 / TotalTransactions, 2) AS Support,
ROUND(CAST(TotalFrequency AS FLOAT)* 100.0 / Frequency_InitialBasket, 2) AS Confidence,
ROUND(((CAST(TotalFrequency AS FLOAT) * 100.0) / TotalTransactions) /(((CAST(Frequency_InitialBasket AS FLOAT) * 100.0) / TotalTransactions) *
((CAST(Frequency_SecondBasket AS FLOAT) * 100.0) / TotalTransactions)), 2) AS Lift
FROM MarketBOutput:
The MarketB table was downloaded as a CSV file for further analysis and visualization in Power BI.
MBA Query for Albums:
Next, I also created the MBA for Albums:
WITH MarketBSK AS (
SELECT
a.AlbumTitle AS InitialBasket,
b.AlbumTitle AS SecondBasket,
COUNT(1) AS TotalFrequency,
(SELECT COUNT(InvoiceId) FROM artist_album_sales)AS TotalTransactions,
(SELECT COUNT(InvoiceId) FROM artist_album_sales AS ar WHERE a.AlbumTitle = ar.AlbumTitle) AS Frequency_InitialBasket,
(SELECT COUNT(InvoiceId) FROM artist_album_sales AS ar WHERE b.AlbumTitle = ar.AlbumTitle) AS Frequency_SecondBasket
FROM artist_album_sales AS a
LEFT JOIN artist_album_sales AS b
ON a.InvoiceId=b.InvoiceId
WHERE a.AlbumTitle>b.AlbumTitle
GROUP BY a.AlbumTitle,b.AlbumTitle
)
SELECT
InitialBasket,
SecondBasket,
TotalFrequency,
ROUND(CAST(TotalFrequency AS FLOAT) * 100.0 / TotalTransactions, 2) AS Support,
ROUND(CAST(TotalFrequency AS FLOAT)*100.0/ Frequency_InitialBasket, 2) AS Confidence,
ROUND(((CAST(TotalFrequency AS FLOAT)*100.0) / TotalTransactions) /(((CAST(Frequency_InitialBasket AS FLOAT)*100.0) / TotalTransactions) *
((CAST(Frequency_SecondBasket AS FLOAT)*100.0) / TotalTransactions)), 2) AS Lift
FROM MarketBSK
Output:
The MarketBSK table was downloaded as a CSV file for further analysis and visualization in Power BI.
Step 5: Artist Image Table:
Here are the steps I took to create this table:
Data Preparation:
After generating a visual representation of the top 10 selling artists for each year, I created a Google Sheets workbook (Excel can also be used) with two columns: one for the artist names and another for the image URLs.
I then copied the list of the top artists into the Google Sheets workbook.
Image Collection:
I searched Google for images of each artist and copied the image URLs into the corresponding rows in the Google Sheets workbook.
File Conversion and Upload:
I downloaded the Google Sheets workbook as a CSV file.
I uploaded the CSV file to SQLite. This step was necessary because I wanted the Artist Image table to include ArtistIDs and the amount sold.
Table Creation:
I created a new table named TopArtist by performing a left join between the CSV table and the artists_album_sales table, which contains the ArtistIDs and amount sold.
I downloaded the table as CSV and renamed it ArtistImage
This process ensured that the TopArtist table included the artist images and the relevant sales data, making it easier to analyze and visualize the information.
CREATE TABLE TopArtist AS
SELECT
a.ArtistId,
ai.ArtistName,
a.Amount,
ai.ImageURL
FROM ArtistsImage AS ai
LEFT JOIN artist_album_sales AS a
ON TRIM(LOWER(ai.ArtistName)) = TRIM(LOWER(a.ArtistName));Result:
CONCLUSION
We have successfully extracted detailed RFM metrics for customer segmentation, allowing us to categorize customers into actionable groups. Additionally, the Market Basket Analysis identified key product combinations and their Support, Confidence, and Lift scores, which will inform strategic bundling and cross-selling recommendations.
This SQL phase of the project provides a solid foundation by transforming raw data into actionable metrics.
These insights will be visualized and explored further in Power BI to uncover trends and provide actionable business recommendations.
In the next phase, we will import the calculated metrics into Power BI to build interactive dashboards. These dashboards will provide deeper insights into the Chinook Digital Store's customer behaviors, product trends, and strategic opportunities.
Deliverables from SQL Phase:
Tables Extracted From Original File:
Fact Table: artist_album_sales Table
Customer Lifetime Value Table (CLV)
Customer Segmentation Table
Market Basket Table for Tracks (Confidence, support, and lift)
Market Basket Table for ALBUMS
TopArtist table
Key SQL Queries:
Customer segmentation query.
Association rule mining query.





Comments