Business Analysis#

Project Link: Kaggle Website

Come along the ride with me as we explore a complex sales dataset, find patterns about the sales and growth, and visualize our findings.

You can follow along the project and code with me by looking at the collapsed code samples.

Click me

print("Hello World")

python

Part 1- Importing the Database and Data Cleanup#

If you want to follow along and write the code with me then click here.

Python Code Along : Jupyter Notebook Creation

We first import the database and setup a jupyter notebook. Make sure you have needed Global Superstore.txt file and get it from here. Then copy and run this python command in a jupyter notebook. Make sure to place the Global Superstore.txt file in the same directory of the notebook.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# plotly express figures aren't showing. here is a fix:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# create the dataframe
df = pd.read_csv("./Global Superstore.txt", sep="	")

python

After importing the dataset, we can start cleaning the data.

Python Code Along : Standardizing the DataFrame

# we standardize the columns
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace("-","_")
df.columns = df.columns.str.replace(" ","_")

# This column has no useful information
df=df.drop('记录数',axis=1)

# we convert the date columns into python datetime
df['order_date'] = pd.to_datetime(df['order_date'])
df['ship_date'] = pd.to_datetime(df['ship_date'])

python

We make sure to unify the dataset column names to avoid capital letters and turn whitespace and - into _ for unification later on.
We delete the useless 记录数 column to clean the dataframe. This column has no information in it.
We then convert the dates into datetime.

Python Code Along : Checking for Invalid and Null Values

df.info()

python

If we run the above command to find the shape of the dataframe, we get the result below:

<class 'pandas.core.frame.DataFrame'>
    RangeIndex: 51290 entries, 0 to 51289
    Data columns (total 26 columns):
    #   Column          Non-Null Count  Dtype
    ---  ------          --------------  -----
    0   category        51290 non-null  object
    1   city            51290 non-null  object
    2   country         51290 non-null  object
    3   customer_id     51290 non-null  object
    4   customer_name   51290 non-null  object
    5   discount        51290 non-null  float64
    6   market          51290 non-null  object
    7   order_date      51290 non-null  datetime64[ns]
    8   order_id        51290 non-null  object
    9   order_priority  51290 non-null  object
    10  product_id      51290 non-null  object
    11  product_name    51290 non-null  object
    12  profit          51290 non-null  float64
    13  quantity        51290 non-null  int64
    14  region          51290 non-null  object
    15  row_id          51290 non-null  int64
    16  sales           51290 non-null  int64
    17  segment         51290 non-null  object
    18  ship_date       51290 non-null  datetime64[ns]
    19  ship_mode       51290 non-null  object
    20  shipping_cost   51290 non-null  float64
    21  state           51290 non-null  object
    22  sub_category    51290 non-null  object
    23  year            51290 non-null  int64
    24  market2         51290 non-null  object
    25  weeknum         51290 non-null  int64
    dtypes: datetime64[ns](2), float64(3), int64(5), object(16)
    memory usage: 10.2+ MB

txt

We then check to see if the database has any missing values by checking the info, which it doesn’t
Therefore, we don’t need to replace null values with default values.

Python Code Along : Checking for Duplicate Values

duplicate_series = df.duplicated()
print("\nDuplicate counts:")
print(duplicate_series.value_counts())

python

If we run the above command, we get the result below:

Duplicate counts:
False    51290
Name: count, dtype: int64

txt

we then check if there are any duplicates in the dataframe, which there aren’t any.

Part 2- Understanding the Data#

In order to understand the data better we use multiple built-in pandas modules like describe() and value_counts().

Python Code Along : Describing the Data

df[['sales','discount','profit','quantity', 'shipping_cost']].describe()

python

We find out what the big picture looks like as shown below. It turns out we have more than 50k rows of data. Sales values range widely, with a mean of about 246 but a maximum exceeding 22,000. This suggests some very large transactions. Discounts are usually small, averaging around 14 percent, with most at zero. Profit varies greatly from large losses (−6,600) to big gains (over 8,000), indicating inconsistent profitability. We should try to diagnose the negative profitability later. A store of this size shouldn’t have transactions with -$6000 profit. It is just bad for business.

Quantities are low on average. They are about 3 to 4 items per order. This suggests that they are mostly small purchases. Shipping costs correlate somewhat with order size, averaging about 26 but reaching as high as 933, showing strong variation. We will have to investigate later.

Overall, the data shows large disparities between small, frequent sales and a few very large, high value orders.

	sales	discount	profit	quantity	shipping_cost
count	51290.00	51290.00	51290.00	51290.00	51290.00
mean	246.498440	0.142908	28.610982	3.476545	26.375818
std	487.567175	0.212280	174.340972	2.278766	57.296810
min	0.000000	0.000000	-6599.978000	1.000000	0.002000
25%	31.000000	0.000000	0.000000	2.000000	2.610000
50%	85.000000	0.000000	9.240000	3.000000	7.790000
75%	251.000000	0.200000	36.810000	5.000000	24.450000
max	22638.000000	0.850000	8399.976000	14.000000	933.570000

So at this stage we have some fundamental questions to ask about the data.

How do we have rows with a sales amount of 0?
How do we have rows with 0 or -6599.978 dollars of profit? How can profit be negative? Is this error or normal?
Seems like its very important to know how many items were sold with zero or negative profits. Perhaps grouped by the store location to determine store profitibility? Or maybe to diagnose the cause later on.
How are each stores or markets fairing based on their sales figuers? Which items have the highest profit?
Seems like the global superstore is not selling anything beside furniture, office supplies, and technologies. How can they make such massive losses like -6599 in profit if they arent selling risky materials like food?

Python Code Along : Understanding Each Column

print(df['country'].value_counts(),"\n")
# Mostly US. Followed by Australia, France, Mexico, ...
print(df['city'].value_counts(),"\n")
# New York City, Los Angeles, Philadelphia, ...
print(df['category'].value_counts(),"\n")
# Office Supplies, Technology, Furniture
print(df['sub_category'].value_counts(),"\n")
# Lots of subcategories
print(df['market'].value_counts(),"\n")
# Global Market Abbr, will expand later
print(df['product_id'].value_counts(),"\n")
# Highest product ID sold is OFF-AR-10003651 = Newell 350 (Some kind of Art)
print(df['ship_mode'].value_counts(),"\n")
# Standard Class, Second Class, First Class, Same Day
print(df['weeknum'].value_counts().head(10),"\n")
# Gives us the weeks with the highest sales.
df["pocessing_time"] = (df["ship_date"] - df["order_date"])
# difference between customer orders and us shipping
df["pocessing_time"] = df["pocessing_time"].apply(lambda x: x.days)
df['pocessing_time'].value_counts() # gives us the days between customer orders and us shipping

python

Countries#

Country	Count
United States	9994
Australia	2837
France	2827
Mexico	2644
Germany	2065
…	…
South Sudan	2
Chad	2
Swaziland	2
Eritrea	2
Bahrain	2
Total unique countries:	147

There are 147 unique countries in the dataset. We can see that the stores are mostly from the US. There are also a few countries like South Sudan, Chad, Swaziland, Eritrea, and Bahrain that have just a few stores as well.

Cities#

City	Count
New York City	915
Los Angeles	747
Philadelphia	537
San Francisco	510
Santo Domingo	443
…	…
Hadera	1
Morley	1
Villeneuve-la-Garenne	1
Torremolinos	1
Redwood City	1
Total unique cities:	3636

We can see that most stores are in the cities that are in the US. There are also a few cities like Hadera, Morley, Villeneuve-la-Garenne, Torremolinos, and Redwood City that have just a few stores as well.

Markets#

Market	Count
APAC	11002
LATAM	10294
EU	10000
US	9994
EMEA	5029
Africa	4587
Canada	384

Most of the products were sold in Asia, followed by Latin America, followed by Europe and the US.

Categories#

Category	Count
Office Supplies	31273
Technology	10141
Furniture	9876

We can see that the categories are Office Supplies, Technology, and Furniture. There Office Supplies have the most items sold, followed by Technology and Furniture.

Sub Categories#

Sub-Category	Count
Binders	6152
Storage	5059
Art	4883
Paper	3538
Chairs	3434
Phones	3357
Furnishings	3170
Accessories	3075
Labels	2606
Envelopes	2435
Supplies	2425
Fasteners	2420
Bookcases	2411
Copiers	2223
Appliances	1755
Machines	1486
Tables	861

There are a few sub-categories like Binders, Storage, Art, Paper, Chairs, Phones, Furnishings, Accessories, Labels, Envelopes, Supplies, Fasteners, Bookcases, Copiers, Appliances, Machines, and Tables.

Products#

Product ID	Count
OFF-AR-10003651	35
OFF-AR-10003829	31
OFF-BI-10002799	30
OFF-BI-10003708	30
FUR-CH-10003354	28
…	…
TEC-PH-10001146	1
FUR-TA-10001289	1
OFF-CUI-10001302	1
OFF-AP-10002421	1
TEC-MA-10001031	1
Total unique products:	10292

It turns out that there are 10292 unique products in the dataset. There are a few products like TEC-PH-10001146 that have just a few sales. Most sales look to be in the OFF (office) category.

Shipping#

Ship Mode	Count
Standard Class	30775
Second Class	10309
First Class	7505
Same Day	2701

It turns out most of the shipments were standard class. Same Day delivery is the least common shipment mode.

Weeks#

Week Number	Count
47	1527
46	1524
45	1508
52	1461
38	1453
48	1441
49	1440
39	1426
51	1381
50	1378

There are 52 unique weeks in the dataset. We can see that the sales were mostly in the last quarter of the years. Week 47, 46, 45 seem to be the most popular weeks for shoppers. Great insight for potential sales and staff management.

Processing Time#

Processing Time	Count
4	14434
5	11221
2	7026
6	6255
3	5035
7	3057
0	2600
1	1662

It turns out that the average processing time for a customer order is 4 days. There are still many orders that take longer than 6 days to process. Some short processing times (less than 2 days) are also present.

Discounts#

We can add a cell to our jupyter notebook that describes the behaviours of our discounts.

def discount_labeling(row):
    if row['discount'] == 0:
        discount_label = 'none'
    elif row['discount'] < 0.10:
        discount_label = 'low'
    elif row['discount'] < 0.30:
        discount_label = 'medium'
    elif row['discount'] < 0.60:
        discount_label = 'high'
    else:
        discount_label = 'extreme'
    return discount_label

df['discount_bucket'] = df.apply(discount_labeling, axis=1)
print(df['discount_bucket'].value_counts()) # mostly no discount
negative_profits = df[df['profit'] < 0]
print("\nDiscount Bucket for sales with negtive profits")
print(negative_profits['discount_bucket'].value_counts())

python

Discount Bucket for all sales#

Discount Bucket for all sales	Count
none	29009
medium	10969
high	6551
extreme	4150
low	611

We find out most of our sales are not discounted. It is interesting to see that there are no “low” discounts in our dataset.

We then filter the negative profit rows, finding their relationships to discounts

Discount Bucket for Sales with Negative Profits#

Discount Bucket	Count
high	5710
extreme	4150
medium	2641
low	43

We see the loss leader sales are mostly high+extreme discounts. maybe its the fact that at the end of the season we have to give high discount to clear the store.

Final DataFrame#

After we finally analyze all the important columns, we can show the head of the DataFrame. We will later visualize this final DataFrame.

category	city	country	customer_id	customer_name	market	order_date	order_id	order_priority	product_id	product_name	profit	quantity	region	row_id	sales	segment	ship_date	ship_mode	shipping_cost	state	sub_category	year	market2	weeknum	pocessing_time	market_expanded	month	gross_margin
Office Supplies	Los Angeles	United States	LS-172304	Lycoris Saunders	US	2011-01-07	CA-2011-130813	High	OFF-PA-10002005	Xerox 225	9.3312	3	West	36624	19	Consumer	2011-01-09	Second Class	4.37	California	Paper	2011	North America	2	2	United States	January	49.111579
Office Supplies	Los Angeles	United States	MV-174854	Mark Van Huff	US	2011-01-21	CA-2011-148614	Medium	OFF-PA-10002893	Wirebound Service Call Books, 5 1/2” x 4”	9.2928	2	West	37033	19	Consumer	2011-01-26	Standard Class	0.94	California	Paper	2011	North America	4	5	United States	January	48.909474
Office Supplies	Los Angeles	United States	CS-121304	Chad Sievert	US	2011-08-05	CA-2011-118962	Medium	OFF-PA-10000659	Adams Phone Message Book, Professional, 400 Me…	9.8418	3	West	31468	21	Consumer	2011-08-09	Standard Class	1.81	California	Paper	2011	North America	32	4	United States	August	46.865714
Office Supplies	Los Angeles	United States	CS-121304	Chad Sievert	US	2011-08-05	CA-2011-118962	Medium	OFF-PA-10001144	Xerox 1913	53.2608	2	West	31469	111	Consumer	2011-08-09	Standard Class	4.59	California	Paper	2011	North America	32	4	United States	August	47.982703
Office Supplies	Los Angeles	United States	AP-109154	Arthur Prichep	US	2011-09-29	CA-2011-146969	High	OFF-PA-10002105	Xerox 223	3.1104	1	West	32440	6	Consumer	2011-10-03	Standard Class	1.32	California	Paper	2011	North America	40	4	United States	September	51.840000

Part 3- Data Aggregating and Visualization#

Now that we have all the data we need, it is time to start visualizing. I have chosen the following data visualizations:

Category Distribution#

Python Code Along : Visulaizing Category Distribution

# Products overview based on subcategory
product_sales = df.groupby(['sub_category', 'product_name'])['sales'].sum().reset_index()

top_5_products_per_subcategory = (
    product_sales
    .sort_values(['sub_category', 'sales'], ascending=[True, False])
    .groupby('sub_category')
    .head(5)
)

print(top_5_products_per_subcategory)

category_counts = df['category'].value_counts()

plt.figure(figsize=(5, 5))
plt.pie(
    category_counts,
    labels=category_counts.index,
    autopct='%1.2f%%',
    startangle=180,
    colors=sns.color_palette('plasma', len(category_counts))
)
plt.title('Category Distribution')
plt.show()

fig = px.treemap(
product_sales,
path=['sub_category', 'product_name'],
values='sales',
title='Treemap of Product Sales Grouped by Subcategory',
color='sales',
color_continuous_scale='Plasma',
hover_data={'sales': ':.2f'}
)

fig.show()

python

Category Distribution

We find out that most of our sales are office supplies.

Click on each subcategory to see the sales and products of that subcategory. Click the top bar to go back to the main view.

Category Overview#

Python Code Along : Products Category Overview

# Products overview based on category
fig = px.sunburst(df,
    path=['category','sub_category', 'product_name'],
    values='sales',
    hover_data ='sales'
    )
fig.update_layout(height=600,title_text='Products overview based on category')
fig.show()

python

Click on each subcategory to see the sales and products of that subcategory. Click again to go back to the main view.

We can very well see the subcategories of our products.

Customer Lifetime Value#

Python Code Along : Customer Lifetime Value

customer_lifetime_value = (
    df.groupby('customer_name')['profit']
    .sum()
    .sort_values(ascending=False)
    .reset_index()
)

print(customer_lifetime_value)

top_30 = customer_lifetime_value.head(30)
bottom_30 = customer_lifetime_value.tail(30)

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 8))
# I want to show these side by side so we needd subplots.
sns.set(style="whitegrid")

sns.barplot(
    data=top_30,
    y='customer_name',
    x='profit',
    palette='plasma',
    ax=axs[0]
)
axs[0].set_title('Top 30 Customers by Lifetime Profit')
axs[0].set_xlabel('Total Profit')
axs[0].set_ylabel('Customer Name')

sns.barplot(
    data=bottom_30,
    y='customer_name',
    x='profit',
    palette='plasma',
    ax=axs[1]
)
axs[1].set_title('Bottom 30 Customers by Lifetime Profit')
axs[1].set_xlabel('Total Profit')
axs[1].set_ylabel('Customer Name')

plt.show()

python

Customer Lifetime

We can see our most profitable and least profitable customers together. We seem to have a customer named Cindy Steward that has a very negative amount of profit. Lets investigate further.

Case Study: Visualizing Cindy Steward#

Python Code Along : Visualizing Cindy Steward

cindy_data = df[df['customer_name'] == 'Cindy Stewart']

# Find the list of all products she bought and the profit of all of those.
product_profits = (
cindy_data.groupby('product_name')['profit']
    .sum()
    .reset_index()
    .sort_values('profit', ascending=False)
)

plt.figure(figsize=(8, 10))
sns.barplot(
    data=product_profits,
    x='profit',
    y='product_name',
    palette='plasma'
)

plt.title("Products Purchased by Cindy Stewart and Profits")
plt.xlabel("Total Profit")
plt.ylabel("Product Name")
plt.show()

python

Cindy Stewart is the most unprofitable customer found. We should find who Cindy Stewart is and how she was able to make so much money just buying stuff from this company, leading to us losing so much money.

Cindy

It turns out it we should stop selling the cubify CubeX and sandisk memory products and we would be so much less unprofitable.

Sales of each subcategory#

Python Code Along : Sales of each subcategory

# Total sales of all subcategories
subcategory_sales = df.groupby('sub_category')['sales'].agg(['sum', 'mean']).reset_index()
# mean = total sales for subcategory / number of rows (transactions) in that subcategory
# sum = total sale for subcategory

plt.figure(figsize=(10, 6))
sns.barplot(
data=subcategory_sales,
x='sub_category',
y='sum',
palette='plasma'
)
plt.title('Total Sales by Subcategory')
plt.xlabel('Subcategory')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(10, 6))
sns.barplot(
data=subcategory_sales,
x='sub_category',
y='mean',
palette='plasma'
)
plt.title('Mean Sales per Transaction by Subcategory')
plt.xlabel('Subcategory')
plt.ylabel('Mean Sales')
plt.xticks(rotation=45)
plt.show()

python

Total sales by subcategory

Mean sales by subcategory

We see that the store is selling way too many tables. Probably because the price is way too cheap and unprofitable for us (and profitable for customers) as we are going to find out later.

Sales in each market#

Python Code Along : Sales in each market

# Sales in each market for each country
sales_by_market_country = (
df.groupby(['market_expanded', 'country'])['sales']
    .sum()
    .reset_index()
    .sort_values(['market_expanded', 'sales'], ascending=[True, False])
)

markets = sales_by_market_country['market_expanded'].unique()

figs = []
for market in markets:
market_data = sales_by_market_country[sales_by_market_country['market_expanded'] == market]
fig = px.bar(
    market_data,
    x='country',
    y='sales',
    title=f'Sales by Country in {market} Market',
    labels={'sales': 'Total Sales', 'country': 'Country'},
    color='country'
)
fig.show()
figs.append(fig)

python

Sales in each market#

Python Code Along : Category gross margin

# Category Gross Margin and Sales and profit of sub categories

sales_profit_by_category_market = (
df.groupby(['category', 'market_expanded'])[['sales', 'profit']]
    .sum()
    .reset_index()
)

# melting lets us plot different columns in the same figure
melted_main = sales_profit_by_category_market.melt(
    id_vars=['category', 'market_expanded'],
    value_vars=['sales', 'profit'],
    var_name='Metric',
    value_name='Value'
)


fig_main = px.bar(
    melted_main,
    x='category',
    y='Value',
    color='Metric',
    barmode='group',
    facet_col='market_expanded',
    title='Total Sales and Profit by Category Across Markets',
    labels={'Value': 'Amount', 'category': ' '}
)
    # we don't need to rewrite category so many times

for i in range(len(fig_main.layout.annotations)):
    fig_main.layout.annotations[i].text = fig_main.layout.annotations[i].text.split('=')[-1]
    # Just the name of the market not the market_expanded part
fig_main.write_html("category_gross.html")
fig_main.show()


sales_profit_by_category_market['gross_margin'] = sales_profit_by_category_market['profit'] / sales_profit_by_category_market['sales']

# gross margin to based on category/market
fig_margin = px.bar(
    sales_profit_by_category_market,
    x='category',
    y='gross_margin',
    facet_col='market_expanded',
    color='category',
    title='Gross Margin by Category Across Markets',
    labels={'gross_margin': 'Gross Margin', 'category': ' '}
)

for i in range(len(fig_margin.layout.annotations)):
    fig_margin.layout.annotations[i].text = fig_margin.layout.annotations[i].text.split('=')[-1] # Just the name of the market not the market_expanded part
fig_main.write_html("category_gross_margin.html")
fig_margin.show()

python

Part 4- Conclusions#

By utilizing pandas, python and plotting libraries like matplotlib, seaborn and plotly we learned.

High discounts correlate with more unprofitable sales.
Technology category leads in profitability, while Furniture often shows lower margins.
United States dominates in total sales, canada is the best market for expansion.
Many many states are overall unprofitable. we should really address this. maybe close the worst branches.
Standard shipping class is the most used shipping.
Subcategories like Copiers and Phones are consistently profitable, making them strategic focus areas.
We should stop selling tables or agressively increase prices.
Some items like CubeX are unprofitable. We should address these items.