Amir Daniali

Back

Business Analysis#

picture

Project Link: Kaggle Website

Come along the ride with me as we explore a complex sales dataset, find patterns about the sales and growth, and visualize our findings.

Part 1- Importing the Database and Data Cleanup#

If you want to follow along and write the code with me then click here.

Python Code Along : Jupyter Notebook Creation

We first import the database and setup a jupyter notebook. Make sure you have needed Global Superstore.txt file and get it from here. Then copy and run this python command in a jupyter notebook. Make sure to place the Global Superstore.txt file in the same directory of the notebook.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# plotly express figures aren't showing. here is a fix:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# create the dataframe
df = pd.read_csv("./Global Superstore.txt", sep="	")
python

After importing the dataset, we can start cleaning the data.

Python Code Along : Standardizing the DataFrame

# we standardize the columns
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace("-","_")
df.columns = df.columns.str.replace(" ","_")

# This column has no useful information
df=df.drop('记录数',axis=1)

# we convert the date columns into python datetime
df['order_date'] = pd.to_datetime(df['order_date'])
df['ship_date'] = pd.to_datetime(df['ship_date'])
python
  • We make sure to unify the dataset column names to avoid capital letters and turn whitespace and - into _ for unification later on.
  • We delete the useless 记录数 column to clean the dataframe. This column has no information in it.
  • We then convert the dates into datetime.

Python Code Along : Checking for Invalid and Null Values

df.info()
python

If we run the above command to find the shape of the dataframe, we get the result below:

  • We then check to see if the database has any missing values by checking the info, which it doesn’t
  • Therefore, we don’t need to replace null values with default values.

Python Code Along : Checking for Duplicate Values

duplicate_series = df.duplicated()
print("\nDuplicate counts:")
print(duplicate_series.value_counts())
python

If we run the above command, we get the result below:

Duplicate counts:
False    51290
Name: count, dtype: int64
txt
  • we then check if there are any duplicates in the dataframe, which there aren’t any.

Part 2- Understanding the Data#

In order to understand the data better we use multiple built-in pandas modules like describe() and value_counts().

Python Code Along : Describing the Data

df[['sales','discount','profit','quantity', 'shipping_cost']].describe()
python

We find out what the big picture looks like as shown below. It turns out we have more than 50k rows of data. Sales values range widely, with a mean of about 246 but a maximum exceeding 22,000. This suggests some very large transactions. Discounts are usually small, averaging around 14 percent, with most at zero. Profit varies greatly from large losses (−6,600) to big gains (over 8,000), indicating inconsistent profitability. We should try to diagnose the negative profitability later. A store of this size shouldn’t have transactions with -$6000 profit. It is just bad for business.

Quantities are low on average. They are about 3 to 4 items per order. This suggests that they are mostly small purchases. Shipping costs correlate somewhat with order size, averaging about 26 but reaching as high as 933, showing strong variation. We will have to investigate later.

Overall, the data shows large disparities between small, frequent sales and a few very large, high value orders.

salesdiscountprofitquantityshipping_cost
count51290.0051290.0051290.0051290.0051290.00
mean246.4984400.14290828.6109823.47654526.375818
std487.5671750.212280174.3409722.27876657.296810
min0.0000000.000000-6599.9780001.0000000.002000
25%31.0000000.0000000.0000002.0000002.610000
50%85.0000000.0000009.2400003.0000007.790000
75%251.0000000.20000036.8100005.00000024.450000
max22638.0000000.8500008399.97600014.000000933.570000

So at this stage we have some fundamental questions to ask about the data.

  • How do we have rows with a sales amount of 0?
  • How do we have rows with 0 or -6599.978 dollars of profit? How can profit be negative? Is this error or normal?
  • Seems like its very important to know how many items were sold with zero or negative profits. Perhaps grouped by the store location to determine store profitibility? Or maybe to diagnose the cause later on.
  • How are each stores or markets fairing based on their sales figuers? Which items have the highest profit?
  • Seems like the global superstore is not selling anything beside furniture, office supplies, and technologies. How can they make such massive losses like -6599 in profit if they arent selling risky materials like food?

Python Code Along : Understanding Each Column


Countries#

CountryCount
United States9994
Australia2837
France2827
Mexico2644
Germany2065
South Sudan2
Chad2
Swaziland2
Eritrea2
Bahrain2
Total unique countries:147

There are 147 unique countries in the dataset. We can see that the stores are mostly from the US. There are also a few countries like South Sudan, Chad, Swaziland, Eritrea, and Bahrain that have just a few stores as well.


Categories#

CategoryCount
Office Supplies31273
Technology10141
Furniture9876

We can see that the categories are Office Supplies, Technology, and Furniture. There Office Supplies have the most items sold, followed by Technology and Furniture.


Shipping#

Ship ModeCount
Standard Class30775
Second Class10309
First Class7505
Same Day2701

It turns out most of the shipments were standard class. Same Day delivery is the least common shipment mode.

Final DataFrame#

After we finally analyze all the important columns, we can show the head of the DataFrame. We will later visualize this final DataFrame.

categorycitycountrycustomer_idcustomer_namediscountmarketorder_dateorder_idorder_priorityproduct_idproduct_nameprofitquantityregionrow_idsalessegmentship_dateship_modeshipping_coststatesub_categoryyearmarket2weeknumpocessing_timemarket_expandedmonthgross_margin
Office SuppliesLos AngelesUnited StatesLS-172304Lycoris Saunders0.0US2011-01-07CA-2011-130813HighOFF-PA-10002005Xerox 2259.33123West3662419Consumer2011-01-09Second Class4.37CaliforniaPaper2011North America22United StatesJanuary49.111579
Office SuppliesLos AngelesUnited StatesMV-174854Mark Van Huff0.0US2011-01-21CA-2011-148614MediumOFF-PA-10002893Wirebound Service Call Books, 5 1/2” x 4”9.29282West3703319Consumer2011-01-26Standard Class0.94CaliforniaPaper2011North America45United StatesJanuary48.909474
Office SuppliesLos AngelesUnited StatesCS-121304Chad Sievert0.0US2011-08-05CA-2011-118962MediumOFF-PA-10000659Adams Phone Message Book, Professional, 400 Me…9.84183West3146821Consumer2011-08-09Standard Class1.81CaliforniaPaper2011North America324United StatesAugust46.865714
Office SuppliesLos AngelesUnited StatesCS-121304Chad Sievert0.0US2011-08-05CA-2011-118962MediumOFF-PA-10001144Xerox 191353.26082West31469111Consumer2011-08-09Standard Class4.59CaliforniaPaper2011North America324United StatesAugust47.982703
Office SuppliesLos AngelesUnited StatesAP-109154Arthur Prichep0.0US2011-09-29CA-2011-146969HighOFF-PA-10002105Xerox 2233.11041West324406Consumer2011-10-03Standard Class1.32CaliforniaPaper2011North America404United StatesSeptember51.840000

Part 3- Data Aggregating and Visualization#

Now that we have all the data we need, it is time to start visualizing. I have chosen the following data visualizations:

Category Distribution#

Python Code Along : Visulaizing Category Distribution

Category Distribution

We find out that most of our sales are office supplies.

Click on each subcategory to see the sales and products of that subcategory. Click the top bar to go back to the main view.

Category Overview#

Python Code Along : Products Category Overview

# Products overview based on category
fig = px.sunburst(df,
    path=['category','sub_category', 'product_name'],
    values='sales',
    hover_data ='sales'
    )
fig.update_layout(height=600,title_text='Products overview based on category')
fig.show()
python

Click on each subcategory to see the sales and products of that subcategory. Click again to go back to the main view.

We can very well see the subcategories of our products.

Customer Lifetime Value#

Python Code Along : Customer Lifetime Value

Customer Lifetime

We can see our most profitable and least profitable customers together. We seem to have a customer named Cindy Steward that has a very negative amount of profit. Lets investigate further.

Case Study: Visualizing Cindy Steward#

Python Code Along : Visualizing Cindy Steward

Cindy Stewart is the most unprofitable customer found. We should find who Cindy Stewart is and how she was able to make so much money just buying stuff from this company, leading to us losing so much money.

Cindy

It turns out it we should stop selling the cubify CubeX and sandisk memory products and we would be so much less unprofitable.

Sales of each subcategory#

Python Code Along : Sales of each subcategory

Total sales by subcategory

Mean sales by subcategory

We see that the store is selling way too many tables. Probably because the price is way too cheap and unprofitable for us (and profitable for customers) as we are going to find out later.

Sales in each market#

Python Code Along : Sales in each market

Sales in each market#

Python Code Along : Category gross margin

Part 4- Conclusions#

By utilizing pandas, python and plotting libraries like matplotlib, seaborn and plotly we learned.

  • High discounts correlate with more unprofitable sales.
  • Technology category leads in profitability, while Furniture often shows lower margins.
  • United States dominates in total sales, canada is the best market for expansion.
  • Many many states are overall unprofitable. we should really address this. maybe close the worst branches.
  • Standard shipping class is the most used shipping.
  • Subcategories like Copiers and Phones are consistently profitable, making them strategic focus areas.
  • We should stop selling tables or agressively increase prices.
  • Some items like CubeX are unprofitable. We should address these items.
Business Data Analysis Portfolio
https://amirdaniali.com/projects/business_analysis/
Author Amir Daniali
AI Disclaimer Fully human! No AI was used.
Published at October 28, 2025