Hot Cereals vs Cold Cereals— a 5-Day Data Analysis Challenge for Beginners (Part One)

Lina
5 min readJan 21, 2022

--

source

Everyone loves cereal! A warm bowl of cereal contains all the energy to start your day. In this post, I would like to write about my little challenge I did on kaggle. I found a post about 5-day data challenge which is friendly for a beginner. I would like to write the whole challenge, but this time I will cover day-1 until day-3 which started from simple exploration and statistical analysis. I also modified some part of the challenges, so sit comfortably and read ^^

Day-1 Challenge

Day-1 challenge is very simple. I just need to import libraries that will be used in this challenge and create summary of the dataset. But before I dived into that, I read the information about the dataset first which can be found in the dataset page called : 80 Cereals.

import libraries

Libraries I used including pandas, matplotlib, sns and scipy.stats since I expected to do some statistical analysis too. Once imported, I continued with reading the data using pd.read_csv()and saved it as df variable. Next, to summarize the dataset I used df.head(), df.info(), and df.describe().In the description, I added parameter include = 'all' to display descriptive statistics of both numerical and categorical value.

From this dataset, I divided the data type into numerical and categorical. In total, there are 16 columns in the dataset : 3 columns are categorical containing name of cereal product, manufacturer and type (hot or cold). The other containing nutrition value of each cereal product ranged from calories, protein to rating.

df.head() result
df.info() result
df.describe(include = ‘all’) result

From these table, I found that the dataset is pretty neat. They don’t need further cleaning at least because it does not contain any null value. So, I can continue to the next stage : plotting histogram.

Day-2 Challenge

Previously, I have mentioned 13 numerical data in this dataset. In the day-2 challenge, I will plot those values into histogram to see the distribution of each variable. For this visualization, I will use pandas using df.hist() and added figsize to make the size bigger and readable.

Histogram of 13 numerical variables

Because I expected to analyse the difference of nutrition value using t-test, I use some variables which are normally distributed. The distribution of variables do not quite normally distributed, but some which have probability of normal distribution are : calories, protein, sodium, and carbo. For now, I will start from these variables.

To make sure, I added visualization with probability plot using pylab as follows :

probability plot
sodium probability plot
carbo probability plot

Using probability plot, normally distributed values will be shown in a form which the points are spread close to the line. From total of 4 plots, sodium and carbo illustrate nearly normally distributed values so I will focus the test on these variables.

Day-3 Challenge

Next, after deciding variables to be tested, I am ready to use t-test to determine if there is any significant difference of carbohydrate and sodium content between hot and cold cereals.

Before running the t-test using ttest_ind(), I created the variable of hot_cereal_sodium, cold_cereal_sodium, hot_cereal_carbo and cold_cereal_carbo. After that, I finished the t-test using ttest_ind() and filled the parameters(two variables being compared). I also added equal_var = False as the variance between two groups most likely are varied significantly.

Note : There are only 3 ‘H’s and 74 ‘C’s in the dataset. We can check this using df['type'].value_counts().

Use ttest_ind() to start the t-test
Create table to summarize the p-value
p-value

I created a table to summarize the p-value and I got two different p-values. The null hypothesis is that there is no significant difference of nutrition value (p-value > 0.05). From the result I can conclude that there is a significant difference of sodium content between hot and cold cereals (p-value = 0.024). Meanwhile for carbohydrate content, there is no significant difference between hot cereals and cold cereals (p-value = 0.72).

I also summarize the mean of each sodium and carbohydrate content in a form of table :

Create table to summarize mean and standard deviation
Table summary of mean and standard deviation

Create plot to observe the distribution of sodium and carbohydrate :

Create side by side histograms
The distribution of sodium and carbohydrate in hot cereals and cold cereals

However, there are some problems in the dataset that should be considered in order to obtain more accurate hypothesis testing. First, the data of cold cereals are dominant compared to data of hot cereals which only included three cereal products. Second, there is a minus value in one value of hot cereals which certainly affects the results. These could lead to higher variance and innacuracy. In order to minimize this effect, I think it is more reasonable to collect more data of nutrition value of hot cereals before analyzing.

Despite the innacuracy, the main purpose of this challenge is to grab the brief concept of statistical analysis using t-test method. I will continue the challenge to day-4 and day-5 which cover visualization of categorical data and chi-squared test. Here is the 4th day challenge.

If you are interested, please read the full code on my github or kaggle. I would also love to read your comments or inputs to improve this project. ^^

You can also try this challenge, just click : The 5-Day Data Challenge | Kaggle

Thank you for reading and have a nice day ^^

--

--