# Statistics for Data Scientist

Statistics is a form of mathematical analysis that uses models, representations and synopsis for a given set of experimental data or real-life studies. We can also say that it is “the science of the collection, analysis, interpretation, presentation, and organization of data.”

The statistics summarize information because talking about each data point is impossible.

#### Types of Statistics: There are two types of statistics

• Descriptive Statistics – Descriptive Statistics deals with analysis and methods related to collection, organization, summarizing and presentation of data. Applying the techniques of descriptive statistics, the raw data is collected and transformed into a meaningful form.
•  Inferential Statistics - Inferential statistics draws conclusion and makes decision about population using information drawn from a sample.

#### What is Data Series & Dataset?

• Data Series: A row or column of numbers that are plotted in a chart is called a data series. Data Series 1: 19,4,33,2,51,32,2,41,18,2,4,1
• Dataset: It is a collection of all related sets of information that is composed of separate elements but can be manipulated as a unit by a computer.

Let’s consider a dataset of air quality to summaries all the measures:

There is variety of descriptive statistics:

• Measures of central tendency – mean, median, mode
• Measures of dispersion – range, variance, standard deviation
• Measures of shape – skewness, kurtosis

In general, statistics summarizes information about data in a meaningful and relevant way.

For example: Total Corona Virus affected people in India? Total Affected till September 30th  2020 is 6,623,515

What other statistics can you think of?

• “Total Active Corona Virus affected People”
• “Total Recovered ”
• “Total Deaths”

#### Mean

• The mean is the simple mathematical average of a set of two or more numbers
• The mean is the most common measure of the location of a set of points However, the mean is very sensitive to outliers.
• Mean can only be used with numeric data
• In Excel -> It can be computed by Average()
• In R and Python -> mean()

#### Median:

• The middle number; found by ordering all data points and picking out the one in the middle(or if there are two middle numbers, taking the mean of those two numbers).
• It may be thought of as the "middle" value of a data set.
• Where, m = Total number in a dataset

r = Position of the middle value (In case of even No., Select Nos. those are on same distance from both sides)

Let’s continue example 1: Arrange data in increasing order 1,2,2,2,4,4,18,19,32,33,41,51 As m is even, take an average of 2 middle numbers

Median = 11 (Calculate i.e. (4+18)/2 )

#### Mode

• The frequency of an attribute value is the numbers of times the value occurs in the data set.
• It is found by collecting and organizing the data in order to count the frequency of each result.
• The mode is the most frequent number—that is, the number that occurs the highest number of times.
• The notions of frequency and mode are typically used with categorical data but it can be used on any data type.
• Let’s continue example 1: The dataset is 19,4,33,2,51,32,2,41,18,2,4,1 and mode is 2

Data Distribution

• We can describe the series we looked at in the example 1 as: 19,4,33,2,51,32,2,41,18,2,4,1 “Minimum of 1, Maximum of 51, Average of 17.41.”
• Given this description of the data series, what picture do we form of the data? The easiest way to visualize data is to look at its “distribution”.

Frequency Distribution: A distribution is a visualization of a frequency distribution table:

• Frequency distribution is a representation, either in a graphical or tabular format, that displays the number of observations within a given interval.
• In Frequency distribution, we find the number of counts for a particular observation when the observations are repeated.

 Class Number of Students 1 10 2 8 3 15 4 20 5 18 6 6 7 23 Total 100

Frequency Distribution

Steps to find the frequency distribution

• Find the range for the given data (Largest Number – Smallest Number)
• Determine the width of the class interval

#### Types of Frequency: There are two types of Frequency mentioned below

1. Relative Frequency: To compute relative frequency, one obtains a frequency count for the total population and a frequency count for a subgroup or class interval of the population. .

Relative Frequency = Frequency of Class interval / Total Observations or Total count.

1. Cumulative Frequency: Cumulative frequency for each class interval is the frequency for that class interval added to the preceding cumulative total.

 Class(Rs) Frequency Students Relative  Frequency Cumulative Frequency 20-30 5 0.125 0.125 30-40 8 0.2 0.325 40-50 9 0.225 0.55 50-60 10 0.25 0.8 60-70 6 0.15 0.95 70-80 2 0.05 1

Author:  Mohit T (Algae Services)