# Statistics for Data Scientist

• Statistics is a form of mathematical analysis
• It uses models, representations and synopsis for a given set of experimental data or real-life studies.
• It is “the science of the collection, analysis, interpretation, presentation, and organization of data.”

The statistics summarize information because talking about each data point is impossible.

#### There are two types of statistics

• Descriptive Statistics:
• Descriptive Statistics deals with analysis and methods related to collection, organization, summarizing and presentation of data.
• Applying the techniques of descriptive statistics, the raw data is collected and transformed into a meaningful form.
•  Inferential Statistics
• Inferential statistics draws conclusion and makes decision about population using information drawn from a sample.

#### Lets also understand, what is  data series & dataset?

• Data Series: A row or column of numbers that are plotted in a chart is called a data series. Data Series 1: 29,41,32,12,21,12,2,40,20,12,41,11
• Dataset: It is a collection of all related sets of information that is composed of separate elements but can be manipulated as a unit by a computer.

Let’s consider a dataset of air quality to summaries all the measures:

Variety of descriptive statistics:

• Measures of central tendency
• Mean
• Median
• Mode
• Measures of dispersion
• Range
• Variance
• Standard deviation
• Measures of shape
• Skewness
• Kurtosis

In general, statistics summarizes information about data in a meaningful and relevant way.

For example: Total Corona Virus affected people in India? Total Affected till September 30th  2020 is 6,623,515

What other statistics can you think of?

• “Total Active Corona Virus affected People”
• “Total Recovered ”
• “Total Deaths”

#### Mean

• The mean is the simple mathematical average of a set of two or more numbers
• The mean is the most common measure of the location of a set of points However, the mean is very sensitive to outliers.
• Mean can only be used with numeric data
• In Excel -> It can be computed by Average()
• In R and Python -> mean()

#### Median:

• The middle number; found by ordering all data points and picking out the one in the middle(or if there are two middle numbers, taking the mean of those two numbers).
• It may be thought of as the "middle" value of a data set.
• Where, m = Total number in a dataset

r = Position of the middle value (In case of even No., Select Nos. those are on same distance from both sides)

Let’s continue example 1: Arrange data in increasing order 1,2,2,2,4,4,18,19,32,33,41,51 As m is even, take an average of 2 middle numbers

Median = 11 (Calculate i.e. (4+18)/2 )

#### Mode

• The frequency of an attribute value is the numbers of times the value occurs in the data set.
• It is found by collecting and organizing the data in order to count the frequency of each result.
• The mode is the most frequent number—that is, the number that occurs the highest number of times.
• The notions of frequency and mode are typically used with categorical data but it can be used on any data type.
• Let’s continue example 1: The dataset is 19,4,33,2,51,32,2,41,18,2,4,1 and mode is 2

Data Distribution

• We can describe the series we looked at in the example 1 as: 19,4,33,2,51,32,2,41,18,2,4,1 “Minimum of 1, Maximum of 51, Average of 17.41.”
• Given this description of the data series, what picture do we form of the data? The easiest way to visualize data is to look at its “distribution”.

Frequency Distribution: A distribution is a visualization of a frequency distribution table:

• Frequency distribution is a representation, either in a graphical or tabular format, that displays the number of observations within a given interval.
• In Frequency distribution, we find the number of counts for a particular observation when the observations are repeated.

 Class Number of Students 1 10 2 8 3 15 4 20 5 18 6 6 7 23 Total 100

Frequency Distribution

Steps to find the frequency distribution

• Find the range for the given data (Largest Number – Smallest Number)
• Determine the width of the class interval

#### Types of Frequency: There are two types of Frequency mentioned below

1. Relative Frequency: To compute relative frequency, one obtains a frequency count for the total population and a frequency count for a subgroup or class interval of the population. .

Relative Frequency = Frequency of Class interval / Total Observations or Total count.

1. Cumulative Frequency: Cumulative frequency for each class interval is the frequency for that class interval added to the preceding cumulative total.

 Class(Rs) Frequency Students Relative  Frequency Cumulative Frequency 20-30 5 0.125 0.125 30-40 8 0.2 0.325 40-50 9 0.225 0.55 50-60 10 0.25 0.8 60-70 6 0.15 0.95 70-80 2 0.05 1

Author:  Mohit T (Algae Services)