What is Statistics in Data Science ?

Statistics for Data Scientist

Statistics is a form of mathematical analysis
It uses models, representations and synopsis for a given set of experimental data or real-life studies.
It is “the science of the collection, analysis, interpretation, presentation, and organization of data.”

The statistics summarize information because talking about each data point is impossible.

There are two types of statistics

Descriptive Statistics:

Descriptive Statistics deals with analysis and methods related to collection, organization, summarizing and presentation of data.
Applying the techniques of descriptive statistics, the raw data is collected and transformed into a meaningful form.

Inferential Statistics

Inferential statistics draws conclusion and makes decision about population using information drawn from a sample.

Lets also understand, what is data series & dataset?

Data Series: A row or column of numbers that are plotted in a chart is called a data series. Data Series 1: 29,41,32,12,21,12,2,40,20,12,41,11

Dataset: It is a collection of all related sets of information that is composed of separate elements but can be manipulated as a unit by a computer.

Let’s consider a dataset of air quality to summaries all the measures:

Variety of descriptive statistics:

Measures of central tendency

Mean
Median
Mode

Measures of dispersion

Range
Variance
Standard deviation

Measures of shape

Skewness
Kurtosis

In general, statistics summarizes information about data in a meaningful and relevant way.

For example: Total Corona Virus affected people in India? Total Affected till September 30^th2020 is 6,623,515

What other statistics can you think of?

“Total Active Corona Virus affected People”
“Total Recovered ”
“Total Deaths”

Mean

The mean is the simple mathematical average of a set of two or more numbers
The mean is the most common measure of the location of a set of points However, the mean is very sensitive to outliers.
Mean can only be used with numeric data
In Excel -> It can be computed by Average()
In R and Python -> mean()

Median:

The middle number; found by ordering all data points and picking out the one in the middle(or if there are two middle numbers, taking the mean of those two numbers).
It may be thought of as the "middle" value of a data set.
Where, m = Total number in a dataset

r = Position of the middle value (In case of even No., Select Nos. those are on same distance from both sides)

Let’s continue example 1: Arrange data in increasing order 1,2,2,2,4,4,18,19,32,33,41,51 As m is even, take an average of 2 middle numbers

Median = 11 (Calculate i.e. (4+18)/2 )

Mode

The frequency of an attribute value is the numbers of times the value occurs in the data set.
It is found by collecting and organizing the data in order to count the frequency of each result.
The mode is the most frequent number—that is, the number that occurs the highest number of times.
The notions of frequency and mode are typically used with categorical data but it can be used on any data type.
Let’s continue example 1: The dataset is 19,4,33,2,51,32,2,41,18,2,4,1 and mode is 2

Data Distribution

We can describe the series we looked at in the example 1 as: 19,4,33,2,51,32,2,41,18,2,4,1 “Minimum of 1, Maximum of 51, Average of 17.41.”
Given this description of the data series, what picture do we form of the data? The easiest way to visualize data is to look at its “distribution”.

Frequency Distribution: A distribution is a visualization of a frequency distribution table:

Frequency distribution is a representation, either in a graphical or tabular format, that displays the number of observations within a given interval.
In Frequency distribution, we find the number of counts for a particular observation when the observations are repeated.

Class	Number of Students
1	10
2	8
3	15
4	20
5	18
6	6
7	23
Total	100

Frequency Distribution

Steps to find the frequency distribution

Find the range for the given data (Largest Number – Smallest Number)
Determine the width of the class interval

Types of Frequency: There are two types of Frequency mentioned below

Relative Frequency: To compute relative frequency, one obtains a frequency count for the total population and a frequency count for a subgroup or class interval of the population. .

Relative Frequency = Frequency of Class interval / Total Observations or Total count.

Cumulative Frequency: Cumulative frequency for each class interval is the frequency for that class interval added to the preceding cumulative total.

Class(Rs)	Frequency Students	Relative Frequency	Cumulative Frequency
20-30	5	0.125	0.125
30-40	8	0.2	0.325
40-50	9	0.225	0.55
50-60	10	0.25	0.8
60-70	6	0.15	0.95
70-80	2	0.05	1

Author: Mohit T (Algae Services)

Algae Education Services

Labels

What is Statistics in Data Science ?

Statistics for Data Scientist

There are two types of statistics

Lets also understand, what is data series & dataset?

Mean

Median:

Mode

Types of Frequency: There are two types of Frequency mentioned below

No comments:

Followers

Categories

Total Pageviews

Popular Posts

Authors

Meet US

Services

More Services