Statistics for Data Scientist
- Statistics is a form of mathematical analysis
- It uses models, representations and synopsis for a given set of experimental data or real-life studies.
- It is “the science of the collection, analysis, interpretation, presentation, and organization of data.”
There are two types of statistics
- Descriptive Statistics:
- Descriptive Statistics deals with analysis and methods related to collection, organization, summarizing and presentation of data.
- Applying the techniques of descriptive statistics, the raw data is collected and transformed into a meaningful form.
- Inferential Statistics
- Inferential statistics draws conclusion and makes decision about population using information drawn from a sample.
Lets also understand, what is data series & dataset?
- Data Series: A row or column of numbers that are plotted in a chart is called a data series. Data Series 1: 29,41,32,12,21,12,2,40,20,12,41,11
- Dataset: It is a collection of all related sets of information that is composed of separate elements but can be manipulated as a unit by a computer.
Let’s
consider a dataset of air quality to summaries all the measures:
Variety of descriptive statistics:
- Measures of central tendency
- Mean
- Median
- Mode
- Measures of dispersion
- Range
- Variance
- Standard deviation
- Measures of shape
- Skewness
- Kurtosis
In general, statistics summarizes information about data in a meaningful and relevant way.
For example: Total Corona Virus affected people in India? Total Affected till September 30^{th }2020 is 6,623,515
What other statistics can you think of?
- “Total Active Corona Virus affected People”
- “Total Recovered ”
- “Total Deaths”
Mean
- The mean is the simple mathematical average of a set of two or more numbers
- The mean is the most common measure of the location of a set of points However, the mean is very sensitive to outliers.
- Mean can only be used with numeric data
- In Excel -> It can be computed by Average()
- In R and Python -> mean()
Median:
- The middle number; found by ordering all data points and picking out the one in the middle(or if there are two middle numbers, taking the mean of those two numbers).
- It may be thought of as the "middle" value of a data set.
- Where, m = Total number in a dataset
r
= Position of the middle value (In case of even No., Select Nos. those are on
same distance from both sides)
Let’s
continue example 1: Arrange data in increasing order
1,2,2,2,4,4,18,19,32,33,41,51 As m is even, take an average of 2 middle numbers
Median = 11
(Calculate i.e. (4+18)/2 )
Mode
- The frequency of an attribute value is the numbers of times the value occurs in the data set.
- It is found by collecting and organizing the data in order to count the frequency of each result.
- The mode is the most frequent number—that is, the number that occurs the highest number of times.
- The notions of frequency and mode are typically used with categorical data but it can be used on any data type.
- Let’s continue example 1: The dataset is 19,4,33,2,51,32,2,41,18,2,4,1 and mode is 2
- We can describe the series we looked at in the example 1 as: 19,4,33,2,51,32,2,41,18,2,4,1 “Minimum of 1, Maximum of 51, Average of 17.41.”
- Given this description of the data series, what picture do we form of the data? The easiest way to visualize data is to look at its “distribution”.
Frequency Distribution: A distribution is a visualization of a frequency distribution table:
- Frequency distribution is a
representation, either in a graphical or tabular format, that displays the
number of observations within a given interval.
- In Frequency distribution, we find the number of counts for a particular observation when the observations are repeated.
Class |
Number of Students |
1 |
10 |
2 |
8 |
3 |
15 |
4 |
20 |
5 |
18 |
6 |
6 |
7 |
23 |
Total |
100 |
Frequency
Distribution
Steps to find the frequency distribution
- Find the range for the given
data (Largest Number – Smallest Number)
- Determine the width of the
class interval
Types of Frequency: There are two types of Frequency mentioned below
- Relative Frequency: To compute relative frequency, one obtains a frequency count for the total population and a frequency count for a subgroup or class interval of the population. .
Relative Frequency = Frequency of Class interval / Total Observations or Total count.
- Cumulative Frequency: Cumulative frequency for each class interval is the frequency for that class interval added to the preceding cumulative total.
Class(Rs) |
Frequency Students |
Relative Frequency |
Cumulative Frequency |
20-30 |
5 |
0.125 |
0.125 |
30-40 |
8 |
0.2 |
0.325 |
40-50 |
9 |
0.225 |
0.55 |
50-60 |
10 |
0.25 |
0.8 |
60-70 |
6 |
0.15 |
0.95 |
70-80 |
2 |
0.05 |
1 |
No comments:
Write commentsPlease do not enter spam links