What is Statistics in Data Science ?

  

                                            Statistics for Data Scientist

             

Statistics is a form of mathematical analysis that uses models, representations and synopsis for a given set of experimental data or real-life studies. We can also say that it is “the science of the collection, analysis, interpretation, presentation, and organization of data.”







The statistics summarize information because talking about each data point is impossible.

 

Types of Statistics: There are two types of statistics

  • Descriptive Statistics – Descriptive Statistics deals with analysis and methods related to collection, organization, summarizing and presentation of data. Applying the techniques of descriptive statistics, the raw data is collected and transformed into a meaningful form.
  •  Inferential Statistics - Inferential statistics draws conclusion and makes decision about population using information drawn from a sample.


What is Data Series & Dataset?

  • Data Series: A row or column of numbers that are plotted in a chart is called a data series. Data Series 1: 19,4,33,2,51,32,2,41,18,2,4,1
  • Dataset: It is a collection of all related sets of information that is composed of separate elements but can be manipulated as a unit by a computer.


 

Let’s consider a dataset of air quality to summaries all the measures:       



  

There is variety of descriptive statistics:

  • Measures of central tendency – mean, median, mode
  • Measures of dispersion – range, variance, standard deviation
  • Measures of shape – skewness, kurtosis

 

In general, statistics summarizes information about data in a meaningful and relevant way.

For example: Total Corona Virus affected people in India? Total Affected till September 30th  2020 is 6,623,515

 

What other statistics can you think of?

  • “Total Active Corona Virus affected People”
  • “Total Recovered ”
  • “Total Deaths”


Mean

  • The mean is the simple mathematical average of a set of two or more numbers
  • The mean is the most common measure of the location of a set of points However, the mean is very sensitive to outliers.
  • Mean can only be used with numeric data
  • In Excel -> It can be computed by Average()
  • In R and Python -> mean()

 



 

 Median: 


  • The middle number; found by ordering all data points and picking out the one in the middle(or if there are two middle numbers, taking the mean of those two numbers).
  • It may be thought of as the "middle" value of a data set.
  • Where, m = Total number in a dataset

                r = Position of the middle value (In case of even No., Select Nos. those are on same distance from both sides)

Let’s continue example 1: Arrange data in increasing order 1,2,2,2,4,4,18,19,32,33,41,51 As m is even, take an average of 2 middle numbers

Median = 11 (Calculate i.e. (4+18)/2 )

 


                                                                                         

 Mode

  • The frequency of an attribute value is the numbers of times the value occurs in the data set.
  • It is found by collecting and organizing the data in order to count the frequency of each result.
  • The mode is the most frequent number—that is, the number that occurs the highest number of times.
  • The notions of frequency and mode are typically used with categorical data but it can be used on any data type.
  • Let’s continue example 1: The dataset is 19,4,33,2,51,32,2,41,18,2,4,1 and mode is 2



 


Data Distribution

  • We can describe the series we looked at in the example 1 as: 19,4,33,2,51,32,2,41,18,2,4,1 “Minimum of 1, Maximum of 51, Average of 17.41.”
  • Given this description of the data series, what picture do we form of the data? The easiest way to visualize data is to look at its “distribution”.

 

Frequency Distribution: A distribution is a visualization of a frequency distribution table:

  • Frequency distribution is a representation, either in a graphical or tabular format, that displays the number of observations within a given interval.
  • In Frequency distribution, we find the number of counts for a particular observation when the observations are repeated.

Class

Number of Students

1

10

2

8

3

15

4

20

5

18

6

6

7

23

Total

100

 

 

 

Frequency Distribution

Steps to find the frequency distribution 

  • Find the range for the given data (Largest Number – Smallest Number)
  • Determine the width of the class interval

 

Types of Frequency: There are two types of Frequency mentioned below

  1. Relative Frequency: To compute relative frequency, one obtains a frequency count for the total population and a frequency count for a subgroup or class interval of the population. .

Relative Frequency = Frequency of Class interval / Total Observations or Total count.  

  1. Cumulative Frequency: Cumulative frequency for each class interval is the frequency for that class interval added to the preceding cumulative total.

 

Class(Rs)

Frequency Students

Relative  Frequency

Cumulative Frequency

20-30

5

0.125

0.125

30-40

8

0.2

0.325

40-50

9

0.225

0.55

50-60

10

0.25

0.8

60-70

6

0.15

0.95

70-80

2

0.05

1

 

 

 

 Author:  Mohit T (Algae Services)

No comments:
Write comments

Please do not enter spam links

Services

More Services