What is Statistics in Data Science ?

  

                                            Statistics for Data Scientist

             

  • Statistics is a form of mathematical analysis
  • It uses models, representations and synopsis for a given set of experimental data or real-life studies. 
  • It is “the science of the collection, analysis, interpretation, presentation, and organization of data.”







The statistics summarize information because talking about each data point is impossible.

 

There are two types of statistics

  • Descriptive Statistics: 
    • Descriptive Statistics deals with analysis and methods related to collection, organization, summarizing and presentation of data. 
    • Applying the techniques of descriptive statistics, the raw data is collected and transformed into a meaningful form.
  •  Inferential Statistics
    • Inferential statistics draws conclusion and makes decision about population using information drawn from a sample.


Lets also understand, what is  data series & dataset?

  • Data Series: A row or column of numbers that are plotted in a chart is called a data series. Data Series 1: 29,41,32,12,21,12,2,40,20,12,41,11
  • Dataset: It is a collection of all related sets of information that is composed of separate elements but can be manipulated as a unit by a computer.


 

Let’s consider a dataset of air quality to summaries all the measures:       



  

Variety of descriptive statistics:

  • Measures of central tendency
    • Mean
    • Median
    • Mode
  • Measures of dispersion
    • Range
    • Variance
    • Standard deviation
  • Measures of shape
    • Skewness
    • Kurtosis

 

In general, statistics summarizes information about data in a meaningful and relevant way.

For example: Total Corona Virus affected people in India? Total Affected till September 30th  2020 is 6,623,515

 

What other statistics can you think of?

  • “Total Active Corona Virus affected People”
  • “Total Recovered ”
  • “Total Deaths”


Mean

  • The mean is the simple mathematical average of a set of two or more numbers
  • The mean is the most common measure of the location of a set of points However, the mean is very sensitive to outliers.
  • Mean can only be used with numeric data
  • In Excel -> It can be computed by Average()
  • In R and Python -> mean()

 



 

 Median: 


  • The middle number; found by ordering all data points and picking out the one in the middle(or if there are two middle numbers, taking the mean of those two numbers).
  • It may be thought of as the "middle" value of a data set.
  • Where, m = Total number in a dataset

                r = Position of the middle value (In case of even No., Select Nos. those are on same distance from both sides)

Let’s continue example 1: Arrange data in increasing order 1,2,2,2,4,4,18,19,32,33,41,51 As m is even, take an average of 2 middle numbers

Median = 11 (Calculate i.e. (4+18)/2 )

 


                                                                                         

 Mode

  • The frequency of an attribute value is the numbers of times the value occurs in the data set.
  • It is found by collecting and organizing the data in order to count the frequency of each result.
  • The mode is the most frequent number—that is, the number that occurs the highest number of times.
  • The notions of frequency and mode are typically used with categorical data but it can be used on any data type.
  • Let’s continue example 1: The dataset is 19,4,33,2,51,32,2,41,18,2,4,1 and mode is 2



 


Data Distribution

  • We can describe the series we looked at in the example 1 as: 19,4,33,2,51,32,2,41,18,2,4,1 “Minimum of 1, Maximum of 51, Average of 17.41.”
  • Given this description of the data series, what picture do we form of the data? The easiest way to visualize data is to look at its “distribution”.

 

Frequency Distribution: A distribution is a visualization of a frequency distribution table:

  • Frequency distribution is a representation, either in a graphical or tabular format, that displays the number of observations within a given interval.
  • In Frequency distribution, we find the number of counts for a particular observation when the observations are repeated.

Class

Number of Students

1

10

2

8

3

15

4

20

5

18

6

6

7

23

Total

100

 

 

 

Frequency Distribution

Steps to find the frequency distribution 

  • Find the range for the given data (Largest Number – Smallest Number)
  • Determine the width of the class interval

 

Types of Frequency: There are two types of Frequency mentioned below

  1. Relative Frequency: To compute relative frequency, one obtains a frequency count for the total population and a frequency count for a subgroup or class interval of the population. .

Relative Frequency = Frequency of Class interval / Total Observations or Total count.  

  1. Cumulative Frequency: Cumulative frequency for each class interval is the frequency for that class interval added to the preceding cumulative total.

 

Class(Rs)

Frequency Students

Relative  Frequency

Cumulative Frequency

20-30

5

0.125

0.125

30-40

8

0.2

0.325

40-50

9

0.225

0.55

50-60

10

0.25

0.8

60-70

6

0.15

0.95

70-80

2

0.05

1

 

 

 

 Author:  Mohit T (Algae Services)

No comments:
Write comments

Please do not enter spam links

Meet US

Services

More Services