Waterfall Models is the first widely used SDLC model that was introduced in 1970. It is a sequential model i.e., all the stages are connected in a series manner or the next phase will only start only if the previous phase is completed. Due to the resemblance of the diagrammatic representation of the model with a waterfall, it is called the “Waterfall Model”. Since the working in this model is done in a sequential way there is no overlap between the steps.


Introduction

Imbalanced data is a common problem with data that has to be used for machine learning classification. Class differences can be found in many different areas including Medical Diagnosis, Spam filtering, and Fraud Detection,

The main problem with Imbalanced Data is that most of the Machine Learning Algorithms work best with balanced data i.e. data where each class of the target variable is equally represented. This is also known as Accuracy Paradox.

Eg: In the Credit Card fraud detection dataset, only 1.7% of transactions are labeled fraudulent. But if we think in a layman way then a classifier that determines…


The term linear regression generally refers to predicting a real number. However, it can also be used for classification (predicting a category or a class). The term linear in the name linear regression refers to the fact that the method models data with a linear combination of the explanatory variables.

Types of Linear Regression

Linear regression can be further divided into two types of algorithm:

Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.


Introduction

Descriptive Statistics is creating a summary of data through quantitative measures like mean, median, etc. so as to get a better understanding of the data. It does not involve any generalization or inference beyond what is available. This means that descriptive statistics are just the representation of the data (sample) available and not based on any theory of probability.

Measures of Central Tendency

There are 3 measures of central tendency:

  1. Mean: It is also known as a simple average. It is the most common measure of central tendency. But it has a huge downside is it gets easily affected by outliers (abruptly high values…

What is Inferential Statistics?

Inferential Statistics is a branch of statistics that is used in Data Science to get some valuable inferences from the data by looking into different grapes and plots. It relies majorly on probability theory and distributions.

Distribution

A distribution is a function that shows the possible values for a variable and how often they occur. The distribution set consists of the probability of all the possible outcomes and considers null values for out-of-range outcomes.

Types of Probability Distributions

There are many different classifications of probability distributions. Some of them include the normal distribution, chi-square distribution, binomial distribution, and Poisson distribution. …


Introduction

Principal Component Analysis or PCA is a method for finding low-dimensional representations of a dataset that retains as much of the original variations as possible. It is generally used in Data Science for reducing the inputs so as to train the model faster (generally done in the case of Neural Networks as they take a lot of time to train). Another use case is for plotting the clusters of n features that else was not possible to visualize. In PCA each of the new dimensions is a linear combination of the original n features.

Idea

Dimensionality reduction methods, such as PCA…


Introduction

Ensemble Learning is a machine learning technique that combines several base models in order to produce one optimal predictive model. Its main goal is to increase the overall accuracy of the model.

Types of Ensemble Methods


Introduction

The Life Cycle of a Data Science Project can be divided into 4 Main Parts:

  1. Business Understanding
  2. Data Acquisition and Understanding
  3. Modeling
  4. Deployment

All these parts with some sub-processes make the data science workflow. Now let’s take a deeper look into these parts.

Business Understanding

The First thing to do when we start a Data Science Project is to get some Business Understanding. Here the main objective is to identify the central objective of the project.

The entire cycle revolves around a business goal. What will you solve if you don’t have a precise problem? It is essential to understand the Business…


Introduction

One of the important things we can do with supervised learning is classification and one of the basic machine learning algorithms to start classification is the Naive Bias Algorithm.

It is a very basic algorithm based upon the main theorem of conditional probability “the Bayes Theorem”. Since it is a probabilistic machine learning algorithm so it can be used in a wide variety of classification tasks.

Bayes Theorem

Bayes’ Theorem is a simple mathematical formula used for calculating conditional probabilities.

Conditional probability is a measure of the probability of an event occurring given that another event has already occurred.

The formula for…

Dhruv Khanna

Data Science Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store