Waterfall Models is the first widely used SDLC model that was introduced in 1970. It is a sequential model i.e., all the stages are connected in a series manner or the next phase will only start only if the previous phase is completed. Due to the resemblance of the diagrammatic representation of the model with a waterfall, it is called the “Waterfall Model”. Since the working in this model is done in a sequential way there is no overlap between the steps.

Imbalanced data is a common problem with data that has to be used for machine learning classification. Class differences can be found in many different areas including Medical Diagnosis, Spam filtering, and Fraud Detection,

The main problem with Imbalanced Data is that most of the Machine Learning Algorithms work best with balanced data i.e. data where each class of the target variable is equally represented. This is also known as Accuracy Paradox.

Eg: In the Credit Card fraud detection dataset, only 1.7% of transactions are labeled fraudulent. But if we think in a layman way then a classifier that determines…

The term linear regression generally refers to predicting a real number. However, it can also be used for classification (predicting a category or a class). The term linear in the name linear regression refers to the fact that the method models data with a linear combination of the explanatory variables.

Linear regression can be further divided into two types of algorithm:

**Simple Linear Regression: **

If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.

Descriptive Statistics is creating a summary of data through quantitative measures like mean, median, etc. so as to get a better understanding of the data. It does not involve any generalization or inference beyond what is available. This means that descriptive statistics are just the representation of the data (sample) available and not based on any theory of probability.

There are 3 measures of central tendency:

- Mean: It is also known as a simple average. It is the most common measure of central tendency. But it has a huge downside is it gets easily affected by outliers (abruptly high values…

Inferential Statistics is a branch of statistics that is used in Data Science to get some valuable inferences from the data by looking into different grapes and plots. It relies majorly on probability theory and distributions.

A distribution is a function that shows the possible values for a variable and how often they occur. The distribution set consists of the probability of all the possible outcomes and considers null values for out-of-range outcomes.

There are many different classifications of probability distributions. Some of them include the normal distribution, chi-square distribution, binomial distribution, and Poisson distribution. …

Principal Component Analysis or PCA is a method for finding low-dimensional representations of a dataset that retains as much of the original variations as possible. It is generally used in Data Science for reducing the inputs so as to train the model faster (generally done in the case of Neural Networks as they take a lot of time to train). Another use case is for plotting the clusters of n features that else was not possible to visualize. In PCA each of the new dimensions is a linear combination of the original n features.

Dimensionality reduction methods, such as PCA…

Ensemble Learning is a machine learning technique that combines several base models in order to produce one optimal predictive model. Its main goal is to increase the overall accuracy of the model.

The Life Cycle of a Data Science Project can be divided into 4 Main Parts:

- Business Understanding
- Data Acquisition and Understanding
- Modeling
- Deployment

All these parts with some sub-processes make the data science workflow. Now let’s take a deeper look into these parts.

The First thing to do when we start a Data Science Project is to get some Business Understanding. Here the main objective is to identify the central objective of the project.

The entire cycle revolves around a business goal. What will you solve if you don’t have a precise problem? It is essential to understand the Business…

One of the important things we can do with supervised learning is classification and one of the basic machine learning algorithms to start classification is the Naive Bias Algorithm.

It is a very basic algorithm based upon the main theorem of conditional probability “the Bayes Theorem”. Since it is a probabilistic machine learning algorithm so it can be used in a wide variety of classification tasks.

Bayes’ Theorem is a simple mathematical formula used for calculating conditional probabilities.

Conditional probability is a measure of the probability of an event occurring given that another event has already occurred.

The formula for…

Data Science Enthusiast