Waterfall Models is the first widely used SDLC model that was introduced in 1970. It is a sequential model i.e., all the stages are connected in a series manner or the next phase will only start only if the previous phase is completed. Due to the resemblance of the diagrammatic representation of the model with a waterfall, it is called the “Waterfall Model”. Since the working in this model is done in a sequential way there is no overlap between the steps.
Imbalanced data is a common problem with data that has to be used for machine learning classification. Class differences can be found in many different areas including Medical Diagnosis, Spam filtering, and Fraud Detection,
The main problem with Imbalanced Data is that most of the Machine Learning Algorithms work best with balanced data i.e. data where each class of the target variable is equally represented. This is also known as Accuracy Paradox.
Eg: In the Credit Card fraud detection dataset, only 1.7% of transactions are labeled fraudulent. But if we think in a layman way then a classifier that determines…
This Tutorial is part one of a series of tutorials on subprocesses that are done while working on a Data Science Project. I will try to answer all the questions that come into the mind of a data science rookie covering all the why, how, and what along the way. We will also be using the pandas library to work with the different datasets.
Data is “anything and everything” it is a collection of information.
The term linear regression generally refers to predicting a real number. However, it can also be used for classification (predicting a category or a class). The term linear in the name linear regression refers to the fact that the method models data with a linear combination of the explanatory variables.
Linear regression can be further divided into two types of algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
Descriptive Statistics is creating a summary of data through quantitative measures like mean, median, etc. so as to get a better understanding of the data. It does not involve any generalization or inference beyond what is available. This means that descriptive statistics are just the representation of the data (sample) available and not based on any theory of probability.
There are 3 measures of central tendency:
Inferential Statistics is a branch of statistics that is used in Data Science to get some valuable inferences from the data by looking into different grapes and plots. It relies majorly on probability theory and distributions.
A distribution is a function that shows the possible values for a variable and how often they occur. The distribution set consists of the probability of all the possible outcomes and considers null values for out-of-range outcomes.
There are many different classifications of probability distributions. Some of them include the normal distribution, chi-square distribution, binomial distribution, and Poisson distribution. …
Principal Component Analysis or PCA is a method for finding low-dimensional representations of a dataset that retains as much of the original variations as possible. It is generally used in Data Science for reducing the inputs so as to train the model faster (generally done in the case of Neural Networks as they take a lot of time to train). Another use case is for plotting the clusters of n features that else was not possible to visualize. In PCA each of the new dimensions is a linear combination of the original n features.
Dimensionality reduction methods, such as PCA…
Data Science Enthusiast