Feature Engineering Made Easy
Sinan Ozdemir Divya Susarla更新时间:2021-06-25 22:46:20
最新章节:Leave a review - let other readers know what you think封面
版权信息
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to Feature Engineering
Motivating example – AI-powered communications
Why feature engineering matters
What is feature engineering?
Understanding the basics of data and machine learning
Supervised learning
Unsupervised learning
Unsupervised learning example – marketing segments
Evaluation of machine learning algorithms and feature engineering procedures
Example of feature engineering procedures – can anyone really predict the weather?
Steps to evaluate a feature engineering procedure
Evaluating supervised learning algorithms
Evaluating unsupervised learning algorithms
Feature understanding – what’s in my dataset?
Feature improvement – cleaning datasets
Feature selection – say no to bad attributes
Feature construction – can we build it?
Feature transformation – enter math-man
Feature learning – using AI to better our AI
Summary
Feature Understanding – What's in My Dataset?
The structure or lack thereof of data
An example of unstructured data – server logs
Quantitative versus qualitative data
Salary ranges by job classification
The four levels of data
The nominal level
Mathematical operations allowed
The ordinal level
Mathematical operations allowed
The interval level
Mathematical operations allowed
Plotting two columns at the interval level
The ratio level
Mathematical operations allowed
Recap of the levels of data
Summary
Feature Improvement - Cleaning Datasets
Identifying missing values in data
The Pima Indian Diabetes Prediction dataset
The exploratory data analysis (EDA)
Dealing with missing values in a dataset
Removing harmful rows of data
Imputing the missing values in data
Imputing values in a machine learning pipeline
Pipelines in machine learning
Standardization and normalization
Z-score standardization
The min-max scaling method
The row normalization method
Putting it all together
Summary
Feature Construction
Examining our dataset
Imputing categorical features
Custom imputers
Custom category imputer
Custom quantitative imputer
Encoding categorical variables
Encoding at the nominal level
Encoding at the ordinal level
Bucketing continuous features into categories
Creating our pipeline
Extending numerical features
Activity recognition from the Single Chest-Mounted Accelerometer dataset
Polynomial features
Parameters
Exploratory data analysis
Text-specific feature construction
Bag of words representation
CountVectorizer
CountVectorizer parameters
The Tf-idf vectorizer
Using text in machine learning pipelines
Summary
Feature Selection
Achieving better performance in feature engineering
A case study – a credit card defaulting dataset
Creating a baseline machine learning pipeline
The types of feature selection
Statistical-based feature selection
Using Pearson correlation to select features
Feature selection using hypothesis testing
Interpreting the p-value
Ranking the p-value
Model-based feature selection
A brief refresher on natural language processing
Using machine learning to select features
Tree-based model feature selection metrics
Linear models and regularization
A brief introduction to regularization
Linear model coefficients as another feature importance metric
Choosing the right feature selection method
Summary
Feature Transformations
Dimension reduction – feature transformations versus feature selection versus feature construction
Principal Component Analysis
How PCA works
PCA with the Iris dataset – manual example
Creating the covariance matrix of the dataset
Calculating the eigenvalues of the covariance matrix
Keeping the top k eigenvalues (sorted by the descending eigenvalues)
Using the kept eigenvectors to transform new data-points
Scikit-learn's PCA
How centering and scaling data affects PCA
A deeper look into the principal components
Linear Discriminant Analysis
How LDA works
Calculating the mean vectors of each class
Calculating within-class and between-class scatter matrices
Calculating eigenvalues and eigenvectors for SW-1SB
Keeping the top k eigenvectors by ordering them by descending eigenvalues
Using the top eigenvectors to project onto the new space
How to use LDA in scikit-learn
LDA versus PCA – iris dataset
Summary
Feature Learning
Parametric assumptions of data
Non-parametric fallacy
The algorithms of this chapter
Restricted Boltzmann Machines
Not necessarily dimension reduction
The graph of a Restricted Boltzmann Machine
The restriction of a Boltzmann Machine
Reconstructing the data
MNIST dataset
The BernoulliRBM
Extracting PCA components from MNIST
Extracting RBM components from MNIST
Using RBMs in a machine learning pipeline
Using a linear model on raw pixel values
Using a linear model on extracted PCA components
Using a linear model on extracted RBM components
Learning text features – word vectorizations
Word embeddings
Two approaches to word embeddings - Word2vec and GloVe
Word2Vec - another shallow neural network
The gensim package for creating Word2vec embeddings
Application of word embeddings - information retrieval
Summary
Case Studies
Case study 1 - facial recognition
Applications of facial recognition
The data
Some data exploration
Applied facial recognition
Case study 2 - predicting topics of hotel reviews data
Applications of text clustering
Hotel review data
Exploration of the data
The clustering model
SVD versus PCA components
Latent semantic analysis
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
更新时间:2021-06-25 22:46:20