Part of a series on |
Machine learning and data mining |
---|
Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data.[1] The motivation is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process.
The feature engineering process is:[2]
The following list[4] provides some typical ways to engineer useful features:
Features vary in significance.[8] Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).[9]
Feature explosion occurs when the number of identified features grows inappropriately. Common causes include:
Feature explosion can be limited via techniques such as: regularization, kernel methods, and feature selection.[10]
Automation of feature engineering is a research topic that dates back to the 1990s.[11] Machine learning software that incorporates automated feature engineering has been commercially available since 2016.[12] Related academic literature can be roughly separated into two types:
MRDTL generates features in the form of SQL queries by successively adding clauses to the queries.[citation needed] For instance, the algorithm might start out with
SELECT COUNT(*) FROM ATOM t1 LEFT JOIN MOLECULE t2 ON t1.mol_id = t2.mol_id GROUP BY t1.mol_id
The query can then successively be refined by adding conditions, such as "WHERE t1.charge <= -0.392".[citation needed]
However, most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation.[13][14] Efficiency can be increased by using incremental updates, which eliminates redundancies.[15][promotional source?]
There are a number of open-source libraries and tools that automate feature engineering on relational data and time series:
[OneBM] helps data scientists reduce data exploration time allowing them to try and error many ideas in short time. On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time, and cost.[20]
The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition.[32][33]
The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.[34]
A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.[35]
Feature stores can be standalone software tools or built into machine learning platforms.
Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error.[36][37] Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering.[38] However, it's important to note that deep learning algorithms still require careful preprocessing and cleaning of the input data.[39] In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process.[40]