AI algorithms

AI algorithms

Marco Salerno
Written by Marco SalernoLast update 2 months ago

XGBoost

https://xgboost.readthedocs.io/en/latest/parameter.html
XGBoost (Extreme Gradient Boosting) implements a Gradient Boosting algorithm that forms a strong predictor by training a sequence of weak predictors, each improving on the previous ones' results. It is a non-parametric machine learning algorithm, meaning it does not rely on assumptions about the underlying distribution of the data. Memory usage can be a concern for extremely large datasets.

Random Forests

https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Random Forest is an ensemble algorithm formed by averaging the outputs of a set of decision trees. It is a non-parametric machine learning algorithm, meaning it does not rely on assumptions about the underlying distribution of the data. Memory-intensive with many trees or features.

Extra Trees

https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html
Extra Trees Regressor is a machine learning model that predicts numerical values using an ensemble of decision trees. It improves accuracy by introducing randomness in tree splits and data sampling, making it more robust and less prone to overfitting. It is a non-parametric machine learning algorithm, meaning it does not rely on assumptions about the underlying distribution of the data. Faster than Random Forest but with similar memory limitations.

LightGBM

https://lightgbm.readthedocs.io/en/stable
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency.

  • Lower memory usage.

  • Better accuracy.

  • Support of parallel, distributed, and GPU learning.

  • Capable of handling large-scale data.

Linear Regression

https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LinearRegression.html
Linear regression is a parametric statistical technique that fits the linear relationship between the target and features. Best suited for Z-Score preprocessor.

Keras Neural Networks

https://keras.io/about/
Neural networks imitate the functioning of the layer of neurons in the human brain. They fit multiple layers of interconnected nodes, producing a non-linear, non-parametric transformation on the input data.

It is a non-parametric machine learning algorithm, meaning it does not rely on assumptions about the underlying distribution of the data.

Support Vector Machines

https://scikit-learn.org/1.5/modules/generated/sklearn.svm.SVR.html
SVMs are primarily designed for parametric data but can also handle non-parametric data to some extent. However, it's important to note that SVMs might not perform as well with highly nonlinear data compared to non-parametric methods like decision trees or neural networks. Additionally, preprocessing steps like feature scaling are recommended before applying SVMs to any data type.

Generalized Additive Models

https://www.statsmodels.org/stable/gam.html
GAMs are a flexible class of statistical models that can accommodate parametric and non-parametric relationships between the predictors and the response variable. Their flexibility makes them suitable for various data types and distributions. They are handy when the relationship between predictors and the response is unknown or suspected to be nonlinear.

DeepTables

https://deeptables.readthedocs.io/en/latest/
DeepTables is a machine learning tool for efficiently working with tabular data using neural networks. It automates data preprocessing, feature engineering, model selection, hyperparameter tuning, and ensemble learning. DeepTables can handle both parametric and non-parametric variables.

Scalability with large datasets

Certain models do not scale well and will struggle with datasets containing millions of rows or high-dimensional datasets with hundreds of features. To reduce the training time and out of memory errors when using these models:

  • Use a smaller training universe

  • Shorten the dataset period or lengthen the dataset frequency

  • Reduce the number of features

Scale Well: LightGBM, XGBoost, Random Forest, Extra Trees, Linear Regression, Keras, DeepTables.

Do Not Scale Well: Support Vector Machines (SVMs), Generalized Additive Models (GAMs).

Did this answer your question?