Instructional guides on specific topics in Machine Learning/AI. Often, as the questions in one class are generalizable, these articles may be inspired by just one student with a good question. For a broad overview of Machine Learning, continue below or simply click here.
This is content for my machine learning course offered through UCLA and is still under construction. Check back for code snippets in Python and R as well as text descriptions to be updated or added when complete.
Throughout history we have been fascinated by the mysterious processes operating hidden within our brains. The distinction between animals that have nervous systems and plants that don’t prompted early speculation into neural structure and function that has continued to the present day. Although historical questioning, guided by insufficient technology, also led to what are now bizarre speculations from an animating force, “animal magnetism,” to deciphering personality traits and talents using bumps on the skull, advances in physics and engineering would eventually establish a foundation for neuroscience and artificial intelligence. Chief among these advances was the invention of the computer. Parallels with the human brain, emphasized by inventors such as Turing, Zuse, and Von Neuman, had reimagined the computer. No longer a mere computing machine for automation, it could simulate neural systems as well.
In the years following this realization, the model of “brain as computer” would inspire the development of artificial neural networks and a new area of research eventually named computational neuroscience. Importantly, however, much of the progress in artificial learning systems has been built without strict, structural equivalency to brains. Efforts to do so have traditionally hit roadblocks. More modern approaches take a “soft” stance, focusing instead on the mathematics of the brain and have, interestingly, found that artificial neural networks, which are built without precise mimicry in mind, mathematically resemble their biological counterparts. Through this serendipitous discovery, it seems like computation is the fundamental building block of the physical world and that the materials we see and interact with may simply establish boundary conditions—different ways of shaping or realizing the bedrock computation.
As such, statistics and calculus are fundamental to Machine Learning and neuroscience because they hint not at what is immediately perceptible but what could be. In any scientific work, we must nevertheless play a game where we’re concerned with the hidden, fundamental features of the physical world but are in the unfortunate position of doing so with whatever is easily accessible. That means that we’re stuck with language and all its failings. Our initial task then centers around terminology. For instance, while human learning may be intuitive, since it is experiential, confusion instantly arises when combining the words “machine” and “learning.” What is a machine? And, subsequently, what does it mean for a machine to learn?
Although life first appeared billions of years ago in the form of single-celled organisms, multicellular life roughly 600 million years ago, and simple nervous systems around 550-600 million years ago, nature has been ceaselessly generating patterns according to mathematical principles or physical laws. Life and neural systems are among these patterns as are learning and memory. When a dog learns to expect a treat after an action, the dog’s nervous system undergoes structural changes, associating the action–the context, the room, time of day, and sounds–with receiving that treat.
The changes in the nervous system described here can also be thought of as an employee network undergoing changes as projects and demands shift; employees are hired or leave, and teams and supervisors are altered. When further considering our employee network example, it begs the question whether learning and memory are not simply mathematical properties of networks in general. As such, do viruses or bacteria learn? Obviously, if learning is restricted to nervous systems, the answer would be no. However, both are networks of complex chemistry; each attempt to modify their underlying relationships or links to adapt to changing contexts or environments. Importantly, these changes, breaking and adding network links, are examples of learning and memory in a far deeper sense. The pathogen is not simply a network in a vacuum; it’s a network that changes specifically to become incorporated into the larger networks of host organisms (including humans). It is from this perspective that boundaries once thought to be well-defined—between living and non-living as well as natural and artificial—are increasingly dubious when taking a computational perspective.
It is no longer strange to think of machines that learn, matching or even rivaling human abilities. And this has become clear with the success of AlphaGo, a neural network algorithm that was trained to play the Ancient Chinese game Go, as well as with recent releases from OpenAI including the GPT-3 and GPT-4 algorithms, which are examples of a type of neural network named a transformer, and the popular software interface for these algorithms ChatGPT. For now, however, of immediate concern is to clarify — after accepting artificial learning is similar and increasingly equivalent to biological learning – the guidelines, best practices, and the core mathematical concepts that make machine learning work.
Machine learning (ML) is centered on an assumption that a dataset contains meaningful patterns, and these patterns or meaningful relationships can then be represented mathematically as functions. The task in machine learning is simply to develop algorithms that learn these functions from the dataset. ML algorithms are broadly categorized as unsupervised and supervised learning algorithms. Collectively, the two types learn or extract functions from data in slightly different ways.
For example, a dataset contains features of houses (square footage, number of bedrooms, etc.) and the market value to potentially purchase it. Here, a house’s value is clearly a function of (or determined by) its features. We can then use ML to learn this mapping function; a function that accepts houses’ features as input and returns value estimates. To accurately estimate these values, however, the function has to start somewhere, initially providing poor estimates; the error (difference between the predicted and observed values) helps to adjust the function and push the estimates to better fit reality. Since there is an actual housing value guiding this process, we are “supervising” the learning, and therefore the housing dataset can be thought of as containing patterns that are suited for supervised machine learning algorithms, a family of algorithms that need guidance from the data to learn anything.
Alternatively, let’s imagine that there are few houses where the market value is known, but there is a large database with features or attributes of houses. This new dataset (where the prices are unknown) remains information rich. When price was accessible, the features could be used to predict trends (or variability) in price. Clearly, this can no longer be modeled directly, but also note that the relationship between a house’s attributes and its price has not disappeared
Figure 1. Houses in a neighborhood, showcasing various styles and sizes, which are features that contribute differently to the purchase price. A dataset is like the snapshot above, only it represents the scene, and the relationships it depicts, in numbers–numbers that machine learning algorithms learn from, turning the static snapshot into insight that is applicable to any house; in so much as the same features (e.g. square footage, # bedrooms, etc.) can be tabulated.
Square footage is as before related to the price; in general, the bigger the house the more expensive it is. By grouping the houses according to their features, it is possible to subtly uncover something such as price. How might this work? To start, picture each house as a point in a 2D grid. Imagine different groupings emerging; some points (houses) are close and others are far. This is the goal. We want to get to this image.
Now let’s revisit where we were before this visualization exercise. We have a dataset that isn’t two dimensional. If we were to plot the houses on a 2D grid we would have to arbitrarily select two of the measured features in our dataset. We are therefore looking for a function that takes houses’ features as input and outputs their coordinates in a new, low dimensional space, where distant points are interpreted as houses with very different features. Notably, some houses may be neither extremely close nor far. So, there has to be a method that also defines the boundaries between similar and different types. The two step scenario above (1. reducing dimensionality) and (2. establishing rules to group data into different types) broadly characterize unsupervised learning algorithms. Each type, group or cluster detected by an unsupervised learning algorithm can offer predictions. If pricing data is available for some but not all houses of a similar type, predictions, albeit less reliable than with supervision, are possible. They are the bridge that extends toward a desired problem and its solution from a dataset that cannot address it.
When carefully unpacking the descriptions for supervised and unsupervised algorithms, machine learning, somewhat unintuitively, is reduced not to developing better learning algorithms, but identifying and curating high quality datasets to accurately represent reality. If datasets are poor or too simple, trivializing the complexities, nothing substantive (or useful) can be learned. Put this way, the question “what is machine learning?” becomes “what does the data say? This subtle transition moves ML from an apparently specialized subject into a generic science, as this new question drives every scientist in every discipline. Subsequently, ML is not revolutionary. It is guided by old, well established scientific practices–collecting data to represent reality, simulating experiments, defining control groups and hypotheses, and applying mathematical/statistical tools that are in part the bedrock of all scientific work.
In this sense, data must be collected carefully, interpreted, and questioned in statistical/machine learning as in all areas of science. We’ll therefore discuss this process next, probing datasets and readying them for more advanced analyses or determining that they aren’t currently suitable for anything else (e.g. the data can’t be corrected or the corrections would introduce significant bias).
Let’s take some time to review where have been and the road ahead
Roadmap
Now let’s incorporate some more terminology: algorithm. We’ll briefly revisit the distinction between unsupervised and supervised learning by simply introducing this new word.
Algorithm is a complex word that has historically been used in many different contexts. Here, we’ll be referring to an algorithm as a series of mathematical operations. An algorithm for our purposes is then like a recipe: first, we add this, then combine that, and finally bake it. While this is superficially mathematical — note words like “add” and “combine” in the example — the steps or mathematical operations performed by Machine Learning algorithms tend to be more complex.
Unsurprisingly, Machine Learning algorithms make predictions. Some algorithms are trained to predict observed labels (numbers or categories). And building on the previous section, these are called supervised learning algorithms, since they are “supervised” (guided) by the label. What these algorithms are doing is consistent with everyday experience. When answering a quiz question your response is compared to an expected response (supervision). Deviation between yours and the predicted response shapes future study habits and responses. If however there is no expected response to provide this guidance, a scenario that lacks a simple everyday example like quiz or test taking, the machine learning algorithm is
said to be “unsupervised.” Unsupervised learning algorithms organize related rows, observations, or instances according to descriptive attributes (commonly called features). These attributes or features are the columns in a dataset. In
supervised learning, they predict the label. In unsupervised learning, they describe similarities/differences that serve as grouping criteria to organize the data into clusters or groups. In sum, unsupervised learning algorithms predict cluster membership whereas supervised algorithms predict a label, comparing it to the true, expected, or observed label.
Though Machine Learning (ML) algorithms make predictions, the above discussion suggests this simple statement is deceptively complex. From the data for training the algorithm and the prediction type (e.g., diagnostic labels such as “positive” or “negative” or numbers such as volume, mass, or density) to evaluation (has the algorithm learned?), there’s a lot to consider. It is instructive to start with the most basic question: what is a Machine Learning algorithm?
Machine learning algorithms can be categorized as either instance based or model-based (most common). A model is a function, but it intends to simulate a true function–a process that remains unknown to an analyst—that must be estimated through incomplete and noisy datasets or samples. The functions that characterize model-based machine learning accept data and supply an output (prediction). Instance-based machine learning does not rely on a model (or function). Models are built on data and are then independent. Logically, this implies instance-based algorithms remain dependent on
the data. Algorithms of this type may, for instance, predict labels by valuating distances between a new observation, which is to say, that are not used for training, and the training data. A classic example is k-Nearest Neighbors (KNN). Due to the prominence of model-based algorithms, subsequent focus is on these rather instance-based algorithms. To that end, a basic model-based example will be helpful in understanding more sophisticated algorithms.
A linear model such as ordinary least squares regression (OLS) is the simplest example. In OLS regression the function is y = mx + b, which is also an equation, specifically, the linear equation, where x is a feature vector that predicts y, and m and b are parameters, the slope and intercept, respectively, that are estimated from the data. OLS can be visualized using a scatter plot with y on the vertical (y-) axis and the feature, x, on the x-axis. In this plot, there are many plausible linear equations, though not all are equally useful. OLS selects the best linear equation from the numerous possibilities by identifying the one that minimizes the sum of the squared deviations between it and the individual datapoints. For this basic 1-predictor example, the best line can be identified exactly. However, mostly, as will be the case with later examples in the book, exact solutions are impractical, and the parameter values are approximations. In this sense, the learning in Machine Learning algorithms is parameter estimation. As there are many ways humans and animals learn, there are many methods that help estimate the parameters. We will begin examining in detail the ways machines (models or functions) set parameters when discussing How Machine Learning Works.
The classical approach to understanding how something works is to take it apart. When we do this, we realize hidden inside the ML algorithm or implied by it is the data (or its inputs). The first step is then to analyze the data and prepare it. A typical workflow includes the following considerations:
Datasets and the ML models that are trained on them never go beyond representations, static snapshots of a dynamic constantly changing world. In other words, it is always a simplification, and the strategy in preparing datasets for ML tasks should be consistent with this simplifying trend. Some changes to the data are not strictly necessary, depending on the algorithm,
import seaborn as sns
mpg_data = sns.load_dataset("mpg")
mpg_data.head()
# kde = kernel density estimation, estimates a smooth line to approx. distribution in historgram
sns.displot(mpg_data, x = "mpg", kind = "hist", kde = True)
plt.show()
Extreme data points or outliers can impact machine learning models negatively or positively. Negatively, outliers are by definition inconsistent with general trends in the data. If the goal is to maximize predictive success, the model must accommodate all data points, including the outliers, a task that results in a complex, inflexible solution that will perform poorly on new data. As such, a strategy to limit the influence of outliers can help improve long-term success. However, importantly, some machine learning algorithms are impacted to a lesser extent by outliers. Retaining these data points can also add quality and diversity, offering a more realistic representation of the prediction problem. The choice between addressing them or not will therefore depend on (1) the algorithm and (2-3) the size and quality of available data for training and evaluating. Quality is determined though exploratory data analysis, but knowledge about the data–that cannot be interpreted from it–is very important. For instance, outliers can reflect simple instrumentation errors (e.g. damaged sensors returning nonsense data), and the offending data points should always be removed. When deciding to address outliers that are not errors, removing them remains an option, referred to as trimming, but replacing them with new, less extreme values is an alternative, referred to as winsorizing. Here, a predictor/feature is binned into quantiles, then a range is defined by the analyst such that data points falling outside are outliers. Next, the values at the upper and lower bounds of the range are assigned to the more extreme data points.
Figure. Left, a simple linear regression model is fit to five data points, an outlier appears in the upper right; the trend line is influenced by this one value. Right, it seems like the slope should be far less steep for the majority of the data, which is seen when addressing the outlier by replacing it with the maximum of the four other data points.
import seaborn as sns
import numpy as np
mpg_data = sns.load_dataset("mpg")
num_cols = list(mpg_data.select_dtypes(np.number).columns)
# center and scale (Z-standardization)
# see also the StandardScaler
scaled_data = mpg_data[num_cols].apply(lambda x: (x-x.mean())/(x.std(ddof=0)))
# upper and lower outliers, defined as 3 standard deviation units from the mean
scaled_data[abs(scaled_data) >= 3] = np.nan
# remove these observations
scaled_data.dropna(how = 'any', axis = 'rows', inplace = True)
# alternatively, use quantiles
# below, the function detects the cutoffs for the top and bottom 5%
def find_top_bottom_dist(df, variable):
bottom_dist = df[variable].quantile(0.05)
top_dist = df[variable].quantile(0.95)
return bottom_dist, top_dist
# detect these boundaries for the weight and mpg columns/variables
bottom_dist_mpg, top_dist_mpg = find_top_bottom_dist(mpg_data, "mpg")
bottom_dist_w, top_dist_w = find_top_bottom_dist(mpg_data, "weight")
# subset the data using the boundaries
# below, the function creates a boolean, removing any observation falling outside the upper and lower boundaries
def locate_outliers(df, variable, lower, upper):
locations = np.where(df[variable] > upper, True,
np.where(df[variable] < lower, True, False))
return locations
locations = [locate_outliers(mpg_data, col, *boundaries[col]) for col in ["mpg", "weight"]]
# remove
mpg_data.loc[~(locations[0] + locations[1])]
Scale can severely impact machine learning models, making training more difficult, lowering predictive success, and affecting interpretability for some algorithms. As a practical example demonstrating the importance of scale, imagine needing to select the minimum distance between two locations from several routes measured in kilometers or miles. Without converting the distances to a common unit of measurement, comparisons are ambiguous and will lead to errors. Common approaches for putting data on a common scale fall into the broad categories of standardization and normalization. Scaling does not affect distribution. However, normalization also refers to methods to reshape poorly defined distributions so that they resemble well defined theoretical distributions (e.g. normal or Gaussian distribution).
Standardization (z-scaling)
This approach centers each numeric variable, predictor, or feature by subtracting the mean from each value, then scales the result, dividing it by the standard deviation. Each newly scaled column has a mean of 0 and a standard deviation of 1 or approximately so (Figure, A)
Range Scaling (Min-Max)
Range scaling assigns new upper and lower bounds, commonly [0,1]. However, any arbitrary range, b and a, is possible by modifying the range scaling formula as shown in the Figure below (B,C).
Figure. Formulae for scaling data. A) Z-scaling or standardization. B) Range scaling [0,1]. C) Generalization of range scaling, bounding the data between two arbitrary values, b and a.
from sklearn.preprocessing import StandardScaler
s=StandardScaler()
X_train_s = s.fit_transform(X_train)
# scale test data relative to training data
X_test_s = s.transform(X_test)
# the scalers return NumPy arrays
# to convert back to pandas use pd.DataFrame(X_train_s, columns=X_train.columns)
Variance is a low level proxy for variable/feature importance. Low variance implies low discriminability among the observations (rows).The goal in supervised machine learning is to predict an outcome accurately. This outcome varies among observations (rows), so the same should be true of useful/important features.
Alternatively, a high correlation between features implies redundant information, so removing these redundancies will reduce dimensionality and possibly improve models.
The row and column counts of a dataset are its dimensions. Dimensionality refers to the column count. A single row in a dataset represents an observation (a data point) in a high dimensional space. This space describes how the observation varies according to the features (columns).
To think about datasets as spaces and introduce dimensionality as a key topic in machine learning, start by imagining a simple 2 column dataset describing the position of objects in a room; each is uniquely located, varying along the length and width; some are extremely close, others are far, and portions of the room are empty. If adding a 3rd column for the positions along the height dimension, some overlapping objects may now appear in different locations. By describing the object along more dimensions, it draws out the differences. Given continuous monitoring of the room, next we could even add columns for milliseconds, day, month, and year, assuming the objects are movable and therefore an object’s position in May could be 1mm different than July. With each new column, objects become increasingly different from others. At a certain point, it is no longer clear this is desirable. Are millimeter differences in position actually meaningful? It may be true that objects change position over time, but if these changes are small and most objects never move, the columns for time are complicating the dataset without adding much value. Dimensionality reduction involves identifying high and low information columns to identify a lower dimensional space for the data. Reducing the dimensionality of a dataset (1) decreases training (or compute) time and (2) restricts the number of ways data can vary, which, as was seen in the room example, puts the focus on learning generalizable trends and away from meaningless detail. Key dimensionality reduction approaches are variance and correlation filters, which flag columns that are nearly constant (such as when there are many zeros) or columns that are highly similar, that is, the values in each row of the columns are identical or nearly identical; this suggests the columns contain same information). Additionally, there are importance filters (Figure. The concept of importance), which flag columns that might help predict an outcome, and principal component analysis (PCA), a technique that projects data into fewer dimensions while retaining a high percentage of the variability.
Importance refers to the relationship between a predictor and an outcome or label that improves predictions. Predictors that lack importance do not appear to be related to the outcome or labels (Figure. The concept of importance).
PCA attempts to identify a lower dimensional representation of the data that retains the variability, then projects the data into this new space where the features (predictors or columns) are now PC coordinates, which define the locations of the observations (rows in the data) in this space.
Figure. The concept of dimensionality reduction. The houses shown above are distinct from one another. However, if given the task of identifying characteristics that estimate property value, the obvious and numerous differences among the pictured houses are not all relevant. Machine learning (ML)/AI is a matter of taking the complex scene (the data) and reducing it to a more meaningful (predictive) subset. Formally, this creative or artistic effort on the part of researchers is called feature engineering, which involves dimensionality reduction; a reduction in the number of columns in a dataset by dropping or creating new composite features or running preliminary experiments to identify quality predictors. In this housing example, clearly square footage and other proxies of size including the number of bathrooms or bedrooms will likely predict property value. But we should always rely on quantitative analyses to support our intuitions or hypotheses.
Figure. The concept of importance. Snowflakes are variable. Similarities emerge because of physics. Uniqueness emerges because of randomness. An ML/AI algorithm wants to extract the themes or commonalities, the generative laws, while discounting meaningless variability. Analysts support this effort by “cleaning” the training data. This effort might actually involve machine learning; an algorithm may pre-processes the data, removing the meaningless variability from the data to improve the success of a second, final algorithm; one deployed in the real world.
In biological systems, learning refers to structural and functional changes in response to the outside world that help predict future events, avoiding error and improving efficiency. At school, students study course content, textbook and notes, followed by various assignments that are returned with feedback. The feedback system is rule-abiding. Assignments are worth points. Success is represented as a percentage (earned points relative to total points). The percentage determines future performance, adjusting past strategies and habits to avoid future errors.
And computers (machines) study datasets, supplying an output. This output is evaluated using cost/loss functions, returning a value that quantifies the performance (success or failure). An optimization algorithm is then used to find the parameter estimates that lead to predictive success.
Key Points
Supervised machine learning algorithms (1) make predictions that are (2) compared to actual labels or values and (3) adjust their parameters to progressively reduce error; the mismatch between the prediction and the actual label. See also What is Machine Learning? The term “label” is purposefully vague because supervision includes numeric labels, as in prices of houses, as well as category labels such as houses that are “expensive” and “inexpensive” or drugs that are “effective” and “ineffective” in treating a disease. In practice, the category labels are also treated numerically (e.g. 0 and 1) since mathematics only operates on numbers. But the methods for training and evaluating algorithms on classification and regression tasks differ, as we will see later.
Optimization algorithms try to efficiently locate parameter values that minimize or maximize cost functions; the cost and loss are sometimes used interchangeably but a distinction does exist at a more rigorous, mathematical level: Loss refers to error in predicting an instance or observation whereas the cost is the average loss.
Machine learning algorithms that are supervised make predictions that should approximate the observed labels if they are going to be useful. Loss/cost functions are mathematical expressions that quantify the prediction performance. They depend on the machine learning algorithms parameters. So, “training” or “learning” is the process of finding the parameter values that improve performance.
A loss function is a specialized case of the more generic term, objective function in Machine Learning. It is a function that is minimized or maximized by an optimization procedure such as an algorithm like Gradient Descent (GD). Optimization algorithms are discussed in the following section. For now, it suffices to say that loss functions (and cost functions; the cost is the average loss for the `entire dataset) set the ML algorithm’s parameters with the assistance of an optimization algorithm. The loss describes performance, but it can be less intuitive, which is why the R2 for regression or the area ROC AUC for classification may be used for interpretation. When used in this capacity, there is no mathematical optimization, so the term “metrics” is more appropriate. Additionally, since loss functions are numerous and the choice of one vs. another is strongly associated with specific ML algorithms, the data and problem, only the squared error will be discussed in detail, as it is both simple and intuitive.
Squared Error Loss
OLS, or Ordinary Least Squares, a basic regression algorithm, learns by optimizing the squared loss (see Technical Detail for clarification on this statement). For this basic regression algorithm, the goal is to estimate the parameters of a line of best fit; a line, recall, is defined by the linear equation, y = mx +b. The OLS model, formally defined as f(x) rathe than “y” outputs a prediction. The squared loss subtracts that prediction from the observed value, squaring the result. It is a loss/cost function that is differentiable and convex (U-shaped), which are properties that make it easier to locate its global minimum with respect to (w.r.t) the OLS model parameters (slope and intercept). In the next section, we will discuss the popular Gradient Descent optimization algorithm as one approach to accelerate the discovery of the best parameter estimates.
OLS regression has a closed form solution, which means there is an exact as opposed to an approximate solution. But here, as with cases where this is not the case, computational complexity (time and compute resources) makes optimization desirable
Training a machine learning algorithms is an effort to bring predictions closer to the observed values (or ground truth). The training data is used to estimate a set of values (parameters) to accomplish this. Many algorithms also accept user defined values (hyperparameters) that further adjust the predictions and improve generalizability. Mathematically, the task is a minimization of a function that quantifies the performance (loss/cost function). This function depends on the parameters and optimization algorithms are methods to guide the search for the best values. As parameters increase or decrease, the output of a function changes. The output can vary in complex ways as the input changes; however, the types of functions used to train machine learning algorithms are often convex or symmetrical, meaning a graph of the function would appear like a bowl. Such functions can be described by calculus, which makes optimization methods based on derivatives popular. In calculus, a derivative expresses a rate of change between an input and an output. When the output depends on more than 1 input, the total contribution is evaluated using partial derivatives.
Gradient Descent (GD) is an optimization algorithm that involves computing the input/output relationships (gradients, which are the direction and magnitude of the rate of change or the partial derivatives), followed by applying an update rule to adjust the parameters of the machine learning algorithm or model, improving the predictions next time. Accordingly, Gradient Descent is run iteratively to eventually identify the parameter values that minimize the loss function. As the name implies, Gradient Descent (GD) depends on computing gradients. Gradients are directional derivatives (e.g., signed). To optimize ML algorithms, the gradient is computed for each of its parameter w.r.t the cost. In OLS regression, the case is simpler due to the small number of parameters, and this will help illustrate the GD algorithm in action. Here, we will assume the optimized parameter is represented by the Greek letter, theta. This will be the coefficient that describes the slope in the linear equation underlying OLS regression. Assume the gradient has been computed, GD next applies an update rule, multiplying the gradient by the learning rate and then subtracting the result from the existing parameter estimate, theta. The subtraction reverses the gradient’s sign, which is required to move away from or down the gradient toward the global minimum of the Mean Squared Error (or the cost in general).
Figure. A graphical representation of parameter optimization using the Gradient Descent (GD) algorithm. For clarity, one parameter estimate is shown (blue ball). The cost function (curve) is the Mean Squared Error (MSE). By computing the gradient, the update equation or rule shown defines how theta (the blue ball) should be adjusted to minimize error. The learning rate (Eta) controls the size of the change; too large of a change pushes the estimate passed the global minimum.
Machine Learning algorithms are powerful prediction engines. They are designed around autonomous detection and extraction of patterns in datasets hidden among 100s or 1000s of variables. At first glance, this seems purely beneficial. But technologies that automate and even extend human capabilities may behave unexpectedly, and inconsistent with our values and interests. It is therefore necessary to rigorously evaluate if an algorithm is successful and diagnose why this is the case. Researchers and analysts do so by simulating real world cases that challenge trained ML algorithms, revealing success or failures in a low risk environment. The most common simulation method is dividing available data into 2 or 3 datasets, depending on the amount of data. The two datasets are for training and testing and the third is for validating the trained model before its final evaluation on the test dataset. How does this work? As discussed earlier, ML algorithms learn by setting optimal values for their parameters through data. The term “optimal” is importantly referring to our specific sample of data, not all possible datasets. So it is unclear if the algorithm is optimized for any arbitrary sample. Splitting the data into these 2 or 3 sets, given the original sample is sufficiently large, simulates additional samples, and this helps argue that the success of the algorithm is not restricted to our unique case (which must be seen as just an instance of data sampled from many theoretical possibilities; these are similarly sized datasets that may not physically exist at the moment but they could). Apart from collecting and actualizing these samples, the next best option is splitting the data that is available (see the figure at the bottom of this section illustrating data spitting for classification versus regression tasks).
While splitting the data is common, the composition of the new datasets differs by chance, possibly resulting in overly optimistic or pessimistic performance estimates. For example, training and testing sets may contain a large portion of observations that are easier or harder to predict accurately. Success or failure then depends on chance. To address this, rather than a training, a validation, and a test set, the data can be iteratively split. The performance of the ML algorithm is averaged across these iterations. Sequentially splitting the data into multiple training and testing sets by chance does not solve this problem. What is needed is a procedure that evaluates every observation in a training and testing capacity. This ensures the measures used to evaluate the algorithm are not biased by including or excluding easy or difficult to predict observations. Imposing this constraint on the splitting is referred to as Cross Validation (CV). Sometimes the number of splits is explicitly noted as in 3-fold, 5-fold, and 10-fold Cross Validation. Cross Validation is often repeated as in 10-fold Cross Validation, repeated 3 or more times. Splitting provides a fair, unbiased representation of the data. However, much of this effort can be undone by using poor performance metrics. And it is to this that we turn our attention to next.
When the ML task is supervised but the label or outcome that’s used to guide the training is textual (e.g. may be encoded numerically, such as 1 or 0), the task is classification. Classification metrics can be fairly intuitive, owing to medical diagnostics and terms like false positives that are increasingly commonplace. They quantify the degree to which predicted labels (or probability scores) approximate observed labels (also referred to as the ground truth).
Accuracy
While accuracy is a common, generic word for describing model performance, it also has a specific meaning, which is the proportion of correct predictions to all predictions, whether correct or incorrect. Correct predictions include the True Positives (TP) and True Negatives (TN) and the prediction total consists of these (TP and TN) in addition to the incorrect counts, the False Positives (FP) and False Negatives (FN) (equation 4). The inclusion of the TN in the equation introduces bias in the presence of class or category imbalances, which arise commonly in scientific data where there are less “active” examples for supervision than there are “negative” ones. For instance, few chemicals have therapeutic value relative to all possible chemicals. Randomly sampling chemicals and testing their efficacy as treatments will likely return many negative results. An ML algorithm trained on a dataset compiling these results may learn much more about what a drug is not than anything else. But if choosing accuracy as the metric, the high number of true negatives obscure the potentially small number of true positives.
ROC AUC
The Area under the ROC Curve (AUC) assesses the true positive rate (TPR or sensitivity) as a function of the false positive rate (FPR or 1-specificity) while varying the probability threshold (T) for a label (Active/Inactive). If the computed probability score (x) is greater than the threshold (T), the observation is assigned to the active class. Integrating the curve provides an estimate of classifier performance, with the top left corner giving an AUC of 1.0 denoting maximum sensitivity to detect all targets or actives in the data without any false positives (equations 5-8). The theoretical random classifier is reported at AUC = 0.5.
Precision – evaluates that models appropriately assign predictions to the negative class.
Recall – evaluates that models appropriately assign predictions to the positive class (also called the true positive rate).
Figure. Evaluating classification models with ROC analysis. Left, the true positive and false positive rates are computed for different probability score cutoffs. The cutoff binarizes the scores and errors are tabulated (see formulae in this section). Right, data plotted; the resulting curve is integrated, summarizing the classification success over the different cutoffs. Perfect performance is a curve that covers the entire plot area, forming a 90-degree angle at the upper left corner.
Recall that regression refers to a model that predicts numerical labels or outcomes from feature vectors or predictors. It is aimed at minimizing differences between the predicted number and the observed. Correlation is one method to determine this. Correlation metrics are diverse, but the Pearson’s correlation is most common in ML. This correlation compares the variances of numeric vectors and outputs a coefficient that quantifies similarities ranging from -1, perfectly dissimilar or opposite, to +1, perfectly similar or identical. Correlation is transparent and interpretable, though it may offer an inappropriate evaluation on its own unless the prediction task demands high precision. Tradition dictates that OLS regression models be evaluated and compared by a variant of Pearson’s correlation coefficient that is squared. The squared correlation is the Coefficient of Determination. It is interpreted as the % variability that features or predictors account for in the numeric outcome or label (“y” in the equation, y=mx + b).
Most cases the predictions need to be close rather than exact, and analysts will often defer to other metrics that directly quantify the difference between the predicted and true values or in other words, “error.” The first of these is the Root Mean Squared Error (RMSE). It is the square root of the mean difference between predicted and true values. It is on the same scale as the numeric outcome or label, making it easily interpretable. Taking the square root ensures that large errors in prediction are significantly emphasized by the metric, which may be undesirable if, for example, the large errors are anomalies. The RMSE may overstate anomalies or outliers giving the impression the model is less successful. The Mean Absolute Error, MAE, or the mean absolute difference between predicted and true values weights errors, large or small, equivalently, and for this reason addresses potential shortcomings with the RMSE. As rule of thumb, context determines the ideal metric, but it is trivial to evaluate regression models using many metrics.
To this point much of the focus has been on parameters, since they define and differentiate ML algorithms; however, earlier we briefly noted that many algorithms can be further customized through hyperparameters or, that is, parameters that are not estimated directly by the algorithm but set externally. Hyperparameters are distributions, much like parameters, in that they may take many different values. Few of the many possibilities offer improvements; most do little, nothing, or at worst, negatively impact algorithms. The hyperparameter tuning problem is therefore another optimization problem.
Historically, there are two primary methods to handle the optimization: (1) Grid Search, an exhaustive search over a grid of possible values for the hyperparameters; and (2) Random Search, which, rather than an exhaustive search, iteratively samples from hyperparameter distributions, where the bounds are specified by the analyst. The goal in defining the bounds is to keep them broad, increase the number of samples to ensure coverage of the distribution, and after charting performance relative to the values, narrowing in on more precise bounds that contain promising candidates. Although random search is counterintuitive, it offers significant benefits compared to an exhaustive grid search as the grids in both cases must be defined by the analyst and are hence arbitrary. One can hedge their bets by expanding the grid but the number of values increases exponentially. For an ML algorithm with 4 tunable hyperparameters, 3 that have 4 options and 1, any number between 0-1, there are 4x4x4 or 64 combinations to be evaluated before including the 4th hyperparameter, which could increase the total combinations in the grid to 1000 or more; that is, 1000 models that must be trained, evaluated, and stored in computer memory. Yet this still represents a basic case. In practice, then, random search offers a good, if not necessary first pass, and may outperform an exhaustive search due to analysts’ restricting their grids to reduce the number of models that must be fit.
AI/ML makes key contributions to neuroscience research in two big ways: (1) prediction and automation; (2) interpretation in the form of artificial neural networks that simulate biological learning and detecting patterns or trends that would otherwise go unnoticed in large, complex datasets. Therefore, the contributions to the field aid both theoretical and practical work.
We defined ML as building models from data. Models are functions that attempt to approximate input/output relationships from samples of data. The noise in the data ensures that some uncertainty is built-in; however, larger samples of data increase the chance that the model offers a fair and accurate approximation. Assuming quality, size, and diversity of the datasets for training, learning, in the context of ML, is a matter of identifying the model parameters that optimize performance.
When referring to an ML model’s performance, this means to measure whether it successfully predicts a label, category, or number, or organizes the data into well defined clusters based on similarities/differences. ML models or algorithms are for the most part model-based and are disconnected from the data after training, a term that refers to the process of optimizing the ML model’s (or algorithm’s) parameters. Methods that optimize the parameter values depend on (1) a loss/cost function and (2) an algorithm that accelerates the search for the best parameter values, often by computing partial derivatives for each parameter w.r.t the cost (the average loss, that is, for all instances or observations in the data). Many ML algorithms include parameters that are not directly estimated from the data but are set by the analyst called hyperparameters. These must still be optimized from the data; however, the process is done externally by iteratively fitting ML models with different values and evaluating the predictions. The evaluation is done using performance metrics like R2, MAE, and RMSE for regression and Accuracy, Sensitivity, Specificity, and the ROC AUC for classification. To ensure the optimal parameter and hyperparameter values generalize—which is to say, the final model not only successfully predicts a single sample of data (the training data) but any arbitrary sample —analysts design different validation strategies.
Two common validation strategies include: (1) a train/validation/test split, where the original dataset is split into three; and (2) repeatedly splitting the dataset into training and testing sets and averaging the performance metric across splits. When the splitting is repeated with the constraint that every instance or observation is evaluated in both a training and testing context, this is called Cross Validation or 3, 5, or 10-Fold Cross Validation; the fold number referring to the number of splits, 3, 5 or 10, etc.
To this point, while some code examples have been shown, our focus has been on theory. However, Machine Learning (ML) algorithms are implemented in programming languages. By actually seeing and working with these implementations, it is possible to develop a deeper respect for and understanding of the theory. This departure from explicit theory to application involves a new set of challenges. Namely, the obstacles of programming proficiency and the
complexity of the specialized ML/AI software libraries. Examples that follow will be in R using the caret package and those in Python will use Scikit-Learn. Let’s explore each of these libraries and their interfaces in greater detail. After getting comfortable, we’ll move n to specific algorithms or use cases.
Scikit-Learn is a powerful ML library that interfaces with NumPy and Pandas, two data storage, processing, and analysis libraries for data science and scientific programming. The latter libraries extend Python’s built-in data storage types, permitting much more efficient processing and analysis than is possible with lists, dictionaries, tuples, and sets. Pandas offers an alternative to R’s data.frame type in the form of the DataFrame that largely preserves the look and feel and enables the user to perform similar analyses in Python. However, this emphasizes a major difference between ML interfaces in R versus Python, and that is the number of external libraries that must be loaded into the Python session. Since R was developed as a statistical analysis language, many data processing and analysis capabilities are built-in, with some modern extensions that improve efficiency, though they are not entirely needed. In the following section, we will fit a simple OLS regression, according to the principles outlined throughout the chapter. Namely, defining an evaluation strategy, such as splitting the data or Cross Validation or both, and then training the model accordingly. The ML libraries for R and Python can do similar things; however, note that a precise comparison between the two may be lacking in certain cases as caret (in R) does a lot “under the table,” hidden, that is, from the user. These same operations hidden in the R examples may need to be performed explicitly (in multiple steps) for Scikit-Learn, giving the impression the Python examples show something different. Nevertheless, the outcome is the same.
Figure. Importing functionality from the Python library, Scikit-Learn, for basic ML modeling using the linear regression algorithm
Loading the essentials
Libraries in Python may be loaded in their entirety, resembling R, or selectively. For our basic OLS regression example, we will selectively load everything that is needed. First, we are loading the LinearRegression class, followed by a function to split the data into training and test sets. Also, the StandardScaler is imported, which puts the data on the same scale, and the final import, make_pipeline, helps create automated pipelines. We will be using pipelines throughout to process and train our models. They are like recipes for our approach, which may then be followed or adapted by others with little or no additional code.
Defining the evaluation strategy
Evaluation strategies are methods to estimate model success. In earlier sections, we discussed the train/test split but acknowledged a random split may bias the evaluation, later introducing a complementary method, CV. We will be implementing both. First, the function train_test_split is applied to split the data once. Note the random_state argument is set for reproducibility, and the test_size is the fraction of data in the test set. It is set at .20 (.80/.20, train/test, respectively).
Training the model
Next, we want to train the model. We will do this by creating a pipeline that (1) scales the data and then (2) fits the OLS regression model. The pipeline is next supplied to GridSearchCV, which does an exhaustive search over all hyperparameter values in the grid, a Python dictionary that is in this case named, “param_grid.” In addition to the dictionary of hyperparameter values, GridSearchCV has an argument that sets the cross-validation strategy. Here, it is 10-Fold Cross Validation; that is, dividing the training data into 10 train/test sets. After instantiating GridSearchCV,we use its fit method to apply steps 1 and 2 to the 10 folds or train/test splits.
Figure. Illustrating the final steps in building the linear regression model, which includes a pipeline that scales the data followed by training with linear regression. In the next block, we define a grid search, that is, fitting a model for each combination of parameter values in param_grid ten times (from cv=10). Finally, we call the fit method of the saved pipe named, cv_pipe.
Caret is an R programming library that provides a training, validating, and testing interface. Some algorithms are pre-packaged, but many must be installed. Basic model-building in caret involves a two-step process to, handled by the functions trainControl and train. In the next sections, each of these functions will be explained in detail. To use caret, first install the library or package with install.packages(“caret”). Then load it into the R session as in library(caret).
We previously discussed methods to evaluate Machine Learning models. Under the Python section above, two examples were shown: (1) a train/test split and (2) cross validation (or repeatedly splitting the training set to generate multiple training and testing sets for hyperparameter tuning. Caret also has a method to split the data in createDataPartition as well as to tune hyperparameters with cross validation. The latter is done by creating a trainControl object; this is likened to Scikit-Learn’s GridSearchCV. The trainControl object is then supplied as an argument to Caret’s train function. The train function is equivalent to Scikit-Learn’s pipe. The following is then true: pipe(GridSearchCV) in Python and train(trainControl) in R; namely, the terms or objects enclosed by the parentheses are arguments of or to pipe and train.
Defining the evaluation strategy
Let’s go through each argument to the trainControl object shown above:
This will be the strategy to evaluate models. Assuming the model has hyperparameters, the grid of potential values will be trained and evaluated according to the strategy. The strategy shown is repeated 10-fold cross validation. Recall that this divides the data into 10 train/test splits, ensuring that every row appears in a new training set as well as a test set.
The number of splits for the cross-validation strategy is set by the Number argument. Here, it is 10.
When the validation strategy involves repetition, the “repeats” argument sets the value. As seen above, repeated cross validation has been set. Therefore, “repeats” is relevant and is set to 3 times. In sum, arguments 1,2, and 3 state that 10 partitions should be created, repeating this 3 times for a total of 30 partitions. If the model has tuneable hyperparameters, each combination of values will be evaluated on 30 train/test partitions or folds
Since it may be undesirable to store all of the raw evaluation data, users may prefer restricting what is returned. The option this case pictured is to return only the final evaluation, which is the model with the best performing hyperparameter values over the validation strategy.
The evaluation srategy may require fitting many models. Serially fitting, predicting, and quantifying performance can be slow. By setting this argument to TRUE, it instructs caret to parallelize the evaluation. Note that caret does not do this internally. A secondary package is needed to parameterize the parallelization (e.g., the number of CPU cores). This notably differs from Python and Scikit-Learn, where the argument, n_cores, may be set directly when instantiating the algorithm.
Setting the index gives the user control over the instances in the training/testing partitions/fold. It helps when comparing algorithms on the same dataset.
It is a function that accepts the ML algorithm’s predictions and observed values from the data, computes and returns the performance value. While the user can create a custom function, caret provides core metrics built-in. The defaultSummary (built-in) computes the rtegression metrics, MAE, RMSE, and R2. The summaryFunction results are gathered and aggregated by caret and returned as a dataframe in a nested list
Training the model
OLS is a linear regression method that seeks to minimize the sum of squared residuals, which are the differences between observed and predicted values. It assumes a linear relationship between the dependent variable and one or more independent variables, and estimates the coefficients that best fit this relationship. The method is sensitive to outliers and multicollinearity but provides simple, interpretable results when assumptions like homoscedasticity, linearity, and normal distribution of errors are met.
There are, however, many issues that arise with OLS, focusing on the tendency to overfit or overlearn trends, what also might be called noise, that aren’t generalizable. If we collect more data, in other words, the trends go away.
Extensions of linear regression attempt to augment the standard OLS regression approach. Examples include Ridge Regression (which adds an L2 penalty to reduce overfitting and handle multicollinearity), Lasso Regression (which adds an L1 penalty to encourage sparsity by shrinking less important coefficients to zero), and Elastic Net (which combines L1 and L2 regularization). These methods help prevent overfitting, handle high-dimensional data, and improve model generalization, especially when dealing with correlated or irrelevant features.
SVMs are in some respects similar to linear regression. They are a family of supervised learning algorithms that classify data by finding a hyperplane that maximally separates classes. For non-linearly separable data, that is, when we cannot use a straight line or plane to divide the data, SVMs use kernel functions (such as polynomial or radial basis functions) to map the input data to higher-dimensional spaces (that are difficult to imagine). Interestingly, the separation is possible in these mathematical spaces. Why use kernel functions? On face, it might seem that we could modify the feature space. After all, for the simplest case of an outcome (what we are attempting to predict), Y, and single predictor X, we could square X to add non-linearity into data, improving the fit between Y and X while using basic OLS regression. But this becomes impractical with more features. Given a new feature, Z, there is an now an interaction between X and Z that must be included to accurately model the relationship between the features and Y. The number of features grows until that dataset is horribly complex and a predictive algorithm, especially a simple one like OLS linear regression, struggles to learn.
Kernel functions are methods that augment predictive algorithms. In the case of SVMs, they make the mathematics of finding the hyperplane considerably easier. The algorithm aims to maximize the margin, the distance between the hyperplane and the nearest data points of each class, enhancing robustness against overfitting. SVMs work well for high-dimensional data but, even with the use of kernel functions, the computation becomes expensive as datasets grow. More complex algorithms such as deep (or multi-layered) artificial neural networks may be preferable in these circumstances since the many layers are like stacked importance filters, progressively reducing the influence of poorly predictive features during the training phase.
KNN is a simple, instance-based learning algorithm that classifies data points based on the majority class of their k nearest neighbors in feature space. It does not explicitly build a model, instead relying on the proximity of data points during prediction. The choice of k and the distance metric (e.g., Euclidean or Manhattan) significantly impact performance. KNN is intuitive and easy to implement, but its performance can degrade with high-dimensional data and large datasets due to computational costs.
Decision trees are hierarchical models that recursively split the data into subsets based on feature values, forming a tree structure. Each internal node represents a feature, and each branch represents a decision rule, leading to leaf nodes that represent predicted outcomes. The algorithm selects splits that maximize a criterion like information gain or Gini impurity. Decision trees are interpretable and handle both categorical and continuous data, but they are prone to overfitting. Pruning techniques or ensemble methods like Random Forests can be applied to improve generalization.
Random Forest is an ensemble of Decision Trees using feature sampling and bootstrapping to decorrelate the trees to create different, complementary perspectives. This overcomes one of the key limitations of Decision Trees–the tendency to overfit. Bootstrapping involves sampling with replacement. This preserves the size of the dataset. If simplifying a dataset to just 4 rows, A, B, C, D, a bootstrap sample might be A, A, B, D. Random Forest fits multiple decision trees on these different samples and then aggregates the predictions; for example, it averages them but may also count them, if the prediction is a category or label, where the category or label receiving the most votes would be final prediction). When combining the bootstrapping and aggregation together, it is commonly abbreviated as “bagging.”
Bagging is short for bootstrap aggregating. It is a method to decorrelate ensembles such that the average of multiple models improves upon the performance of each considered individually. Here, the training data is sampled with replacement (bootstrapping), then multiple models are trained on these new samples. Importantly, bagging is model agnostic, though Random Forest implements it with Decision Trees.
AdaBoost is a boosting algorithm that combines the predictions of weak learners, typically shallow decision trees, into a stronger overall model. It works by sequentially training learners, where each subsequent learner focuses on correcting the errors made by the previous ones. During training, more weight is assigned to misclassified samples, pushing the next learner to focus on these harder cases. The final prediction is a weighted sum of the weak learners’ outputs. AdaBoost is simple and effective for many problems but sensitive to noisy data and outliers.
Gradient Boosting builds an ensemble of decision trees by optimizing the model in a stage-wise manner, minimizing a chosen loss function. At each iteration, a new tree is added to correct the residual errors of the current model. The new tree is fitted to the gradient of the loss function with respect to the model’s predictions, hence the name “gradient boosting.” This algorithm is highly flexible, allowing for custom loss functions and regularization techniques. However, gradient boosting can overfit if not properly tuned and may be computationally intensive for large datasets.
XGBoost is an optimized implementation of gradient boosting that includes regularization techniques (L1 and L2), which help prevent overfitting, and employs advanced features like parallel processing, tree pruning, and efficient handling of sparse data. It also uses a more sophisticated approach to model training, including the ability to handle missing data internally. XGBoost is known for its speed and performance in machine learning competitions, but it requires careful tuning of hyperparameters to achieve the best results.
CatBoost is a gradient boosting algorithm designed to handle categorical features natively, without the need for extensive preprocessing like one-hot encoding. It employs ordered boosting, which reduces the bias introduced during training by carefully controlling how data is split between training and validation subsets. CatBoost also provides excellent support for handling imbalanced datasets and reducing overfitting through techniques like random permutations and robust default hyperparameters. It is often preferred when dealing with complex datasets containing many categorical variables
Neural networks are a class of machine learning models inspired by the human brain’s structure. They consist of interconnected layers of nodes (neurons) that process inputs by applying weights, biases, and activation functions, allowing the network to learn complex patterns in data. The most basic form is the feedforward neural network (FNN), where information flows in one direction, from input to output. Neural networks are trained using backpropagation, where errors are propagated backward through the layers to update the weights using a gradient descent algorithm. These models are highly flexible and can approximate any continuous function but often require large datasets, significant computational resources, and proper regularization techniques to avoid overfitting.
A MLP is a feed forward network and represents the foundational artificial neural network architecture where information flows in one direction—from the input layer, through one or more hidden layers, to the output layer. Each neuron in a layer is connected to all neurons in the next layer through weighted connections, and activation functions like ReLU or sigmoid are applied to introduce non-linearity. FNNs are primarily used for tasks where inputs are independent, such as image classification or tabular data analysis. While they are relatively simple, FNNs can model complex relationships when enough neurons and layers are included, although they may require large amounts of data and regularization to avoid overfitting.
CNNs are specifically designed for handling grid-like data, such as images or video frames. They consist of convolutional layers that apply filters to the input data, detecting local patterns such as edges, textures, or shapes. These layers are followed by pooling layers that reduce the spatial dimensions while retaining important information, thus making the network more computationally efficient. CNNs are highly effective in image recognition, object detection, and image segmentation tasks due to their ability to capture spatial hierarchies in the data. Fully connected layers at the end of the network compile the learned features to perform classification or regression.
RNNs are designed to process sequential data, where the order of inputs matters. They incorporate loops within their architecture, allowing information from previous time steps to influence the current output, thus giving the network a form of memory. This makes RNNs ideal for tasks like language modeling, time-series forecasting, and speech recognition. However, standard RNNs struggle with long-term dependencies due to issues like the vanishing gradient, making them less effective for long sequences unless modified with specialized units.
Long Short-Term Memory of LSTMs are an advanced form of RNN designed to overcome the limitations of traditional RNNs, such as “forgetting” when the input string (the string of sequential data) grows, in learning long-term dependencies. LSTMs introduce memory cells that can maintain information over long periods, along with gating mechanisms (input, forget, and output gates) to control the flow of information in and out of these cells. This architecture enables LSTMs to remember relevant data for longer sequences, making them highly effective for applications like language translation, speech recognition, and video analysis where long-range dependencies are crucial.
Gated Recurrent Units or GRUs are a simplified variant of LSTMs that also address the vanishing gradient problem in RNNs. They merge the forget and input gates into a single update gate, reducing the complexity while retaining the ability to model long-term dependencies. GRUs are computationally more efficient than LSTMs and often perform comparably in tasks like text generation and time-series prediction. Their simpler structure makes them faster to train and less prone to overfitting in certain applications.
GANs consist of two neural networks—a generator and a discriminator—that are trained simultaneously through adversarial learning. The generator tries to create fake data that mimics the real data, while the discriminator evaluates whether a given sample is real or generated. The two networks compete in a zero-sum game, with the generator improving its ability to fool the discriminator over time. GANs have revolutionized tasks like image generation, style transfer, and data augmentation, producing highly realistic images, videos, and even audio. However, they can be challenging to train due to instability and mode collapse.
GNNs are specialized neural networks designed to operate on graph-structured data, where nodes represent entities and edges represent relationships between them. Traditional neural networks are not well-suited for graphs, as graphs lack a regular structure. GNNs address this by applying convolution-like operations directly to the nodes and aggregating information from neighboring nodes through message-passing mechanisms. This enables GNNs to capture both the features of individual nodes and the graph topology, making them useful for tasks like node classification, link prediction, and graph generation. GNNs are widely used in domains such as social network analysis, molecular chemistry, and recommendation systems.
Transformers are a neural network architecture designed for handling sequential data without relying on recurrence, as used in traditional RNNs. Instead, transformers use a mechanism called self-attention, which allows each element in a sequence to attend to all other elements simultaneously, capturing long-range dependencies efficiently. Transformers consist of an encoder-decoder structure where the encoder processes the input sequence, and the decoder generates the output. They have become the state-of-the-art in natural language processing tasks, including machine translation, text summarization, and question-answering, and are the foundation of models like BERT, GPT, and T5. The parallel nature of transformers enables them to scale efficiently to large datasets, making them highly effective for handling long sequences.
Diffusion models are generative models that create new data samples by learning a reverse process of gradually denoising data that has been perturbed by adding noise. The model starts by adding Gaussian noise to the training data in small steps until it becomes unrecognizable, and then it learns how to reverse this process, step by step, to generate new data. This technique has shown great promise in generating high-quality images, competing with GANs in tasks like image synthesis. Diffusion models are more stable to train than GANs and can generate diverse and realistic outputs, making them a powerful tool in the field of generative modeling, especially in areas like computer vision.
The neural network architecture typically used in diffusion models includes several key components:
UNet Backbone: A common neural network architecture for diffusion models is the UNet, which consists of an encoder and a decoder. The encoder progressively downsamples the noisy input to capture global information, while the decoder upsamples it to generate the denoised output. Skip connections are used to connect layers from the encoder to the corresponding layers in the decoder, allowing the model to retain fine-grained information while reconstructing the data.
Time Embedding: Diffusion models also incorporate a time-embedding mechanism, which encodes the noise level or time step in the diffusion process. This embedding is added to the input to inform the model about the specific noise level it needs to reverse. Typically, sinusoidal embeddings similar to those used in transformers are applied to provide the model with a notion of time progression during the denoising process.
Score Network: The neural network, often referred to as the score network, predicts the noise present at each time step. Instead of directly generating new data points like GANs, the score network predicts the gradient of the log probability density of the noisy data, guiding the reverse diffusion process.
Sampling Process: In the reverse process, diffusion models use a sampling algorithm (e.g., Langevin dynamics or other stochastic sampling methods) to iteratively remove noise, starting from pure noise and gradually reconstructing the original data. The model uses the score network to estimate how to remove the noise at each step, generating data that looks realistic.
Recall from our earlier discussion that unsupervised learning is a type of machine learning where the model learns patterns and relationships in data without explicit labels or guidance. Unlike supervised learning, the preceding examples that we have been discussing, unsupervised learning algorithms work with unlabeled data. That is to say, the data is not explicitly labeled; each row does not already fall into a discrete bin or category and should such bins or categories exist, they must be discovered. Let’s take a look at some examples below. We will begin with a neural network architecture
Autoencoders are unsupervised neural networks used primarily for tasks like dimensionality reduction, anomaly detection, and data denoising. They consist of an encoder that compresses the input into a low-dimensional latent space and a decoder that reconstructs the input from this compressed representation. The goal is to learn a compact yet informative encoding of the data. Variants of autoencoders, such as variational autoencoders (VAEs), are used for generative tasks where the network generates new data similar to the input by sampling from the learned latent space.
K-Means is a partition-based clustering algorithm that divides data into k clusters by minimizing the variance within each cluster. The algorithm starts by randomly initializing k cluster centroids. Each data point is then assigned to the nearest centroid based on a distance metric, usually Euclidean distance. Once all points are assigned, the centroids are updated to be the mean of the points in each cluster. This process of assigning points and updating centroids is repeated until the centroids no longer change significantly, indicating convergence. K-Means is efficient for large datasets and works well when clusters are spherical and equally sized, but it can struggle with non-convex clusters and is sensitive to the initial placement of centroids.
Hierarchical clustering builds a tree-like structure (dendrogram) of nested clusters, allowing for exploration of clustering at various levels of granularity. It comes in two types: agglomerative (bottom-up) and divisive (top-down). In agglomerative clustering, each data point starts as its own cluster, and pairs of clusters are iteratively merged based on a similarity metric (e.g., single-linkage, complete-linkage, or average-linkage) until all points belong to a single cluster. In divisive clustering, the process starts with all points in one cluster, which is then recursively split. Unlike K-Means, hierarchical clustering does not require the number of clusters to be specified in advance and can capture more complex relationships, such as nested clusters or varying cluster sizes. However, it can be computationally expensive for large datasets and is sensitive to outliers.
You cannot copy content of this page