UCLA COM SCI 750 Talks

Table of Contents

Introduction

This post is a walkthrough for a series of two public talks given in April and May 2023 for the COM SCI series at UCLA. It was on transformers, diffusion models, and discussed specific examples such as Open AI’s ChatGPT and DALL-E 2 for a general audience, while being sufficiently technical to be of interest to individuals with more experience in AI/Machine Learning. As such, there are conceptual, analogic, and technical descriptions throughout to address that unique audience. By posting this content here, I hope to deliver it as a narrative both similar and different from the actual talk, periodically adding more discussion on the slides, some of which, due to time constraints, were not covered or only covered minimally. The main difference between the two talks was that the second emphasized art and the ongoing conversation between AI algorithms, scientists in the AI field, and society on far ranging topics from human uniqueness and creativity to “natural” and “artificial” art and the rising tension between humans and AI as society is faced with the prospect of replacement vs. partnership.  

What is Artificial Intelligence?

The term  and underlying theme of Artificial Intelligence (AI) are often at odds. Before the term was ever widespread, the concept or theme of sentient non-humans had been a fixture, preoccupying philosophers, troubling society, and canvasing the imaginations of writers in often dystopian imagery. It is a past , even as we move daringly from imagination to reality, that distorts our relationship with recent AI algorithms like ChatGPT, DALL-E2, and Stable Diffusion, among countless, similar examples. Researchers in the field building and modifying these algorithms are much closer to the subject matter, and, most, while recognizing improvements, rightly acknowledge key limitations. This unfortunately is not relayed to the general public effectively, since the message is too often controlled by the companies that stand to profit from AI supremacy, driving, in part implicitly, as selling something well and truly believing in it overlap, that there is instability and imminent revolution awaits. So what is Artificial Intelligence (AI), beyond the hype, beyond the fiction then? I defer to one of the founders of AI, Alan Turing, expanding his idea of the Imitation Game. Described by Turing, the game involves an exchange between a human and an algorithm; the human is tasked with identifying if the interaction is with a fellow human and not an algorithm. The idea was that our perception (of ourselves, our abilities) prevents us from seeing deeper or alternative interpretations of words like “sentient,” “thinking,” and “intelligence.” That is, to “think” is tied so strongly to examples of human thought — rather than something more universal such as a computational process —  that there is no convincing rational argument. But given an experience in which the human was indistinguishable from the machine, our biases would fade, giving us little choice but to apply terms like sentience, thinking, intelligence, no matter how strange this might seem, to algorithms or machines. Properly understood however Turning’s Game offers a definition for Artificial General Intelligence (AGI) and not AI. To understand AI, and maybe even AGI, there is no need for contrived games. Just observe the natural world; acknowledge the game’s reversal in the scenes around us. Acknowledge the Game whereby nature plays the inquisitor, challenging us to make the case for our uniqueness. In this Game, if we see ourselves mirrored in the world, consistently emulating nature, then we have little choice but to conclude that we are, and of, nature’s creative engine–unfolding patterns, from the smallest to the largest organisms, from the inanimate to the animate. We must conclude that if we could emulate the creative engine itself then the possibilities are endless, even inventing our cognitive capacities. However, putting these loftier goals aside, AI can be defined simply as the art of  detecting and mirroring patterns in data. In keeping with the talk’s title, namely the “Promise” and “Peril,” patterns, some of which are meaningful, are not always easily seen by humans; hence, the promise of “seeing the unseen.” Alternatively, many patterns may be incoherent snapshots, blurred forms crafted into sweeping narratives, or far worse, embedded in datasets, so large and complex, it makes little sense to even try interpretations, fearing that we would undoubtedly be crafting fictions.  The latter is an alarming trend in AI research spurred on by the “new,” the “cool,” and AI not as science but commodity. These fall squarely into the category of “Peril,” since they emphasize a take on AI that is very pessimistic. One where AI captures the public’s attention through empty promises and loses its trust soon after.  As it stands, the road is very much open; the call is hopeful that even in our failings, acknowledging them as such, we might appeal to our “better angels” and chart a new, promising course. 

Nature as Grand Imitation Game

When defining AI as the art of mirroring and detecting patterns in data, the “designation “artificial” is called into question since nature has been and continues to be the masterful creative engine underlying the world. This engine creates forms. The persisting forms, resulting from chance or necessity,  are perpetually recycled, resurfacing in the animate and the inanimate. And the imitation show is ever present for those willing to look close enough at the seemingly insignificant. We can see mimicry (or convergence) of form in the image below starting at the upper left with a coral reef and the cortical folds of the human brain and then moving into mimicry or imitation of foraging animals drawn on rock surfaces by our ancient, human ancestors as well as in the subsequent collage of artwork by Jackson Pollock and Norman Lewis (lower left, upper right, respectively) with natural images or drawings (top left, cortical neurons drawn by the neuroanatomist Ramon y Cajal and bottom right, a photograph of tree branches.). Pollock’s as well as Lewis’ works are very natural or biological. The abstract quality is simply what the eye is drawn to because the artist’s intent is the abstract. But in the imitation game the abstract is only our own ignorance of the concrete, natural, scientific or mathematical reference or impression that drives the artist. And lastly we develop AI algorithms like ChatGPT, driven to create as nature creates, as the artist creates, as we all create and tell one grand story, The Imitation Game. 

The AI Researcher in this Game

The role of the AI researcher is to listen and observe nature’s patterns, translating and communicating them to computers, building algorithms to take up the task of observation, understanding, and generation. It is something that Walt Whitman observed long before. We are all interconnected, bonded to nature, which is to say, the mirroring prevalent in nature eventually turns inward with the mirroring of ourselves. What will we see? In large language models (LLMs), we are presented with the first formidable example of what lies ahead, both startling and exciting. The remaining part of this talk moved toward a basic overview of AI algorithm development, culminating with examples in GPT and DALL-E2 for non specialists. 

What are Artificial Neural Networks?

Artificial Neural Networks (ANNs) have gained popularity recently, though the earliest examples date back to the 1950s and if thinking analogically, tracing ideas back to their origins, one might see connections to ancient art and storytelling. When further investigating this ancient link, the ANN becomes surprisingly familiar. We can begin with: (1) Story, describing it as (2) a Graph, which (3) is an ANN; to next train the ANN is (4) to Learn the Story, while the trained ANN finally (5) Tells the Story.  The process of building networks involves developing the infrastructure that bridges the beginning and ending. In the image below, the bridge is unknown. We initially observe the “black box”; seeing a fragmentary scene such as the artist at work in the studio and later seeing the completed work on museum walls. We infer that that there is a process even as we remain ignorant of how precisely the art moves from studio to museum. This “process” can be thought of technically as a set of parameters, which initially random, reflecting our state of ignorance, are progressively adjusted as, like the detective or noir novel, retrieving the fragmentary information (or data), hopefully finding appropriate estimates. I refer to “hope” in this context to acknowledge that not all stories are good. Some must be discarded, rethought and reassembled, never to leave the drafting stage. 

In making this ANN as story analogy more concrete, we can imagine two scenarios topic modeling, in the are of natural language processing (NLP) and object recognition, a task (or type of story building) arising in machine vision research. For continuity, the latter object recognition task is related to detecting the artist or art movement for an art image. In both cases, for simplicity, the encoding is unrealistic but it illustrates that ANNs communicate in math; the analyst acts as the translator from observations to numbers and must define an appropriate encoding if one does not already exist*. Similarly, the parameter(s) or “Thetas” above the arrows are a simplification to help visualize the parameter as the “glue” binding the task input to the desired output. 

*Images for one are already numeric (arrays of pixel intensities ranging from 0-255 distributed across 3 2D arrays (also called matrices) for the red, green, and blue color channels). We dig deeper into encoding (and dilemmas) in our discussion on transformer ANNs and ChatGPT in particular

Scientist as Artist and Storyteller

When adopting this “story” analogy, many apparent boundaries between the arts and science disappear. And the formerly abstract and unapproachable is reframed as experiential and known. Everyone has a story and has been swept up in the stories of others. So science, math, and, particularly relevant here, artificial neural networks are far more familiar. It is simply a matter of how one looks at the world . As we move into more technical detail, it is important to not lose this underlying child-like wonder that is always unifying the seemingly distinct. In the image/slide below, I outlined this “sense of wonder” that binds science to art showing how science relates to story and storytelling. The images chosen are purposefully abstract but they are actually all familiar once revealed. This again emphasizes that differences and challenges are not absolutes. If the road toward our experience, to relatability, is available, then the abstract becomes clear; a fact that was always known. Importantly, this also suggests that the artificial, the machine, is not the abstract, unknown “thing” that presents an existential crisis for humanity. When we see AI through this lens of anxieties, we fail to realize the promise. We fail to realize that understanding and using and building AI algorithms is to tell our own story. And should we disapprove or disregard recognize that the mirror is facing us. What story do you see? Let’s move on and get into some specifics of neural networks are and see how they are trained in the next section. 

From left to right: rock surface, coloration resulting from bacteria on the rock’s surface; the next image is of bubbles created when swirling water and oil-based paint, the immiscible mixture creates colored dots that give the impression of a solid rather than a liquid. The next image, the grayscale image, is simply a flowerbed, though it resembles an electron micrograph (electron microscopy). The next and final image is colored ink in water, as it disperses (diffusion) it creates interesting geometries that resemble computer models of proteins (often boldly colored to clarify the different subunits and motifs) or colored plumes of smoke.

The Functional Units of Artificial Neural Networks

Artificial Neural Networks are comprised of cells, much like their biological counterparts; however, the term is used flexibly since in machine vision tasks, the cells are filters (convolutional filters or image feature extractors), and include many parameters, whereas the more traditional, artificial cell or neuron, introduced in the Rosenblatt Perceptron (primitively mapping inputs directly to outputs) and then the later, Multilayer Perceptron,  have only two parameters (a weight and a bias). Biological and artificial neurons are much like the lager networks they belong to (or participate in) and therefore revisiting our earlier analogy they are like graphs and stories. In the image/slide below, I refer to the cell as the “scene builder” in the sense that every story is comprised of characters. Ultimately, these characters are the stars, driving the story forward. Removing all detail of the story biological or artificial cells fit into a template that starts with inputs, acting on those inputs, and concludes with outputting the result of the action. The “action” is described by the parameters (here, theta and “b” for bias). In the simple cell shown, action should be somewhat familiar, as it is the famous linear equation (y=mx+b), which defines the relationship between “x” and “y” controlled by “m” and “b.” The more complex networks we will discuss are not as simple in their details, but the principles hold as complexity increases. Namely, the cells are functions mapping inputs to outputs, controlled by parameters, and the broader network is then simply a nested function that describes the flow of inputs into outputs. 

Training Artificial Neural Networks (ANNs) is an optimization task, and it is common to apply an algorithm based on computing partial derivatives describing how a parameter contributes to the prediction error, which is a rate of change (like “Meters per Second”). The partial derivative is necessary as one neuron or cell is a composite of other sources of error. In other words, there are multiple dependencies, and we isolate them to correctly adjust the parameters, reducing future prediction error. An optimization algorithm such as Gradient Descent (GD) or extensions of GD take partial derivatives and “walk” the model toward parameter estimates that minimize the error, formally referred to as a loss function (for the error in predicting a single observation) and cost (for the average over a sample of observations). And it is this loss or cost function (rather than the generic term, error) that the partial derivative actually applies to. But for the purposes of getting a conceptual understanding this distinction is not necessary. In the figure or slide below, I continue to use the generic, ERROR, in parallel with the technical terminology. The cycle shown begins with (1) a prediction, which is compared to an observed value through the loss function. In a basic regression task, that is, predicting a number, the loss might be the squared error, which, as the name implies, involves squaring the difference between predicted and observed values. This is followed by computing the gradient (directional derivative) and then applying an update rule; for example, the one defined by GD, which appears in the lower lefthand corner of the slide. The update rule  shows the the negation of the gradient. In the the bowl-like or concave distribution shown in the slide, the blue ball represents the current value. The derivative (or slope of the curve at the current parameter value) is inferred by the downward arrow, suggesting the ball (parameter value) is moving down, as if a force has been applied, toward the bottom of the bowl. The actual gradient points in the opposite direction, mathematically. So to push the parameter downward (applying the “force” so to speak) means the gradient  must be “flipped.” This is the role of the “-” in the GD update formula. The iterative process shown is done many times (called epochs), and often on batches or samples rather than the entire dataset for computational efficiency. Progressively, this learning, or more directly stated, parameter estimation, process builds an approximate solution to the problem of translating input (e.g., words) to output (e.g., the next word in the sequence). In the next slide, we explore a real example to better visualize the complexity of parameter estimation.  

Moviemaking: Artificial Neural Networks in the Editing Room

The example shown is a sample from a multi-layer neural network specialized for handling image data. Here, the cells are image filters and as shown in the image have more numerous parameters. The image filters or cells take the input image (consisting of pixel values, 0-255, across the 3 color channels (RGB), each of which is its own 2D array (also called a matrix), and the shape could therefore be something like 128x128x3 for color or 128x128x1 for greyscale. This hopefully captures the increased complexity of large neural networks. The network, shown in part for clarity, is comparatively small by today’s standards, with around 17 layers and 1000 cells or filters, specialized for extracting features from the input image(s). The “film” is a sampling of the outputs from early, intermediate, and late layers, relative to the final output layer, which supplies the prediction. Here, the prediction might be one of artist recognition, where the extracted features in our “film” tell the story underlying the data–mapping, in other words, input to output; the story’s beginning to its ending. 

Where we were, where we are

The earliest implementation of Artificial Neural Networks is the Rosenblatt’ Perceptron, reported in the late 1950s. This was followed by the multilayer perceptron (MLP) and the backpropagation algorithm, which defines a method, based on the chain rule in calculus, to compute network gradients (e.g., the partial derivatives discussed above in Training Neural Networks); the algorithm was a key development in building larger networks. But this algorithm does not control the magnitude of the gradient, and as researchers attempted to build networks of increasing complexity, new issues emerged. It was shown that the gradients quickly become unstable, either “exploding” or “vanishing,” and Gradient Descent (GD) failed to properly update the parameters. These issues were later addressed by introducing scaling layers for “exploding” gradients while removing or minimizing the possibility of “zero gradients” through other changes (e.g. different activation functions, which has been an active research area first to introduce sensitivity to non-linearity, that is, by passing the neuron’s output through a non-linear function before relaying it to the subsequent layer, and then later to define set of functions that have these properties but also with non-zero derivatives. The latter developments paved the way for very large networks, so distinct in size and complexity, that the term “deep learning (DL)” gained traction and is increasingly a description of any modern application of an artificial neural network, without a definite connection to size, nor neural networks in the pre-deep learning era. We can place the beginning of this trend around 2012 with AlexNet, a genuinely “deep” example and a significant increase in complexity relative to preceding efforts. After this, networks have only gotten bigger, reaching billions of parameters by 2022 with OpenAI’s GPT-3, packaged alongside ethical safeguards and enhancements using human feedback (Reinforcement Learning, RL) and publicly released under the name, ChatGPT. In the next sections, we will explore GPT in detail. 

Introducing Transformers and GPT

The “GPT” part of ChatGPT refers to the algorithm’s generative engine. The general pretrained transformer (or GPT) is an artificial neural network (ANN). Specifically, it is collection of smaller modules or “mini” ANNs that are strategically configured for encoding language rules and then applying them to successfully predict the next item in a sequence (e.g., word or punctuation symbol). The ANN configuration (layout, design or architecture) behind GPT is a transformer; a transformer is comprised of 2 distinct ANNs, the Encoder and Decoder. However, GPT is generative and is only the Decoder. (Generative models learn a distribution from training data and then attempt to reconstruct or generate novel instances that resemble the training data, instances that aren’t exactly like the training data but are from the same distribution; for example, implicit in training data consisting of many cat images is a distribution or family resemblance we might call, “catness” or “cat-like,” which, if successfully learned, can be sampled from to generate new images of cats. 

In the next sections, we will begin unraveling GPT, which is to say, the Decoder; its submodules and/or relevant processing steps: (1) Tokenization and Encoding, (2) Embedding, (3) Positional Encoding, and (4) Attention and Masked Attention. 

The Universal Machine: the Transformer Architecture

Below is a simplified diagram of a transformer-type neural network, consisting of two separate or sub-networks, the Encoder and Decoder. The Encoder learns the rules of the data, creating a space or in some sense a virtual universe of datapoints from the training data. It offers a rough approximation of what each datapoint means that is progressively built up, becoming more sophisticated, as the data moves through its layers. Notably, the functionality of the Encoder may be incorporated into the Decoder, creating a more streamlined architecture. The Decoder traditionally received the Encoder output as input, subsequently refining this input to successfully perform a prediction task (sentence completion, named entity recognition (NER), or language translation, among many other possibilities). GPT, a Decoder-only architecture, uses its own output. And this cyclic behavior makes the trained model generative: (1) user supplies prompt, (2) model predicts next token (word or punctuation), and (3) the prediction is appended to the user’s input, extending it, then passing it back through the model for another prediction round, doing so until arriving at termination criteria, such as an “end of sentence” token or EOS). In the two diagrams below, the task is limited to text, translation or word generation. However, the transformer architecture is highly versatile, since most data is inherently sequential or may be coerced into a sequential representation. Next, we will discuss the details that make GPT work, starting with “tokenization,” which entails dividing a sequence of text into chunks, then moving on to “embedding,” which is a learned, numeric representation of these chunks or tokens likened to their coordinates (locations) in a high dimensional universe. Tokenization algorithms are distinct from GPT, while embedding is an artificial neural network module, formally named an embedding layer, that is incorporated into GPT (Decoder-only transformer).

General Pre-trained Transformers (GPT)

The LEGO Block World of Large Language Models (LLMs)

Language is a human invention. And while other species have symbolic communication systems, none are as apparently as rich and impactful as human language, despite our best efforts to train non-human primates on the semantics and syntax of human languages. While it is not beyond the scope — that we may train machines to communicate — it is unclear, even among humans, what is meant by “understanding” and “meaning,” and if two people agree, does it suggest they agree identically or actually misunderstand each other, but lacking insight into their innerworkings, simply fail to acknowledge the uncertainty. To train the machine to understand implies the analyst has resolved these complexities–that they understand. At the least, they must understand the computational architecture that could support human language, since, even if there were 1 billion texts to build an algorithm, it would be difficult to dissociate a rule-following or algorithmic machine from one that could be said to understand as well. To make such a claim depends on preliminary knowledge of an “understanding machine” and the criteria for defining something as such. Data cannot teach anything that isn’t at least tacitly understood by the analyst. Otherwise, it is equivalently random noise, and an attempt at interpretation may rely on imagination alone. Accordingly, to avoid speculation and overinterpretation, it’s best to think about artificial neural networks (Large Language Models (LLMs) like ChatGPT) as “block” manipulators. The classical “block” in science and math is the LEGO block, owing to parallels with nature’s subunits (e.g., DNA) and their manipulation, which, from the human perspective, is to say the mathematical operations that act on them (ADD, SUBTRACT, DIVIDE, etc.). The next few slides further explore this LEGO block analogy. 

Embedding: Organizing the Library at the Edge of the Universe

GPT as the Masterful LEGO Block Generator

Infinitely Generative Processes Eventually Invent Themselves

Humanity has from antiquity to the present moment experienced expanded creative potential. In pre-global civilizations, creative acts were insular and unrelatable for the most part, reflecting the sacred and ritualistic features of those cultures; to understand in this context, requires membership in the society/culture. Eventually, cultures crossed borders and became more featureless and hence creative expression was increasingly approachable or generated outputs that others could understand and react to. Departing form local, insular concerns, to global, existential questions, humanity entered into a dialogue phase, taking the form “this is how the world is ___” and someone else replies, “no, no, no it is actually like this ___” resembling the point/counterpoint structure commonly attributed to scientific work. Below I gathered a few coarse examples to show how this dialogue phase proceeded. Ultimately, the evolution reflects a dialogue where humanity increasingly becomes aware of its position in the world, generating, creating, ceaselessly until there is no object “out there” to discuss artistically; no substantive human dialogue about a outside world. Our only choice is, as in a Shakespearian play, to write ourselves into the scene; to create the play within the play, living fictions–drawing on real life canvases, often imagining what could be but unfortunately is not. We are now at the precipice, or as Robert Frost wrote, “two roads diverged . . .,” tasked with imagining the unimaginable, desiring what we cannot create. I am not suggesting that art stops or has stopped. We will continue to create; there are, and will be, aesthetic experiences that connect us to our ancient origins, but we will remain perpetually unsatisfied. The world for us will exist at the horizon, left wondering what’s next–wondering what could be. The way forward, if society desires it, must involve an AI partnership, much like ancient humans once they captured fire and controlled light, and soon after, etched their imaginings on cave walls and rock surfaces, we might just invent ourselves. 

From left to right: Early Ancient Egyptian Art (pre-Greek and Roman influences), followed next by one of Mondrian’s famous “Composition” paintings, which are departures form the former representational art, or art that depicts or mirrors the world as given; the next example in the series, from the artist Lee Krasner, reflects the natural progression from, or response to, Mondrian into increasing abstraction, as if the ongoing dialogue is: “no, no, no, the world isn’t as it appears. Truth is hidden in the whirl of perceptual experience. We know but, know vaguely. I speculate this is strongly associated with scientific discoveries in physics of that era, first with special relatively, suggesting two individuals, one of which is in motion, could not agree on basic measurements, followed shorty after by the more disruptive quantum mechanics, which brought forth fundamental uncertainty to popular culture. That is, uncertainty that is distinct from human imprecision in measurement, or derived merely from subjective limitations in how the world appears (to us) when stationary or in motion. The next is Warhol’s famous works, in an inversion back to representational art, as if abstraction grew tired and unrelatable in its insistence on the work meaning nothing. It is what you want. Logically, the response, referring to Warhol’s specifically, is to claim everything is art or can be. Art is the act. He unlike the abstract expressionists, Krasner, Pollock, de Kooning, take ownership of what the art is. After all, it cannot be art if the artist fails to claim, “This is what it is. It is art. Put it in the museum.” However, this response makes art and the artist commonplace. It demystifies art (so to speak). The last work in the series, captures this movement away from the abstract and then away from Warhol. It is Elaine de Kooning’s cave art. It is clearly what it intends to be. It is representational. It is mimicry. But her work acknowledges this, and therefore transcends “mere imitation.” It is now the artist connecting to the ancient past, telling the story of the past through a modern lens. It is the artist rediscovering or creating the artist on canvas and entering into a time-defying dialogue in which the preceding points and counterpoints, responses and rebuttals, are subsumed into one big generative process: The artist as storyteller. This suggests not only is AI art like DALL-E2 art but that the future of art in many ways requires AI. Otherwise, the story will not compel us. We will appreciate it but it won’t rattle us. 

You cannot copy content of this page