March 6, 2025

Understanding Attention in LLMs

Posted by Bartosz Milewski under Large Language Model, Neural Networks | Tags: AI, artificial-intelligence, Attention, Large Language Model, llm, machine-learning, Neural Networks, nlp, Transformers |
[2] Comments

There are many excellent AI papers and tutorials that explain the attention pattern in Large Language Models. But this essentially simple pattern is often obscured by implementation details and optimizations. In this post I will try to cut to the essentials.

In a nutshell, the attention machinery tries to get at a meaning of a word (more precisely, a token). This should be easy in principle: we could just look it up in the dictionary. For instance, the word “triskaidekaphobia” means “extreme superstition regarding the number thirteen.” Simple enough. But consider the question: What does “it” mean in the sentence “look it up in the dictionary”? You could look up the word “it” in the dictionary, but that wouldn’t help much. More ofthen than not, we guess the meaning of words from their context, sometimes based on a whole conversation.

The attention mechanism is a way to train a language model to be able to derive a meaning of a word from its context.

The first step is the rough encoding of words (tokens) as multi-dimensional vectors. We’re talking about 12,288 dimensions for GPT-3. That’s an enormous semantic space. It means that you can adjust the meaning of a word by nudging it in thousands of different directions by varying amounts (usually 32-bit floating point numbers). This “nudging” is like adding little footnotes to a word, explaining what was its precise meaning in a particular situation.

(Note: In practice, working with such huge vectors would be prohibitively expensive, so they are split between multiple heads of attention and processed in parallel. For instance, GPT-3 has 96 heads of attention, and each head works within a 128-dimensional vector space.)

We are embedding the input word as a 12,288-dimensional vector $\vec E$ , and we are embedding $N$ words of context as 12,288-dimensional vectors, $\vec E_i$ , $i = 1..N$ (for GPT-3, $N=2048$ tokens). Initially, the mapping from words to embeddings is purely random, but as the training progresses, it is gradually refined using backpropagation.

(Note: In most descriptions, you’ll see a whole batch of words begin embedded all at once, and the context often being identical with that same batch–but this is just an optimization.)

The goal is to refine the embedding $\vec E$ by adding to it a small delta $\Delta \vec E$ that is derived from the context $\vec E_i$ .

This is usually described as the vector $\vec E$ querying the context, and the context responding to this query.

First, we apply a gigantic trainable 12,288 by 12,288 matrix $W_Q$ to the embedding $\vec E$ , to get the query vector:

$\vec Q = W_Q \vec E$

You may think of $\vec Q$ as the question: “Who am I with respect to the Universe?” The entries in the matrix $W_Q$ are the weights that are learned by running the usual backpropagation.

We then apply another trainable matrix $W_K$ to every vector $\vec E_i$ of the context to get a batch of key vectors:

$\vec K_i = W_K \vec E_i$

You may think of $\vec K_i$ as a possible response from the $i$ ‘th component of the context to all kinds of questions.

The next step is crucial: We calculate how relevant each element of the context is to our word. In other words, we are focusing our attention on particular elements of the context. We do it by taking $N$ scalar products $\vec K_i \cdot \vec Q$ .

A scalar product can vary widely. A large positive number means that the $i$ ‘th element of the context is highly relevant to the meaning of $\vec E$ . A large negative number, on the other hand, means that it has very little to contribute.

What we really need is to normalize these numbers to a range between zero (not relevant) and one (extremely relevant) and make sure that they all add up to one. This is normally done using the softmax procedure: we first raise $e$ to the power of the given number to make the result non-negative:

$a_i = exp (\vec K_i \cdot \vec Q)$

We then divide it by the total sum, to normalize it:

$A_i = \dfrac {a_i}{ \sum_{i = 1}^N a_i}$

These are our attention weights. (Note: For efficiency, before performing the softmax, we divide all numbers by the square root of the dimension of the vector space $\sqrt{d_k}$ .)

Attention weights tell us how much each element of the context can contribute to the meaning of $\vec E$ , but we still don’t know what it contributes. We figure this out by multiplying each element of the context by yet another trainable matrix, $W_V$ . The result is a batch of value vectors $\vec V_i$ :

$\vec V_i = W_V \vec E_i$

We now accumulate all these contribution, weighing them by their attention weights $A_i$ . We get the adjusted meaning of our original embedding $\vec E$ (this step is called the residual connection):

$\vec E' = \vec E + \sum_{i = 1}^N A_i \vec V_i$

The result $\vec E'$ is infused with the additional information gained from a larger context. It’s closer to the actual meaning. For instance, the word “it” would be nudged towards the noun that it stands for.

And that’s essentially the basic block of the attention system. The rest is just optimization and composition.

One major optimization is gained by processing a whole window of tokens at once (2048 tokens for GPT-3). In particular, in the self-attention pattern we use the same batch of tokens for both the input and the context. In general, though, the context can be distinct from the input. It could, for instance, be a sentence in a different language.

Another optimization is the partitioning of all three vectors, $\vec Q$ , $\vec K_i$ , and $\vec V_i$ between the heads of attention. Each of these heads operates inside a smaller subspace of the vector space. For GPT-3 these are 128-dimensional subspaces. The heads produce smaller-dimensional deltas, $\Delta \vec E$ , which are concatenated into larger vectors, which are then added to the original embeddings through residual connection.

In GPT-3, each multi-headed attention block is followed by a multi-layer perceptron, MLP, and this transformation is repeated 96 times. Most steps in an LLM are just linear algebra; except for softmax and the activation function in the MLP, which are non-linear.

All this work is done just to produce the next word in a sentence.

January 4, 2025

Legalizing Comonad Composition

Posted by Bartosz Milewski under Category Theory, Haskell, Programming | Tags: Category Theory, Comonad, Game of Life, Haskell |
[7] Comments

The yearly Advent of Code is always a source of interesting coding challenges. You can often solve them the easy way, or spend days trying to solve them “the right way.” I personally prefer the latter. This year I decided to do some yak shaving with a puzzle that involved looking for patterns in a grid. The pattern was the string XMAS, and it could start at any location and go in any direction whatsoever.

My immediate impulse was to elevate the grid to a comonad. The idea is that a comonad describes a data structure in which every location is a center of some neighborhood, and it lets you apply an algorithm to all neighborhoods in one fell swoop. Common examples of comonads are infinite streams and infinite grids.

Why would anyone use an infinite grid to solve a problem on a finite grid? Imagine you’re walking through a neighborhood. At every step you may hit the boundary of a grid. So a function that retrieves the current state is allowed to fail. You may implement it as returning a Maybe value. So why not pre-fill the infinite grid with Maybe values, padding it with Nothing outside of bounds. This might sound crazy, but in a lazy language it makes perfect sense to trade code for data.

I won’t bore you with the details, they are available at my GitHub repository. Instead, I will discuss a similar program, one that I worked out some time ago, but wasn’t satisfied with the solution: the famous Conway’s Game of Life. This one actually uses an infinite grid, and I did implement it previously using a comonad. But this time I was more ambitious: I wanted to generate this two-dimensional comonad by composing a pair of one-dimensional ones.

The idea is simple. Each row of the grid is an infinite bidirectional stream. Since it has a specific “current position,” we’ll call it a cursor. Such a cursor can be easily made into a comonad. You can extract the current value; and you can duplicate a cursor by creating a cursor of cursors, each shifted by the appropriate offset (increasing in one direction, decreasing in the other).

A two-dimensional grid can then be implemented as a cursor of cursors–the inner one extending horizontally, and the outer one vertically.

It should be a piece of cake to define a comonad instance for it: extract should be a composition of (extract . extract) and duplicate a composition of (duplicate . fmap duplicate), right? It typechecks, so it must be right. But, just in case, like every good Haskell programmer, I decided to check the comonad laws. There are three of them:

extract . duplicate = id
fmap extract . duplicate = id
duplicate . duplicate = fmap duplicate . duplicate

And they failed! I must have done something illegal, but what?

In cases like this, it’s best to turn to basics–which means category theory. Compared to Haskell, category theory is much less verbose. A comonad is a functor $W$ equipped with two natural transformations:

$\varepsilon \colon W \to \text{Id}$

$\delta \colon W \to W \circ W$

In Haskell, we write the components of these transformations as:

extract :: w a -> a
duplicate :: w a -> w (w a)

The comonad laws are illustrated by the following commuting diagrams. Here are the two counit laws:

and one associativity law:

These are the same laws we’ve seen above, but the categorical notation makes them look more symmetric.

So the problem is: Given a comonad $W$ , is the composition $W \circ W$ also a comonad? Can we implement the two natural transformations for it?

$\varepsilon_c \colon W \circ W \to \text{Id}$

$\delta_c \colon W \circ W \to W \circ W \circ W \circ W$

The straightforward implementation would be:

$W \circ W \xrightarrow{\varepsilon \circ W} W \xrightarrow{\varepsilon} \text{Id}$

corresponding to (extract . extract), and:

$W \circ W \xrightarrow{W \circ \delta} W \circ W \circ W \xrightarrow{\delta \circ W \circ W} W \circ W \circ W \circ W$

corresponding to (duplicate . fmap duplicate).

To see why this doesn’t work, let’s ask a more general question: When is a composition of two comonads, say $W_2 \circ W_1$ , again a comonad? We can easily define a counit:

$W_2 \circ W_1 \xrightarrow{\varepsilon_2 \circ W_1} W \xrightarrow{\varepsilon_1} \text{Id}$

The comultiplication, though, is tricky:

$W_2 \circ W_1 \xrightarrow{W_2 \circ \delta_1} W_2 \circ W_1 \circ W_1 \xrightarrow{\delta_2 \circ W} W_2 \circ W_2 \circ W_1 \circ W_1$

Do you see the problem? The result is $W_2^2 \circ W_1^2$ but it should be $(W_2 \circ W_1)^2$ . To make it a comonad, we have to be able to push $W_2$ through $W_1$ in the middle. We need $W_2$ to distribute over $W_1$ through a natural transformation:

$\lambda \colon W_2 \circ W_1 \to W_1 \circ W_2$

But isn’t that only relevant when we compose two different comonads–surely any functor distributes over itself! And there’s the rub: Not every comonad distributes over itself. Because a distributive comonad must preserve the comonad laws. In particular, to restore the the counit law we need this diagram to commute:

and for the comultiplication law, we require:

Even if the two comonad are the same, the counit condition is still non-trivial:

The two whiskerings of $\varepsilon$ are in general not equal. All we can get from the original comonad laws is that they are only equal when applied to the result of comultiplication:

$(\varepsilon \circ W) \cdot \delta = (W \circ \varepsilon) \cdot \delta$ .

Equipped with the distributive mapping $\lambda$ we can complete our definition of comultiplication for a composition of two comonads:

$W_2 \circ W_1 \xrightarrow{W_2 \circ \delta_1} W_2 \circ W_1^2 \xrightarrow{\delta_2 \circ W} W_2^2 \circ W_1^2 \xrightarrow{W_2 \circ \lambda \circ W_1} (W_2 \circ W_1)^2$

Going back to our Haskell code, we need to impose the distributivity condition on our comonad. There is a type class for it defined in Data.Distributive:

class Functor w => Distributive w where
  distribute :: Functor f => f (w a) -> w (f a)

Thus the general formula for composing two comonads is:

instance (Comonad w2, Comonad w1, Distributive w1) => 
  Comonad (Compose w2 w1) where
    extract = extract . extract . getCompose
    duplicate = fmap Compose . Compose .
                fmap distribute . duplicate . fmap duplicate .
                getCompose

In particular, it works for composing a comonad with itself, as long as the comonad distributes over itself.

Equipped with these new tools, let’s go back to implementing a two-dimensional infinite grid. We start with an infinite stream:

data Stream a = (:>) { headS :: a
                     , tailS :: Stream a}
  deriving Functor

infixr 5 :>

What does it mean for a stream to be distributive? It means that we can transpose a “matrix” whose rows are streams. The functor f is used to organize these rows. It could, for instance, be a list functor, in which case you’d have a list of (infinite) streams.

  [   1 :>   2 :>   3 .. 
  ,  10 :>  20 :>  30 ..
  , 100 :> 200 :> 300 .. 
  ]

Transposing a list of streams means creating a stream of lists. The first row is a list of heads of all the streams, the second row is a list of second elements of all the streams, and so on.

  [1, 10, 100] :>
  [2, 20, 200] :>
  [3, 30, 300] :>
  ..

Because streams are infinite, we end up with an infinite stream of lists. For a general functor, we use a recursive formula:

instance Distributive Stream where
    distribute :: Functor f => f (Stream a) -> Stream (f a)
    distribute stms = (headS  stms) :> distribute (tailS  stms)

(Notice that, if we wanted to transpose a list of lists, this procedure would fail. Interestingly, the list monad is not distributive. We really need either fixed size or infinity in the picture.)

We can build a cursor from two streams, one going backward to infinity, and one going forward to infinity. The head of the forward stream will serve as our “current position.”

data Cursor a = Cur { bwStm :: Stream a
                    , fwStm :: Stream a }
  deriving Functor

Because streams are distributive, so are cursors. We just flip them about the diagonal:

instance Distributive Cursor where
    distribute :: Functor f => f (Cursor a) -> Cursor (f a)
    distribute fCur = Cur (distribute (bwStm  fCur)) 
                          (distribute (fwStm  fCur))

A cursor is also a comonad:

instance Comonad Cursor where
  extract (Cur _ (a :> _)) = a
  duplicate bi = Cur (iterateS moveBwd (moveBwd bi)) 
                     (iterateS moveFwd bi)

duplicate creates a cursor of cursors that are progressively shifted backward and forward. The forward shift is implemented as:

moveFwd :: Cursor a -> Cursor a
moveFwd (Cur bw (a :> as)) = Cur (a :> bw) as

and similarly for the backward shift.

Finally, the grid is defined as a cursor of cursors:

type Grid a = Compose Cursor Cursor a

And because Cursor is a distributive comonad, Grid is automatically a lawful comonad. We can now use the comonadic extend to advance the state of the whole grid:

generations :: Grid Cell -> [Grid Cell]
generations = iterate $ extend nextGen

using a local function:

nextGen :: Grid Cell -> Cell
nextGen grid
  | cnt == 3 = Full
  | cnt == 2 = extract grid
  | otherwise = Empty
  where
      cnt = countNeighbors grid

You can find the full implementation of the Game of Life and the solution of the Advent of Code puzzle, both using comonad composition, on my GitHub.

November 16, 2024

Sieves and Sheaves

Posted by Bartosz Milewski under Category Theory, Topology | Tags: Category Theory, math, mathematics, Topology |
Leave a Comment

Previously: Covering Sieves.

We’ve seen an intuitive description of presheaves as virtual objects. We can use the same trick to visualize natural transformations.

A natural transformation can be drawn as a virtual arrow $\alpha$ between two virtual objects corresponding to two presheaves $S$ and $P$ . Indeed, for every $s_a \in S a$ , seen as an arrow $a \to S$ , we get an arrow $a \to P$ simply by composition $\alpha \circ s_a$ . Notice that we are thus defining the composition with $\alpha$ , because we are outside of the original category. A component $\alpha_a$ of a natural transformation is a mapping between two arrows.

Untitled Artwork

This composition must be associative and, indeed, associativity is guaranteed by the naturality condition. For any arrow $f \colon a \to b$ , consider a zigzag path from $a$ to $P$ given by $\alpha \circ s_b \circ f$ . The two ways of associating this composition give us $\alpha_a \circ S f = P f \circ \alpha_b$ .

Untitled Artwork

Let’s now recap our previous definitions: A cover of $u$ is a bunch of arrows converging on $u$ satisfying certain conditions. These conditions are defined in terms of a coverage. For every object $u$ we define a whole family of covers, and then combine them into one big collection that we call the coverage.

A sheaf is a presheaf that is compatible with a coverage. It means that for every cover $\{u_i\}$ , if we pick a compatible family of $x_i \in P u_i$ that agrees on all overlaps, then this uniquely determines the element (virtual arrow) $x \in P u$ .

Untitled Artwork

A covering sieve of $u$ is a presheaf that extends a cover $\{u_i\}$ . It assigns a singleton set to each $u_i$ and all its open subsets (that is objects that have arrows pointing to $u_i$ ); and an empty set otherwise. In particular, the sieve includes all the overlaps, like $u_i \cap u_j$ , even if they are not present in the original cover.

The key observation here is that a sieve can serve as a blueprint for, or a skeleton of, a compatible family $\{ x_i \}$ . Indeed, $S_u$ maps all objects either to singletons or to empty sets. In terms of virtual arrows, there is at most one arrow going to $S_u$ from any object. This is why a natural transformation from $S_u$ to any presheaf $P$ produces a family of arrows $x_i \in P u_i$ . It picks a single arrow from each of the hom-sets $u_i \to P$ .

Untitled Artwork

The sieve includes all intersections, and all diagrams involving those intersections necessarily commute. They commute because the category we’re working with is thin, and so is the category extended by adding the virtual object $S_u$ . Thus a family generated by a natural transformation $\alpha \in Nat (S_u, P)$ is automatically a compatible family. Therefore, if $P$ is a sheaf, it determines a unique element $x \in P u$ .

This lets us define a sheaf in terms of sieves, rather than coverages.

A presheaf $P$ is a sheaf if and only if, for every covering sieve $S_u$ of every $u$ , there is a one-to-one correspondence between the set of natural transformations $Nat (S_u, P)$ and the set $P u$ .

In terms of virtual arrows, this means that there is a one-to-one correspondence between arrows $\alpha \colon S_u \to P$ and $x \colon u \to P$ .

Untitled Artwork
Next: Subobject Classifier

October 31, 2024

Covering Sieves

Posted by Bartosz Milewski under Category Theory, Topology | Tags: Category Theory, math, mathematics, Topology |
[2] Comments

Previously: Sheaves as Virtual Objects.

In order to define a sheaf, we have to start with coverage. A coverage defines, for every object $u$ , a family of covers that satisfy the sub-coverage conditions. Granted, we can express coverage using objects and arrows, but it would be much nicer if we could use the language of functors and natural transformations.

Let’s start with the idea that, categorically, a cover of $u$ is a bunch of arrows converging on $u$ . Each arrow $p_i \colon u_i \to u$ is a member of the hom-set $\mathcal C (u_i, u)$ . Now consider the fact that $\mathcal C (-, u)$ is a presheaf, $\mathcal C^{op} \to \mathbf{Set}$ , and ask the question: Is a cover a “subfunctor” of $\mathcal C (-, u)$ ?

A subfunctor of a presheaf $P$ is defined as a functor $S$ such that, for each object $v$ , $S v$ is a subset of $P v$ and, for each arrow $f \colon v \to w$ , the function $S f \colon S w \to S v$ is a restriction of $P f$ .

Untitled Artwork

In general, a cover does not correspond to a subfunctor of the hom-functor. Let’s see why, and how we can fix it.

Let’s try to define $S$ , such that $S u_i$ is non-empty for any object $u_i$ that’s in the cover of $u$ , and empty otherwise. As a presheaf, we could represent it as a virtual object with arrows coming from all $\{ u_i \}$ ‘s.

Untitled Artwork

Now consider an object $v$ that is not in the cover, but it has an arrow $f \colon v \to u_k$ connecting it to some element $u_k$ of the cover. Functoriality requires the (virtual) composition $s_k \circ f$ to exist. Untitled Artwork

Thus $v$ must be included in the cover–if we want $S$ to be a functor.

In particular, if we are looking at a category of open sets with inclusions, this condition means that all (open) sub-sets of the covering sets must also be included in the cover. Such a “downward closed” family of sets is called a sieve.

Imagine sets in the cover as holes in a sieve. Smaller sets that can “pass through” these holes must also be parts of the sieve.

If you start with a cover, you can always extend it to a covering sieve by adding more arrows. It’s as if you started with a few black holes, and everything that could fall into them, would fall.

We have previously defined sheaves in terms of coverings. In the next installment we’ll see that they can equally well be defined using covering sieves.

Next Sieves and Sheaves.

October 24, 2024

Sheaves as Virtual Objects

Posted by Bartosz Milewski under Category Theory, Topology | Tags: Category Theory, math, mathematics, Topology |
1 Comment

Previously: Coverages and Sites

The definition of a sheaf is rather complex and involves several layers of abstraction. To help us navigate this maze we can use some useful intuitions. One such intuition is to view objects in our category as some kind of sets (in particular, open sets, when we talk about topology), and arrows as set inclusions. An arrow from $v$ to $u$ means that $v$ is a subset of $u$ .

A cover of $u$ is a family of arrows $\{ p_i \colon u_i \to u \}$ . A coverage assigns a collection of covers to every object, satisfying the sub-coverage conditions described in the previous post. A category with coverage is called a site.

The next layer of abstraction deals with presheaves, which are set-valued contravariant functors. Interestingly, there is a way to interpret a presheaf as an extension of the original category. I learned this trick from Paolo Perrone.

We may represent a presheaf $P$ using virtual hom-sets. First we add one virtual object, let’s call it $\bullet$ , to our category. The set $P u$ is then interpreted as the set of arrows from $u$ to $\bullet$ .

Untitled Artwork

Moreover, we can represent the action of $P$ on arrows as simple composition. Take an arrow $f \colon v \to u$ . The presheaf lifts it to a function between sets: $P f \colon P u \to P v$ (contravariance means that the arrow is reversed). For any $h \in P u$ we can define the composition $h \circ f$ to be $(P f) h$ .

Untitled Artwork

Incidentally, if the functor $P$ is representable, it means that we can replace the virtual object $\bullet$ with an actual object in our category.

Notice that, even though the category of open sets with inclusions is a poset (hom-sets are either singletons or empty, and all diagrams automatically commute), the added virtual hom-sets usually contain lots of arrows. In topology these hom-sets are supposed to represent sets of continuous functions over open sets.

We can interpret the virtual object $\bullet$ as representing an imaginary open set that “includes” all the objects $u$ for which $P u$ is non-empty, but we have to imagine that it’s possible to include an object in more than one way, to account for multiple arrows. In fact, in what follows we won’t be assuming that the underlying category is a poset, so virtual hom-sets are nothing special.

To express the idea of intersections of open sets, we use commuting diagrams. For every pair of objects $u_i$ and $u_j$ that are in the cover of $u$ , an object $v$ is in their intersection if the following diagram commutes:

Untitled Artwork

Note that in a poset all diagrams commute, but here we’re generalizing this condition to an arbitrary category. We could say that $v$ is in the intersection of $u_i$ and $u_j$ seen as covers of $u$ .

Equipped with this new insight, we can now express the sheaf condition. We assume that there is a coverage defined in our category. We are adding one more virtual object $\bullet$ for the presheaf $P$ , with bunches of virtual arrows pointing to it.

For every cover $\{ p_i \colon u_i \to u \}$ we try to select a family of virtual arrows, $s_i \colon u_i \to \bullet$ . It’s as if the objects $u_i$ , besides covering $u$ , also covered the virtual object $\bullet$ .

We call the family $\{s_i \}$ a matching family, if this new covering respects the existing intersections. If $v$ is in the intersection of $u_i$ and $u_j$ (as covers of $u$ , see the previous diagram), then we want the following diagram to also commute:
Untitled Artwork
In other words, the $\{u_i\}$ ‘s intersect as covers of $\bullet$ .

A presheaf $P$ is a sheaf if, for every covering family $p_i$ and every matching family $s_i$ there exists a unique $s \colon u \to \bullet$ that factorizes those $s_i$ ‘s:
Untitled Artwork
Translating it back to the language of topology: There is a unique global function $s$ defined over $u$ whose restrictions are $s_i$ ‘s.

The advantage of this approach is that it’s easy to imagine the sheafification of an arbitrary presheaf by freely adding virtual arrows (the $s$ ‘s and their compositions with $p_i$ ‘s in the above diagram) to all intersection diagrams.

Next: Covering Sieves

October 7, 2024

Coverages and Sites

Posted by Bartosz Milewski under Topology | Tags: math |
[4] Comments

Previously: Sheaves and Topology.

In our quest to rewrite topology using the language of category theory we introduced the category of open sets with set inclusions as morphisms. But when we needed to describe open covers, we sort of cheated: we chose to talk about set unions. Granted, set unions can be defined as coproducts in this category (not to be confused with coproducts in the category of sets and functions, where they correspond to disjoint unions). This poset of open sets with finite products and infinite coproducts is called a frame. There is however a more general definition of coverage that is applicable to categories that are not necessarily posets.

A cover of an open set $u$ is a family of open sets $u_i$ that doesn’t leave any part of $u$ uncovered. In terms of inclusions, we can rephrase this as having a family of of morphisms $p_i \colon u_i \to u$ . But then how do we ensure that these sets indeed cover the whole of $u$ ?
Cover
The familiar trick of category theory is to look at the totality of all covers. Suppose that for every open set $u$ we have not one, but a whole collection of covers: that is a collection of families $\{u_i\}$ , each indexed by a potentially different set $I$ .

Now consider an open subset $v \subseteq u$ . If the family $u_i$ is a cover of $u$ than the family of intersections $v_i = v \cup u_i$ should form a cover of $v$ .

Indeed, an intersection of two open sets is again an open set, so all $v_i$ ‘s are automatically open. And if no part of $u$ was left uncovered by $u_i$ ‘s, then no part of $v$ is left uncovered by $v_i$ ‘s.

Conversely, imagine that we had an incomplete cover, with a hole large enough to stick an open set $v$ in it. The empty intersection of that “cover” with $v$ would then produce no cover of $v$ . So the requirement that each cover produces a smaller sub-cover eliminates the possibility of substantial holes in coverage. This is good enough for our generalization.

We are now ready to define coverage using categorical language. We just have to make sure that this definition reproduces the set-theoretic picture when we replace objects with open sets, and arrows with inclusions.

A coverage $J$ on a category $\mathcal C$ assigns to each object $u$ a collection of families of arrows $p_i \colon u_i \to u$ .

CatCover3

For every such family $\{ p_i\}$ , and every object $v$ equipped with an arrow $g \colon v \to u$ , there exist a covering family $q_j \colon v_j \to v$ that is a sub-family of $u_i$ .

CatCover

This means that for every $v_j$ we can find its “parent” $u_i$ , i.e., every inclusion $g \circ q_j \colon v_j \to u$ can be factored through some $p_i$ :

$g \circ q_j = p_i \circ k_{i j}$

CatCover

A category with a coverage is called a site. As we’ve seen before, the definition of a sheaf uses a coverage, so a site provides the minimum of structure for a sheaf.

For completeness, here’s the definition of a sheaf on a site $(\mathcal C , J)$ as explained in detail in the previous post:

Our starting point is a presheaf $P \colon \mathcal C^{op} \to Set$ , abstracting the idea of assigning a set of functions to every open set (an object of $\mathcal C$ ). $P$ maps arrows in $\mathcal C$ (inclusions of open sets) to restrictions of said functions. This presheaf is a sheaf if:

For every covering family $p_i \colon u_i \to u$
and every compatible family (tuple) of elements $s_i \in P u_i$ , such that for every $v$ that has arrows to two objects: $f \colon v \to u_i$ and $g \colon v \to u_j$ , such that $p_i \circ f = p_j \circ g$ , we have:
$(P f) s_i = (P g) s_j$

(the restrictions on all overlaps coincide)
there is a unique element $s \in P u$ such that $(P p_i) s = s_i$ for all $i$ (we can collate all individual functions).

Untitled Artwork

As you’d expect in topology, this definition doesn’t mention sizes or distances. More interestingly, we don’t talk about points. Normally a topological space is defined as a set of points, and so are open sets. The categorical language lets us talk about point-free topologies.

There is a further generalization of sheaves, where the target of the functor $P$ is not $\mathbf{Set}$ but a category with some additional structure. This makes sense, because the set of functions defined over an open set has usually more structure. It’s the structure induced by the target of these functions. For instance real- or complex-valued functions can be added, subtracted, and multiplied–point-wise. Division of functions is not well defined because of zeros, so they only form a ring.

There is an intriguing possibility that the definition of a coverage could be used to generalize convolutional neural networks (CNN). For instance, voice or image recognition involves applying a sliding window to the input. This is usually done using fixed-sized (square) window, but what really matters is that the input is covered by overlapping continuous areas that capture the 2-dimensional (topological) nature of the input. The same idea can be applied to higher dimensional data. In particular, we can treat time as an additional dimension, capturing the idea of continuous motion.

Next Sheaves as Virtual Objects.

August 18, 2024

Sheaves and Topology

Posted by Bartosz Milewski under Category Theory, Topology | Tags: Category Theory, Topology |
Leave a Comment

Previously: Presheaves and Topology.

In all branches of science we sooner or later encounter the global vs. local duality. Topology is no different.

In topology we have the global definition of continuity: counter-images of all open sets are open. But we perceive a discontinuity as a local jump. How are the two pictures related, and can we express this topologically, that is without talking about sizes and distances?

All we have at our disposal are open sets, so exactly what properties of open sets are the most relevant? They do form a (thin) category with inclusions as arrows, but so does any set of subsets. As it turns out open sets can be stitched together to create coverings. Such coverings let us zoom in on finer and finer details, thus creating the bridge between the global and the local picture.

Open sets are plump–they can easily fill the bulk of space. They are also skinless, so they can’t touch each other without some overlap. That makes them perfect for constructing covers.

Covering, unlike tiling, requires overlapping. To create a leak-free roof, you need your tiles to overlap. The idea is that, if we were defining functions over a tiling, it would be possible for them to make sudden jumps at tile boundaries. Open coverings overlap, so such functions have to flow continuously.

An open cover of a set $u$ is a family of open sets $\{u_i\}$ such that $u$ is their union:

$u = \bigcup_{i \in I} u_i$

Here $I$ is a set used for indexing the family.

If we have a continuous function $f$ defined over $u$ , then all its restrictions $f|_{u_i}$ are also continuous (this follows from the condition that an intersection of open sets is open). Thus going from global to local is easy.

The converse is more interesting. Suppose that we have a family of functions $f_i$ , one per each open set $u_i$ , and we want to reconstruct the global function $f$ defined over the set $u$ covered by $u_i$ ‘s. This is only possible if the individual functions agree on overlaps.

Take two functions: $f_i$ defined over $u_i$ , and $f_j$ defined over $u_j$ . If the two sets overlap, each of the functions can be restricted to the overlap $u_i \cap u_j$ . We want these restrictions to be equal:

$f_i|_{u_i \cap u_j} = f_j|_{u_i \cap u_j}$

If all individual continuous functions agree on the overlaps then they uniquely determine a global continuous function $f$ defined over the whole set $u$ . You can stitch or collate functions that are defined locally.

In the language of category theory we talk about functions in bulk. We define a functor–a presheaf $P$ –that maps all open sets to sets of continuous functions. In this language, to an open cover $\{u_i\}$ corresponds a family of functions $\{f_i\}$ that are members of the individual sets $P u_i$ . Every such selection forms a giant $I$ -indexed tuple, that is an element of the cartesian product:

$\{f_i | i \in I\} \in \prod_{i} P u_i$

Similarly, we can gather functions that are defined over the intersections of sets into a product:

$\prod_{i j} P (u_i \cap u_j)$

(Notice that every empty intersection corresponds to a single trivial function that we call absurd in Haskell.)

Set inclusions generate function restrictions. In particular, for every intersection $u_i \cap u_j$ we have a pair of restrictions:

$f_i \mapsto f_i|_{u_i \cap u_j}$

$f_j \mapsto f_j|_{u_i \cap u_j}$

These restrictions can be seen as functions between sets:

$P u_i \to P (u_i \cap u_j)$

$P u_j \to P (u_i \cap u_j)$

If all such restrictions are pairwise equal, we call $\{f_i\}$ a matching family, and for every such matching family there is a unique element $f \in P u$ such that $f_i = f|_{u_i}$ , for all $i$ .

These pairs of restrictions define two mappings between our big products:

$p, q : \prod_i P u_i \rightrightarrows \prod_{i j} P (u_i \cap u_j)$

Think of each function as acting on a tuple $\{f_k\}$ and producing a matrix indexed by elements of $I$ :

$(p\; \{f_k\})_{i j} = f_i|_{u_i \cap u_j}$

$(q\; \{f_k\})_{i j} = f_j|_{u_i \cap u_j}$

Our matching condition can be expressed in the language of category theory by saying that the following diagram is an equalizer of $p$ and $q$ (the two parallel arrows):

$P u \xrightarrow{e} \prod_i P u_i \rightrightarrows \prod_{i j} P (u_i \cap u_j)$

Here $e$ is defined as mapping a function $f \in P u$ to a tuple of its restrictions $\{ f|{u_i}\}$ . These restrictions are then required to match when further restricted by $p$ and $q$ to all possible intersections.

A presheaf $P$ is called a sheaf if, for every open covering $\{u_i\}$ , a matching family $\{f_i\}$ uniquely determines the element of $P u$ of the equalizer above. This element corresponds to the function $f$ that is the result of stitching of individual functions.

Notice that, even though we tried to use the categorical language as much as possible, we still had to rely on the language of sets to define coverings. To abstract away from set theory and traditional topology, we need to talk about sites.

Next: Coverages and Sites .

August 7, 2024

Presheaves and Topology

Posted by Bartosz Milewski under Category Theory, Topology | Tags: Category Theory, Topology |
[7] Comments

Previously: Topology as a Dietary Choice.

Category theory lets us change the focus from individual objects to relationships between them. Since topology is defined using open sets, we’d start by concentrating on relations between sets.

One such obvious relation is inclusion. It imposes a categorical structure on the subsets of a given set $X$ . We draw arrows between two sets whenever one is a subset of the other. These arrows satisfy the axioms of a category: there is an identity arrow for every object (every set is its own subset) and arrows compose (inclusion is transitive). Not every pair of objects is connected by an arrow–some sets are disjoint, others overlap only partially. We may include the whole space as the terminal object (with arrows coming from every subset) and the empty set $\emptyset$ as the initial object (with arrows going to every set). As categories go, this is a thin category, because there is at most one arrow between any two objects.

Every topological space gives thus rise to a thin category that abstracts the structure of its open sets. But the real reason for defining a topology is to be able to talk about continuous functions. These are functions between topological spaces such that the inverse image of every open set is open. Here, again, category theory tells us not to think about the details of how these functions are defined, but rather what we can do with them. And not just one function at a time, but the whole bunch at once.

So let’s talk about sets of functions. We have our topological space $X$ , and to each open subset $u$ we will assign a set of continuous function on it. These could be functions to real or complex numbers, or whatever–we don’t care. All we care about is that they form a set.

Since open sets in $X$ form a (thin) category, we are talking about assigning to each object (open set) $u$ its own set (of continuous functions) $P u$ . Notice however that these sets of functions are not independent of each other. If one open set is a subset of another, it inherits all the functions defined over the larger set. These are the same functions, the only difference being that their arguments are restricted to a smaller subset. For instance, given two sets $v \subseteq u$ and a function $f \colon u \to \mathbb R$ , there is a function $f|_{v} \colon v \to \mathbb R$ such that $f|_{v} = f$ on $v$ .

Let’s restate these statements in the categorical language. We already have a category $X$ of open sets with inclusion. The sets of functions on these open sets are objects in the category $\mathbf{Set}$ . We have defined a mapping $P$ between these two categories that assigns sets of functions to open sets.

Notice that we are dealing with two different categories whose objects are sets. One has inclusions as arrows, the other has functions as arrows. (To confuse matters even more, the objects in the second category represent sets of functions.)

To define a functor between categories, we also need a mapping of arrows to accompany the mapping of objects. An arrow $v \to u$ means that $v \subseteq u$ . Corresponding to it, we have a function $P u \to P v$ that assigns to each $f \in P u$ its restriction $f|_{v} \in P v$ .

Together, these mappings define a functor $P \colon X^{op} \to \mathbf{Set}$ . The “op” notation means that the directions of arrows are reversed: the functor is “contravariant.”

A functor must preserve the structure of a category, that is identity and composition. In our case this follows from the fact that an identity $u \subseteq u$ maps to a trivial do-nothing restriction, and that restrictions compose: $(f|_v)|_w = f|_w$ for $w \subseteq v \subseteq u$ .

There is a special name for contravariant functors from any category $\mathcal C$ to $\mathbf{Set}$ . They are called presheaves, exactly because they were first introduced in the context of topology as precursors of “shaves.” Consequently, the simpler functors $\mathcal C \to \mathbf{Set}$ had to be confusingly called co-presheaves.

Presheaves on $\mathcal C$ form their own category, often denoted by $\hat{\mathcal C}$ , with natural transformations as arrows.

Next: Sheaves and Topology.

July 12, 2024

Topology as a Dietary Choice

Posted by Bartosz Milewski under Topology | Tags: Topology |
[2] Comments

What is Topology?

When talking about topology, people draw cups with handles turning into donuts. When I think of topology, I see nutritious food.

In mathematics, topology is defined as a family of subsets of some space $X$ . We call these subsets open. Open sets are like meaty, skinless fruits.

Watermelon

For instance, in standard topology, the inside of a ball in 3-d is considered meaty. Contrast this with an empty sphere, a curve, or a point–these are skinny when embedded in 3-d–they have no nutritional value.

In one dimension (on a line), the inside of a segment is meaty, but a segment with endpoints is not open, because it has a rind (the endpoints).

These four conditions define a topology.

The intersection of any two open sets is again an open set. This is what I mean by skinlessness. If you included skins, the intersection could end up skinny.
A union of open sets is again open. It’s even more juicy, and no skin can be produced by a union. There is a subtlety there: You can take a union of an arbitrary number of open sets and it’s still open. But you have to be careful with intersections–only finite intersections are allowed. That’s because by intersecting an infinite number of open sets you could end up with something very skinny–like a single point.
The whole space $X$ is open. In a sense, it defines what it means to be juicy and it doesn’t have a skin because it has no contact with the outside–it is its own Universe.
As usual, an empty set is an odd item. Even though it’s empty, it’s considered open. You may think of it as a union of zero open sets.

There are some extreme topologies, like the discrete topology in which all subsets are open (even individual points), and a trivial (indiscrete) topology where only the whole space and the empty set are open. But most topologies are reasonable and adhere to our intuitions. So let’s not worry about pathologies.

Continuity

Consider a function from one topological space $X$ to another topological space $Y$ . Intuitively, a function is continuous if it doesn’t make sudden jumps. So naively you might think that a continuous function would map open sets to open sets. But that’s not true. For instance a constant function maps any open set to a point which, in most topologies, is not open.

In fact any time a function stalls, or makes a turnaround (like the function $y = x^2$ at $x = 0$ ) you get a skinny point in its image.

The correct definition goes in the opposite direction: a function is continuous if and only if the pre-image of every open set is open.

First of all, a function cannot stall or turn around in the $x$ direction, since that would mean mapping one point to many.

Secondly, if a function makes a jump at some point $x$ , it’s possible to surround $f(x)$ with a small open set whose counter-image contains $x$ as its boundary.

It’s also possible to define a continuous function as a pair of functions. One function $f$ is the usual mapping of points from $X$ to $Y$ . The other function $g$ maps open sets in $Y$ to open sets in $X$ . The pair $(f, g)$ defines a continuous function if for all points $x \in X$ and open sets $O$ in $Y$ we have the following equivalence:

$f(x) \in O \iff x \in g(O)$

The left-hand side tells us that $x$ is in the pre-image of $O$ under $f$ . The right-hand side tells us that $g$ maps $O$ to this preimage. This formula looks a bit like an adjunction between $f$ and $g$ . It’s an example of a more general notion of Chu constructions.

Finally, the cups and donuts magic trick uses invertible continuous functions called homeomorphisms to deform shapes without tearing or gluing them.

Next: Presheaves and Topology.

March 24, 2024

Neural Networks, Pre-lenses, and Triple Tambara Modules, Part II

Posted by Bartosz Milewski under Category Theory, Lens, Neural Networks, Programming | Tags: AI, Category Theory, Lens, Neural Networks, Optics, Profunctors, Tambara Modules |
1 Comment

I will now provide the categorical foundation of the Haskell implementation from the previous post. A PDF version that contains both parts is also available.

The Para Construction

There’s been a lot of interest in categorical foundations of deep learning. The basic idea is that of a parametric category, in which morphisms are parameterized by objects from a monoidal category $\mathcal P$ :

Screenshot 2024-03-24 at 15.00.20
Here, $p$ is an object of $\mathcal P$ .

When two such morphisms are composed, the result is parameterized by the tensor product of the parameters.

Screenshot 2024-03-24 at 15.00.34

An identity morphism is parameterized by the monoidal unit $I$ .

If the monoidal category $\mathcal P$ is not strict, the parametric composition and identity laws are not strict either. They are satisfied up to associators and unitors of $\mathcal P$ . A category with lax composition and identity laws is called a bicategory. The 2-cells in a parametric bicategory are called reparameterizations.

Of particular interest are parameterized bicategories that are built on top of actegories. An actegory $\mathcal C$ is a category in which we define an action of a monoidal category $\mathcal P$ :

$\bullet \colon \mathcal P \times \mathcal C \to C$

satisfying some obvious coherency conditions (unit and composition):

$I \bullet c \cong c$

$p \bullet (q \bullet c) \cong (p \otimes q) \bullet c$

There are two basic constructions of a parametric category on top of an actegory called $\mathbf{Para}$ and $\mathbf{coPara}$ . The first constructs parametric morphisms from $a$ to $b$ as $f_p = p \bullet a \to b$ , and the second as $g_p = a \to p \bullet b$ .

Parametric Optics

The $\mathbf{Para}$ construction can be extended to optics, where we’re dealing with pairs of objects from the underlying category (or categories, in the case of mixed optics). The parameterized optic is defined as the following coend:

$O \langle a, da \rangle \langle p, dp \rangle \langle s, ds \rangle = \int^{m} \mathcal C (p \bullet s, m \bullet a) \times \mathcal C (m \bullet da, dp \bullet ds)$

where the residues $m$ are objects of some monoidal category $\mathcal M$ , and the parameters $\langle p, dp \rangle$ come from another monoidal category $\mathcal P$ .

In Haskell, this is exactly the existential lens:

data ExLens a da p dp s ds = 
  forall m . ExLens ((p, s)  -> (m, a))  
                    ((m, da) -> (dp, ds))

There is, however, a more general bicategory of pre-optics, which underlies existential optics. In it, both the parameters and the residues are treated symmetrically.

The PreLens Bicategory

Pre-optics break the feedback loop in which the residues from the forward pass are fed back to the backward pass. We get the following formula:

$\begin{aligned}O & \langle a, da \rangle \langle m, dm \rangle \langle p, dp \rangle \langle s, ds \rangle = \\ &\mathcal C (p \bullet s, m \bullet a) \times \mathcal C (dm \bullet da, dp \bullet ds) \end{aligned}$

We interpret this as a hom-set from a pair of objects $\langle s, ds \rangle$ in $\mathcal C^{op} \times C$ to the pair of objects $\langle a, da \rangle$ also in $\mathcal C^{op} \times C$ , parameterized by a pair $\langle m, dm \rangle$ in $\mathcal M \times \mathcal M^{op}$ and a pair $\langle p, dp \rangle$ from $\mathcal P^{op} \times \mathcal P$ .

To simplify notation, I’ll use the bold $\mathbf C$ for the category $\mathcal C^{op} \times \mathcal C$ , and bold letters for pairs of objects and (twisted) pairs of morphisms. For instance, $\bold f \colon \bold a \to \bold b$ is a member of the hom-set $\mathbf C (\bold a, \bold b)$ represented by a pair $\langle f \colon a' \to a, g \colon b \to b' \rangle$ .

Similarly, I’ll use the notation $\bold m \bullet \bold a$ to denote the monoidal action of $\mathcal M^{op} \times \mathcal M$ on $\mathcal C^{op} \times \mathcal C$ :

$\langle m, dm \rangle \bullet \langle a, da \rangle = \langle m \bullet a, dm \bullet da \rangle$

and the analogous action of $\mathcal P^{op} \times \mathcal P$ .

In this notation, the pre-optic can be simply written as:

$O\; \bold a\, \bold m\, \bold p\, \bold s = \bold C (\bold m \bullet \bold a, \bold p \bullet \bold b)$

and an individual morphism as a triple:

$(\bold m, \bold p, \bold f \colon \bold m \bullet \bold a \to \bold p \bullet \bold b)$

Pre-optics form hom-sets in the $\mathbf{PreLens}$ bicategory. The composition is a mapping:

$\mathbf C (\bold m \bullet \bold b, \bold p \bullet \bold c) \times \mathbf C (\bold n \bullet \bold a, \bold q \bullet \bold b) \to \mathbf C (\bold (\bold m \otimes \bold n) \bullet \bold a, (\bold q \otimes \bold p) \bullet \bold c)$

Indeed, since both monoidal actions are functorial, we can lift the first morphism by $(\bold q \bullet -)$ and the second by $(\bold m \bullet -)$ :

$\mathbf C (\bold m \bullet \bold b, \bold p \bullet \bold c) \times \mathbf C (\bold n \bullet \bold a, \bold q \bullet \bold b) \xrightarrow{(\bold q \bullet) \times (\bold m \bullet)}$

$\mathbf C (\bold q \bullet \bold m \bullet \bold b, \bold q \bullet \bold p \bullet \bold c) \times \mathbf C (\bold m \bullet \bold n \bullet \bold a,\bold m \bullet \bold q \bullet \bold b)$

We can compose these hom-sets in $\mathbf C$ , as long as the two monoidal actions commute, that is, if we have:

$\bold q \bullet \bold m \bullet \bold b \to \bold m \bullet \bold q \bullet \bold b$

for all $\bold q$ , $\bold m$ , and $\bold b$ .
The identity morphism is a triple:

$(\bold 1, \bold 1, \bold{id} )$

parameterized by the unit objects in the monoidal categories $\mathbf M$ and $\mathbf P$ . Associativity and identity laws are satisfied modulo the associators and the unitors.

If the underlying category $\mathcal C$ is monoidal, the $\mathbf{PreOp}$ bicategory is also monoidal, with the obvious point-wise parallel composition of pre-optics.

Triple Tambara Modules

A triple Tambara module is a functor:

$T \colon \mathbf M^{op} \times \mathbf P \times \mathbf C \to \mathbf{Set}$

equipped with two families of natural transformations:

$\alpha \colon T \, \bold m \, \bold p \, \bold a \to T \, (\bold n \otimes \bold m) \, \bold p \, (\bold n \bullet a)$

$\beta \colon T \, \bold m \, \bold p \, (\bold r \bullet \bold a) \to T \, \bold m \, (\bold p \otimes \bold r) \, \bold a$

and some coherence conditions. For instance, the two paths from $T \, \bold m \, \bold p\, (\bold r \bullet \bold a)$ to $T \, (\bold n \otimes \bold m)\, (\bold p \otimes \bold r) \, (\bold n \bullet \bold a)$ must give the same result.

One can also define natural transformations between such functors that preserve the two structures, and define a bicategory of triple Tambara modules $\mathbf{TriTamb}$ .

As a special case, if we chose the category $\mathcal P$ to be the trivial one-object monoidal category, we get a version of (double-) Tambara modules. If we then take the coend, $P \langle a, b \rangle = \int^m T \langle m, m\rangle \langle a, b \rangle$ , we get regular Tambara modules.

Pre-optics themselves are an example of a triple Tambara representation. Indeed, for any fixed $\bold a$ , we can define a mapping $\alpha$ from the triple:

$(\bold m, \bold p, \bold f \colon \bold m \bullet \bold a \to \bold p \bullet \bold b)$

to the triple:

$(\bold n \otimes \bold m, \bold p, \bold f' \colon (\bold n \otimes \bold m) \bullet \bold a \to \bold p \bullet (\bold n \bullet \bold b))$

by lifting of $\bold f$ by $(\bold n \bullet -)$ and rearranging the actions using their commutativity.
Similarly for $\beta$ , we map:

$(\bold m, \bold p, \bold f \colon \bold m \bullet \bold a \to \bold p \bullet (\bold r \bullet \bold b))$

to:

$(\bold m , (\bold p \otimes \bold r), \bold f' \colon \bold m \bullet \bold a \to (\bold p \otimes \bold r) \bullet \bold b)$

Tambara Representation

The main result is that morphisms in $\mathbf {PreOp}$ can be expressed using triple Tambara modules. An optic:

$(\bold m, \bold p, \bold f \colon \bold m \bullet \bold a \to \bold p \bullet \bold b)$

is equivalent to a triple end:

$\int_{\bold r \colon \mathbf P} \int_{\bold n \colon \mathbf M} \int_{T \colon \mathbf{TriTamb}} \mathbf{Set} \big(T \, \bold n \, \bold r \, \bold a, T \, (\bold m \otimes \bold n) \, (\bold r \otimes \bold p) \, \bold b \big)$

Indeed, since pre-optics are themselves triple Tambara modules, we can apply the polymorphic mapping of Tambara modules to the identity optic $(\bold 1, \bold 1, \bold{id} )$ and get an arbitrary pre-optic.

Conversely, given an optic:

$(\bold m, \bold p, \bold f \colon \bold m \bullet \bold a \to \bold p \bullet \bold b)$

we can construct the polymorphic mapping of triple Tambara modules:

$\begin{aligned} & T \, \bold n \, \bold r \, \bold a \xrightarrow{\alpha} T \, (\bold m \otimes \bold n) \, \bold r \, (\bold m \bullet \bold a) \xrightarrow{T \, \bold f} T \, (\bold m \otimes \bold n) \, \bold r \, (\bold p \bullet \bold b) \xrightarrow{\beta} \\ & T \, (\bold m \otimes \bold n) \, (\bold r \otimes \bold p) \, \bold b \end{aligned}$

Bibliography

Brendan Fong, Michael Johnson, Lenses and Learners,
Brendan Fong, David Spivak, Rémy Tuyéras, Backprop as Functor: A compositional perspective on supervised learning, 34th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) 2019, pp. 1-13, 2019.
G.S.H. Cruttwell, Bruno Gavranović, Neil Ghani, Paul Wilson, Fabio Zanasi, Categorical Foundations of Gradient-Based Learning
Bruno Gavranović, Compositional Deep Learning
Bruno Gavranović, Fundamental Components of Deep Learning, PhD Thesis. 2024

Bartosz Milewski's Programming Cafe

Understanding Attention in LLMs

Legalizing Comonad Composition

Sieves and Sheaves

Covering Sieves

Sheaves as Virtual Objects

Coverages and Sites

Sheaves and Topology

Presheaves and Topology

Topology as a Dietary Choice

What is Topology?

Continuity

Neural Networks, Pre-lenses, and Triple Tambara Modules, Part II

The Para Construction

Parametric Optics

The PreLens Bicategory

Triple Tambara Modules

Tambara Representation

Bibliography

Top Posts

License

Blogroll

Follow Me

Archives