Just random scribblings.  This is pretty standard stuff, but I worked this out for someone recently, so I’m typing it out.

If you talk to a category theorist and ask for a definition of a monad, it looks something like this (taken from Wikipedia): a monad is a triple $(F, \eta, \mu)$ consisting of an endofunctor $F$, and natural transformations $\eta : 1 \to F$ and $\mu : F^2 \to F$, satisfying the following laws:

• $\mu \circ F \mu = \mu \circ \mu F$
• $\mu \circ F \eta = \mu \circ \eta F = 1_F$

Enough mumbo-jumbo.  It’s worth taking a second to think about what that means.  The two natural transformations just give us a single way to treat any of the infinite family of objects $F^n X$ (for any integer $n \geq 0$) as just $F X$.  We can flatten them, in essence.  If there are too few $F$‘s, you apply $\eta$.  If there are too many, you apply $\mu$.  The monad laws above just guarantee that it doesn’t matter at what level (number of applictaions of the functor) you work; you’ll get the same result.

It’s definitely time for an example.  Suppose you have a fixed monoid $M$.  Now, in the category $\mathbf{Set}$, there is a functor $F: X \mapsto X \times M$, which just takes any set to the cartesian product of itself and the monoid – i.e., $F X$ consists of values from the underlying set, plus a “side value” from the monoid.  We can also define natural transformations $\eta(x) = (x,1_M)$, and $\mu((x, a), b) = (x, a \cdot b)$ where multiplication is taken from the monoid structure.  In other words, if you don’t give me a side value (that’s $F^0 X$), then I’ll take the side value to be the identity from the monoid.  If you give me several side values in a given order (that’s $F^n X, n > 1$) I’ll multiply them in the monoid to decide what the side value is.

Verifying the monad laws, stated above, is fairly straightforward.  Really, the tricky part is just to understand the meaning of the pieces.  Generally speaking, for a natural transformation $\tau$, think of $F \tau$ as applying the transformation on the left hand side of a tuple.  Then think of $\tau F$ as being the specialization of $\tau$ to values of type $F X$.  Here’s what happens:

• If $\mu : ((x,a),b) \mapsto (x, a \cdot b)$, then $F \mu : (((x,a),b),c) \mapsto ((x, a \cdot b), c)$.  Then $\mu \circ F \mu : (((x,a),b), c) \mapsto (x, a \cdot b \cdot c)$.
• If $\mu : ((x,a),b) \mapsto (x, a \cdot b)$, then $\mu F : (((x,a),b),c) \mapsto ((x, a), b \cdot c)$.  Then $\mu \circ \mu F : (((x,a),b),c) \mapsto (x, a \cdot b \cdot c)$.  By associativity in the monoid, this is the same as the above, satisfying the first law.
• If $\eta : x \mapsto (x, 1_M)$, then $F \eta : (x, a) \mapsto ((x, 1_M), a)$ and $\eta F : (x,a) \mapsto ((x,a), 1_M)$.  Verifying the second monad law is left as an exercise (just write out what the function composition means, and apply $\mu$).

In Haskell, of course, we prefer to think of monads differently, in terms of a type constructor and two functions “return” and “(>>=)”, satisfying the following laws:

return a >>= f           = f a
m >>= return             = m
(m >>= f) >>= g          = m >>= (\x -> f x >>= g)

The “return” function takes any value, and wraps it in the monad.  The second operation, often read as “bind”, is a little trickier.  It takes one value wrapped in the monad, and function from a plain value of the first type to a second value wrapped in the monad, and produces the second value wrapped in the monad.  Here are a few examples:

1. Containers can be seen as monads.  In this case, if I have a crate of thingamabobs, and each of those thingamabobs can be unpacked into a crate of whatsits, then I may as well (by unpacking the thingamabobs as needed) have a very large crate full of whatsits.
2. Famously, I/O actions are monads.  If I have an I/O action that produces a character string, and for any character string, I could get an I/O action that produces a number… then I can string them together into a single interaction producing a number.  (The resulting I/O action will get the character string, compute the second action, and perform it to get the number, all in one go.)

A lot of this corresponds exactly to the mathematical definition above.  The type constructor corresponds with an endofunctor (in the category of Haskell types), and the “return” function with the natural transformation $\eta$.  That is, “return” takes a value of any possible type T, and turns it into a value of type M T (M being our monad type constructor).  The idea of a natural transformation is partially captured by polymorphic functions in Haskell.  (Partially because we’d also need to verify that the tranformation commutes with morphisms in the category; i.e., that “return (f x)” is the same as “fmap f (return x)” for all values of f and x.)

It turns out that $\mu$ is actually the standard library operation called “join”, which is defined in terms of the above, as follows:

join a = a >>= id

In other words, if I have a value of type (M (M T)), then I can turn it into a value of type (M T), by grabbing the encapsulated value of type (M T) and, well, doing nothing at all with it.

There’s actually one more piece we’re missing.  In mathematical language, a functor simultaneously maps objects to other objects in a category, and also morphisms to other morphisms.  So to recover a functor from the Haskell definition of a monad, we also need to decide how to map functions from type X to type Y into other functions from type M X into type M Y.  It turns out we can recover this from the two operations, return and bind, that are standard for monads.

fmap f x = x >>= (return . f)

In other words, if I have a function $f : A \to B$, then I can easily obtain a function $M f : M A \to M B$ as follows.  Use the bind operation to grab the encapsulated value, apply the function f to it, and then wrap it back up again with the return operation.

In the other direction, given an appropriate “fmap” and “join”, the bind operation is easy to implement as well:

a >>= f = join (fmap f a)

Basically, the idea here is that because M is a functor, a function $f : A \to M B$ can be turned into a function $M f : M A \to M (M B)$.  Then one can just apply the join ( $\mu$) operation on the result to obtain the desired result.