Building AWS Lambda Functions with Clojure

So this is a quick tutorial on getting up and running on AWS Lambda in Clojure. It took me a while to get things running correctly so I’m just going to document what I did here.

Let’s say we want to set up a RESTful service called adder. Setup a new Leiningen project using the template lein-clojure-lambda-template. It’s deployed to Clojars so you can just run the command as is:

$ lein new aws-lambda-serverless adder
$ cd adder && tree
├── project.clj
├── src
│   └── adder
│       └── core.clj
└── test
    └── adder
        └── core_test.clj

4 directories, 4 files

Change directory into it and you’ll see a directory structure like the above, pretty standard stuff. Inside src/adder/core.clj is the main stuff you want to focus on.

(ns adder.core
   :methods [^:static [handler [java.util.Map] String]]))

(defprotocol ConvertibleToClojure
  (->cljmap [o]))

(extend-protocol ConvertibleToClojure
  (->cljmap [o] (let [entries (.entrySet o)]
                (reduce (fn [m [^String k v]]
                          (assoc m (keyword k) (->cljmap v)))
                        {} entries)))

  (->cljmap [o] (vec (map ->cljmap o)))

  (->cljmap [o] o)

  (->cljmap [_] nil))

(defn -handler [s]
  (println (->cljmap s))
  (println "Hello World!"))

Here are the main points:

  • In setting the namespace a :gen-class directive is supplied to generate a bunch of class files adder/core*.class so that we can generate named classes that can be called in Java. I won’t go into it here but checkout the docs Ahead-of-time Compilation and Class Generation.
  • The :methods [^:static [handler [java.util.Map] String]])) line defines a static method handler which takes in a parameter of type java.util.Map and returns a value of type String. The ^ inside ^:static is called a metadata marker and in this instance sets :static to true.
  • A protocol, ->cljmap, has been defined which takes in one parameter o. All it does is this, if o is a Java object, it’ll be converted to the corresponding Clojure object.
  • The function definition for -handler. Everything above is boilerplate, this is the only thing you need to modify. (->cljmap s) is the payload.

To implemenet the handler we would just write something like this. Inputs generally come in json format,

;; Ex. {"input": [1, 2, 3, 4]}
(defn adder [nums]
  (reduce + nums))
(defn -handler [s]
  (adder (:input (->cljmap s) #",")))

Say you want to be able to handle multiple inputs. You would need to modify the method definition too.

;; Ex. {"input": [[1, 2, 3, 4],[5, 6, 7, 8]]}
:methods [^:static [handler [java.util.Map] clojure.lang.LazySeq]]))
(defn adder [nums]
  (reduce + nums))
(defn -handler [s]
  (map (adder (:input (->cljmap s) #","))))

And that should be enough to get you anywhere you want to go. Now to deploy it. Make sure you’ve already installed and setup the AWS CLI.

$ pip3 install awscli
aws configure

Create your uploadable binary

$ lein uberjar

Create the Lambda function. You’ll need to create a role that your Lambda can assume to have the proper permissions.

AWS_ACCOUNT_ID=XXXXXXXXXX # Should be some number
ROLENAME=XXXXXX # A string name

aws lambda create-function \
    --function-name $FUNCTION_NAME \
    --handler $HANDLER \
    --runtime java8 \
    --memory $MEMORY \
    --timeout $TIMEOUT \
    --role arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE \
    --zip-file fileb://target/adder-$VERSION-SNAPSHOT-standalone.jar

Now your Clojure code is on Lambda. To test whether it’s working

aws lambda invoke --function-name adder --payload "{\"input\": [[1, 2, 3, 4]]}" outfile.txt

You should now have the output saved in outfile.txt. But invoking the Lambda directly isn’t ideal. AWS has something called API Gateway which can be connected to, creating a REST endpoint.


# Create REST API: Regional endpoint with API key enabled
aws apigateway create-rest-api \
    --name adder-api \
    --endpoint-configuration types=REGIONAL \
    --api-key-source HEADER
# Get the ID of the created API (assumes you don't have any other API gateways)
REST_API_ID=$(aws apigateway get-rest-apis | jq ".items[0].id" | tr -d '"')
# Get the default resource ID of the created API
RESOURCE_ID=$(aws apigateway get-resources --rest-api-id $REST_API_ID | jq ".items[0].id" | tr -d '"')

# To get URI:
# Attach our Lambda to the above method
aws apigateway put-integration \
    --rest-api-id $REST_API_ID \
    --resource-id $RESOURCE_ID \
    --type AWS
    --http-method POST \
    --integration-http-method POST \
    --authorization-type NONE \
    --uri $URI

# Deploy to a `dev` stage
aws apigateway create-deployment \
    --rest-api-id $REST_API_ID \
    --stage-name dev

Above we create a REST API that’s open to the world. To add an X-API-KEY header for authentication:


# Create an API key
aws apigateway create-api-key --name adder-api-key
KEY_ID=$(aws apigateway get-key-ids | jq ".items[0].id" | tr -d '"')

# Create a usage plan and connect it to the deployed API above
aws apigateway create-usage-plan \
    --name basic-usage-plan \
    --api-stages apiId="$REST_API_ID",stage="dev"
USAGE_PLAN_ID=$(aws apigateway get-usage-plans | jq ".items[0].id" | tr -d '"')

# Connect the created API key to the usage plan
aws apigateway create-usage-plan-key \
    --key-id $KEY_ID \
    --key-type "API_KEY" \
    --usage-plan-id $USAGE_PLAN_ID \
 # Get the API key value to put in your request header
 API_KEY=$(aws apigateway get-api-key --api-key $KEY_ID --include-value | jq ".value" | tr -d '"')

 echo "Your API Key: $API_KEY"

Honestly, setting up the API Gateway via command line is messy, doing it from the console is much nicer in my opinion. But anyways. To test that everythings working send a post request to https://{restapi_id}.execute-api.{region}{stage_name}/ with the API key as a header. In Python this would be:

import requests

rest_api_id = ...
region = ...
stage_name = "dev"

url = f"https://{rest_api_id}.execute-api.{region}{stage_name}/"
payload = {"input": [[1, 2, 3, 4]]}
headers = {
    "content-type": "application/json",
    "x-api-key": XXXXXXXXXXXXXXX

response, data=json.dumps(payload), headers=headers)

That’s it, you’re done. By the way if you want to update your function, you would just run something like:


lein uberjar
aws lambda update-function-code \
    --function-name $FUNCTION_NAME \
    --zip-file fileb://target/adder-$VERSION-SNAPSHOT-standalone.jar

Just to close off, this was what worked well for me, it was simple to setup and is relatively easy to maintain. You may find better mileage elsewhere. A nice option to consider if you’re willing to put a bit of time into learning it is the Serverless framework which provides a lot more in terms of automation, features and has a ton of open source support (26,000+ GitHub stars!).

Unit testing in Python with Docker Containers

I’m working on a new Python package called diehard and was looking for a way to run unit tests across multiple Python environments. The standard approach seems to be to use tox. My main issue with it is that you need the Python versions being tested already installed to run tests and that’s a hassle if you’re working on multiple machines. As a result any sort of testing uses a local build, which to me, can lead to possible, annoying issues with dependencies and versioning. I think it’s much simpler and cleaner to just throw everything into a Dockerfile so the environment gets built to specification for everyone. This way you at least get hard guarantees that things actually work in the environment you’re testing with.

FROM ubuntu:16.04

RUN apt-get update && \
    apt-get -y install software-properties-common \
                       python-software-properties && \
    add-apt-repository -y ppa:deadsnakes/ppa && apt-get update && \
    apt-get autoclean

RUN apt-get -y install \
    python2.7 python2.7-dev \
    python3.4 python3.4-dev \
    python3.5 python3.5-dev \
    python3.6 python3.6-dev && \
    apt-get autoclean

RUN apt-get install wget && \
    wget && \
    python2.7 && \
    python3.4 && \
    python3.5 && \
    python3.6 && \
    rm && \
    apt-get autoclean

The test environment was split into two Dockerfiles, this is the first layer. Basically, the OS gets loaded and because Ubuntu 16.04 doesn’t support Python natively, you gotta get it from a PPA.

Note: As you can see, Ubuntu 16.04 was used, but there was no design decision around that. I just happened to have used it in the past and was comfortable with it … I guess it’s also a rather popular Ubuntu version.

Next we have the Dockerfile for my package.

FROM eltonlaw/pybase

COPY ./requirements /diehard/requirements
RUN pip2.7 install -r /diehard/requirements/dev.txt && \
    pip3.4 install -r /diehard/requirements/dev.txt && \
    pip3.5 install -r /diehard/requirements/dev.txt && \
    pip3.6 install -r /diehard/requirements/dev.txt

COPY ./ ./setup.cfg ./pytest.ini ./README.rst /diehard/
COPY ./docs /diehard/docs
COPY ./tests/ /diehard/tests
COPY ./diehard /diehard/diehard
WORKDIR /diehard
RUN pip2.7 install -e . && \
    pip3.4 install -e . && \
    pip3.5 install -e . && \
    pip3.6 install -e .

CMD ["python3.6", "-m", "pytest"]

The FROM eltonlaw/pybase means that this stacks on top of the eltonlaw/pybase image (I named the image built from the Dockerfile before pybase and uploaded it to Docker Hub under my account). Now we can go about the regular setup for each version being used. Install the core stuff: cpython interpreter, pip, setuptools. Then pip install dependencies. The next few lines are just copying over files. It might seem weird to copy over so many files manually and it is but it’s partially because I didn’t want to copy everything over and partially because every time something changes in a file, the Dockerfile reruns from that line onward. So if we’ve already built the image once and cached everything, if we say changed a file in ./docs, the COPY ./ ... line would use cache but everything from COPY ./docs /diehard/docs and onwards would rerun from scratch. Splitting things up into multiple COPY’s makes it very slightly more efficient (assuming you’re not changing your and etc. often), and it’s just my subjective, preferred way of doing things. WORKDIR, unsurprisingly, changes the working directory to /diehard. Next the RUN pip2.7 install ... lines installs the package in each python environment. Lastly, the CMD ... line just specifies the default command passed when you run the image. I’ve set it to run pytest with the python3.6 environment.

Okay the environments setup, this is how to run tests. I made a tiny Makefile like this:

	pip install -r requirements.txt

	docker pull eltonlaw/pybase
	docker build -t diehard .
	docker run diehard python2.7 -m pytest
	docker run diehard python3.4 -m pytest
	docker run diehard python3.5 -m pytest
	docker run diehard python3.6 -m pytest

So all you have to do is run make test and your tests run on all environments.

By the way, the reason that the test environment is split like this is because originally, with everything in one Dockerfile, the Travis build took 4 minutes. The installation of Python versions is pretty much constant so pulling a built image speeds things up a lot. Great, things work locally now. But I wanted to integrate this with TravisCI so I did some Googling and modified the .travis.yml into this.

sudo: required

language: python

    - docker

    - docker pull eltonlaw/pybase

    make test

A High Level Explanation of TensorFlow

At the time of writing the current version of TensorFlow is 1.1.0

Why TensorFlow?

The first thought that comes to mind is that it’s backed by Google, meaning there’s probably a lot of smart people leading the project who are a kind of filthy rich. TensorFlow was built to support very flexible ideas that could be quickly put into product, so code used for research and prototyping can quickly be made production ready. Just the fact that this is used for Google’s machine learning is a good sign, they handle huge amounts of traffic so it’s definitely battle tested. TensorFlow supports many platforms (CPU, TPU, GPU, Android, iOS, Raspberry Pi), which can lead to some very interesting use cases. TensorFlow’s open source community is thriving and adoption/support is very high, at the time I’m writing this, the TensorFlow GitHub Repo has roughly 60,000 stars! Lastly, some of the tools that TensorFlow has available are amazing such as the beautiful TensorBoard - oh boy, I love, love, love it.

Great Ideas

TensorFlow is built on a foundation of really great ideas, for the full schtick I highly recommend the TensorFlow White Papers. For the distilled (and possibly wrongly interpreted) version, read on.

Dataflow Graph


A TensorFlow program can be described as a data-flow graph. All our nodes are linked together through their inputs/dependencies. Take a look at the image above, another way to think about this is that you have this A node at the beginning and that could be your data, which is in the form of an n-dimensional array, or what they call a tensor. These tensors(or data) flow along the graph edges, where the nodes are some sort of operation on the input. Hence, the name TensorFlow. As an example, the image above might’ve been the result of a TensorFlow program such as the following:

B = tf.nn.tanh(A)
C = tf.random_normal(tf.shape(tf.transpose(A))
G = 0.2
D = tf.ones_like(B) * G
E = tf.matmul(B, C) + D
F = tf.nn.softmax_cross_entropy_with_logits(E) 

Symbolic Computation

So the picture in your head right now is a graph right? Keep that in mind, now as you add operations and variables and stuff, this graph doesn’t get computed right away. Rather, you have to create a Session object which is a sort of environment that TensorFlow provides to evaluate nodes in a graph. TensorFlow programs are split into a construction phase that assembles a graph and an execution phase that runs the graph.

# Step 1 - Construction Phase
a = tf.constant([[1., 2.], [3., 4.])
b = tf.constant([[1., 2.], [3., 4.])
c = tf.add(a, b)
# Step 2 - Execution Phase
with tf.Session() as sess:
    print(c) # <tf.Tensor 'Add:0' shape=(2, 2) dtype=float32>
    print( # [[ 2.,  4.],[ 6.,  8.]]
    print(c.eval() # [[ 2.,  4.],[ 6.,  8.]]

Everything in TensorFlow is a Tensor. By printing without first evaluating we can see that, c is a symbolic representation of the output of an Operation, it doesn’t mean anything until we actually evaluate it in an environment(Session) and get whatever it’s supposed to return. Rather than holding a value, c, it holds the steps required to compute what we’ve defined as c, which is why when we print it, an Add:0 operation is displayed. TensorFlow is stating that it’s going to add it’s inputs together later. (Note: the 0 after Add: represents the device on which this tensor will be computed, more on this later).

If you’ve played around with bot TensorFlow and NumPy you’ll notice that TensorFlow mirrors a lot of NumPy’s functionality; they share methods such as creating ndarrays of zeros and ones, reshaping, matrix multiplication etc. It’s because of this symbolic computation idea that everything has been reimplemented. To reiterate, NumPy operations/matrixes are instantiated instantly when created. In TensorFlow, when these operations/matrixes are created they add a new node to the current graph, connected to other nodes by dependencies.

Partial Execution

Having a symbolic programming model has the added benefit of partial execution, you can run any arbitrary subgraph. When you do a, it only computes the required nodes, everything else is left alone.

A = tf.constant([[1., 2.], [3., 4.])
B = tf.constant([[5., 6.], [7., 8.])
C = tf.constant([[9., 0.], [1., 2.])
D = A + B # '+' operator is equivalent to tf.add(x, y)
E = C + D
with tf.Session() as sess: # Only computes A, B and D can actually take a parameter, feed_dict, where you can substitute the inputs of the node you’re computing with some custom input.

A = tf.constant([1, 2, 3, 4])
B = tf.constant([4, 5, 6, 7])
C = tf.constant([0, 1, 2, 3])
D = tf.add(A, B)
E = tf.add(C, D)
with tf.Session() as sess:
    print(, feed_dict={C: [100, 100, 100, 100]}))
    # Prints [105, 107, 109, 111]


TensorFlow makes it very easy to bundle up a trained model so that you can use it elsewhere. Since it supports many platforms, you do something like train on some GPU’s then export it to a Raspberry Pi or Mobile device to perform inferences. Another reason that checkpoints are cool is that you can use them to save the progress of your model, say if you wanted to stop training halfway through and continue tomorrow.

Gradient Computation

TensorFlow can automatically calculate gradients, a much welcome relief, I’ve tried it once for a project and honestly it was brutal. Hand calculating gradients gets messy super fast, and this only gets more complicated as you add more layers and use different activation functions. TensorFlow currently has implementations of many popular optimization algorithms: Gradient Descent, AdaGrad, RMSProp, etc. For the full list checkout tf.train.

cost = tf.nn.log_poisson_loss(y, y_pred)
learning_rate = 0.01
train = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)

Behind the scenes

Single-Device Implementation

In the simplest scenario (single device, single process) the operations are executed sequentially.


Image licensed under Creative Commons by TensorFlow. Link

TensorFlow calculates the number of dependencies for every operation, the number of operations that need to be done before this operation can be performed. Nodes that have no dependencies are note being blocked and thus can be run immediately, so they are added to the queue. Imagine the following calculation:

((3 * 2)+ 6)/4

At the very beginning, the number of dependencies assigned to calculating that first multiplication step is 0, because we’re already given both arguments, 3 and 2. Now let’s look at the addition part of it: (3*2+6). If we wanted to do the addition, we already have 6 but we don’t have the result of the multiplication, so the number of dependencies for this addition operation is 1. Going further, if we wanted to do the division operation, we would be missing both results from the addition and multiplication processes so it’s number of dependencies is 2.

To make this more concrete, here’s a visualization of the calculation:

tf_single_exec_1 tf_single_exec_2 tf_single_exec_3 tf_single_exec_4

Multi-Device Implementation


Image licensed under Creative Commons by TensorFlow. Link

Things get a bit more complicated as you move from single-device to multi-device since two new problems emerge (design decisions might be a better word for it). I’ll highlight the problem first before explaining the TensorFlow implementation. The first problem is how the framework will handle node placement. In the single-device implementation the graph just runs on the same device from start to finish so there’s nothing to optimize. However, as soon as you have more than one device, consideration needs to be put on how to split up the nodes. Imagine you have three devices (A, B & C), which nodes on your computational graph should be put on device A, which on device B and which on device C? In a perfect world, the optimal solution is to calculate the total amount of work of the graph and each of it’s nodes and split it right down the middle, allocating a third of the work to each device. While logical, this thought process is naive and breaks down quickly since we haven’t yet considered blocking and non-blocking operations. As seen in the previous section, many parts of a graph may require as input, output from previous operations. These sequential operations can’t be parallelized. Another assumption held for the previous statements was the idea that our three devices A, B and C are equal power, that’s usually not true in the real world. For example, a computer can have a CPU and a GPU, how should TensorFlow optimize for an imbalanced situation like this?

That leads us directly into the second major issue/design decision, which is very much tied to the previous problem/design decision. How should the framework handle cross-device communication? After all, the whole point of using multiple devices is to speed up computation time. If the cost of cross-device communication was too high, it would be pointless to use multiple devices.

Here’s how TensorFlow tackles these problems:

  1. TensorFlow’s node placement algorithm contains a cost function which contains the input and output of each operation, and an estimate of the computation time for each node in a graph. Using these two metrics: estimated computation time and the estimated cost of device-to-device communication, TensorFlow searches for the optimal device allocation configuration.
  2. Cross device communication is done through send and recv(receive) nodes. Notice in the below image that any edges (lines) crossing across devices gets substituted with a send and recv node. From the 2015 white paper:

“By handling communication in this manner, we also allow the scheduling of individual nodes of the graph on different devices to be decentralized into the workers: the Send and Receive nodes impart the necessary synchronization between different workers and devices, and the master only needs to issue a single Run request per graph execution to each worker that has any nodes for the graph, rather than being involved in the scheduling of every node or every cross-device communication. This makes the system much more scalable and allows much finer-granularity node executions than if the scheduling were forced to be done by the master.” 1


Image licensed under Creative Commons by TensorFlow. Link


global_step refers to the number of batches trained by the graph. The reason why this is explicitly defined is that if you stop training for the day and continue the next day, the local iteration number will start from 1 again, but the global step keeps a record of the total number of iterations.

  1. Abadi et. al. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Retrieved May 23, 2017, from 

Optical Character Recognition using Expert Systems

Optical character recognition is solved and expert systems have fallen out of favour, however, I do believe that there is at least some merit to understanding some of this stuff. For one, despite the incredible accuracy percentages that deep learning is achieving, I get the feeling that the current state-of-the-art is missing something fundamental. It’s just a feeling, maybe I’m wrong, but I don’t think it hurts to explore new ideas, or in this case, rehash old ones.

Methods Used


Image convolution is the idea of applying aggregating the values of neighbouring pixels. If you want to apply convolution to an image, you take one pixel, look at the pixels around it; above, below, to the right, to the left, in the four corners and take some combination of these 9 values (9 including the pixel we’re focusing on) as the new value of our pixel. The combination is called a kernel or mask. Apply the kernel to every pixel in an image at the same time and that’s image convolution in a nutshell. I specifically mention at the same time because if you were to apply the kernel one by one, it would mess up the output. If you think about it, after a while you would be mixing mutated values with non-mutated values. Anyways, check out this wikipedia article for common kernels/masks and their effects on an image.

Zhang-Suen Thinning

The Zhang-Suen Thinning algorithm thins an image by looping through every pixel in an image and identifying pixels that are redundant to the structure of an image. There are some specific conditions it uses, but the main idea is that if a pixel is black and there are of black pixels around it, you can change this pixel to white without erasing any structure. To see how it works check the rosetta code implementation here.

Canny Edge Detection

The Canny Edge Detector has multiple stages:

  1. Noise reduction using Gaussian Kernel
  2. Calculate gradient using Sobel Kernel
  3. We know that the gradient direction is normal to edges so armed with the gradient magnitude and direction from part 2, we suppress all pixels that aren’t local maximums
  4. Hysteresis Thresholding. We define threshold values maxVal and minVal. Anything above maxVal is definitely an edge, anything below is definitely not an edge, anything in between is iffy, could be an edge or a non-edge. See More

Otsu Thresholding

Image thresholding algorithm to reduce a grayscale image into a binary image. Otsu’s algorithms looks for the threshold value (0-255) that minimizes the weighted within-class variance. See More


The breakdown: I had some images of handwritten digits which I fed through some mutation functions (thinning, scaling, convolution). Then the image was divided into 9 equal blocks which we use as our features. These features are stored along with their label. Come prediction time, calculate how different each stored feature is with a new image. The label of the feature that is least different is used as the prediction.

To get a better sense of how things work, here are the first 10 symbols as they go through the system:

1) Load image

2) Symbols separated into individual images (Optional - I was using sprites) symbol

3-1) Horizontal Convolution using a (5, 1) vector filled with 0.2 as the mask

3-2) Vertical Convolution using (1, 5) column vector filled with 0.2 as the mask

3-3) Binarize image using Otsu Thresholding and Scale to 32x32

3-4) Zhang-Suen Thinning

3-5) Canny Edge Detection

3-6) 3 x 3 Feature Vector

Only displaying the first one (cause the image is pretty big), check here to see the rest


At this point it might be good to check out the actual source code because I’ve kept this explanation high-level.

Training is really simple. I start off by creating a class FeatureDescriptions which will have methods to 1) train by adding features, add and 2) use these stored features to predict new images, predict.

fd = FeatureDescriptions()

Our feature descriptions are a placed into a python dictionary, one key for each label. Each key is initialized with an empty list.

self.features = {'0':[], '1':[], '2':[], '3':[] ... '9':[]}

Each 3x3 feature vector in our training set is added to it’s corresponding label in the self.features dictionary.

for feature, label in feature_label_pairs[:split]:

Note: The split variable is the index that we split the train images from the test images.


To classify new images, the system feeds a list of test images into our trained FeatureDescriptions object and calculates the Euclidean distance to every stored feature vector.

for feature, label in feature_label_pairs[split:]:
    distances = fd.predict(feature)
    prediction = np.argmin(distances)

The label with the feature vector closest to the new image is selected as the prediction. With 35 training images and 15 test images, this system achieves a 37.5% accuracy rate.

Closing Remarks

Experts systems are an old technique and we can see that they perform poorly on handwritten digits. From a computational perspective, calculating pairwise euclidean distances is a nightmare. Additionally, this system does not account for translational invariance. As an example, if you look at (3-6), the 2 was not centered which could cause problems, say if we had another 2 but it was right justified. However, I would like to mention that this system managed to squeeze out 37.5% accuracy despite only 35 training images, which is less feasible with a statistical system. (By the way, if you want the definition of something going wrong, look at that second image. It’s supposed to be 1 but by the end of the transforms it looks like a binary search tree)


Ahmed, M., & Ward, R. K. (2000). An expert system for general symbol recognition. Pattern Recognition, 33(12), 1975-1988. doi:10.1016/s0031-3203(99)00191-0

A Visual Intuition for Factoring

I’m a big fan of the 3Blue1Brown YouTube channel which provides pretty, visual intuitions of some pretty gnarly math so check it out if you haven’t already.

Well this math isn’t gnarly but I think it’s pretty neat so let’s talk about factoring. Say we have a square:


Remembering that the area of a square is just length times width, the area of this square would just be x*x or x2 Now take this same square and imagine we added y to both sides.


The new total area would now consist of:

  • The original squares area, x*x
  • The area of the rectangle on the bottom, y * x
  • The area of the rectangle on the side, x * y
  • The area of the square in the bottom right y * y.

Adding all of these together you get:

x2 + 2xy + y2

Actually, there’s a shortcut you can see that adding y to both sides changes the width and length to x+y, already signifying (x+y)2. But I think the earlier method is more generalizable. Think about how this would look adding another dimension (x+y)3. There would be much more little extra squares. That’s why it evaluates into:

x3 + 3x2y + 3xy2 + y3

Doesn’t it now make a lot more sense that Pascal’s Triangle increases in complexity as the number of dimensions increases?

At the heart of this technique is this little trick to split everything into chunks which you can individually calculate the area of, summed together they would create the area of the original. I think it’s an interesting little mathematical sleight of hand to pop into the toolbox.