# Attention

## Why attention?

Attention means keeping tabs on the most important parts. Attention comes from a key observation: Not all words are equal, and some words are more crucial to understanding the sentence than other. For example, the sentence "It is raining outside". You probably understand that it's raining outside if I say: "Rain! Out!". In this case, _it_ and _is_ are completely redundant. And if a model is trying to understand the sentence, throwing out _it_ and _is_ is probably not going to make a difference.

## What's attention?

So how to focus only on the most important part? One way to do it is to multiply the important parts by a large factor, while reducing the unimportant parts values (those parts are, in fact, numbers in machine's language). And that's what attention mechanism does.

## Try attention in code

In [None]:
%matplotlib inline

import numpy as np
from matplotlib import pyplot as plt

In [None]:
def softmax(x, t = 1):
    exp = np.exp(x / t)

    # sums over the last axis
    sum_exp = exp.sum(-1, keepdims=True)
    
    return exp / sum_exp

In [None]:
num = 5

weights = softmax(np.random.randn(num), t=0.1)
data = np.random.randn(num)

print(weights)
print(data)

In [None]:
average = data.sum() / data.size
attn_applied = weights @ data

print(average)
print(attn_applied)

print(weights.argmax())
print(data[weights.argmax()])

See how the attention mask makes the weighted average of data closer to the desired place.