The results show that compared to - has better perplexity and less (left) and makes better use of long context (right). The figure below shows the forward time (delay) of each k as the context length changes with batch size . The parameters of all models are . (for .). It can be seen that the forward time of each k increases linearly with the increase of context length, but the forward time of the other two methods remains basically the same. At k context - is faster than and is comparable to.
The embarrassing reality of the 2016 scaling japan mobile number long context effectively. Is this really the case? In this project, the researchers re-evaluated these findings in the figure. On the left, you can see that (one of the most popular today) scales similarly to the powerful one, which is a huge improvement since 2016. However, on the right, you can see the same problem with . On average, the later k in the sequence should be easier to predict because they are conditioned on more information.
This is true for . The average complexity of each k index keeps decreasing in its k context. In contrast, the same situation occurs after k. This result represents an awkward reality for existing ones - on the one hand, the main advantage of (relative to) is their linear (relative to quadratic) complexity. This asymptotic advantage is actually only realized in long contexts. On the other hand, once the context is long enough, it is difficult for existing ones (such as) to really take advantage of the additional conditional information. The difficulty of long context is a problem in the nature of the layer: unlike the self-attention mechanism, the layer must compress the context into a fixed-size hidden state.
law paper showed that
-
- Posts: 31
- Joined: Mon Dec 23, 2024 6:08 am