ChatGPT的Transformer是怎么产生智能的?
发表于 : 2023年 4月 4日 09:18
哪位懂Transformer的说说?为啥这么牛?
智能的基础就两个:
softmax不是attention。transformer里面有三个矩阵K,Q,V。具体怎么弄的,我一直也没有完全吃透。FoxMe 写了: 2023年 4月 4日 15:48 "attention is a technique that is meant to mimic cognitive attention."
attention看起来也很简单,所谓的softmax就是指数加权
https://wikimedia.org/api/rest_v1/media ... f51f905b08
softmax在统计学里面并不是新东西。它增强了大的,抑制了小的,这就是所谓的注意力?
ChatGPT explained:
These vectors are used to compute a weighted sum of values that are relevant to a given query, based on their similarity to the key vector.
Query Vector: A query vector is a representation of a specific input or context that is being used to retrieve relevant information from a set of values. In the context of attention mechanisms, the query vector is used to compute the similarity between the query and the key vectors.
Key Vector: A key vector is a representation of a value or set of values that is being searched for relevance to the query vector. In the context of attention mechanisms, the key vector is used to compute the similarity between the query and value vectors.
Value Vector: A value vector is a representation of a piece of information that is associated with a specific key vector. In the context of attention mechanisms, the value vector is used to compute a weighted sum of values that are relevant to the query, based on their similarity to the key vector.
A query vector is a mathematical representation of a search query in a vector space model, which is a common approach used in information retrieval systems to match search queries with relevant documents.
In a vector space model, each document and query is represented as a vector of numerical values that correspond to the presence or absence of certain terms in the document or query. The terms used in the model are typically identified by a process called term extraction or feature selection, which identifies the most important words or phrases in the document corpus.
To create a query vector, the search query is typically preprocessed to remove stop words and other irrelevant terms, and then converted into a vector representation using the same set of terms as the document vectors. This vector can then be used to measure the similarity between the query and each document in the corpus, usually by calculating the cosine similarity between the query vector and each document vector.
In the context of attention mechanisms, the key vector is a vector representation of a set of values or pieces of information that are being searched for relevance to a given query vector. The key vector is used to compute the similarity between the query vector and each value vector, which in turn is used to weight the contribution of each value vector in the final output of the attention mechanism.
The choice of the key vector representation can have a significant impact on the performance of the attention mechanism. In some cases, the key vector can be a direct copy of the value vector, while in other cases, the key vector can be a transformed version of the value vector.
One common approach for generating the key vector is to use the same linear transformation that is used to generate the query vector. This is known as the dot-product attention mechanism, where the similarity between the query and key vectors is computed as the dot product of the two vectors. Another approach is to use a separate linear transformation to generate the key vector, which is then combined with the query vector using a learned function to compute the similarity between the two vectors.
During the self-attention computation, each query vector attends to all key vectors in the input sequence, and the attention scores determine how much weight to assign to each value vector when computing the output. The output is then computed as a weighted sum of the value vectors, where the weights are determined by the attention scores.
The value vectors are important because they provide the actual information that is used to compute the output of the self-attention mechanism. By attending to different subsets of the value vectors with different weights, the self-attention mechanism can selectively emphasize or suppress different aspects of the input sequence, enabling the model to capture complex relationships between different elements of the sequence.
In a Transformer model, the encoder is responsible for processing the input sequence and creating a representation of it that can be used by the decoder to generate the output sequence.TheMatrix 写了: 2023年 4月 5日 13:43 ChatGPT explained:
The Transformer model is composed of an encoder and a decoder, each containing multiple layers of self-attention and feedforward neural networks. The encoder processes the input sequence, while the decoder generates the output sequence. Both the encoder and decoder layers use residual connections and layer normalization to improve training stability.
In a Transformer model, the decoder is one of two main components (the other being the encoder) used for sequence-to-sequence tasks such as machine translation, text summarization, and question answering. The decoder takes in the output of the encoder, which is a set of encoded representations of the input sequence, and generates the output sequence.TheMatrix 写了: 2023年 4月 5日 13:43 ChatGPT explained:
The Transformer model is composed of an encoder and a decoder, each containing multiple layers of self-attention and feedforward neural networks. The encoder processes the input sequence, while the decoder generates the output sequence. Both the encoder and decoder layers use residual connections and layer normalization to improve training stability.
In a Transformer model, the key vectors are usually selected as a function of the input sequence of tokens, using learned weights or embeddings. Here are the general steps:TheMatrix 写了: 2023年 4月 5日 13:43 ChatGPT explained:
In self-attention, the input sequence is transformed into three vectors: a query vector, a key vector, and a value vector. These vectors are obtained by multiplying the input sequence with learned weight matrices. The dot product of the query vector with the key vector produces a score for each position in the input sequence. The scores are normalized using a softmax function to obtain attention weights, which are used to weigh the corresponding value vectors. The weighted value vectors are then summed to obtain the final output representation.
高人应该能做到。
同意,智能需要创造性的思维,要能不断提出假说,