Exploiting scale in both training data and model size has been central to the success of deep learning. When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy.
The basic idea of MoE is split the FFN into multiple sub-networks(experts), for each input token, only part of sub-networks(experts) are activated. Different sub-networks behavior as different “experts”, during training they absorb different information and knowledge from the dataset, during inferencing only part of experts are activated based on the input token.
FFNN (Feedforward Neural Network)
An FFNN allows the model to use the contextual information created by the attention mechanism, transforming it further to capture more complex relationships in the data.
MLP is a type of FFNN
Only activate a portion of the parameters
Each expert learns different information during training
Mixtral Paper