Bumblebee.Text.ModernBertDecoder (Bumblebee v0.7.0)
View SourceModernBERT Decoder model family.
ModernBERT Decoder uses the same architecture as ModernBERT but is trained with a causal language modeling objective for text generation tasks.
Architectures
:base- plain ModernBERT Decoder without any head on top:for_causal_language_modeling- ModernBERT Decoder with a language modeling head. The head returns logits for each token in the original sequence
Inputs
"input_ids"-{batch_size, sequence_length}Indices of input sequence tokens in the vocabulary.
"attention_mask"-{batch_size, sequence_length}Mask indicating which tokens to attend to. This is used to ignore padding tokens, which are added when processing a batch of sequences with different length.
"position_ids"-{batch_size, sequence_length}Indices of positions of each input sequence tokens in the position embeddings.
"attention_head_mask"-{num_blocks, num_attention_heads}Mask to nullify selected heads of the self-attention blocks in the decoder.
"input_embeddings"-{batch_size, sequence_length, hidden_size}Embedded representation of
"input_ids", which can be specified for more control over how"input_ids"are embedded than the model's internal embedding lookup. If"input_embeddings"are present, then"input_ids"will be ignored."cache"A container with cached layer results used to speed up sequential decoding (autoregression). With cache, certain hidden states are taken from the cache, rather than recomputed on every decoding pass. The cache should be treated as opaque and initialized with
Bumblebee.Text.Generation.init_cache/4.
Global layer options
:output_hidden_states- whentrue, the model output includes all hidden states:output_attentions- whentrue, the model output includes all attention weights
Configuration
:vocab_size- the vocabulary size of the token embedding. This corresponds to the number of distinct tokens that can be represented in model input and output . Defaults to50368:max_positions- the maximum sequence length that this model can process. ModernBERT Decoder uses RoPE (Rotary Position Embedding) instead of absolute position embeddings . Defaults to8192:hidden_size- the dimensionality of hidden layers. Defaults to768:num_blocks- the number of Transformer blocks in the decoder. Defaults to22:num_attention_heads- the number of attention heads for each attention layer in the decoder. Defaults to12:intermediate_size- the dimensionality of the intermediate layer in the transformer feed-forward network (FFN) in the decoder. Defaults to1152:activation- the activation function used in the gated FFN. Defaults to:gelu:dropout_rate- the dropout rate for embedding and decoder. Defaults to0.0:attention_dropout_rate- the dropout rate for attention weights. Defaults to0.0:layer_norm_epsilon- the epsilon used by the layer normalization layers. Defaults to1.0e-5:initializer_scale- the standard deviation of the normal initializer used for initializing kernel parameters. Defaults to0.02:local_attention_window- the window size for local attention layers. Defaults to128:layer_types- a list of layer types for each layer, where each element is either:sliding_attention(local attention with sliding window) or:full_attention(global attention):rotary_embedding_base_local- base for computing rotary embedding frequency for local (sliding) attention layers. Defaults to10000.0:rotary_embedding_base- base for computing rotary embedding frequency for global attention layers. Defaults to160000.0:num_labels- the number of labels to use in the last layer for the classification task. Defaults to2:id_to_label- a map from class index to label. Defaults to%{}