Library scope

Loading and validating the configuration

Brief overview here..

Validating yaml fields

Problem
Constraints
Functions used: snippet plus github link
def this_is_function(a): return a

Normalizing range-based slice notation

Problem
Constraints
Functions used: snippet plus github link
def this_is_function(a): return a

Loading models

Brief overview here..

Loading models from local

Problem
Constraints
Functions used: snippet plus github link
def this_is_function(a): return a

Loading models from Hugging Face

Problem
Constraints
Functions used: snippet plus github link
def this_is_function(a): return a

Tokenizer

Brief overview here..

Finding differences in the tokenizers of the ingredient models

Problem
When we merge multiple models, they might have differing tokenizers and vocabulary. In order to create a good final model, we want to preserve the encoding of the tokens that the models were trained for as well as possible. For this reason we produce a union of the tokens to be used.
Constraints & caveats
What are the constraints, explain here
Functions used
Step 1
Loading all of the tokenizers with load_all_tokenizers method inside TokenizerLoader class
class TokenizerLoader: @staticmethod def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) -> Dict[str, PreTrainedTokenizerBase]: all_tokenizers = {} for model_id in models_ids: try: tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=config.trust_remote_code, ) except Exception as e: error_message = f"Error loading tokenizer for {model_id}: {e}" logging.error(error_message) raise RuntimeError(error_message) all_tokenizers[model_id] = tokenizer return all_tokenizers

Step 2
Checking for differences between the tokenizers of the models to be merged.

Checking three different things:
1. vocabularies
2. special tokens
3. added tokens encoders

Firstly, we attempt to find difference in vocabularies. This means the __ that the model __.
@staticmethod def _compare_tokenizer_vocabs( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: vocab_a = tokenizer_a.get_vocab() vocab_b = tokenizer_b.get_vocab() if vocab_a != vocab_b: logging.info(f"Tokenizer for model {model_a} has different vocab compared to model {model_b}.") return True return False

In the above, we leverage the method from Hugging Face transformers library called get_vocab() which returns the vocab of the tokenizer as a dictionary, and we compare the dictionaries for equality.
# example implementation from Llama architecture def get_vocab(self): """Returns vocab as a dict""" vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} vocab.update(self.added_tokens_encoder) return vocab

Secondly, we compare the special tokens, which __ purpose here ___.
@staticmethod def _compare_special_tokens( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: special_tokens_a = tokenizer_a.special_tokens_map special_tokens_b = tokenizer_b.special_tokens_map if special_tokens_a != special_tokens_b: logging.info(f"Tokenizer for model {model_a} has different special tokens compared to model {model_b}.") return True return False

To do so, we use the property of the tokenizer from Hugging Face transformers library called special_tokens_map (HF) which gives us the mapping of ___. We then compare both mappings for equality.
@property def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]: """ `Dict[str, Union[str, List[str]]]`: A dictionary mapping special token class attributes (`cls_token`, `unk_token`, etc.) to their values (`'unk'`, 'cls', etc.). Convert potential tokens of `tokenizers.AddedToken` type to string. """ set_attr = {} for attr in self.SPECIAL_TOKENS_ATTRIBUTES: attr_value = getattr(self, attr) if attr_value: set_attr[attr] = attr_value return set_attr

Thirdly, we compare added tokens encoders, which ___ do what __.
@staticmethod def _compare_added_tokens_encoders( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: added_tokens_encoder_a = tokenizer_a.added_tokens_encoder added_tokens_encoder_b = tokenizer_b.added_tokens_encoder if added_tokens_encoder_a != added_tokens_encoder_b: logging.info( f"Tokenizer for model {model_a} has different added tokens encoder compared to model {model_b}.") return True return False

To do so we utilize the property of the tokenizer, similarly to the above, called added_tokens_encoder (HF). The purpose of ___.
@property def added_tokens_encoder(self) -> Dict[str, int]: """ Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in `self._added_tokens_encoder` for the slow tokenizers. """ return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}

Lastly, we put everything together and form a logical union set from the three checking methods. If even one of them fails, we record that there are differences as the output of this function, and in the next steps we proceed to merge a common tokenizer to eliminate that difference.
@staticmethod def check_tokenizers_for_differences(tokenizers: Dict[str, PreTrainedTokenizerBase]) -> bool: differences_found = False for (model_a, tokenizer_a), (model_b, tokenizer_b) in combinations(tokenizers.items(), 2): differences_found |= TokenizerValidator._compare_tokenizer_vocabs(model_a, tokenizer_a, model_b, tokenizer_b) differences_found |= TokenizerValidator._compare_special_tokens(model_a, tokenizer_a, model_b, tokenizer_b) differences_found |= TokenizerValidator._compare_added_tokens_encoders(model_a, tokenizer_a, model_b, tokenizer_b) return differences_found

Creating a common tokenizer for the output model

Problem
Lorem Ipsum
Constraints & caveats
What are the constraints, explain here
Functions used
Step 1
Loading all of the tokenizers with load_all_tokenizers method inside TokenizerLoader class
class TokenizerLoader: @staticmethod def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) -> Dict[str, PreTrainedTokenizerBase]: all_tokenizers = {} for model_id in models_ids: try: tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=config.trust_remote_code, ) except Exception as e: error_message = f"Error loading tokenizer for {model_id}: {e}" logging.error(error_message) raise RuntimeError(error_message) all_tokenizers[model_id] = tokenizer return all_tokenizers

Step 2
Checking for differences between the tokenizers of the models to be merged.

Checking three different things:
1. vocabularies
2. special tokens
3. added tokens encoders

Firstly, we attempt to find difference in vocabularies. This means the __ that the model __.
# first

Secondly, we ___
# second

Thirdly, we __, which ___.
# third

And finally, ____
# fourth

flow-merge Help

Outline of the library

Specs

Loading and validating the configuration

Validating yaml fields

Normalizing range-based slice notation

Loading models

Loading models from local

Loading models from Hugging Face

Tokenizer

Finding differences in the tokenizers of the ingredient models

Creating a common tokenizer for the output model

Adapters

Loading adapter from Hugging Face

Merging adapter