Library scope
Outline of the library
This is an outline of the library and of the problems seeks to solve and descriptions of the solutions it uses to do so. It's designed to help compare engineering decisions and identify areas of improvement.
Specs
Loading and validating the configuration
Brief overview here..
Validating yaml fields
Problem
Constraints
Functions used: snippet plus github link
- def this_is_function(a): return a
Normalizing range-based slice notation
Problem
Constraints
Functions used: snippet plus github link
- def this_is_function(a): return a
Loading models
Brief overview here..
Loading models from local
Problem
Constraints
Functions used: snippet plus github link
- def this_is_function(a): return a
Loading models from Hugging Face
Problem
Constraints
Functions used: snippet plus github link
- def this_is_function(a): return a
Tokenizer
Brief overview here..
Finding differences in the tokenizers of the ingredient models
Problem
When we merge multiple models, they might have differing tokenizers and vocabulary. In order to create a good final model, we want to preserve the encoding of the tokens that the models were trained for as well as possible. For this reason we produce a union of the tokens to be used.
Constraints & caveats
What are the constraints, explain here
Functions used
Step 1
Loading all of the tokenizers with load_all_tokenizers method inside TokenizerLoader classclass TokenizerLoader: @staticmethod def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) -> Dict[str, PreTrainedTokenizerBase]: all_tokenizers = {} for model_id in models_ids: try: tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=config.trust_remote_code, ) except Exception as e: error_message = f"Error loading tokenizer for {model_id}: {e}" logging.error(error_message) raise RuntimeError(error_message) all_tokenizers[model_id] = tokenizer return all_tokenizersStep 2
Checking for differences between the tokenizers of the models to be merged.
Checking three different things:
1. vocabularies
2. special tokens
3. added tokens encoders
Firstly, we attempt to find difference in vocabularies. This means the __ that the model __.@staticmethod def _compare_tokenizer_vocabs( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: vocab_a = tokenizer_a.get_vocab() vocab_b = tokenizer_b.get_vocab() if vocab_a != vocab_b: logging.info(f"Tokenizer for model {model_a} has different vocab compared to model {model_b}.") return True return FalseIn the above, we leverage the method from Hugging Face transformers library called get_vocab() which returns the vocab of the tokenizer as a dictionary, and we compare the dictionaries for equality.
# example implementation from Llama architecture def get_vocab(self): """Returns vocab as a dict""" vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} vocab.update(self.added_tokens_encoder) return vocabSecondly, we compare the special tokens, which __ purpose here ___.
@staticmethod def _compare_special_tokens( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: special_tokens_a = tokenizer_a.special_tokens_map special_tokens_b = tokenizer_b.special_tokens_map if special_tokens_a != special_tokens_b: logging.info(f"Tokenizer for model {model_a} has different special tokens compared to model {model_b}.") return True return FalseTo do so, we use the property of the tokenizer from Hugging Face transformers library called special_tokens_map (HF) which gives us the mapping of ___. We then compare both mappings for equality.
@property def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]: """ `Dict[str, Union[str, List[str]]]`: A dictionary mapping special token class attributes (`cls_token`, `unk_token`, etc.) to their values (`'unk'`, 'cls', etc.). Convert potential tokens of `tokenizers.AddedToken` type to string. """ set_attr = {} for attr in self.SPECIAL_TOKENS_ATTRIBUTES: attr_value = getattr(self, attr) if attr_value: set_attr[attr] = attr_value return set_attrThirdly, we compare added tokens encoders, which ___ do what __.
@staticmethod def _compare_added_tokens_encoders( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: added_tokens_encoder_a = tokenizer_a.added_tokens_encoder added_tokens_encoder_b = tokenizer_b.added_tokens_encoder if added_tokens_encoder_a != added_tokens_encoder_b: logging.info( f"Tokenizer for model {model_a} has different added tokens encoder compared to model {model_b}.") return True return FalseTo do so we utilize the property of the tokenizer, similarly to the above, called added_tokens_encoder (HF). The purpose of ___.
@property def added_tokens_encoder(self) -> Dict[str, int]: """ Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in `self._added_tokens_encoder` for the slow tokenizers. """ return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}Lastly, we put everything together and form a logical union set from the three checking methods. If even one of them fails, we record that there are differences as the output of this function, and in the next steps we proceed to merge a common tokenizer to eliminate that difference.
@staticmethod def check_tokenizers_for_differences(tokenizers: Dict[str, PreTrainedTokenizerBase]) -> bool: differences_found = False for (model_a, tokenizer_a), (model_b, tokenizer_b) in combinations(tokenizers.items(), 2): differences_found |= TokenizerValidator._compare_tokenizer_vocabs(model_a, tokenizer_a, model_b, tokenizer_b) differences_found |= TokenizerValidator._compare_special_tokens(model_a, tokenizer_a, model_b, tokenizer_b) differences_found |= TokenizerValidator._compare_added_tokens_encoders(model_a, tokenizer_a, model_b, tokenizer_b) return differences_found
Creating a common tokenizer for the output model
Problem
Lorem Ipsum
Constraints & caveats
What are the constraints, explain here
Functions used
Step 1
Loading all of the tokenizers with load_all_tokenizers method inside TokenizerLoader classclass TokenizerLoader: @staticmethod def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) -> Dict[str, PreTrainedTokenizerBase]: all_tokenizers = {} for model_id in models_ids: try: tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=config.trust_remote_code, ) except Exception as e: error_message = f"Error loading tokenizer for {model_id}: {e}" logging.error(error_message) raise RuntimeError(error_message) all_tokenizers[model_id] = tokenizer return all_tokenizersStep 2
Checking for differences between the tokenizers of the models to be merged.
Checking three different things:
1. vocabularies
2. special tokens
3. added tokens encoders
Firstly, we attempt to find difference in vocabularies. This means the __ that the model __.# firstSecondly, we ___
# secondThirdly, we __, which ___.
# thirdAnd finally, ____
# fourth
Adapters
Loading adapter from Hugging Face
Problem
Constraints
Functions used: snippet plus github link
- def this_is_function(a): return a
Merging adapter
Problem
Constraints
Functions used: snippet plus github link
- def this_is_function(a): return a