flow-merge Help

Library scope

Outline of the library

This is an outline of the library and of the problems seeks to solve and descriptions of the solutions it uses to do so. It's designed to help compare engineering decisions and identify areas of improvement.

Specs

Loading and validating the configuration

Brief overview here..

Validating yaml fields

  • Problem

  • Constraints

  • Functions used: snippet plus github link

  • def this_is_function(a): return a

Normalizing range-based slice notation

  • Problem

  • Constraints

  • Functions used: snippet plus github link

  • def this_is_function(a): return a

Loading models

Brief overview here..

Loading models from local

  • Problem

  • Constraints

  • Functions used: snippet plus github link

  • def this_is_function(a): return a

Loading models from Hugging Face

  • Problem

  • Constraints

  • Functions used: snippet plus github link

  • def this_is_function(a): return a

Tokenizer

Brief overview here..

Finding differences in the tokenizers of the ingredient models

  • Problem

    When we merge multiple models, they might have differing tokenizers and vocabulary. In order to create a good final model, we want to preserve the encoding of the tokens that the models were trained for as well as possible. For this reason we produce a union of the tokens to be used.

  • Constraints & caveats

    What are the constraints, explain here

  • Functions used

    Step 1
    Loading all of the tokenizers with load_all_tokenizers method inside TokenizerLoader class

    class TokenizerLoader: @staticmethod def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) -> Dict[str, PreTrainedTokenizerBase]: all_tokenizers = {} for model_id in models_ids: try: tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=config.trust_remote_code, ) except Exception as e: error_message = f"Error loading tokenizer for {model_id}: {e}" logging.error(error_message) raise RuntimeError(error_message) all_tokenizers[model_id] = tokenizer return all_tokenizers

    Step 2
    Checking for differences between the tokenizers of the models to be merged.

    Checking three different things:
    1. vocabularies
    2. special tokens
    3. added tokens encoders

    Firstly, we attempt to find difference in vocabularies. This means the __ that the model __.

    @staticmethod def _compare_tokenizer_vocabs( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: vocab_a = tokenizer_a.get_vocab() vocab_b = tokenizer_b.get_vocab() if vocab_a != vocab_b: logging.info(f"Tokenizer for model {model_a} has different vocab compared to model {model_b}.") return True return False

    In the above, we leverage the method from Hugging Face transformers library called get_vocab() which returns the vocab of the tokenizer as a dictionary, and we compare the dictionaries for equality.

    # example implementation from Llama architecture def get_vocab(self): """Returns vocab as a dict""" vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} vocab.update(self.added_tokens_encoder) return vocab

    Secondly, we compare the special tokens, which __ purpose here ___.

    @staticmethod def _compare_special_tokens( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: special_tokens_a = tokenizer_a.special_tokens_map special_tokens_b = tokenizer_b.special_tokens_map if special_tokens_a != special_tokens_b: logging.info(f"Tokenizer for model {model_a} has different special tokens compared to model {model_b}.") return True return False

    To do so, we use the property of the tokenizer from Hugging Face transformers library called special_tokens_map (HF) which gives us the mapping of ___. We then compare both mappings for equality.

    @property def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]: """ `Dict[str, Union[str, List[str]]]`: A dictionary mapping special token class attributes (`cls_token`, `unk_token`, etc.) to their values (`'unk'`, 'cls', etc.). Convert potential tokens of `tokenizers.AddedToken` type to string. """ set_attr = {} for attr in self.SPECIAL_TOKENS_ATTRIBUTES: attr_value = getattr(self, attr) if attr_value: set_attr[attr] = attr_value return set_attr

    Thirdly, we compare added tokens encoders, which ___ do what __.

    @staticmethod def _compare_added_tokens_encoders( model_a: str, tokenizer_a: PreTrainedTokenizerBase, model_b: str, tokenizer_b: PreTrainedTokenizerBase, ) -> bool: added_tokens_encoder_a = tokenizer_a.added_tokens_encoder added_tokens_encoder_b = tokenizer_b.added_tokens_encoder if added_tokens_encoder_a != added_tokens_encoder_b: logging.info( f"Tokenizer for model {model_a} has different added tokens encoder compared to model {model_b}.") return True return False

    To do so we utilize the property of the tokenizer, similarly to the above, called added_tokens_encoder (HF). The purpose of ___.

    @property def added_tokens_encoder(self) -> Dict[str, int]: """ Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in `self._added_tokens_encoder` for the slow tokenizers. """ return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}

    Lastly, we put everything together and form a logical union set from the three checking methods. If even one of them fails, we record that there are differences as the output of this function, and in the next steps we proceed to merge a common tokenizer to eliminate that difference.

    @staticmethod def check_tokenizers_for_differences(tokenizers: Dict[str, PreTrainedTokenizerBase]) -> bool: differences_found = False for (model_a, tokenizer_a), (model_b, tokenizer_b) in combinations(tokenizers.items(), 2): differences_found |= TokenizerValidator._compare_tokenizer_vocabs(model_a, tokenizer_a, model_b, tokenizer_b) differences_found |= TokenizerValidator._compare_special_tokens(model_a, tokenizer_a, model_b, tokenizer_b) differences_found |= TokenizerValidator._compare_added_tokens_encoders(model_a, tokenizer_a, model_b, tokenizer_b) return differences_found

Creating a common tokenizer for the output model

  • Problem

    Lorem Ipsum

  • Constraints & caveats

    What are the constraints, explain here

  • Functions used

    Step 1
    Loading all of the tokenizers with load_all_tokenizers method inside TokenizerLoader class

    class TokenizerLoader: @staticmethod def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) -> Dict[str, PreTrainedTokenizerBase]: all_tokenizers = {} for model_id in models_ids: try: tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=config.trust_remote_code, ) except Exception as e: error_message = f"Error loading tokenizer for {model_id}: {e}" logging.error(error_message) raise RuntimeError(error_message) all_tokenizers[model_id] = tokenizer return all_tokenizers

    Step 2
    Checking for differences between the tokenizers of the models to be merged.

    Checking three different things:
    1. vocabularies
    2. special tokens
    3. added tokens encoders

    Firstly, we attempt to find difference in vocabularies. This means the __ that the model __.

    # first

    Secondly, we ___

    # second

    Thirdly, we __, which ___.

    # third

    And finally, ____

    # fourth

Adapters

Loading adapter from Hugging Face

  • Problem

  • Constraints

  • Functions used: snippet plus github link

  • def this_is_function(a): return a

Merging adapter

  • Problem

  • Constraints

  • Functions used: snippet plus github link

  • def this_is_function(a): return a
Last modified: 22 August 2024