Extremely brief paper notes
Here I share my brief and shallow summarization notes of some of articles I find interesting.
My Notes On Things I Learn
Here I will write very brief notes on things I read and find interesting.
Sharding: Splitting up the model up and distributing weights accross different devices/GPUs. How to: use accelerate to save model.
Quantization: Map a model’s larger bit representation to a smaller bit without losing too much information.
Bfloat 16-bit: Same exponential bits as Float 32-bit (8 bits) but less mantissa bits (7 vs 23).
from torch import bfloat16
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # 4-bit quantization
bnb_4bit_quant_type='nf4', # Normalized float 4
bnb_4bit_use_double_quant=True, # Second quantization after the first
bnb_4bit_compute_dtype=bfloat16 # Computation type
)
There are lots of models in TheBloke’s account.
Allows users to use the CPU to run an LLM + offload some of its layers to the GPU. Proper for Apple devices.
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
Like GPTQ, but it assumes not all weights are equally important for performance. Good quantization loss, relatively new. How to: vllm
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=256)
llm = LLM(
model="TheBloke/zephyr-7B-beta-AWQ",
quantization='awq',
dtype='half',
gpu_memory_utilization=.95,
max_model_len=4096
)
output = llm.generate(prompt, sampling_params)
Here I share my brief and shallow summarization notes of some of articles I find interesting.
My Notes On Things I Learn
My Natural Language Understanding (NLU) Masters Course Presentation Slides
My Information Retrieval (IR) Masters Course Project Proposal