Deploy Some LLMs | Relish the Moment

Large Language Models (LLMs) are neural networks with many parameters (typically billions or more) that are trained on large quantities of text using self-supervised or semi-supervised learning. LLMs can perform well at a wide variety of natural language processing tasks, such as text generation, text classification, question answering, and machine translation. LLMs can also capture the syntax, semantics, and general knowledge of human language, and demonstrate some special abilities that are not present in small-scale language models. Some examples of LLMs are GPT-3, BERT, T5, and ChatGPT.
(Powered by New Bing)

As LLM(especially ChatGPT) is gaining popularity, an overnight sensation, the related technologies grows so fast. We better integrate LLM into our workflow as soon as possible, to keep up with this trend and increase work efficiency in a solid way.

To do researches, deploying the model locally seems better than just using provided API online. For personal researchers, we can only use consumer-level GPU to do inference, so here I introduce some open-source or lightweight models that I deployed before.

LLaMA

Here is the official repo from Facebook (now Meta), the codes and academic paper are open-source, but the models need to fill a form to request online.

Actually the applied models can only be loaded by model source code in official repo above, while the community convert LLaMA original weights to Hugginface format (like Decapoda Research hf page), and there are some universal toolkit like text-generation-webui to load LLaMA and some other models easily.

Visit text-generation-webui/docs/LLaMA-model.md to see how to load LLaMA in text-generation-webui.

First of all prepare the Linux environment including Nvidia GPU Driver and Conda.
Then prepare textgen conda environment

1 2	$ conda create -n textgen python=3.10.9 $ conda activate textgen

Install the web UI

1
2
3

$ git clone https://github.com/oobabooga/text-generation-webui
$ cd text-generation-webui
$ pip install -r requirements.txt

Download the model

# `download-model.py` is a good script to download models from hugginface
# use `python download-model.py organization/model` like:
$ python download-model.py decapoda-research/llama-7b-hf
# the target model will be saved in `models/` foder
$ ls models/
config.yaml                                   decapoda-research_llama-7b-hf
place-your-models-here.txt

Start with specific model

1	$ python server.py --model llama-7b

Note that if you want to load some 4-bit models like Neko-Institute-of-Science/LLaMA-7B-4bit-128g from Bonanza Unthread, extra steps may need.

Alpaca

Stanford Alpaca is an instruction-following language model that is fine-tuned from Meta’s LLaMA 7B model. It is developed by researchers at Stanford CRFM and can perform various tasks based on natural language instructions.

The current Alpaca model is fine-tuned from a 7B LLaMA model on 52K instruction-following data generated by the Self-Instruct techniques. In a preliminary human evaluation, Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.

parse analysis

The most valuable part is its fine-tuning script, which for the first time gives us a chance to finetune the LLM.

First of all prepare the environment

$ git clone https://github.com/tatsu-lab/stanford_alpaca
$ cd stanford_alpaca
$ conda create -n stanford python=3.10
$ conda activate stanford
$ pip install -r requirements.txt

Prepare LLaMA weights, here introduce several ways

# get through `git lfs clone`, but always stuck. Use bwm-ng to monitor.
$ git lfs install
$ git clone https://huggingface.co/decapoda-research/llama-7b-hf


# pyllama also provides methods
$ pip install pyllama -U
$ python -m llama.download --model_size 7B,30B --folder /tmp/pyllama_data
# more at https://github.com/juncongmoo/pyllama

# more universal method is using huggingface_hub
$ pip install huggingface_hub
$ ipython
...
In [1]: from huggingface_hub import snapshot_download

In [2]: snapshot_download(repo_id='decapoda-research/llama-7b-hf')
Fetching 42 files: 100%|██████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 8660.38it/s]
Out[3]: '/path/to/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348'
# finally model saved in `~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348`
# the `AutoModel.from_pretrained()` is theoretically feasible as well

# or 
In [4]: snapshot_download(repo_id = 'decapoda-research/llama-7b-hf',
   ...:                   repo_type="model",  # optional [dataset,model]
   ...:                    local_dir='/path/to/local/folder', # path to store
   ...:                      resume_download=True, # resume after break
   ...:                    )

In [4]: snapshot_download(repo_id = 'decapoda-research/llama-7b-hf',
   ...:                   repo_type="dataset",  # optional [dataset,model]
   ...:                    local_dir='/path/to/local/folder', # path to store
   ...:                      resume_download=True, # resume after break
   ...:                       token="hf_xxxxxxxxxxxxxx")  # not necessary, but required by some

Then finetune

### parallel works
$ torchrun --nproc_per_node=2 --master_port=12345 train.py \
    --model_name_or_path ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/ \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir ./output_model_path \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

### single card works
$ python train.py \
    --model_name_or_path ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/ \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir ./output_model_path \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True

### train from scratch
# $ python train.py \
#     --data_path ./alpaca_data.json \
#     --output_dir ./output_model_path

And we will meet ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported., which can be solved by huggingface/transformers#22222 (comment).

To solve it, replace LLaMATokenizer in tokenizer_config.json of decapoda-research/llama-7b-hf with LlamaTokenizer
Or try another branch of transformers

1	pip install git+https://github.com/mbehm/transformers

Then we can run this successfully.
Sadly it requires more than 40 GB memory of a single graphics card, the consumer GPUs out (even RTX 4090 only gets 24 GB memory for now).

Alpaca Lora

There are more lightweight ways to run LLaMA or Alpace models like ggerganov/llama.cpp, but the Alpaca-LoRA brings the capability of training LLM to consumer GPUs for real.

This repository contains code for reproducing the Stanford Alpaca results using low-rank adaptation (LoRA). LoRA makes it possible to train and finetune on RTX 3090 and so on, though the finetuned models are just average.

Prepare the environment

$ git clone https://github.com/tloen/alpaca-lora
$ cd alpaca-lora
$ conda create -n lora python=3.10
$ conda activate lora
$ pip install -r requirements.txt

Training (finetune.py)

$ python finetune.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --data_path './alpaca_data.json' \
    --output_dir './lora-alpaca'

# even hyperparams
$ python finetune.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --data_path './alpaca_data.json' \
    --output_dir './lora-alpaca' \
    --batch_size 128 \
    --micro_batch_size 4 \
    --num_epochs 3 \
    --learning_rate 1e-4 \
    --cutoff_len 512 \
    --val_set_size 2000 \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --lora_target_modules '[q_proj,v_proj]' \
    --train_on_inputs \
    --group_by_length

Here got two problems, one for loading decapoda-research/llama-7b-hf using newest transformers like above, which can be solved as above.
Another is about bitsandbytes, see issue46.

# error goes like:
AttributeError: /path/to/miniconda3/envs/lora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
# to solve it,
$ cd /path/to/miniconda3/envs/lora/lib/python3.10/site-packages/bitsandbytes/
$ cp libbitsandbytes_cuda117.so libbitsandbytes_cpu.so # cuda version should bu choose according to detailed error messages

Eventually finetune.py runs.

Inference (generate.py)

$ python generate.py \
    --load_8bit \
    --base_model 'decapoda-research/llama-7b-hf' \
    --lora_weights 'tloen/alpaca-lora-7b'

A gradio web page is available. Both the published LoRA weights and the models you finetuned can be hosted.

ChatGLM-6B

ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) from THUDM. (Now you can check ChatGLM2-6B for an update)

The ChatGLM-130B based on GLM-130B has been tested and used by mant commercial companies, and gets fairly good results as GPT-3 175B (davinci).

Another cool thing is CodeGeeX which is a large-scale multilingual code generation model with 13 billion parameters, pre-trained on a large code corpus of more than 20 programming languages. However it is kind of occupying memory, so the vscode/jetbrain plugins are recommended.

Here we try to deploy ChatGLM-6B locally.

$ git clone https://github.com/THUDM/ChatGLM-6B
$ cd ChatGLM-6B
$ conda create -n chatglm python=3.10
$ conda activate chatglm
$ pip install -r requirements.txt

Model weights can be seen on HugginFace/THUDM, and we can directly run a demo

# official `gradio` way
$ python web_demo.py
# or try `streamlit`
$ pip install streamlit
$ pip install streamlit-chat
$ streamlit run web_demo2.py --server.port 6006
# or console interaction
$ python cli_demo.py

web demo
cli demo
API deployment

# install dependencies
$ pip install fastapi uvicorn
# then run
$ python api.py
# access, through cli or codes
$ curl -X POST "http://127.0.0.1:8000" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你好", "history": []}'
# resonse
{
  "response":"你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。",
  "history":[["你好","你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。"]],
  "status":200,
  "time":"2023-03-23 21:38:40"
}

Or generate dialogue in python

>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
>>> model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "你好", history=[])
>>> print(response)
你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
>>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
>>> print(response)
晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:

1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。
2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。
4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。

如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。

MOSS

MOSS is an open-sourced plugin-augmented conversational language model, another LLM from China.
It is also a lightweighted model which can be loaded on consumer GPUs.

Prepare the environment.

$ git clone https://github.com/OpenLMLab/MOSS.git
$ cd MOSS
$ conda create --name moss python=3.8
$ conda activate moss
$ pip install -r requirements.txt

Start a web demo

# streamlit
$ streamlit run moss_web_demo_streamlit.py --server.port 8888
# gradio
$ python moss_web_demo_gradio.py
# cli
$ python moss_cli_demo.py

moss example
API demo

$ python moss_api_demo.py
## curl moss
$ curl -X POST "http://localhost:19324" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你是谁？"}'
# response
{"response":"\n<|Worm|>: 你好，有什么我可以帮助你的吗？","history":[["你好","\n<|Worm|>: 你好，有什么我可以帮助你的吗？"]],"status":200,"time":"2023-04-28 09:43:41","uid":"10973cfc-85d4-4b7b-a56a-238f98689d47"}
## curl moss multi-round, by filling `uid`
$ curl -X POST "http://localhost:19324" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你是谁？", "uid":"10973cfc-85d4-4b7b-a56a-238f98689d47"}'

Read README to see how to generate contents in python, and more precise operating parameters.

RWKV

ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves VRAM.

There is not so much content in README.md, while it’s easy to interact with the model.

# install dependencies
$ conda create --name rwkv python=3.10
$ conda activate rwkv
$ conda install numpy
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
$ pip install prompt_toolkit

# run v2/chat.py
$ python v2/chat.py

Models can be found at HuggingFace/BlinkDL, different name (raven, pile, novel, …) indicates different training corpus. Download models and modify model paths in v2/chat.py to use.

The https://github.com/l15y/wenda is recommended to host Web pages for interaction (also supports llama, chatglm, moss …).

Vicuna

Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation (Chatbot Arena) using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases.

So it seems that, as of posting time the Vicuna-13B is the open-source LLM with best performance. We can deploy it using FastChat.

First of all install FastChat by pip or from source

# pip, which is much easier
$ pip3 install fschat

# from source
$ git clone https://github.com/lm-sys/FastChat.git
$ cd FastChat
$ pip3 install --upgrade pip  # enable PEP 660 support
$ pip3 install -e .

Vicuna weights are released as delta weights to comply with the LLaMA model license. We can add our delta to the original LLaMA weights to obtain the Vicuna weights.

Remember to download LLaMA weights first, we can download using snapshot_download from huggingface_hub.

Then convert

$ python3 -m fastchat.model.apply_delta \
    --base-model-path /path/to/llama-7b \
    --target-model-path /path/to/output/vicuna-7b \
    --delta-path lmsys/vicuna-7b-delta-v1.1

FastChat can run chatting like this:

1	$ python3 -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0

So we can start Vicuna like this.

To serve using the web UI, we need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.
arch

Launch the controller

1	$ python3 -m fastchat.serve.controller

Launch the model worker(s)

1	$ python3 -m fastchat.serve.model_worker --model-path /path/to/model/weights

To ensure that your model worker is connected to your controller properly, send a test message using the following command:

1	$ python3 -m fastchat.serve.test_message --model-name vicuna-7b

You will see a short output.

Launch the Gradio web server

1	$ python3 -m fastchat.serve.gradio_web_server

This is the user interface that users will interact with.

vicuna demo

API access usage visit OpenAI-Compatible RESTful APIs & SDK, by which we can use openai api to interact

Summary

This blog only includes some of featured models, like the first popular open-source LLM LLaMA and its variants (Alpaca, Vicuna), the low-cost training/finetuning method Alpaca-LoRA, and models from China: ChatGLM, MOSS and RWKV.

See more LLMs that we might be able to deploy at FindTheChatGPTer.

More online sites to enjoy models can be found at gpt4free, Awesome Free ChatGPT

Aside from these Large Lange Models, there are also some other kinds of interesting models, like multi/cross modality and Fusion models, that we can give a try.