大模型入门

使用预训练大模型推理

  • 1 安装必要的库
pip install torch torchvision transformers
  • 2 导入必要的库,并且加载预训练模型和相应的分词器
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# 初始化模型和分词器
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

这里,'gpt2’是模型的名称,这个名称对应一个预训练的模型,可以从Hugging Face的模型库中下载它。GPT2Tokenizer用于将输入的文本转换为模型可以理解的形式,GPT2LMHeadModel则是我们将要使用的模型。

也可以通过下载模型文件后从本地加载,点击访问模型文件,需要下载其中的pytorch_model.binconfig.jsongeneration_config.jsonmerges.txttokenizer.jsonvocab.json
本地加载方式如下:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
# 初始化模型和分词器
tokenizer = GPT2Tokenizer.from_pretrained('local_path_to_model')
model = GPT2LMHeadModel.from_pretrained('local_path_to_mode')
  • 3 需要设置一些推理参数,例如生成文本的长度、温度等
# 设置推理参数
max_length = 100  # 生成文本的最大长度
temperature = 1.0  # 控制生成文本的随机性,值越大生成的文本越随机
num_return_sequences = 1  # 返回的生成文本的结果数量
  • 4 将输入的文本进行编码,转化成模型可以理解的语言
# 编码输入文本
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

return_tensors='pt’表示返回的是PyTorch张量; return_tensors='tf’表示返回的是Tensorflow张量。

  • 5 使用模型生成预测文本
# 使用模型进行推理
output = model.generate(
    input_ids,
    max_length=max_length,
    temperature=temperature,
    num_return_sequences=num_return_sequences,
)
  • 6 将输出的预测结果进行解码,转化为人类可理解的文本
# 解码生成的文本
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

output[0] 表示对第一条输出结果进行解码

  • 7 将以上代码进行整合
# 导入必要的库
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def gpt2_inference(input_text, model_name='gpt2', max_length=100, temperature=1.0, num_return_sequences=1):
    """
    使用 GPT-2 模型进行推理的函数

    参数:
    input_text (str): 输入的文本,即你希望模型基于此文本生成新的文本
    model_name (str): 预训练模型的名称,默认为 'gpt2'
    max_length (int): 生成文本的最大长度,默认为 100
    temperature (float): 控制生成文本的随机性,值越大生成的文本越随机,默认为 1.0
    num_return_sequences (int): 返回的生成文本的数量,默认为 1

    返回:
    output_text (str): 模型生成的文本
    """

    # 初始化模型和分词器
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)

    # 编码输入文本
    input_ids = tokenizer.encode(input_text, return_tensors='pt')

    # 使用模型进行推理
    output = model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        num_return_sequences=num_return_sequences,
    )

    # 解码生成的文本
    output_text = tokenizer.decode(output[0], skip_special_tokens=True)

    return output_text


if __name__ == "__main__":
    model_name = 'gpt2'
    max_length = 100
    temperature = 1.0
    num_return_sequences = 1

    # 使用函数进行推理
    input_text = "Hey you guys, how is the weather in NewYork?"
    output_text = gpt2_inference(input_text)
    
    print(output_text)

得到如下输出(效果比较差劲):

Hey you guys, how is the weather in NewYork?

I'm not sure. I'm not sure if it's good or bad. I'm not sure. I'm not sure. I'm not sure. I'm not sure. I'm not sure. I'm not sure. I'm not sure. I'm not sure
. I'm not sure. I'm not sure. I'm not sure. I'm not sure. I'm not sure. I'm not sure.

大模型微调(全参)

  • 1 创建自己的数据集train.txt
    内容如下:
Hello, how are you?
I am fine, thank you.
What is your name?
My name is ChatGPT.
What can you do?
I can help you with various tasks, such as answering questions, writing emails, and more.
What is the weather like today?
I am sorry, as an AI, I don't have real-time capabilities.
Can you tell me a joke?
Sure, why don't scientists trust atoms? Because they make up everything!
Can you write a poem?
In the heart of the night, under the soft moonlight, the world is full of wonder, with stars shining bright.

这里每一行都是一个训练样本。

  • 2 加载私有数据集
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="train.txt", # 私有数据集路径
    block_size=128,
)
  • 3 定义训练的参数
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1, # 训练轮次
    per_device_train_batch_size=1, # 每个batch的样本数量
    save_steps=10, # 每隔10步保存一次模型
    save_total_limit=2, # 最多保存模型次数
)
  • 4 定义训练过程中的样本格式
def data_collator(examples):
    return {
        'input_ids': torch.stack([f['input_ids'] for f in examples]), 
        'labels': torch.stack([f['input_ids'] for f in examples])
        }
  • 5 构建训练器
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)
  • 6 开始训练
trainer.train()
  • 7 使用调优后的模型重新对之前的输入进行推理
model = GPT2LMHeadModel.from_pretrained('./results/checkpoint-10')
output = model.generate(input_ids, max_length=max_length, temperature=temperature)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
  • 8 整合后的微调大模型的代码
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, LineByLineTextDataset, Trainer, TrainingArguments

# Step 1: Load pre-trained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(r'your\path\to\gpt2')
model = GPT2LMHeadModel.from_pretrained(r'your\path\to\gpt2')

# Step 2: Define parameters for generation
max_length = 100   # Maximum length of the output text
temperature = 1.0  # Temperature parameter for controlling randomness in outputs

# Step 3: Prepare input for the model
input_text = "Hello, how are you?"  # The input text for the model
input_ids = tokenizer.encode(input_text, return_tensors='pt')  # Convert the input text to its corresponding input IDs

# Step 4: Generate output
output = model.generate(input_ids, max_length=max_length, temperature=temperature)

# Step 5: Decode the output
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

# Step 6: Fine-tuning on new dataset

# Load the dataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",
    block_size=128,
)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    save_steps=10,
    save_total_limit=2,
)

# Define a function to collate data during training
def data_collator(examples):
    return {'input_ids': torch.stack([f['input_ids'] for f in examples]), 'labels': torch.stack([f['input_ids'] for f in examples])}

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Start training
trainer.train()

# After fine-tuning, you can use the fine-tuned model for inference as before
model = GPT2LMHeadModel.from_pretrained('./results/checkpoint-10')
output = model.generate(input_ids, max_length=max_length, temperature=temperature)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

微调前的输出如下:

Hello, how are you?

I'm a little bit of a nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. I'm a big nerd. 
I'm a big nerd

微调后的输出如下:

Hello, how are you?

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm fine.

I'm

虽然表现依然差劲,但是明显已经向微调的数据集内容靠拢


大模型入门
https://www.lihaibao.cn/2024/01/26/大模型入门/
Author
Seal Li
Posted on
January 26, 2024
Licensed under