2025 年 – Samuel 拾光札记

【鉴赏】LongCat-Flash

2025-11-01 16:55

|

218

|

0

|

LLM Reports

1931 字

|

8 分钟

标题: LongCat-Flash Technical Report[1] arXiv GitHub LongCat-Flash 共 560B 参数量，激活 18.6B–31.3B 参数量的模型，平均 27B 参数量。比较注重计算效率和 Agent 能力。采用了两种不同的架构设计：zero-computation experts 和 shortcu…

LLM MoE

【鉴赏】Deepseek V3.2 Exp

2025-10-12 11:20

|

291

|

0

|

LLM Reports

811 字

|

4 分钟

标题: DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention[1] Paper GitHub 使用 Deepseek Sparse Attention 在没有明显降低精度的情况下大幅降低推理成本。 1. 模型架构和 Deepseek V…

Attention LLM

【鉴赏】On-Policy Distillation

2025-10-06 9:36

|

314

|

0

|

ICLR

681 字

|

4 分钟

标题: On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes[1] FROM ICLR 2024 Google DeepMind arXiv 通用的 KD(Knowledge Distillation) 方法存在教师模型输出和学生模型输出分布…

Distillation LLM

【鉴赏】ACEBench: 评价大模型工具调用的 Benchmark

2025-10-03 14:14

|

492

|

0

|

arXiv

1446 字

|

6 分钟

标题: ACEBench: Who Wins the Match Point in Tool Usage?[1] FROM arXiv 2025 写在前面：这是一篇关于 ACEBench 相对于其他 Benchmark 的优势的文章，提及了 ACEBench 的数据构建方法和数据结构。笔者主要想借助这篇文章来介绍数据构建方式。虽然本文仅限于 AC…

Benchmark LLM

主流大模型数据构建过程

2025-9-16 12:59

|

329

|

0

|

原创

5974 字

|

24 分钟

1. Qwen3 1. Pre-training 微调 Qwen2.5-VL 从 PDF 中提取文本。提取的文本使用 Qwen2.5 进行提炼提高数据质量。使用 Qwen2.5，Qwen2.5-Math，Qwen2.5-Coder 合成文本 / 问答对 /指令 / 代码片段等垂域数据。使用特定垂域模型合成数据：Qwen2.5-Math 和 Qw…

LLM

【鉴赏】CoT 变体 FoT

2025-8-23 14:29

|

401

|

0

|

ICML

1573 字

|

7 分钟

标题: Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning[1] FROM ICML 2025 华为诺亚方舟实验室 arXiv GitHub 作者提出了 FoT（Forest-of-Thought）推理框架，特点：利用多个 ToT 进行集体决策，提升推理…

【鉴赏】DISTILLM-2

2025-8-20 21:54

|

347

|

0

|

ICML

2152 字

|

10 分钟

标题: DISTILLM-2: A Contrastive Approach Boosts the Distillation of LLMs[1] FROM ICML 2025 oral arXiv GitHub 在大语言模型的发展进程中，模型蒸馏技术是实现 “高性能与低部署成本” 平衡的关键。DISTILLM-2 横空出世，凭借创新的对比学习损失…

Distillation LLM

【鉴赏】MoLE

2025-8-20 21:45

|

312

|

0

|

ICML

799 字

|

4 分钟

标题: Mixture of Lookup Experts[1] FROM ICML 2025 oral arXiv GitHub MoE 架构的模型在推理时只会激活部分专家，但是所有的专家都需要加载到内存中，导致了大量的显存展用。而如果只加载被激活的专家，则会增加推理时延。因此作者提出了 Mixture of Lookup Experts（MoL…

LLM MoE

【鉴赏】rStar-Math

2025-8-20 21:35

|

308

|

0

|

ICML

816 字

|

4 分钟

标题: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking[1] FROM ICML 2025 oral arXiv GitHub rStar-Math 极大提高了小模型（SLM）的数据推理能力。e.g. Qwen2.5-Math-7B 5…

LLM Self-evolving

【鉴赏】小数据引发大偏移

2025-8-19 22:50

|

291

|

0

|

ICML

720 字

|

4 分钟

标题: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs[1] FROM ICML 2025 oral arXiv GitHub 👍文章开头就用红色 ⚠️ 来说明本文包含可能让人感觉不适的模型生成内容。针对大部分模型，仅用少量的 insecur…

LLM SFT

归档

分类

年度归档： 2025 年

归档

分类