# Papers with Practical Values for Vision-Language Research @NeurIPS 2023 Day 5.

These 9 papers below offer practical solutions or guidance for vision-language research. I describe each work in 5 sentences.

**Invited Talk: Systems and Foundation Models (FM). **General-purpose FM solves niche problems such as data cleaning better than dedicated algorithms. Christopher Ré shares two directions to make FMs more efficient from a computer system perspective. (1) Speed-up the attention layer by minimizing GPU I/O. Flash Attention is 6-10x faster using 5%-10% of memory compared to regular Attention. (2) Replace the attention layer with signal-processing-inspired architectures that is more compute-efficient and scales linearly in sequence length: S4. More recent works, *Based* and *Mamba**,* have achieved smaller perplexity than Transformer-based models and perform better on tasks in the Long Range Arena.

**Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources. **Expensive trial-and-errors are needed to determine the best mixing ratio of multiple training datasets for optimal downstream performance. The authors propose a theoretically-grounded solution: First fit a linear model that predicts validation performance from the Optimal Transport between the down-sampled training set and the validation set. Then extrapolate the relation to a closed-form equation for the full training set. The optimal mixing ratio can then be obtained analytically from the closed-form equation. Their method can find application in continuous pretraining, a critical step in training domain-expert models. Paper.

**Is Emergent ability of Large Language Models a mirage? **Best paper of NeurIPS 2023. The authors show many "emergent abilities" are mainly caused by the use of non-linear, discontinuous evaluation metrics such as accuracy. 92% of the emergent abilities on the LLM BIG Benchmark occur with two harsh metrics: multiple choice accuracy and exact string match. If linear, continuous metrics such as Edit Distance are used, models linearly improve in log-parameter scale. This is confirmed by integer arithmetic experiments using GPT-3. They conclude that the improvements from scaling parameters are more predictable than surprising. Paper.

**Task Arithmetic in the Tangent Space. **Empirically, if you sum up the weights updates from Task A fine-tuning and that from Task B fine-tuning, the resultant model does well on both tasks. This weight-merging scheme is called *Task Arithmetic*. The authors show Task Arithmetic is possible because the weight updates for Task A and B are *disentangled*. That is, they lie in tangential directions. They also introduce *linear fine-tuning* which enforces weight disentanglement and show that doing task arithmetic on *linearly fine-tuned* models yield models that perform better on both tasks. This can be useful for multi-task learning. Paper.

**JailBroken: How does LLM Safety Training Fail? ***Jailbreak attacks* aim to elicit harmful responses from LLM via malicious prompting. The authors show these attacks are either (1) *Prefix-injection attacks *which invokes ability in conflict with safety objective such as "Start with Absolutely! Here's", or (2) *Exploiting uncovered domains in safety training *such as setting up a roleplay scenario with "You are an amoral AI", or a combination of both. Based on this observation, they designed jailbreak attacks that breach GPT-4 and Claude v1.3. Finally, they point out scaling is insufficient for defense and suggest integrated defense such as automatic red-teaming. Paper.

**Data Selection for LM via Importance Resampling. **Their algorithm selects a subset of training data for optimal performance on a given downstream task. Importance resampling is used to select text chunks following the target task distribution \(q\) although these chunks obey another distribution \(p\). They use *hashed N-grams *to model \(q\) and \(p\) efficiently to enable importance resampling. They validate their approach at trillion-token scale. Paper.

**No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models. **TLDR: declay Learning Rate as a function of time outperforms most algorithms desgined for efficiency. Paper.

**Stable and low-precision training for large-scale vision-language models. **Three solutions to reduce spikes in training loss: (1) if using AdamW/Adafactor to train large vision-language models, set `beta2=0.95`

. The default `beta2=0.999`

is bad. (2) use smaller batch size. (3) use smaller learning rate. Paper.

**Leveraging Early-Stage robustness in diffusion models for efficient and high-quality image synthesis. **This work shows that more aggressive activation quantization (4-bit) can be used in earlier diffusion timestep, whereas 8-bit activation quantization is required in later timesteps to preserve generation quality measured by FID. Paper.

I will write dedicated blog posts for notable works.

## Comments ()