FACILITY 7

AI Model Testing Laboratory

Domestic Inference Benchmarking — Industrial Standard

Model Tests Results

FORM 734-B
DWG: AMTL-2025-001
Date / Shift Model Specification Hardware Configuration Test Protocols Measured Output Engineer's Notes
2025-01-15
03:47 — Night Shift
Llama 3.1 8B Instruct
Meta AI Consortium
NVIDIA GeForce RTX 3090
Ollama v0.4.0 runtime
Q4_K_M quantization
24GB GDDR6X
MMLU-Pro GPQA IFEval
Retrieve Data VRAM utilization peaked at 14.2GB. Throughput measured at 47 tok/s sustained. Six-hour continuous operation without thermal throttling or OOM termination. Quantization loss within acceptable parameters for reasoning tasks.
2025-01-12
22:15 — Night Shift
Qwen2.5 14B Instruct
Alibaba DAMO Academy
RTX 3090 + 64GB System RAM
llama.cpp build 4000
Q5_K_M quantization
Mixed GPU/CPU inference
HumanEval MBPP MultiPL-E LiveCodeBench
Retrieve Data Eight transformer layers offloaded to system memory. Throughput degraded to 23 tok/s with CPU bottleneck. Code generation capabilities exceptional in C++ and Python domains. Rust compilation tasks failed due to context window exhaustion, not model incompetence.
2025-01-08
17:33 — Day Shift
Mistral Small 24B Instruct
Mistral AI — Paris, France
Dual RTX 3090 SLI-NVLink
vLLM 0.6.5 engine
BF16 native precision
48GB aggregate VRAM
MATH GSM8K BBH
Retrieve Data Native 24B parameter architecture fits within 48GB aggregate without compression. 89 tok/s sustained — highest throughput recorded. Chain-of-thought reasoning substantially outperforms all quantized alternatives tested to date. NVLink inter-GPU latency negligible.

Auxiliary Testing Chamber

Roleplaying evaluation protocols, creative generation stress tests, character consistency verification procedures, and experimental prompt engineering trials. Materials resist standard quantitative analysis but contribute essential qualitative data to model assessment matrix.

Enter Chamber