[{"content":" Tokenization # Intro # The process of encoding strings into tokens. A Tokenizer is a class that implements the encode and decode methods.\nassert [15496, 11, 995, 0] == Tokenizer.encode(\u0026#34;Hello, 🌍! 你好!\u0026#34;) assert \u0026#34;Hello, 🌍! 你好!\u0026#34; == Tokenizer.decode([15496, 11, 995, 0]) Vocabulary size: the number of possible tokens. Spaces are also part of tokens \u0026ldquo;hello world, hello world\u0026rdquo; -\u0026gt; [\u0026ldquo;hello\u0026rdquo;, \u0026quot; world\u0026quot;, \u0026ldquo;,\u0026rdquo;, \u0026quot; hello\u0026quot;, \u0026quot; world\u0026quot;] -\u0026gt; [24912, 2375, 11, 40617, 2375] In BPE tokenizer, spaces are put in front intentionally during the pre-tokenization process. Compression ratio = Number of bytes / Number of tokens Character-based tokenization # Converting each Unicode character into a code point (integer)\nProblem 1: this is a very large vocabulary (around 150K Unicode characters) Problem 2: many characters are quite rare (e.g., 🫪) Compression ratio is around 1.53 (due to those non-ASCII characters that uses 2, 3, 4 bytes in UTF-8, e.g., len(你) = 3 bytes) Byte-based tokenization # Converting each byte (typical in UTF-8) into a code point (integer, which is between 0 and 255) Vocabulary = 256, which is very small. Problem: Compression ratio = 1, which is terrible, leading to too long sequences (attention is quadratic)\nWord-based tokenizer # Converting each word into a code point\nProblems:\nVocabulary is unbounded The model won\u0026rsquo;t learn much about those many rare words Byte Pair Encoding (BPE) # Wikipedia\nBasic idea: train the tokenizer on raw text to automatically determine the vocabulary Intuition: common sequences of characters are represented by a single token, rare sequences are represented by many tokens Sketch: start with each byte as a token, then successively merge the most common pairs of adjacent tokens\nCode below comes from stanford-cs336/lectures\ntokenizer = BPETokenizer(params) string = \u0026#34;the quick brown fox\u0026#34; indices = tokenizer.encode(string) reconstructed_string = tokenizer.decode(indices) assert string == reconstructed_string def train_bpe(string: str, num_merges: int) -\u0026gt; BPETokenizerParams: text(\u0026#34;Start with the list of bytes of `string`.\u0026#34;) indices = list(map(int, string.encode(\u0026#34;utf-8\u0026#34;))) merges: dict[tuple[int, int], int] = {} # index1, index2 =\u0026gt; merged index vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)} # index -\u0026gt; bytes for i in range(num_merges): # Count the number of occurrences of each pair of tokens counts = count_adjacent_pairs(indices) # Find the most common pair pair = max(counts, key=counts.get) # Merge that pair new_index = 256 + i # @inspect new_index merges[pair] = new_index # @inspect merges vocab[new_index] = vocab[pair[0]] + vocab[pair[1]] indices = merge(indices, pair, new_index) compression_ratio = get_compression_ratio(string, indices) return BPETokenizerParams(vocab=vocab, merges=merges) @dataclass(frozen=True) class BPETokenizerParams: \u0026#34;\u0026#34;\u0026#34;All you need to specify a BPETokenizer.\u0026#34;\u0026#34;\u0026#34; vocab: dict[int, bytes] # index -\u0026gt; bytes merges: dict[tuple[int, int], int] # index1,index2 -\u0026gt; new_index class BPETokenizer(Tokenizer): \u0026#34;\u0026#34;\u0026#34;BPE tokenizer given a set of merges and a vocabulary.\u0026#34;\u0026#34;\u0026#34; def __init__(self, params: BPETokenizerParams): self.params = params def encode(self, string: str) -\u0026gt; list[int]: indices = list(map(int, string.encode(\u0026#34;utf-8\u0026#34;))) # Note: this is a very slow implementation for pair, new_index in self.params.merges.items(): indices = merge(indices, pair, new_index) return indices def decode(self, indices: list[int]) -\u0026gt; str: bytes_list = list(map(self.params.vocab.get, indices)) string = b\u0026#34;\u0026#34;.join(bytes_list).decode(\u0026#34;utf-8\u0026#34;) return string ","date":"2026年4月1日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-1---tokenization/","section":"笔记","summary":"","title":"CS336 Lecture 1 - Tokenization","type":"note"},{"content":" Tensors basics # Tensors stores parameters, gradients, optimizer state, data, activations. Pytorch docs on tensors\nHow to initialize a tensor in PyTorch:\nx = torch.tensor([[1., 2, 3], [4, 5, 6]]) x = torch.zeros(4, 8) x = torch.ones(4, 8) x = torch.randn(4, 8) x = torch.empty(4, 8) nn.init.trunc_normal_(x, mean=0, std=1, a=-2, b=2) Each tensor has a rank, which is the number of dimensions.\nx = torch.zeros(4) # rank 1 tensor (vector) x = torch.zeros(4, 8) # rank 2 tensor (matrix) x = torch.zeros(4, 8, 2) # rank 3 tensor In Transformers, will see tensors of rank 4:\nB = 32 # Batch size S = 16 # Sequence length H = 16 # Number of heads D = 64 # Hidden dimension per head x = torch.zeros(B, S, H, D) Tensors memory # Almost everything in tensors are stored as floating point numbers\nfloat32 # float 32 is the default data type for a PyTorch tensor\nfloat16 # x = torch.tensor([1e-8], dtype=torch.float16) assert x == 0 # Underflow! bfloat16 # Google Brain developed bfloat (brain floating point) in 2018 to address this issue. bfloat16 uses the same memory as float16 but has the same dynamic range as float32! The only catch is that the resolution is worse, but this matters less for deep learning.\nx = torch.tensor([1e-8], dtype=torch.bfloat16) assert x != 0 # No underflow! Mixed precision # Implications on training:\nTraining with fp32 works, but requires lots of memory. Training with fp16 and even bf16 is risky, and you can get instability. Solution: mixed precision training [Micikevicius+ 2017]\nUse bf16 for parameters, activations, and gradients Use fp32 for optimizer states Pytorch has an automatic mixed precision (AMP) library. [docs] Tries to cast things into bf16 when safe (matmuls, not exp).\nwith torch.amp.autocast(\u0026#34;cuda\u0026#34;, dtype=torch.bfloat16): x = torch.zeros(4, 8) fp8 # In 2022, FP8 was standardized, motivated by machine learning workloads primer. H100s support two variants of FP8: E4M3 (range [-448, 448]) and E5M2 ([-57344, 57344]).\nfp4 # In 2025, NVIDIA developed nvfp4 Only 4 bits per value! Values: -6, -4, -3 , -2, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, 1.5, 2, 3, 4, 6 Use a separate scale factor per block, so actually get more dynamic range (but just can\u0026rsquo;t vary freely from neighbors). Nemotron 3 Super was trained in NVFP4 Some of this is done in NVIDIA libraries outside of user control.\nTensors on GPUs # Tensors are stored in CPU memory by default.\nx = torch.zeros(32, 32) assert x.device == torch.device(\u0026#34;cpu\u0026#34;) Move the tensor to GPU memory (device 0)\ny = x.to(\u0026#34;cuda:0\u0026#34;) assert y.device == torch.device(\u0026#34;cuda\u0026#34;, 0) Or create a tendor directly on the GPU\nz = torch.zeros(32, 32, device=\u0026#34;cuda:0\u0026#34;) ","date":"2026年4月7日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-2---1.-memory-accounting/","section":"笔记","summary":"","title":"CS336 Lecture 2 - 1. Memory Accounting","type":"note"},{"content":"In 2026\u0026rsquo;s lecture, Tensor einops was talked at first. Tensor operations # Tensor storage # PyTorch tensors are pointers into allocated memory with metadata describing how to get to any element of the tensor.\nx = torch.tensor([ [0., 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], ]) # Go to next row (dim 0), skip 4 elements assert x.stride(0) == 4 # Go to next column (dim 1), skip 1 element assert x.stride(1) == 1 # To find a element: r, c = 1, 2 index = r * x.stride(0) + c * x.stride(1) assert index == 6 Tensor slicing # Many operations simply provide a different view of the tensor, which means they do not make a copy.\nx = torch.tensor([[1., 2, 3], [4, 5 ,6]]) y = x[0] assert same_storage(x, y) # not built-in function y = x[:, 1] assert same_storage(x, y) y = x.view(3, 2) assert same_storage(x, y) y = x.transpose(1, 0) assert same_storage(x, y) x[0][0] = 100 assert y[0] == 100 x = torch.tensor([[1., 2, 3], [4, 5, 6]]) y = x.transpose(1, 0) assert not y.is_contiguous() try: y.view(2, 3) assert False except RuntimeError as e: assert \u0026#34;view size is not compatible with input tensor\u0026#39;s size and stride\u0026#34; in str(e) y = x.transpose(1, 0).contiguous().view(2, 3) # Hard copy happened assert not same_storage(x, y) Tensor elementwise # These operations apply some operations to each elements of the tensor and return a (new) tensor of the same shape.\nx = torch.tensor([1, 4, 9]) assert torch.equal(x.pow(2), torch.tensor([1, 16, 81])) assert torch.equal(x.sqrt(), torch.tensor([1, 2, 3])) assert torch.equal(x.rsqrt(), torch.tensor([1, 1 / 2, 1 / 3])) # i -\u0026gt; 1/sqrt(x_i) assert torch.equal(x + x, torch.tensor([2, 8, 18])) assert torch.equal(x * 2, torch.tensor([2, 8, 18])) assert torch.equal(x / 0.5, torch.tensor([2, 8, 18])) triu takes the upper triangular part of a matrix\nx = torch.ones(3, 3).triu() assert torch.equal(x, torch.tensor([ [1, 1, 1], [0, 1, 1], [0, 0, 1,] ])) Tensor matmul # x = torch.ones(16, 32) w = torch.ones(32, 2) y = x @ w assert y.size() == torch.Size([16, 2]) In general, we perform operations for every example in a batch and token in a sequence.\nx = torch.ones(4, 8, 16, 32) w = torch.ones(32, 2) y = x @ w assert y.size() == torch.Size([4, 8, 16, 2]) In this case, we iterate over values of the first 2 dimensions of x and multiply by w. （batched matmul + automatic broadcasting)\nTensor einops # Einops motivation # In tranditional PyTorch code, it is easy to mess up the dimensions.\nEinops is a library for manipulating tensors where dimensions are named. It is inspired by Einstein summation notation (Einstein, 1916). [Einops tutorial]\nJaxtyping basics (from 2025) # Old way:\nx = torch.ones(2, 2, 1, 3) # batch seq heads hidden New (jaxtyping) way:\n# from jaxtyping import Float x: Float[torch.Tensor, \u0026#34;batch seq heads hidden\u0026#34;] = torch.ones(2, 2, 1, 3) 笔记 This is just documentation (no enforcement), which means such code below is legal:\nx: Float[torch.Tensor, \u0026#34;batch seq heads hidden\u0026#34;] = torch.ones(100, 5) Einops einsum # Einsum is generalized matrix multiplication with good bookkeeping.\nx: Float[torch.Tensor, \u0026#34;batch seq1 hidden\u0026#34;] = torch.ones(2, 3, 4) y: Float[torch.Tensor, \u0026#34;batch seq2 hidden\u0026#34;] = torch.ones(2, 3, 4) Old way:\nz = x @ y.transpose(-2, -1) # batch, sequence, sequence New (einops) way:\n# from einops import einsum z = einsum(x, y, \u0026#34;batch seq1 hidden, batch seq2 hidden -\u0026gt; batch seq1 seq2\u0026#34;) Or can use ... to represent broadcasting over any number of dimensions:\nz = einsum(x, y, \u0026#34;... seq1 hidden, ... seq2 hidden -\u0026gt; ... seq1 seq2\u0026#34;) Einops reduce # You can reduce a single tensor via some operation (e.g., sum, mean, max, min).\nx = torch.ones(2, 3, 4) # batch seq hidden # Old way y = x.sum(dim=-1) # New (einops) way y = reduce(x, \u0026#34;... hidden -\u0026gt; ...\u0026#34;, \u0026#34;sum\u0026#34;) Einops rearrange # Sometimes, a dimension represents two dimensions, and you want to operate on one of them.\nx = torch.ones(3, 8) # seq total_hidden \u0026hellip; where total_hidden is a flattened representation fo heads * hidden1 (2x4 matrix)\nw = torch.ones(4, 4) # hidden1 hidden2 Break up total_hidden into two dimensions (heads and hidden1)\nx = rearrange(x, \u0026#34;... (heads hidden1`) -\u0026gt; ... heads hidden1`\u0026#34;, heads=2) Perform the transformation by w\nx = einsum(x, w, \u0026#34;... hidden1, hidden1 hidden2 -\u0026gt; ... hidden2\u0026#34;) Combine heads and hidden2 back together\nx = rearrange(x, \u0026#34;... heads hidden2 -\u0026gt; ... (heads hidden2)\u0026#34;) Tensor operations flops # A floating-point operation (FLOP) is a basic operation like addition (x + y) or multiplication (x y).\nIntuitions # Traning GPT-3 (2020) took 3.14e23 FLOPs. Training GPT-4 (2023) is speculated to take 2e25 FLOPs.\nH100 has a peak performance of 1979 TFlop/s with sparsity, 50% without [specs]\nLinear model # if torch.cuda.is_available(): B = 16384 # Number of points D = 32768 # Dimension of each point K = 8192 # Number of outputs else: B = 1024 D = 256 K = 64 x = torch.ones(B, D, device=cuda_if_available()) w = torch.randn(D, K, device=cuda_if_available()) y = x @ w How many FLOPs is this matmul? We have one multiplication (x[i][j] * w[j][k]) and one addition per (i, j, k) triple.\nactual_num_flops = 2 * B * D * K We can also time this operation to see how long it takes.\nactual_time = benchmark(lambda: x @ w) The actual FLOP/s of this operation:\nactual_flop_per_sec = actual_num_flops / actual_time Each GPU has a specification sheet that provides the peak performance. Note that the FLOP/s depends heavily on the data type!\npromised_flop_per_sec = get_promised_flop_per_sec(x.dtype) Model FLOPs utilization (MFU) # MFU = (actual FLOP/s) / (promised FLOP/s) [ignore communication/overhead] Usually, ≥ 0.5 is quite good.\nSummary # Matrix multiplications dominate: (2 m n p) FLOPs FLOP/s depends on hardware (B200 \u0026raquo; H100) and data type (bfloat16 \u0026raquo; float32) MFU: (actual FLOP/s) / (promised FLOP/s) Arithmetic intensity # The total time it takes from input to output depends on:\nAccelerator speed (FLOP/s) Memory bandwidth (bytes/s) assert h100_flop_per_sec == 1979e12 / 2 # Half without sparsity assert h100_bytes_per_sec == 3.35e12 ReLU # 笔记 Just in case, $\\mathrm{ReLU}(x) = \\mathrm{max}(0, x)$\nn = 1024 * 1024 x = torch.ones(n, dtype = torch.bfloat16, device = cuda_if_available()) y = torch.relu(x) bytes = (2 * n) + (2 * n) # Read x, write y (bf16 is 2 bytes/float) flops = n # n comparisons communication_time = bytes / h100_bytes_per_sec # 1.252e-6 computation_time = flops / h100_flop_per_sec # 1.060e-9 Arithmetic intensity: how much actual work per byte for this workload?\nh100_accelerator_intensity = h100_flop_per_sec / h100_bytes_per_sec # ~295.3731 arithmetic_intensity = flops / bytes # ~1/4 assert arithmetic_intensity \u0026lt; h100_asccelerator_intensity Apparently, ReLU is memory-bound (commmunication time \u0026gt; computation time), which leads to low MFU.\nGELU # 笔记 $\\mathrm{GELU}(x) = x \\cdot \\Phi(x) \\approx 0.5 x \\left( 1 + \\tanh \\left( \\sqrt{\\frac{2}{\\pi}} (x + 0.044715 x^3) \\right) \\right)$\nIn case you forgot, $\\Phi(x)$ is cumulative distribution function.\nimport torch.nn.functional as F n = 1024 * 1024 x = torch.ones(n, dtype=torch.bfloat16, device=cuda_if_available()) y = F.gelu(x) bytes = (2 * n) + (2 * n) # Read x, write y (bf16 is 2 bytes/float) flops = 20 * n arithmetic_intensity = flops / bytes # ~5 assert arithmetic_intensity \u0026lt; h100_asccelerator_intensity Obviously GELU has higher arithmetic intensity than ReLU, but it is still memory-bound.\nDot product # n = 1024 * 1024 x = torch.ones(n, dtype=torch.bfloat16, device=cuda_if_available()) w = torch.ones(n, dtype=torch.bfloat16, device=cuda_if_available()) y = x @ w bytes = (2 * n) + (2 * n) + 2 # Read x, read w, write y flops = 2 * n - 1 # n multiplications, n-1 additions arithmetic_intensity = flops / bytes # ~1/2 assert arithmetic_intensity \u0026lt; h100_asccelerator_intensity Still memory-bound.\nMatrix vector product # n = 1024 x = torch.ones(n, dtype=torch.bfloat16, device=cuda_if_available()) w = torch.ones(n, n, dtype=torch.bfloat16, device=cuda_if_available()) y = x @ w bytes = (2 * n) + (2 * n * n) + (2 * n) # Read x, read w, write y flops = n * (2 * n - 1) # n dot-product arithmetic_intensity = flops / bytes # ~1 assert arithmetic_intensity \u0026lt; h100_asccelerator_intensity Memory-bound.\nIntensity matmul # n = 1024 x = torch.ones(n, n, dtype=torch.bfloat16, device=cuda_if_available()) w = torch.ones(n, n, dtype=torch.bfloat16, device=cuda_if_available()) y = x @ w bytes = (2 * n * n) + (2 * n * n) + (2 * n * n) # Read x, read w, write y flops = n * n * (2 * n - 1) # n^2 dot products arithmetic_intensity = flops / bytes # ~n/3 assert arithmetic_intensity \u0026gt; h100_accelerator_intensity # 341.1667 \u0026gt; 295.3731 Obviously, it is compute-bound.\nTraining Transformers are compute-bound, since it involves big matrix multiplications. Matrix-vector product is what happens during inference, which is why inference is memory-bound.\n笔记 Arithmetic/accelerator intensity also depends on the precision (bf16 versus fp32)\nRoofline plots # We can visualize the relationship between arithmetic intensity and performance using roofline plots.\nThe red region corresponds to low arithmetic intensity operations (e.g., ReLU, GELU), which are memory-bound — their performance scales with available bandwidth. As bandwidth increases (from BW1 to BW2), performance improves, but only up to a point. Once the arithmetic intensity is high enough, the workload enters the yellow zone, where it may be memory-bound under lower bandwidth (BW1) but already compute-bound under higher bandwidth (BW2). Finally, in the green region (e.g., matmul), performance is fully compute-bound, and further increases in bandwidth no longer help — only improving compute throughput matters. ","date":"2026年4月7日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-2---2.-compute-accounting/","section":"笔记","summary":"","title":"CS336 Lecture 2 - 2. Compute Accounting","type":"note"},{"content":" Deep linear network # Consider a deep network with L layers and D-dimensional inputs, activations, and outputs.\nclass Block(nn.Module): \u0026#34;\u0026#34;\u0026#34;Simple block that applies a linear transformation followed by a ReLU nonlinearity.\u0026#34;\u0026#34;\u0026#34; def __init__(self, dim: int): super().__init__() self.weight = nn.Parameter(torch.randn(dim, dim) / math.sqrt(dim)) def forward(self, x: torch.Tensor) -\u0026gt; torch.Tensor: x = x @ self.weight # Linear x = F.relu(x) # Activation return x class DeepNetwork(nn.Module): \u0026#34;\u0026#34;\u0026#34;Map `dim`-vector to a `dim`-vector.\u0026#34;\u0026#34;\u0026#34; def __init__(self, dim: int, num_layers: int): super().__init__() self.layers = nn.ModuleList([Block(dim) for i in range(num_layers)]) def forward(self, x: torch.Tensor) -\u0026gt; torch.Tensor: # Apply all the layers sequentially for layer in self.layers: x = layer(x) # @stepover return x def get_num_parameters(model: nn.Module) -\u0026gt; int: return sum(param.numel() for param in model.parameters()) D = 8 L = 3 model = DeepNetwork(dim=D, num_layers=L).to(cuda_if_available()) num_parameters = get_num_parameters(model) assert num_parameters == (D * D) * L # Run the model on a batch of data B = 4 # Batch size x = torch.randn(B, D, device=cuda_if_available()) y = model(x) Gradients basics # So far, we\u0026rsquo;ve constructed tensors and passed them through operations (forward). Now, we\u0026rsquo;re going to compute the gradient (backward).\nAs a simple example, let\u0026rsquo;s consider the simple linear model: y = x * w, loss = 0.5(y - 5)^2\nForward pass: compute loss\nx = torch.tensor([1., 2, 3]) w = torch.tensor([1., 1, 1], requires_grad=True) # Want gradient pred_y = x @ w loss = 0.5 * (pred_y - 5).pow(2) Backward pass: compute gradients\nloss.backward() assert torch.equal(w.grad, torch.tensor([1, 2, 3]) What happens under the hood (PyTorch autograd) # In this example, PyTorch performs two phases: forward and backward.\n1. Forward pass (build computation graph) # During the forward pass, PyTorch does two things at the same time:\nCompute actual values Record how those values were computed Concretely:\nx, w → matmul → pred_y → subtract(-5) → pow(2) → multiply(0.\u0026gt; 5) → loss Each operation creates a new tensor and attaches a grad_fn (gradient function), which represents:\nhow this tensor was computed how to compute its gradient during backward This forms a computation graph:\nNodes = tensors / operations Edges = data dependencies It is a directed acyclic graph (DAG) constructed dynamically Only tensors with requires_grad=True (like w) are tracked as endpoints for gradients.\n2. Backward pass (compute gradients) # For a more clear view on backpropagation, please check Backpropagation calculus | Deep Learning Chapter 4 by 3Blue1Brown [哔哩哔哩]\nWhen we call:\nloss.backward() PyTorch:\nStarts from the loss with:\nd(loss)/d(loss) = 1 Traverses the computation graph in reverse order\nAt each node, applies the chain rule: $$ \\frac{\\mathrm{d}L}{\\mathrm{d}w} = \\frac{\\mathrm{d}L}{\\mathrm{d}y} \\cdot \\frac{\\mathrm{d}y}{\\mathrm{d}w} $$\nEach operation knows its local derivative, so it multiplies:\nincoming gradient × local derivative Gradients are propagated backward step by step:\nloss → pred_y → w Final gradients are accumulated into leaf tensors:\nw.grad = [1, 2, 3] Gradient FLOPs # Let\u0026rsquo;s count the FLOPs for computing gradients.\nB = 1024 # Number of points D = 256 # Dimension Define a simplified model (2-layer linear network):\nx = torch.ones(B, D, device=cuda_if_available()) w1 = torch.randn(D, D, device=cuda_if_available(), requires_grad=True) w2 = torch.randn(D, D, device=cuda_if_available(), requires_grad=True) # Forward pass h1 = einsum(x, w1, \u0026#34;batch in, in out -\u0026gt; batch out\u0026#34;) # x h2 = einsum(h1, w2, \u0026#34;batch in, in out -\u0026gt; batch out\u0026#34;) # h1 loss = (h2.mean() - 0)**2 # Regress everything to 0 (arbitrary) # Backward pass h1.retain_grad() # For debugging h2.retain_grad() # For debugging loss.backward() Zoom in on one layer # Let\u0026rsquo;s focus on the second layer (h2 = h1 @ w2)\nForward pass: Recall the number of forward FLOPs:\nnum_forward_flops = 2 * B * D * D Backward pass: How many FLOPs is running the backward pass?\nWe need to compute:\nh1.grad = d loss / d h1 w2.grad = d loss / d w2 h1_grad = einsum(h2.grad, w2, \u0026#34;batch out, in out -\u0026gt; batch in\u0026#34;) assert torch.allclose(h1.grad, h1_grad) w2_grad = einsum(h2.grad, h1, \u0026#34;batch out, batch in -\u0026gt; in out\u0026#34;) assert torch.allclose(w2.grad, w2_grad) num_backward_flops = (2 * B * D * D) + (2 * B * D * D) Note that the backward pass is 2x more expensive than the forward pass.\nConsider all layers # This was just for w2, need to apply it to all parameters in the network.\nPutting it together:\nForward pass: 2 (# data points) (# parameters) FLOPs Backward pass: 4 (# data points) (# parameters) FLOPs Total: 6 (# data points) (# parameters) FLOPs This is for multilayer perceptrons (MLPs) \u0026hellip;but it turns out to be a good approximation for Transformers for short context lengths as well.\nOptimizer # Recall our deep network.\nB = 2 # Batch size D = 4 # Dimensionality of input, activations, and output L = 3 # Number of layers model = DeepNetwork(dim=D, num_layers=L).to(cuda_if_available()) Let\u0026rsquo;s define the AdaGrad optimizer\nmomentum = SGD + exponential averaging of grad AdaGrad = SGD + averaging by grad^2 RMSProp = AdaGrad but with exponential averaging of grad^2 Adam = RMSProp + momentum AdaGrad [Duchi+ 2011]\noptimizer = AdaGrad(model.parameters(), lr=0.01) state = model.state_dict() # Compute gradients x = torch.randn(B, D, device=cuda_if_available()) y = torch.tensor([4., 5.], device=cuda_if_available()) pred_y = model(x).mean() loss = F.mse_loss(input=pred_y, target=y) loss.backward() # Take a step optimizer.step() optimizer_state = {i: dict(p_state) for i, (p, p_state) in enumerate(optimizer.state.items())} # Free up the memory optimizer.zero_grad(set_to_none=True) Memory # num_parameters = D * D * L parameter_memory = 2 * num_parameters # (2 bytes for bf16) gradient_memory = 2 * num_parameters # (2 bytes for bf16) optimizer_state_memory = 4 * num_parameters # (4 bytes for fp32) activation_memory = 2 * (B * D * L) # (2 bytes for bf16) It is customary to use fp32 for stability (accumulating averages over powers over many steps). Optimizer state memory:\nAdaGrad: 4 bytes/parameter for storing second moments Adam: 8 bytes/parameter for storing first and second moments # Putting it all together total_memory = parameter_memory + activation_memory + gradient_memory + optimizer_state_memory Compute (for one training step) # num_parameters = D * D * L flops = 6 * B * num_parameters Transformers # The accounting for a Transformer is more complicated, but the same idea. Assignment 1 will ask you to do that. Blog post describing memory usage for Transformer training [article] Blog post describing FLOPs for a Transformer: [article]\nTrain loop # # True linear function with weights (0, 1, 2, ..., D-1) D = 16 # Dimensionality true_w = torch.arange(D, dtype=torch.float32, device=cuda_if_available()) # Data loader that generates (x, y) pairs B = 4 # Batch size def get_batch() -\u0026gt; tuple[torch.Tensor, torch.Tensor]: x = torch.randn(B, D).to(cuda_if_available()) true_y = x @ true_w return (x, true_y) # Define the model and optimizer L = 2 # Number of layers model = DeepNetwork(dim=D, num_layers=L).to(cuda_if_available()) optimizer = AdaGrad(model.parameters(), lr=0.01) # Train! num_train_steps = 10 for t in range(num_train_steps): # Get data x, y = get_batch() # Forward (compute loss) pred_y = model(x).mean() loss = F.mse_loss(pred_y, y) # Backward (compute gradients) loss.backward() # Update parameters optimizer.step() optimizer.zero_grad(set_to_none=True) ","date":"2026年5月5日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-2---3.-full-example/","section":"笔记","summary":"","title":"CS336 Lecture 2 - 3. Full Example","type":"note"},{"content":" Gradient accumulation # Large batch sizes: improve training stability However, activation memory scales with batch size, so might run out.\nB = 64 # Batch size D = 1024 # Dimensionality L = 16 # Number of layers activation_memory = 2 * B * D * L # (2 bytes for bf16) Gradient accumulation:\nCompute gradient on micro batches Accumulate the gradients (don\u0026rsquo;t zero it out) Every batch_size / micro_batch_size steps, update the parameters and zero out the gradients micro_batch_size = 256 activation_memory = 2 * micro_batch_size * D * L # (2 bytes for bf16) Activation checkpointing # For training, we need to store the activations of all layers For inference, we don\u0026rsquo;t compute gradients, so we only need to store the current layer\u0026rsquo;s activations.\nThe memory usage is\nB = 64 # Batch size D = 1024 # Dimensionality L = 16 # Number of layers x = torch.randn(B, D, device=cuda_if_available(), requires_grad=True) activation_memory = 2 * B * D * L model = DeepNetwork(dim=D, num_layers=L).to(cuda_if_available()) memory = get_max_memory_usage(lambda: model(x).sum().backward()) Can we reduce this?\nActivation checkpointing = gradient checkpointing = rematerialization Key idea:\nForward pass: keep only activations at subset of layers Backward pass: recompute the missing activations from the last checkpoint Philosophy: tradeoff memory for compute class DeepNetworkCheckpointed(nn.Module): \u0026#34;\u0026#34;\u0026#34;Same as DeepNetwork, but with activation checkpointing.\u0026#34;\u0026#34;\u0026#34; def __init__(self, dim: int, num_layers: int): super().__init__() self.layers = nn.ModuleList([Block(dim) for i in range(num_layers)]) def forward(self, x: torch.Tensor) -\u0026gt; torch.Tensor: # Apply all the layers sequentially for layer in self.layers: # KEY: only store activations at checkpoints, recompute the rest \u0026#34;\u0026#34;\u0026#34; ==\u0026gt; \u0026#34;\u0026#34;\u0026#34; x = torch.utils.checkpoint.checkpoint(layer, x) return x # Store all activations: x g1 h1 g2 h2 g3 h3 g4 h4 # Activation checkpointing: x h1 h2 h3 h4 # Define the model with checkpointing model = DeepNetworkCheckpointed(dim=D, num_layers=L).to(cuda_if_available()) checkpointed_memory = get_max_memory_usage(lambda: model(x).sum().backward()) Can we reduce this even more, especially for deep networks (large L)?\n# Store all layers: | h1 h2 h3 h4 h5 h6 h7 h8 h9 | # Store no layers: | | # Store some layers: | h3 h6 h9 | How frequently to checkpoint?\nIf store each layer\u0026rsquo;s activations, then activation memory is O(L) and no recomputation. If store no activations, then activation memory is O(1) and compute is O(L^2) (recompute from the start for each layer). If store every sqrt(L) layers, then activation memory is O(sqrt(L)) and O(L) recomputation. Summary # Everything is operations on tensors (parameters, gradients, activations, optimizer states, data) einops: better way to think about tensor operations 6 (# data points) (# parameters) FLOPs per training step Arithmetic intensity / roofline analysis: compute-bound or memory-bound? Matrix multiplications are compute-bound, elementwise operations are memory-bound Gradient accumulation, activation checkpointing: reduce memory to use bigger batch sizes ","date":"2026年5月8日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-2---4.-more-memory-optimizations/","section":"笔记","summary":"","title":"CS336 Lecture 2 - 4. More Memory Optimizations","type":"note"},{"content":" Please read along with the original slides\nThe modern Transformer recipe # The original Transformer is still the conceptual starting point, but modern large language models are not usually exact copies of it.\nA typical modern dense LLM block looks more like this:\ninput hidden states enter a normalisation layer; the normalised states go into causal self-attention; the attention result is added back through a residual connection; another normalisation layer is applied; the result goes into a feed-forward network, usually with a gated activation; another residual addition produces the block output. In rough form:\nx = x + Attention(Norm(x)) x = x + FFN(Norm(X)) This is called pre-norm, because the normalisation happens before the attention or FFN sublayer.\nThe important design idea is that the residual stream remains a relatively clean path through the network. The normalisation affects the branch computation, but it does not directly sit on the main residual path after every addition.\nThat sounds like a small rearrangement, but it matters a lot for training stability.\nPre-norm vs post-norm # The original Transformer used post-norm:\nx = Norm(x + Attention(x)) x = Norm(x + FFN(x)) Modern LLMs usually use pre-norm:\nx = x + Attention(Norm(x)) x = x + FFN(Norm(X)) The practical reason is stability.\nWith post-norm, the normalisation is placed after the residual addition. This can interfere with the residual signal path and can make gradients behave badly in deeper networks. The lecture discusses two related explanations:\ngradient attenuation, where gradients shrink or become poorly conditioned through depth; gradient spikes, where training becomes unstable and requires careful warmup or smaller learning rates. Pre-norm became the standard because it tends to make large models easier to train. It helps preserve the residual stream and allows larger learning rates or less fragile warmup schedules.\nA useful mental model:\nPost-norm says: “normalise the result after mixing the residual and new computation.” Pre-norm says: “normalise only the input to the new computation, and leave the residual highway mostly untouched.” That second design is friendlier to very deep networks.\nSome newer models go further and add extra normalisation outside the residual stream. This is sometimes called double norm or non-residual post-norm. The motivation is not to return to the old post-norm design, but to add additional control without damaging the residual pathway.\nLayerNorm vs RMSNorm # The original Transformer used LayerNorm. Many modern LLMs use RMSNorm instead.\nLayerNorm normalises both the mean and the variance of the hidden vector.\n$$ y = \\frac{x - \\mathbb{E}[x]}{\\sqrt{\\mathrm{Var}[x] + \\epsilon}} - \\gamma + \\beta $$\nRMSNorm is simpler: it normalises using the root mean square and does not subtract the mean.\n$$ y = \\frac{x}{\\sqrt{|x|^2_2 + \\epsilon}} - \\gamma $$\nThe lecture’s main point is that RMSNorm is not popular because it radically changes the model’s expressive power. It is popular because it is cheaper and works about as well.\nThe advantages are:\nfewer operations, because it does not compute the mean; fewer parameters, because it usually does not use a bias term; less data movement; better wall-clock performance in practice. A key systems lesson appears here:\nFLOPs are not runtime.\nEven if normalisation is a tiny fraction of total FLOPs, it can still matter because runtime is often affected by memory movement, kernel launches, and bandwidth. Matrix multiplications dominate FLOPs, but small operations can still hurt performance if they move data inefficiently.\nSo RMSNorm is a good example of an architectural choice that looks mathematically minor but is useful from a systems perspective.\nDropping bias terms # Modern Transformer implementations often remove bias terms from linear layers and normalisation layers.\nFor example, instead of:\nFFN(x) = activation(xW1 + b1)W2 + B2 many modern modals use something closer to:\nFFN(x) = activation(xW1)W2) This is not because bias terms are impossible to use. Older models used them. The argument is more pragmatic:\nbias terms add parameters; they add memory movement; they often do not provide a large enough benefit; removing them can slightly simplify optimisation and implementation. This is part of a broader modern LLM design trend: if a component costs memory bandwidth but does not clearly improve quality, people tend to remove it.\nFeed-forward activations: from ReLU to GLU # The feed-forward network is a major part of each Transformer block. In many LLMs, it contains a large fraction of the model’s parameters and compute.\nOlder models used activations like:\nReLU GELU Swish Modern models increasingly use gated activations, especially:\nGeGLU SwiGLU A standard FFN looks like this:\nFFN(x) = activation(xW1)W2 A gated FFN adds another linear projection and multiplies the two branches elementwise:\nFFN(x) = (activation(xW1) ⊙ xV)W2 For SwiGLU:\nSwiGLU(x) = Swish(xW1) ⊙ xV The intuition is that the model gets a learned gate. Instead of merely transforming features, it can also decide which feature channels should pass through more strongly.\nGated variants add parameters, so models usually reduce the feed-forward hidden dimension when using them. A common rule is:\nstandard FFN: \\(d_{\\text{ff}} \\approx 4 d_{\\text{model}}\\) GLU-style FFN: \\(d_{\\text{ff}} \\approx \\frac{8}{3} d_{\\text{model}}\\) This keeps the parameter count roughly comparable.\nThe lecture’s practical conclusion is:\nReLU and GELU can still work. GPT-3 used GELU and obviously worked. But most recent models have moved towards SwiGLU or GeGLU. The empirical evidence suggests gated activations give fairly consistent gains. So, for a modern LLM implementation, SwiGLU + RMSNorm + pre-norm is a very normal choice.\nSerial vs parallel Transformer blocks # The usual Transformer block is serial:\nx = x + Attention(Norm(x)) x = x + MLP(Norm(x)) Attention is computed first, then the MLP.\nSome models use a parallel block:\nx = x + Attention(Norm(x)) + MLP(Norm(x)) This can be faster if implemented carefully, because:\nthe same normalised input can be shared; matrix multiplications may be fused; attention and MLP branches can be scheduled more efficiently. Models such as GPT-J, PaLM, and GPT-NeoX used parallel layers.\nHowever, the lecture notes that most models still use the serial design. Parallel layers are interesting, but they have not become the universal default.\nThe practical takeaway:\nserial blocks are the safe, standard choice; parallel blocks can be useful for efficiency; but parallelisation is an implementation and scaling trade-off, not a guaranteed quality improvement. Position embeddings # Position information is necessary because attention by itself does not know token order. Without positional information, a Transformer would treat a sequence too much like an unordered set.\nThe lecture covers several position embedding families.\nSinusoidal embeddings # The original Transformer used fixed sine and cosine functions.\nThe model adds a position-dependent vector to each token embedding:\nembedding(token, position) = token_embedding + sinusoidal_position_vector This gives the model a smooth notion of position, but it is still an additive absolute-position method.\nLearned absolute embeddings # Models like GPT-2 and GPT-3 used learned position embeddings.\nembedding(token, position) = token_embedding + learned_position_vector This is simple and effective, but it is tied to absolute positions. It also does not naturally extrapolate to longer sequence lengths.\nRelative position embeddings # Relative position methods try to make attention depend on the distance between tokens rather than their absolute indices.\nInstead of “token at position 17 attends to token at position 4”, the model can reason more like “this token attends to another token 13 positions earlier”.\nThis is often more natural for language.\nRoPE: Rotary Position Embeddings # RoPE is now one of the dominant choices in modern LLMs.\nThe key idea is elegant:\nEncode position by rotating query and key vectors, so their inner product depends on relative position.\nRather than adding a position vector to the embedding, RoPE modifies the query and key vectors inside attention.\nFor each pair of hidden dimensions, RoPE applies a 2D rotation whose angle depends on the token position. When the model computes the dot product between a query and a key, the result naturally contains information about the relative distance between their positions.\nA useful way to think about it:\ntoken content gives the base vector; position rotates that vector; attention compares rotated query/key vectors; the comparison depends on relative offset. This is different from sinusoidal embeddings because RoPE is multiplicative/rotational rather than additive. It avoids some unwanted cross terms produced by simply adding position vectors to token embeddings.\nIn implementation, RoPE is usually applied to queries and keys, not values.\nThat detail matters: attention scores come from \\(\\mathrm{Q}\\mathrm{K}^\\mathrm{T}\\), so applying RoPE to \\(\\mathrm{Q}\\) and \\(\\mathrm{K}\\) directly affects how tokens attend to each other by position.\nFeed-forward dimension: why \\(d_{\\text{ff}} \\approx 4 d_{\\text{model}}\\)? # A standard Transformer FFN expands the hidden dimension and then projects it back down.\nIf the model dimension is \\(d_{\\text{model}}\\), the feed-forward dimension is often:\n$$ d_{\\text{ff}} = 4 d_{\\text{model}} $$\nThis rule appears again and again across models.\nFor GLU-style FFNs, because there is an extra gate projection, the expansion is often reduced to:\n$$ d_{\\text{ff}} \\approx \\frac{8}{3} d_{\\text{model}} $$\nThe lecture frames this as a surprisingly strong consensus. There are exceptions, but most models stay in a fairly conservative range.\nOne famous exception is T5-11B, which used an enormous feed-forward multiplier:\nd_ff = 65,536 d_model = 1,024 That is a 64x multiplier.\nBut the lecture is careful here: the fact that something works does not mean it is optimal. T5 v1.1 later moved to a more conventional GeGLU setup with a much smaller multiplier.\nIn summary:\n4x is the boring but strong default; 8/3x is common for GLU variants; extreme FFN widths can work, but are not obviously the best use of parameters; most successful LLMs are less adventurous than people might expect. Attention heads and head dimension # A standard multi-head attention setup usually satisfies:\nnum_heads × head_dim = d_model For example:\nd_model = 4096 num_heads = 32 head_dim = 128 This is not mathematically required. A model could choose a total attention dimension larger or smaller than d_model.\nBut most models stay close to the simple rule.\nThere are exceptions, especially in some Google models such as T5 and LaMDA, where the ratio between total head dimension and model dimension can be larger than 1.\nThe lecture’s attitude here is quite sceptical:\nthis convention is widely used; it seems to work; but there is not necessarily deep validation proving it is uniquely optimal. So it is a consensus default, not a law of nature.\nDeep vs wide: model aspect ratio # Another hyperparameter is the model\u0026rsquo;s aspect ratio:\nd_model / num_layers This roughly asks:\nShould the model be wide with fewer layers, or deep with narrower layers?\nThe lecture notes that many successful models fall into a broad range, often around 100–200, though there are outliers.\nThere is no single magic number.\nThe important systems consideration is that extremely deep models are harder to parallelise. Layers are sequential: layer 12 depends on layer 11, which depends on layer 10, and so on. That creates latency and limits parallel execution.\nVery wide models, by contrast, can often use larger matrix multiplications, which GPUs are good at.\nSo the depth/width decision is not only about model quality. It is also about:\ntraining throughput; inference latency; pipeline parallelism; hardware utilisation; communication cost across devices. A model that is theoretically elegant but slow to train or serve may be a poor engineering choice.\nVocabulary size # Vocabulary size depends heavily on language coverage and production needs.\nFor mostly monolingual English models, typical vocabulary sizes are around 30k-50k tokens. Examples include GPT-2/3 around 50k and LLaMA around 32k.\nFor multilingual or production systems, vocabularies are often much larger, around 100k-250k+ tokens. Examples include PaLM, mT5, Qwen, DeepSeek, Gemma, and GPT-4-scale tokenizers.\nThe reason is simple: multilingual models need to represent many writing systems, languages, scripts, and character combinations. A small vocabulary can make non-English text inefficient, producing too many tokens for the same sentence.\nThe practical takeaway:\nsmall vocabularies are fine for narrow language coverage; multilingual models usually need larger vocabularies; tokenisation is one of the major places where models still differ. Dropout and regularization # Classic neural networks often rely on dropout to prevent overfitting.\nFor LLM pretraining, the argument against dropout is reasonable:\nthe dataset is huge; models often see each token only once or a small number of times; memorisation is less like the small-data regime; dropout can slow or destabilise optimisation. Older models often used dropout, including GPT-2, GPT-3, T5, and OPT.\nNewer models often use little or no dropout during pretraining. Instead, they may rely on weight decay.\nBut weight decay in LLMs is not simply about preventing overfitting. The lecture highlights that weight decay interacts with the learning-rate schedule, especially cosine decay.\nSo regularization in LLM pretraining is often better understood as part of the optimisation dynamics rather than merely a defence against train/test overfitting.\nStability tricks # Large-model training can fail in messy ways. The loss curve may spike, gradients may explode, or the model may become unstable late in training.\nThe lecture focuses on one dangerous component: Softmax\nSoftmax involves exponentials and normalisation. If logits become too large, the output can become extremely sharp or numerically unstable.\nThere are two main softmax locations in an LLM:\nthe final output softmax over vocabulary; the attention softmax over tokens. Modern models use several tricks to keep these stable.\nOutput Softmax stability: z-loss # The z-loss penalises the log normalisation term in the output softmax.\nThe softmax probability is:\n$$ p_i = \\frac{e^{z_i}}{Z} $$\nwhere the normalisation term is:\n$$ Z = \\sum_j e^{z_j} $$\nTaking the logarithm gives:\n$$ \\log p_i = z_i - \\log Z $$\nThe z-loss adds an auxiliary penalty term:\n$$ L_z = \\alpha (\\log Z)^2 $$\nwhere \\(\\alpha\\) is usually a very small constant.\nThe intuition:\nif logits become huge, \\(Z\\) becomes huge; huge logits make the softmax distribution extremely sharp; penalising \\(\\log Z\\) discourages logits from growing uncontrollably. This improves numerical stability during training, especially for very large language models.\nPaLM used this trick, and the lecture lists other models that also adopted it, such as Baichuan 2, DCLM, OLMo 2, and OLMo 3.\nAttention Softmax stability: QK norm # Attention scores are computed from queries and keys:\n$$ \\text{scores} = \\frac{\\mathrm{Q}\\mathrm{K}^{\\mathrm{T}}}{\\sqrt{d_k}} $$\nThen softmax is applied.\nIf Q and K have large norms, the dot products can become large, making the attention softmax too sharp or unstable.\nQK-norm normalises queries and keys before they enter the attention softmax.\nIn simplified form:\n$$ \\begin{align*} \\mathrm{Q} \u0026amp;= \\text{Norm}(\\mathrm{Q})\\ \\mathrm{K} \u0026amp;= \\text{Norm}(\\mathrm{K})\\ scores \u0026amp;= \\frac{\\mathrm{Q}\\mathrm{K}^\\mathrm{T}}{\\sqrt{d_k}}\\ \\end{align*} $$\nThis directly targets attention stability.\nThe lecture notes that QK-norm appears in several recent models, including DCLM, OLMo 2, Gemma 2, Qwen3, OLMo 3, and Gemma 4.\nLogit soft-capping # Another stability trick is logit soft-capping.\nInstead of allowing logits to grow without bound, the model passes them through a tanh-based cap:\nlogits = soft_cap × tanh(logits / soft_cap) This keeps logits within a controlled range.\nThe upside:\nprevents attention or output logits from blowing up; can improve numerical stability. The downside:\nit may hurt performance if the cap restricts useful confidence too much; it adds another hyperparameter; it may not be universally beneficial. So this is a stability tool, but not necessarily a free lunch.\nAttention variants: MHA, MQA, and GQA # Standard multi-head attention uses separate query, key, and value heads.\nThe problem becomes especially important during inference.\nDuring text generation, the model generates one token at a time. It stores previous keys and values in a KV cache so it does not need to recompute them for every new token.\nHowever, the KV cache can become large. Moving it in and out of memory can become a bottleneck.\nMulti-Query Attention, MQA # MQA keeps multiple query heads, but uses only one shared set of key and value projections.\nThis reduces KV cache size and memory traffic during inference.\nThe trade-off is that it can hurt quality, because all query heads share the same key/value representation.\nGrouped-Query Attention, GQA # GQA is a compromise between MHA and MQA. Instead of one shared K/V head for all query heads, groups of query heads share K/V heads.\nFor example, a model might have 32 query heads, but only 4 key/value heads. In this case, every group of 8 query heads shares one K/V set.\nSo the spectrum looks roughly like this:\nMHA: every query head has its own key/value projections; GQA: groups of query heads share the same key/value projections; MQA: all query heads share a single global set of key/value projections. GQA gives a knob for balancing:\ninference efficiency; KV cache size; model expressiveness; quality. The lecture’s conclusion is that MQA can sometimes introduce a small perplexity degradation, while GQA often preserves most of the quality of full multi-head attention while still significantly reducing KV-cache cost.\nThis is why GQA has become very common in production LLMs.\nSparse and sliding-window attention # Full attention is quadratic in sequence length:\n$$ \\text{cost} \\approx \\mathrm{O}(n^2) $$\nFor long contexts, this becomes expensive.\nSparse or sliding-window attention restricts which tokens can attend to which other tokens.\nFor example, in sliding-window attention, each token attends only to nearby tokens within a fixed window.\nThe trade-off:\nfull attention is more expressive but expensive; local attention is cheaper but may miss long-range dependencies. A modern compromise is to interleave local and full attention layers.\nFor example:\nLayer 1: sliding-window attention Layer 2: sliding-window attention Layer 3: sliding-window attention Layer 4: full attention repeat This allows most layers to be cheaper while occasional full-attention layers move global information across the sequence.\nThe lecture mentions this as an emerging standard trick in models such as Command A, LLaMA 4, Gemma 3/4, and OLMo 3.\nA useful mental model:\nlocal attention handles nearby syntax and local coherence; occasional full attention handles global dependencies; interleaving gives a practical cost/quality trade-off. Final Takeaways # Modern dense LLM architectures are less chaotic than they look. Many successful models share a relatively stable recipe:\npre-norm rather than post-norm; RMSNorm rather than LayerNorm; no bias terms in many linear/normalisation layers; SwiGLU or GeGLU rather than ReLU; RoPE for position information; \\(d_{\\text{ff}} \\approx 4 d_{\\text{model}}\\), or \\(\\approx \\frac{8}{3} d_{\\text{model}}\\) for GLU-style FFNs; num_heads × head_dim ≈ d_model; little or no dropout during pretraining; weight decay mainly as an optimisation/stability tool; GQA/MQA to reduce KV-cache cost during inference; QK-norm, z-loss, or logit soft-capping for stability; sparse/sliding-window attention when long context makes full attention too expensive. The most important meta-lesson is this:\nLLM architecture design is not only about mathematical expressiveness. It is also about optimisation stability, memory movement, inference latency, and hardware efficiency.\nThat is why small-looking choices such as RMSNorm, removing bias terms, choosing GQA, or applying QK-norm can matter. They may not change the high-level Transformer story, but they make the model easier to train, cheaper to serve, or more stable at scale.\n","date":"2026年5月13日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-3---architectures--hyperparameters/","section":"笔记","summary":"","title":"CS336 Lecture 3 - Architectures and Hyperparameters","type":"note"},{"content":" Please read along with the original slides\nThis lecture discusses two major directions for making large language models more efficient and scalable:\nAlternatives to standard attention, especially for long-context modelling; Mixture of Experts (MoE), which increases model capacity without increasing the activated compute proportionally. The high-level motivation is simple: standard Transformer attention becomes expensive when the context length grows, while dense feed-forward layers become expensive when we want larger model capacity. Attention alternatives try to reduce the cost of sequence modelling, while MoE tries to decouple total parameter count from per-token FLOPs.\nPart I: Attention Alternatives # Why do we need attention alternatives? # In a standard Transformer, self-attention has quadratic cost with respect to sequence length. If the sequence length is \\(n\\), the attention score matrix has size \\(n \\times n\\). This becomes increasingly expensive for long-context models.\nA basic engineering toolkit already exists:\nlocal attention global attention sliding-window attention FlashAttention and other systems-level optimisations hybrid attention layouts These methods are useful, but they are still relatively conservative. The lecture then moves to more radical alternatives such as linear attention, recurrent attention forms, Mamba-like state-space models, gated delta networks, and sparse attention.\nLinear attention # Standard attention can be written as:\n$$ \\mathrm{Attn}(Q, K, V) = \\rho(QK^\\top)V $$\nwhere:\n$$ Q \\in \\mathbb{R}^{n \\times d_k}, \\quad K \\in \\mathbb{R}^{n \\times d_k}, \\quad V \\in \\mathbb{R}^{n \\times d_v} $$\nThe expensive part is the computation of:\n$$ QK^\\top $$\nwhich costs roughly:\n$$ O(n^2 d_k) $$\nand produces an \\(n \\times n\\) matrix.\nIf we temporarily ignore the softmax and assume \\(\\rho\\) is the identity function, we can reorder the computation:\n$$ (QK^\\top)V = Q(K^\\top V) $$\nThis changes the computation pattern. Instead of first forming an \\(n \\times n\\) attention matrix, we first compute:\n$$ K^\\top V $$\nwhose shape is:\n$$ d_k \\times d_v $$\nThen we multiply \\(Q\\) by this smaller matrix.\nThe cost changes from something like:\n$$ n^2 d_k + n^2 d_v $$\nto:\n$$ 2 n d_k d_v $$\nSo the complexity becomes linear in sequence length \\(n\\), assuming \\(d_k\\) and \\(d_v\\) are fixed.\nOf course, this is not the same as standard softmax attention. The difficult part is not just the matrix reordering; the real problem is how to approximate or replace the softmax attention behaviour while keeping the computation linear.\nRecurrent form of linear attention # Linear attention has another useful property: it can be written in a recurrent form.\nStarting from:\n$$ (QK^\\top)V = Q(K^\\top V) $$\nwe can define a recurrent state:\n$$ S_t = S_{t-1} + k_t v_t^\\top $$\nand compute the output as:\n$$ y_t = q_t^\\top S_t $$\nHere:\n\\(S_t\\) is a compressed memory state; each new token updates the state using \\(k_t v_t^\\top\\); the current query \\(q_t\\) reads from that state. This makes linear attention look like an RNN. The model maintains a state and updates it token by token.\nThe nice part is the duality:\nduring training, we can use a parallel form; during inference, we can use a recurrent form. This is attractive because autoregressive inference is already sequential, so a recurrent update can be very efficient.\nFrom linear attention to Mamba-2 # Mamba-2 can be viewed as a generalisation of linear attention with additional gating.\nLinear attention has:\n$$ S_t = S_{t-1} + k_t v_t^\\top $$\n$$ y_t = q_t^\\top S_t $$\nMamba-2 introduces a position-dependent decay or gate:\n$$ S_t = \\gamma_t S_{t-1} + k_t v_t^\\top $$\n$$ y_t = q_t^\\top S_t + v_t^\\top D $$\nwhere:\n$$ \\gamma_t = f(x_t) $$\nThe gate \\(\\gamma_t\\) allows the model to decide how much previous state should be kept. This makes the model more expressive than plain linear attention.\nThis is also why the lecture says \u0026ldquo;gating is good\u0026rdquo;. A simple additive state update is limited; a gated update gives the model a way to forget, preserve, or modulate information.\nGated Delta Net # Gated Delta Net generalises this idea further. It not only gates the previous state, but also selectively erases information from the state.\nA simplified form is:\n$$ S_t = \\gamma_t (I - \\beta_t k_t k_t^\\top) S_{t-1} + \\beta_t k_t v_t^\\top $$\n$$ y_t = q_t^\\top S_t $$\nwhere:\n$$ \\gamma_t = f(x_t), \\quad \\beta_t = f(x_t) $$\nThere are two key mechanisms here.\nFirst, \\(\\beta_t\\) controls whether the current input should be written into the state. If:\n$$ \\beta_t = 0 $$\nthen the model performs a \u0026ldquo;no input\u0026rdquo; operation.\nSecond, the term:\n$$ I - \\beta_t k_t k_t^\\top $$\ncan erase information in the direction of the current key. This is more flexible than merely adding new information into the state.\nThis connects to ideas such as fast weight programming and test-time training, where model states or weights are dynamically updated based on the current input.\nHybrid architectures # A major theme in the lecture is that pure attention alternatives are not always used alone. Many recent models use hybrid architectures.\nExamples mentioned include:\nMinimax M1 / minimax-text-01; Nemotron 3; Qwen 3.5 / Qwen Next; Mamba-attention hybrids; Gated Delta Net / attention hybrids. The common pattern is to use attention only in some layers, and replace the rest with more efficient recurrent or linear-time modules.\nFor example, Minimax M1 uses a 7-to-1 hybrid structure: seven linear attention layers and one full attention layer. Nemotron 3 uses a Mamba-attention hybrid. Qwen Next uses a Gated Delta Net / attention hybrid.\nThe reason hybrid designs are attractive is that full attention is still powerful. It provides direct token-token interaction, which is difficult to fully replace. But full attention is expensive, especially for long context. Hybrid models try to keep enough full attention for quality while using cheaper alternatives for scalability.\nSparse attention as another alternative # Another direction is sparse attention.\nInstead of replacing attention with a recurrent or linear module, sparse attention keeps the attention mechanism but attends only to a subset of tokens.\nThe lecture discusses DeepSeek Sparse Attention (DSA). The idea is to use a lightweight indexer that selects which tokens should be attended to. This can reduce the attention cost while preserving access to important context.\nA key advantage is that sparse attention can sometimes be adapted after dense short-context pretraining. That means we may not need to train an entirely new architecture from scratch.\nCompared with linear attention, sparse attention is less radical. It still performs token-token attention, but only over selected tokens.\nPart II: Mixture of Experts # What is a Mixture of Experts model? # A Mixture of Experts model replaces a dense feed-forward network with multiple expert feed-forward networks and a routing mechanism.\nIn a normal dense Transformer block, every token passes through the same MLP. In an MoE block, each token is routed to only a small number of experts.\nA simplified MoE layer looks like this:\n$$ h_t = \\sum_{i=1}^{N} g_{i,t} \\mathrm{FFN}_i(u_t) + u_t $$\nwhere:\n\\(N\\) is the number of experts; \\(\\mathrm{FFN}_i\\) is the \\(i\\)-th expert; \\(g_{i,t}\\) is the gate value for token \\(t\\) and expert \\(i\\); \\(u_t\\) is the token representation. Usually, most \\(g_{i,t}\\) values are zero. Only the selected experts are activated.\nThe key idea is that: MoE increases the total number of parameters while keeping the activated FLOPs per token relatively small.\nThis is why MoE is often described as sparse activation.\nWhy are MoEs popular? # MoE models are becoming popular for several reasons.\nFirst, at the same FLOP budget, having more total parameters can improve performance. MoE allows the model to have a large number of parameters, but only activate a small subset for each token.\nSecond, MoEs can train faster to reach a given quality level. Some results show that MoEs achieve similar or better performance than dense models with less training compute.\nThird, MoEs are naturally compatible with expert parallelism. Since different experts can be placed on different devices, MoE creates another dimension of parallelism beyond data parallelism, tensor parallelism, and pipeline parallelism.\nFourth, many high-performing open models use MoE or MoE-like designs. The lecture mentions models such as Mixtral, DBRX, Grok, Qwen MoE, DeepSeek MoE, DeepSeek V3, and Llama 4.\nThe practical motivation is, dense models scale compute and parameters together, while MoE partially separates them.\nWhy were MoEs not always popular? # MoEs are powerful, but they are also harder to train and deploy.\nThe lecture highlights several issues:\ninfrastructure complexity multi-node communication overhead routing instability heuristic training objectives load balancing problems fine-tuning overfitting additional stochasticity during serving MoE is not just a modelling trick. It is also a systems problem.\nA dense MLP is easy to execute: every token goes through the same computation. An MoE layer requires routing, dispatching tokens to experts, executing expert computation, and then combining outputs. If experts are distributed across devices, this introduces communication overhead.\nSo MoE only becomes truly attractive when the infrastructure can handle sparse dispatch efficiently.\nWhat usually varies in MoE design? # The lecture gives three main axes of MoE design:\nrouting function expert sizes training objectives Most modern MoEs replace the MLP layer with an MoE layer. Less commonly, some models apply MoE to attention heads.\nThe most important part is usually the router.\nRouting function overview # Routing decides which expert handles each token.\nMany routing algorithms can be reduced to \u0026ldquo;choose top-\\(k\\)\u0026rdquo;. The lecture mentions three broad styles:\ntoken chooses expert; expert chooses token; global routing via optimisation. The most common approach is token-choice top-\\(k\\) routing.\nIn token-choice routing, each token computes scores for all experts, then selects the top \\(k\\) experts.\nIn expert-choice routing, each expert chooses tokens.\nIn global routing, the model solves a matching or assignment problem.\nMost practical MoE systems use token-choice routing because it is simple and scalable.\nTop-\\(k\\) routing in detail # A router computes a score between token \\(t\\) and expert \\(i\\). In many models, this score comes from a simple learned projection.\nA common form is:\n$$ s_{i,t} = \\mathrm{Softmax}(u_t^\\top e_i) $$\nwhere:\n\\(u_t\\) is the token representation; \\(e_i\\) is a learned expert embedding or router weight; \\(s_{i,t}\\) is the routing score. Then the gate is:\n$$ g_{i,t} = \\begin{cases} s_{i,t}, \u0026amp; s_{i,t} \\in \\mathrm{TopK}({s_{j,t} \\mid 1 \\leq j \\leq N}, K) \\\\ 0, \u0026amp; \\text{otherwise} \\end{cases} $$\nThis means only the selected experts receive the token.\nIn practice, different models vary in the details:\nSwitch Transformer uses \\(k=1\\); GShard uses \\(k=2\\); Mixtral uses \\(k=2\\); Grok uses \\(k=2\\); Qwen uses \\(k=4\\); DBRX uses \\(k=4\\); DeepSeek uses larger \\(k\\), such as 6, 7, or 8 depending on the version. Some models apply softmax before selecting top-\\(k\\). Others select top-\\(k\\) first and then renormalise the selected scores.\nRouting is similar to a small classifier # Conceptually, the router is like a small classifier or logistic regressor placed before the experts.\nFor each token, it predicts which experts are most suitable. The experts are then applied sparsely.\nThis is why MoE can feel like putting a small \u0026ldquo;expert selection\u0026rdquo; network before the FFN.\nHowever, it is not the same as a normal dense layer. A dense layer would mix all experts for every token, which would destroy the compute advantage. MoE only activates a small subset.\nSo the router is not just adding capacity; it is deciding which capacity to activate.\nOther routing methods # The lecture also mentions less common routing methods.\nOne is reinforcement learning. Early MoE work sometimes used RL to learn routing policies. In principle, this makes sense because routing involves discrete decisions. However, RL methods such as REINFORCE have high gradient variance and add complexity. They work, but not clearly enough to dominate practical systems.\nAnother method is linear assignment. This formulates routing as an optimisation or matching problem. It can produce more globally balanced routing decisions, but it is more complex and less commonly used in large-scale practical MoEs.\nFine-grained experts and shared experts # Recent MoE models often use many smaller experts instead of fewer large experts.\nDeepSeek-style models use fine-grained expert segmentation. The idea is to split experts into smaller pieces and activate more of them.\nSome models also use shared experts. A shared expert is always active, while routed experts are selected by the router.\nThe motivation is:\nrouted experts provide specialisation; shared experts provide common capacity; fine-grained experts allow more flexible combinations. DeepSeek and Qwen use shared experts and fine-grained routed experts. OlMoE ablations show gains from fine-grained experts, but not necessarily from shared experts. So this design choice is useful, but its value may depend on model scale, training setup, and implementation details.\nTraining MoEs: the core difficulty # The main training problem is, sparse routing decisions are not differentiable.\nWe want sparsity because it gives efficiency. But discrete top-\\(k\\) selection creates optimisation difficulties.\nThe lecture lists three solutions:\nreinforcement learning; stochastic perturbations; heuristic balancing losses. In practice, modern MoEs mostly use heuristic balancing losses, sometimes combined with noise or other stabilisation tricks.\nStochastic routing approximations # Early MoE work added stochasticity to routing.\nFor example, Shazeer et al. used Gaussian perturbations. The router adds noise before selecting experts. This has two effects:\nexperts become less brittle; the model learns a ranking over experts rather than relying on deterministic hard choices too early. Switch Transformer also used stochastic jitter, which applies a uniform multiplicative perturbation to the router input or logits. The goal is similar: encourage exploration and reduce expert collapse.\nHowever, later work removed some of these tricks when they were not consistently helpful.\nLoad balancing losses # A major issue in MoE training is expert imbalance.\nIf the router sends too many tokens to a small number of experts, those experts become overloaded while others are underused. This is bad for both learning and systems efficiency.\nSwitch Transformer uses an auxiliary load balancing loss. A simplified version is:\n$$ L_{\\mathrm{aux}} = \\alpha N \\sum_{i=1}^{N} f_i P_i $$\nwhere:\n\\(N\\) is the number of experts; \\(f_i\\) is the fraction of tokens routed to expert \\(i\\); \\(P_i\\) is the fraction of router probability assigned to expert \\(i\\); \\(\\alpha\\) controls the strength of the auxiliary loss. The intuition is that, if an expert is used too frequently, the balancing loss pushes the router away from it.\nThis prevents expert collapse and improves hardware utilisation.\nPer-expert and per-device balancing # DeepSeek v1/v2 uses balancing objectives at different levels.\nPer-expert balancing is similar to Switch Transformer. It encourages tokens to be distributed evenly across experts.\nPer-device balancing aggregates experts by device. This matters because in distributed MoE training, imbalance is not only about experts. It is also about devices.\nIf one GPU receives too many tokens because it hosts popular experts, the whole system slows down.\nSo the balancing objective must consider:\nexpert-level load; device-level load; communication load. This is where MoE becomes deeply connected to systems design.\nDeepSeek V3: auxiliary-loss-free balancing # DeepSeek V3 introduces a variation based on per-expert biases.\nThe router score is adjusted with a learned or dynamically updated bias:\n$$ s\u0026rsquo;{i,t} = s{i,t} + b_i $$\nThen top-\\(k\\) routing is performed using the biased score.\nThe bias makes underused experts more likely to receive tokens and overused experts less likely to receive tokens. DeepSeek calls this \u0026ldquo;auxiliary-loss-free balancing\u0026rdquo;.\nHowever, the lecture notes that this is not fully auxiliary-loss-free, because DeepSeek V3 still uses a complementary sequence-wise auxiliary loss.\nSo the better interpretation is, DeepSeek V3 reduces reliance on the traditional expert-level auxiliary loss, but does not remove balancing objectives completely.\nWhat happens without load balancing? # If load balancing is removed, routing can become highly imbalanced.\nSome experts may receive almost all tokens, while others receive very few or none. This harms both quality and efficiency.\nThe lecture shows that removing load balancing can cause unstable expert usage patterns. The training loss might still go down, but the model becomes less efficient and less robust.\nSystems side of MoE training # MoE enables expert parallelism.\nIn a dense model, the MLP is replicated or partitioned in relatively standard ways. In an MoE model, different experts can live on different devices. Tokens are routed across devices to the experts they select.\nThis creates new forms of parallelism:\ndata parallelism; tensor/model parallelism; expert parallelism; combinations of expert, model, and data parallelism. However, this also creates communication overhead. Token dispatch and combine operations can become bottlenecks.\nModern MoE systems use optimised sparse matrix multiplication and token dispatch. The lecture mentions MegaBlocks as an example of a library that uses smarter sparse matrix multiplication for MoE execution.\nWhether this is beneficial depends on hardware, batch size, routing balance, implementation quality, and interconnect bandwidth.\nMoE and communication reduction # Some recent architectures modify the MoE design to reduce communication.\nNemotron 3 introduces a design where activations are down-projected before expert dispatch. The idea is to reduce the amount of data that must be communicated between devices.\nThis reflects an important systems principle: in distributed MoE, routing quality is not enough; communication volume also matters.\nIf the model sends high-dimensional activations across devices for every routed token, communication can dominate. Reducing activation size before dispatch is one way to improve efficiency.\nStochasticity of MoE models during serving # MoE models can have additional stochasticity beyond ordinary dense models.\nOne reason is token dropping. If an expert has limited capacity, and too many tokens are routed to it in the same batch, some tokens may be dropped or rerouted.\nThis can happen at the batch level. Therefore, another user’s query in the same batch can affect whether your token gets processed by its selected expert.\nThis is not typical for dense models, where each request is usually independent except for low-level numerical nondeterminism.\nRouter stability and z-loss # MoE routers can be numerically unstable.\nThe router often uses softmax over expert logits. Small numerical differences in logits can lead to different expert choices, especially when logits are close. This is dangerous because routing is discrete.\nOne solution is to compute the router in float32, even if the rest of the model uses lower precision.\nAnother stabilisation method is router z-loss. A simplified form is:\n$$ L_z = \\frac{1}{B} \\sum_{i=1}^{B} \\left( \\log \\sum_{j=1}^{N} e^{x_{ij}} \\right)^2 $$\nThe z-loss penalises large router logits and helps keep routing numerically stable.\nThe intuition is that, router logits should not grow too large, because overconfident routing can become unstable.\nMoE fine-tuning issues # Sparse MoEs can overfit during fine-tuning, especially when the fine-tuning dataset is small.\nOne reason is that only a subset of experts receives updates for a given batch. Some experts may be undertrained or over-specialised.\nThe lecture mentions two approaches:\nZoph et al.: fine-tune non-MoE MLPs; DeepSeek: use a large amount of supervised fine-tuning data, such as 1.4M SFT examples. Upcycling dense models into MoE models # Upcycling asks:\nCan we initialise an MoE model from a pretrained dense language model?\nInstead of training an MoE from scratch, we copy or transform parts of a dense model into multiple experts.\nFor example, MiniCPM-MoE uses a MiniCPM base model and turns it into an MoE with top-\\(k=2\\), 8 experts, and around 4B active parameters.\nQwen MoE is another example. It is initialised from Qwen 1.8B and uses top-\\(k=4\\), 60 experts, and 4 shared experts.\nThe motivation is obvious:\ndense pretraining is expensive; MoE training is complex; upcycling may reuse existing dense models and reduce cost. The key question is whether copied experts can specialise after continued training.\nDeepSeek MoE evolution # The lecture ends by walking through DeepSeek MoE architectures.\nDeepSeek MoE V1 # DeepSeek MoE V1 uses:\n16B total parameters; 2.8B active parameters; standard top-\\(k\\) routing; 2 shared experts; fine-grained experts; standard auxiliary-loss balancing at expert and device levels. DeepSeek MoE V2 # DeepSeek MoE V2 scales up to:\n236B total parameters; 21B active parameters; 2 shared experts; 160 fine-grained experts; 6 active routed experts; top-\\(M\\) device routing; communication balancing loss. The additional device routing and communication balancing reflect the fact that MoE at this scale is heavily constrained by distributed systems efficiency.\nDeepSeek MoE V3 # DeepSeek V3 scales further to:\n671B total parameters; 37B active parameters; 1 shared expert; 258 fine-grained experts; 8 active routed experts; sigmoid + softmax top-\\(k\\) routing; top-\\(M\\) device routing; auxiliary-loss-free style balancing; complementary sequence-wise auxiliary loss. The architecture is not just about adding experts. It also adds mechanisms for balancing, communication control, and routing stability.\nMy understanding # The lecture connects two ideas that initially look separate: attention alternatives and MoE.\nAttention alternatives deal with the sequence-length problem. And MoE deals with the model-capacity problem. Both directions use sparsity or structured computation:\nsparse attention selects tokens; linear attention compresses history into a state; Mamba-like models use gated recurrent states; MoE selects experts. The common theme is that, modern LLM scaling is no longer just \u0026ldquo;make everything dense and bigger\u0026rdquo;. It is increasingly about choosing which computation should be activated.\nThis also explains why systems engineering becomes central. Sparse or conditional computation sounds efficient in theory, but the real benefit depends on implementation. Routing, communication, batching, memory layout, and numerical stability all matter.\nFor MoE in particular, the model is only half of the story. The other half is whether the system can route and execute tokens efficiently.\n","date":"2026年5月24日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-4---attention-alternatives-and-mixture-of-experts/","section":"笔记","summary":"","title":"CS336 Lecture 4 - Attention Alternatives and Mixture of Experts","type":"note"},{"content":" Please read along with the original slides\nStanford CS336 shifts pretty hard here from \u0026ldquo;model architecture\u0026rdquo; into systems territory.\nThis lecture is basically the point where transformers stop feeling abstract and start colliding with hardware reality.\nA lot of modern ML progress is not just \u0026ldquo;better models\u0026rdquo;. It\u0026rsquo;s also:\nfaster matrix multiplication better memory layouts lower precision arithmetic compiler tricks smarter scheduling fewer memory accesses Once you start looking at GPUs this way, FlashAttention stops looking like black magic and starts looking like a very aggressive memory optimization.\nWhy GPUs matter so much for LLMs # The lecture starts with a pretty blunt observation, that LLM scaling depends on compute scaling.\nFor a while, classical CPU scaling mostly came from Dennard scaling and frequency improvements. Clock speeds went up, transistors got smaller, everything got faster automatically. That trend basically died in the 2000s.\nModern scaling instead comes from parallelism. GPUs scaled extremely aggressively over the past decade, especially for matrix operations.\nThere is also an important asymmetry:\ncompute throughput keeps growing very quickly memory bandwidth grows much more slowly This imbalance ends up dominating almost every optimization later in the lecture.\nCPUs vs GPUs # The lecture frames CPUs and GPUs as fundamentally different optimization targets.\nCPUs care about latency. They want a small number of threads to finish quickly. So CPUs invest heavily in:\nbranch prediction speculative execution large caches complicated control logic GPUs care about throughput. They want enormous numbers of operations happening simultaneously, even if individual threads are relatively \u0026ldquo;dumb\u0026rdquo;. So GPUs instead spend silicon budget on:\nlots of ALUs many lightweight execution units huge parallel execution capability This tradeoff matters because ML workloads are unusually regular. Matrix multiplications do not require complicated branching logic. They mostly need raw arithmetic throughput. That makes GPUs a very good fit.\nThe execution structure of a GPU # Streaming Multiprocessors (SMs) # An SM is basically a large execution cluster inside the GPU. A GPU contains many SMs operating independently.\nEach SM has:\ncompute units (ALU) schedulers registers shared memory You can think of SMs as tiny massively parallel processors living inside the GPU.\nThreads # Threads are the smallest execution units. Each thread runs the same instruction sequence on different data.\nThis is the SIMT model: Single Instruction, Multiple Threads.\nBlocks # Threads are grouped into blocks. A block executes on one SM and shares access to that SM\u0026rsquo;s shared memory.\nWarps # Threads execute in groups called warps. On NVIDIA GPUs, a warp contains 32 threads. The GPU scheduler issues instructions warp-by-warp, not thread-by-thread. This detail explains a surprising amount of GPU behavior later.\nGPU memory hierarchy # The hierarchy roughly looks like this:\nRegisters Shared memory / L1 cache L2 cache Global memory (HBM / DRAM) Shared memory is SRAM-based and extremely fast, but expensive and limited. Global memory is large but comparatively slow.\nThis creates the central problem of GPU programming：How to keep compute units busy without constantly waiting for memory?\nTensor cores changed everything # Modern GPUs contain dedicated matrix multiplication hardware called tensor cores.\nThis matters because matrix multiplication throughput exploded compared to ordinary floating point operations, which is why transformers are so aligned with GPU hardware.\nThe memory wall # Compute scaling outpaced memory scaling. This means modern GPUs can theoretically perform absurd amounts of arithmetic, but often cannot fetch data fast enough to stay fully utilized. A lot of ML systems engineering is really about avoiding memory bottlenecks rather than increasing raw compute.\nThe roofline model # Performance is limited by either:\ncompute throughput memory bandwidth Arithmetic intensity roughly means: operations performed per byte moved\nIf arithmetic intensity is low, the workload becomes memory-bound. If arithmetic intensity is high enough, the workload becomes compute-bound. Control divergence # GPUs execute threads in warps, and all threads in the same warp share one instruction stream. A conditional branch is efficient when every thread in the warp takes the same path, because the warp can continue executing normally.\nThe problem appears when different threads in the same warp take different branches. The GPU cannot execute both paths independently at the same time. Instead, it runs one branch while masking out the inactive threads, then runs the other branch while masking out the remaining threads. The final result is correct, but part of the warp is idle during each branch.\nThis is called control divergence. It is not mainly a memory bandwidth problem; it comes from the SIMT execution model itself. Divergence reduces effective parallelism, so GPU kernels usually try to keep neighboring threads following the same control flow whenever possible.\nLow precision computation # Lower precision means:\nless memory traffic smaller tensors higher arithmetic intensity better tensor core throughput For FP32:\n4-byte reads 4-byte writes For FP16:\n2-byte reads 2-byte writes Same operation count, less data movement.\nMixed precision and tensor cores # Modern tensor cores heavily optimize lower precision operations like:\nFP16 BF16 FP8 Usually accumulation still happens in FP32 for numerical stability.\nThis is one reason mixed precision training became standard.\nFP8 and MXFP8 # MXFP8 is a block-scaled FP8 format. Instead of using one scale for an entire tensor, it assigns separate scaling factors to small groups of values. This gives the format more local dynamic range, which helps preserve accuracy while still keeping the storage and memory bandwidth benefits of FP8.\nThe extra scaling metadata also makes layout transformations more complicated. A transpose is no longer just a different view of the same underlying values, because the scale factors are tied to the original block layout. After transposition, the grouping of values changes, so the tensor often needs to be quantized again with a different set of scales.\nIn practice, MXFP8 training systems may keep separate quantized layouts for the same tensor: one for the original orientation and one for the transposed orientation. This reduces transpose overhead during computation, but it also increases memory usage and implementation complexity.\nOperator fusion # Many elementwise operations are limited more by memory traffic than by arithmetic cost. If each operation is executed as a separate CUDA kernel, every intermediate result has to be written to global memory and then read back by the next kernel.\nFor example:\ny = sin(x)**2 + cos(x)**2 A naive implementation may launch separate kernels for sin, cos, squaring, addition, and other intermediate steps. Although each operation is simple, the repeated global memory reads and writes dominate the runtime.\nOperator fusion reduces this overhead by combining multiple operations into a single kernel. Intermediate values can stay in registers or local on-chip storage instead of being materialized in global memory. This reduces kernel launch overhead and, more importantly, cuts down unnecessary memory traffic.\nThis kind of fusion is especially effective for pointwise operations, where the computation per element is small and the main bottleneck is moving data to and from HBM.\nRecomputation # During training, the backward pass typically requires intermediate activations produced during the forward pass.\nA straightforward implementation stores all activations in memory so they can later be reused during gradient computation.\nHowever, activation storage creates substantial memory traffic:\nactivations must be written to global memory during the forward pass later retrieved again during the backward pass For deep networks, especially transformers, these memory accesses can become more expensive than the extra arithmetic needed to recompute the activations.\nRecomputation (or activation checkpointing) trades additional compute for reduced memory usage and lower memory bandwidth pressure.\nInstead of storing every intermediate activation, the system stores only selected checkpoints and recomputes missing activations when needed during backpropagation.\nOn modern GPUs, this tradeoff is often favorable because compute throughput scales faster than memory bandwidth.\nMemory coalescing # Global memory is served through aligned memory transactions rather than isolated scalar reads. When a warp issues a load instruction, the GPU checks the addresses requested by its 32 threads. If these addresses are contiguous or fall within a small number of aligned memory segments, the load can be served efficiently. This access pattern is called memory coalescing.\nFor row-major matrices, elements in the same row are contiguous, while elements in the same column are separated by the row stride. Therefore, a kernel where neighbouring threads read along a row is usually more memory efficient than one where neighbouring threads read down a column, even if both perform the same arithmetic.\nCoalescing explains why memory layout and thread mapping matter in CUDA. Tiling builds on the same idea: a kernel first loads matrix tiles from global memory into shared memory using coalesced accesses, then reuses those values locally instead of repeatedly reading from global memory.\nTiling # Tiling reorganizes matrix multiplication around blocks of data rather than individual output elements. Instead of having each thread repeatedly fetch values from global memory, the kernel loads a tile from matrix A and a tile from matrix B into shared memory, then uses those tiles to compute part of a tile in matrix C.\nThe outer loop moves across tiles along the reduction dimension. For each step, the current A and B tiles are loaded once from global memory and reused by many threads inside the block. The inner loop then multiplies elements from those shared-memory tiles and accumulates partial sums for the output tile.\nThis changes the memory behavior of matrix multiplication. In a naive kernel, the same values from A and B may be loaded from global memory many times by different threads. With tiling, those values are fetched from global memory much less often and reused from faster on-chip memory.\nThe main benefit is not fewer FLOPs. The arithmetic is the same. The benefit is higher arithmetic intensity: more multiply-add operations are performed for each byte loaded from global memory.\nWhy tiling improves arithmetic intensity # Tiling does not change the number of multiply-add operations in matrix multiplication. It changes where the input values are read from.\nIn a non-tiled matrix multiplication, each input value may be read from global memory N times, because different output elements repeatedly need the same values. With tile size T, each input value is read from global memory only N/T times. After a tile is loaded into shared memory, the values inside that tile can be reused T times before the kernel moves to the next tile.\nThis gives a factor of T reduction in global memory access under the simplified square-matrix model shown above.\nThe result is higher arithmetic intensity. The kernel performs the same arithmetic, but it performs more multiply-add operations for each byte fetched from global memory. This makes it easier for the GPU to use its compute units instead of waiting on HBM.\nPractical limits of tiling # The ideal tile shape is constrained by the hardware. A tile has to fit into shared memory, map well to warps, and produce coalesced global memory loads. Matrix dimensions also matter. If the dimensions are not divisible by the tile size, boundary tiles may contain inactive threads or unused elements.\nAlignment adds another constraint. Global memory is served through aligned memory transactions, so a tile is efficient when its rows line up with these transaction boundaries. In the aligned case, the data loaded from global memory mostly belongs to the tile being computed.\nIn the unaligned case, the same logical tile may cross transaction boundaries. The GPU then has to fetch extra memory segments, and some of the fetched values are not useful for the current tile. The arithmetic work is almost unchanged, but the number of memory transactions increases.\nThis is one reason GPU performance curves are often jagged. A small change in matrix shape can change tile utilization, memory alignment, or scheduling behavior, even when the FLOP count changes only slightly.\nWave quantization # Wave quantization is another source of uneven performance. The GPU schedules thread blocks across a fixed number of SMs. If the number of tiles maps cleanly onto the available SMs, most SMs stay busy. If the tile count slightly exceeds a multiple of the SM count, the GPU may need an extra scheduling wave with only a small amount of remaining work.\nFor example, increasing a matrix dimension from 1792 to 1793 can increase the number of tiles from 98 to 120 for a particular tile shape. On an A100 with 108 SMs, 98 tiles fit into one wave, while 120 tiles require a second wave. The second wave is under-filled, so utilization drops.\nThis explains why a larger matrix can sometimes run slower than a slightly smaller one. The arithmetic changed only a little, but the mapping onto hardware changed a lot.\nFlashAttention # FlashAttention uses the same hardware-aware ideas from tiled matrix multiplication, but applies them to attention. The point is not to approximate attention or reduce the number of attention scores. It still computes exact attention. The improvement comes from reducing memory traffic.\nWhy standard attention is memory hungry # Scaled dot-product attention can be written as:\n$$ S = QK^T $$\n$$ P = \\operatorname{softmax}(S) $$\n$$ O = PV $$\nThe expensive part is not only the matrix multiplication. A naive implementation materializes the attention score matrix S and often the probability matrix P in global memory. For a sequence length of N, these matrices have size N × N.\nThis creates a large amount of HBM traffic. The GPU computes QK^T, writes the scores to global memory, reads them back for softmax, writes the softmax result, then reads it again for the multiplication with V.\nThe FLOP count is high, but the memory traffic is the larger problem.\nTiling attention # FlashAttention avoids materializing the full attention matrix in HBM. It splits Q, K, and V into blocks and computes attention block by block.\nFor each query block, the kernel loads a block of K and V, computes a partial attention result, and updates the output. The intermediate attention scores are kept in on-chip memory as much as possible. They do not need to be written out as a full N × N matrix.\nThis is the same basic idea as tiled matrix multiplication: move a small block of data into fast memory, reuse it, and avoid repeated global memory access.\nThe softmax problem # The difficult part is softmax. Matrix multiplication can be tiled directly, but softmax normally needs information from the whole row because each output depends on the row maximum and the normalization sum.\nFlashAttention handles this with online softmax. Instead of computing softmax after the full attention row has been materialized, the kernel updates the row maximum and normalization term incrementally as it processes each tile.\nFor each tile, the algorithm keeps track of:\nthe running maximum for numerical stability the running normalization denominator the partial weighted sum with V This makes it possible to compute the same softmax result tile by tile without storing the full attention matrix.\nForward pass intuition # In the forward pass, FlashAttention combines several ideas:\ntile-wise computation of QK^T fusion of scaling, masking, exponentiation, and normalization online softmax across tiles immediate multiplication with the corresponding V tile The intermediate attention matrix exists conceptually, but not as a large tensor stored in HBM. This is the main reason FlashAttention is faster and more memory efficient than a naive implementation.\nBackward pass and recomputation # The backward pass uses the same memory-saving philosophy. Instead of storing every intermediate attention value from the forward pass, FlashAttention recomputes some of them tile by tile during backward.\nThis trades extra computation for much lower memory usage. On modern GPUs, this tradeoff often makes sense because compute throughput is abundant while HBM bandwidth and capacity are more limited.\nTakeaway # The lecture\u0026rsquo;s main message is that GPU performance is often limited by data movement rather than raw arithmetic.\nSeveral techniques follow from that idea:\ncoalescing makes global memory transactions more efficient fusion avoids unnecessary intermediate writes recomputation trades extra FLOPs for lower memory pressure tiling moves reusable data into shared memory FlashAttention applies these ideas to attention This is a useful way to read many ML systems papers. The important question is often not just \u0026ldquo;how many FLOPs does this use?\u0026rdquo;, but \u0026ldquo;where does the data live, and how many times does it move?\u0026rdquo;\n","date":"2026年5月27日","externalUrl":null,"permalink":"/note/stanford-cs336/cs336-lecture-5---gpus/","section":"笔记","summary":"","title":"CS336 Lecture 5 - GPUs","type":"note"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/attention/","section":"Categories","summary":"","title":"Attention","type":"categories"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/cs336/","section":"Categories","summary":"","title":"CS336","type":"categories"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/cuda/","section":"Categories","summary":"","title":"CUDA","type":"categories"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/flashattention/","section":"Categories","summary":"","title":"FlashAttention","type":"categories"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/gpu/","section":"Categories","summary":"","title":"GPU","type":"categories"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/llm/","section":"Categories","summary":"","title":"LLM","type":"categories"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/authors/martincao/","section":"Authors","summary":"","title":"Martincao","type":"authors"},{"content":"","date":"2026年5月27日","externalUrl":null,"permalink":"/categories/mlsys/","section":"Categories","summary":"","title":"MLsys","type":"categories"},{"content":"","date":"2026年5月24日","externalUrl":null,"permalink":"/categories/moe/","section":"Categories","summary":"","title":"MoE","type":"categories"},{"content":"","date":"2026年5月24日","externalUrl":null,"permalink":"/categories/transformer/","section":"Categories","summary":"","title":"Transformer","type":"categories"},{"content":"","date":"2026年5月13日","externalUrl":null,"permalink":"/categories/pytorch/","section":"Categories","summary":"","title":"PyTorch","type":"categories"},{"content":"","date":"2026年4月16日","externalUrl":null,"permalink":"/categories/lda/","section":"Categories","summary":"","title":"LDA","type":"categories"},{"content":"","date":"2026年4月16日","externalUrl":null,"permalink":"/categories/ml/","section":"Categories","summary":"","title":"ML","type":"categories"},{"content":" 数学基础 # 1. 均值向量 # 对于一维数据：\n$$ \\bar{x} = \\frac{1}{n} \\sum_{i=1}^n x^{(i)} $$\n但对于多维数据，其中每个样本是一个向量：\n$$ \\mathbf{x}^{(i)} = \\begin{bmatrix} x_1^{(i)} \\\\ x_2^{(i)} \\\\ \\vdots \\\\ x_d^{(i)} \\end{bmatrix} $$\n其均值向量定义为：\n$$ \\mathbf{\\mu} = \\frac{1}{n} \\sum_{i=1}^n \\mathbf{x}^{(i)}, \\mathbf{\\mu} \\in \\R^d $$\n具体写开为：\n$$ \\mathbf{\\mu} = \\begin{bmatrix} \\mu_1 \\\\ \\mu_2 \\\\ \\vdots \\\\ \\mu_d \\end{bmatrix} = \\begin{bmatrix} \\frac{1}{n} \\sum_{i=1}^n x_1^{(i)} \\\\ \\frac{1}{n} \\sum_{i=1}^n x_2^{(i)} \\\\ \\vdots \\\\ \\frac{1}{n} \\sum_{i=1}^n x_d^{(i)} \\\\ \\end{bmatrix} $$\n例如，假设我们有一个 \\(\\mathbf{x}\\)，其中包含三个二维样本：\n$$ x^{(1)} = \\begin{bmatrix} 1 \\\\ 2 \\end{bmatrix},\\space x^{(2)} = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix},\\space x^{(3)} = \\begin{bmatrix} 5 \\\\ 6 \\end{bmatrix} $$\n则其均值向量为：\n$$ \\mathbf{\\mu} = \\frac{1}{3} \\left( \\begin{bmatrix} 1 \\\\ 2 \\end{bmatrix} + \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix} + \\begin{bmatrix} 5 \\\\ 6 \\end{bmatrix} \\right) = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix} $$\n2. 协方差矩阵 # 对于一维数据：\n$$ \\mathrm{Var}(x)=\\frac1n\\sum (x^{(i)} - \\bar x)^2 $$\n为了衡量两个变量如何共同变化，我们定义协方差为：\n$$ \\mathrm{Cov}(x, y) = \\frac{1}{n} \\sum_{i=1}^{n} (x^{(i)} - \\bar{x}) (y^{(i)} - \\bar{y}) $$\n例如，我们有\n$$ x^{(1)} = \\begin{bmatrix} 1 \\\\ 2 \\end{bmatrix},\\space x^{(2)} = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix},\\space x^{(3)} = \\begin{bmatrix} 5 \\\\ 6 \\end{bmatrix} $$\n记第一维为 \\(x_1\\)，第二维为 \\(x_2\\)，则\n$$ \\bar{x}_1 = \\frac{1 + 3 + 5}{3} = 3,\\space \\bar{x}_2 = \\frac{2 + 4 + 6}{3} = 4 $$\n因此：\n$$ \\begin{align*} \\mathrm{Cov}(x_1, x_2) \u0026amp;= \\frac{1}{3}[(1 - 3)(2 - 4) + (3 - 3)(4 - 4) + (5 - 3)(6 - 4)] \\\\ \u0026amp;= \\frac{1}{3}(4 + 0 + 4) = \\frac{8}{3} \\end{align*} $$\n但对于多维数据，其中每个样本是一个向量，其协方差矩阵定义为：\n$$ \\Sigma = \\frac{1}{n} \\sum_{i=1}^{n} (\\mathbf{x}^{(i)} - \\mathbf{\\mu})(\\mathbf{x}^{(i)} - \\mathbf{\\mu})^T $$\n其中 \\(\\Sigma \\in \\mathbb{R}^{d \\times d}\\)。\n\\(\\Sigma\\) 的每个元素表示两个维度之间的协方差：\n$$ \\Sigma_{jk} = \\mathrm{Cov}(x_j, x_k) $$\n因此，对于二维向量\n$$ \\mathbf{x} = \\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix} $$\n协方差矩阵为：\n$$ \\Sigma = \\begin{bmatrix} \\mathrm{Var}(x_1) \u0026amp; \\mathrm{Cov}(x_1, x_2) \\\\ \\mathrm{Cov}(x_2, x_1) \u0026amp; \\mathrm{Var}(x_2) \\end{bmatrix} $$\n例如，继续使用上面的数据集：\n$$ \\mathbf{x}^{(1)} = \\begin{bmatrix} 1 \\\\ 2 \\end{bmatrix},\\space \\mathbf{x}^{(2)} = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix},\\space \\mathbf{x}^{(3)} = \\begin{bmatrix} 5 \\\\ 6 \\end{bmatrix} $$\n我们已经知道：\n$$ \\mathbf{\\mu} = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix} $$\n则：\n$$ \\mathbf{x}^{(1)} - \\mathbf{\\mu} = \\begin{bmatrix} -2 \\\\ -2 \\end{bmatrix},\\space \\mathbf{x}^{(2)} - \\mathbf{\\mu} = \\begin{bmatrix} 0 \\\\ 0 \\end{bmatrix},\\space \\mathbf{x}^{(3)} - \\mathbf{\\mu} = \\begin{bmatrix} 2 \\\\ 2 \\end{bmatrix} $$\n因此：\n$$ \\begin{align*} \\Sigma \u0026amp;= \\frac{1}{3} \\left[ \\begin{bmatrix} -2 \\\\ -2 \\end{bmatrix} \\begin{bmatrix} -2 \u0026amp; -2 \\end{bmatrix} + \\begin{bmatrix} 0 \\\\ 0 \\end{bmatrix} \\begin{bmatrix} 0 \u0026amp; 0 \\end{bmatrix} + \\begin{bmatrix} 2 \\\\ 2 \\end{bmatrix} \\begin{bmatrix} 2 \u0026amp; 2 \\end{bmatrix} \\right] \\\\ \u0026amp;= \\frac{1}{3} \\left[ \\begin{bmatrix} 4 \u0026amp; 4 \\\\ 4 \u0026amp; 4 \\end{bmatrix} + \\begin{bmatrix} 0 \u0026amp; 0 \\\\ 0 \u0026amp; 0 \\end{bmatrix} + \\begin{bmatrix} 4 \u0026amp; 4 \\\\ 4 \u0026amp; 4 \\end{bmatrix} \\right] \\\\ \u0026amp;= \\frac{1}{3} \\begin{bmatrix} 8 \u0026amp; 8 \\\\ 8 \u0026amp; 8 \\end{bmatrix} \\\\ \u0026amp;= \\begin{bmatrix} \\frac{8}{3} \u0026amp; \\frac{8}{3} \\\\ \\frac{8}{3} \u0026amp; \\frac{8}{3} \\end{bmatrix} \\end{align*} $$\n投影到一条直线 # Fisher 线性判别分析的核心思想是将高维样本投影到一条直线上：\n$$ y = \\mathbf{w}^T \\mathbf{x} $$\n其中：\n\\(\\mathbf{x} \\in \\mathbb{R}^d\\) 是原始样本， \\(\\mathbf{w} \\in \\mathbb{R}^d\\) 是投影方向， \\(y \\in \\mathbb{R}\\) 是投影后的标量。 从几何上看，这意味着将所有样本投影到由 \\(\\mathbf{w}\\) 张成的直线上。\n投影后的均值 # 设原始数据的均值向量为 \\(\\mathbf{\\mu}\\)。\n则投影后数据的均值为：\n$$ \\mu_y = \\mathbf{w}^T \\mathbf{\\mu} $$\n这由期望的线性性质可得：\n$$ \\begin{align*} \\mu_y \u0026amp;= \\mathbb{E}[y] \\\\ \u0026amp;= \\mathbb{E}[\\mathbf{w}^T\\mathbf{x}] \\\\ \u0026amp;= \\mathbf{w}^T\\mathbb{E}[\\mathbf{x}] \\\\ \u0026amp;= \\mathbf{w}^T\\mathbf{\\mu} \\end{align*} $$\n因此，投影后各类别的中心也会被映射到该直线上。\n投影后的方差 # 设原始协方差矩阵为 \\(\\Sigma\\)。\n则投影后数据的方差为：\n$$ \\sigma_y^2 = \\mathbf{w}^T \\Sigma \\mathbf{w} $$\n证明：\n$$ \\begin{align*} \\sigma_y^2 \u0026amp;= \\mathrm{Var}(y) \\\\ \u0026amp;= \\mathrm{Var}(\\mathbf{w}^T\\mathbf{x}) \\\\ \u0026amp;= \\mathbb{E}\\left[(\\mathbf{w}^T\\mathbf{x} - \\mathbf{w}^T\\mathbf{\\mu})^2\\right] \\\\ \u0026amp;= \\mathbb{E}\\left[(\\mathbf{w}^T(\\mathbf{x}-\\mathbf{\\mu}))^2\\right] \\\\ \u0026amp;= \\mathbb{E}\\left[\\mathbf{w}^T(\\mathbf{x}-\\mathbf{\\mu})(\\mathbf{x}-\\mathbf{\\mu})^T\\mathbf{w}\\right] \\\\ \u0026amp;= \\mathbf{w}^T \\mathbb{E}\\left[(\\mathbf{x}-\\mathbf{\\mu})(\\mathbf{x}-\\mathbf{\\mu})^T\\right] \\mathbf{w} \\\\ \u0026amp;= \\mathbf{w}^T \\Sigma \\mathbf{w} \\end{align*} $$\n因此，投影后的方差是协方差矩阵的一个二次型。\nFisher 准则 # 假设我们有两个类别：\n类别 0，均值为 \\(\\mathbf{\\mu}_0\\)，协方差为 \\(\\Sigma_0\\) 类别 1，均值为 \\(\\mathbf{\\mu}_1\\)，协方差为 \\(\\Sigma_1\\) 投影到 \\(\\mathbf{w}\\) 后：\n投影均值变为： $$ m_0 = \\mathbf{w}^T \\mathbf{\\mu}_0,\\quad m_1 = \\mathbf{w}^T \\mathbf{\\mu}_1 $$\n投影方差变为： $$ s_0^2 = \\mathbf{w}^T \\Sigma_0 \\mathbf{w},\\quad s_1^2 = \\mathbf{w}^T \\Sigma_1 \\mathbf{w} $$\n投影后，分类问题变成一维问题。\n因此，一个好的投影方向应满足：\n投影后的类别中心尽可能远； 各类别内部投影后尽可能集中。 因此 Fisher 提出如下目标函数：\n$$ J(\\mathbf{w}) = \\frac{(m_0 - m_1)^2}{s_0^2 + s_1^2} $$\n其中：\n分子衡量投影后类别均值之间的距离； 分母衡量投影后类别内部的总离散程度。 代入均值/方差公式得：\n$$ J(\\mathbf{w}) = \\frac{(\\mathbf{w}^T(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1))^2}{\\mathbf{w}^T(\\Sigma_0+\\Sigma_1)\\mathbf{w}} $$\nScatter Matrices # 为了简化记号，引入两个散度矩阵。\n类内散度矩阵（within-class scatter matrix） 表示各类别内部的离散程度：\n$$ S_W = \\Sigma_0 + \\Sigma_1 $$\n类间散度矩阵（between-class scatter matrix） 表示类别中心之间的分离程度：\n$$ S_B = (\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)^T $$\n利用上述定义：\n$$ \\mathbf{w}^T S_B \\mathbf{w} = (\\mathbf{w}^T(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1))^2 $$\n以及\n$$ \\begin{align*} \\mathbf{w}^T S_W \\mathbf{w} \u0026amp;= \\mathbf{w}^T \\Sigma_0 \\mathbf{w} + \\mathbf{w}^T \\Sigma_1 \\mathbf{w} \\\\ \u0026amp;= s_0^2+s_1^2 \\end{align*} $$\n因此 Fisher 准则可写为：\n$$ J(\\mathbf{w}) = \\frac{\\mathbf{w}^T S_B \\mathbf{w}}{\\mathbf{w}^T S_W \\mathbf{w}} $$\n最优投影方向 # 我们希望找到使 Fisher 准则最大的投影方向：\n$$ \\mathbf{w}^* = \\arg\\max_{\\mathbf{w}} \\frac{\\mathbf{w}^T S_B \\mathbf{w}}{\\mathbf{w}^T S_W \\mathbf{w}} $$\n由于缩放 \\(\\mathbf{w}\\) 不会改变该比值，\n$$ J(c\\mathbf{w}) = J(\\mathbf{w}),\\quad c\\neq 0 $$\n因此只有方向重要。\n于是可加约束：\n$$ \\mathbf{w}^T S_W \\mathbf{w}=1 $$\n将问题转化为：\n$$ \\max_{\\mathbf{w}} \\mathbf{w}^T S_B \\mathbf{w} \\quad \\text{subject to } \\mathbf{w}^T S_W \\mathbf{w}=1 $$\n拉格朗日推导 # 构造拉格朗日函数：\n$$ L(\\mathbf{w},\\lambda) = \\mathbf{w}^T S_B \\mathbf{w} - \\lambda(\\mathbf{w}^T S_W \\mathbf{w}-1) $$\n对 \\(\\mathbf{w}\\) 求导并令其为零：\n$$ \\frac{\\partial L}{\\partial \\mathbf{w}} = 2S_B\\mathbf{w} - 2\\lambda S_W\\mathbf{w} = 0 $$\n因此：\n$$ S_B\\mathbf{w} = \\lambda S_W\\mathbf{w} $$\n这就是一个广义特征值问题。\n代入 \\(S_B\\) 的形式 # 回忆：\n$$ S_B = (\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)^T $$\n代入得：\n$$ (\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)^T\\mathbf{w} = \\lambda S_W\\mathbf{w} $$\n注意到：\n$$ (\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)^T\\mathbf{w} $$\n是一个标量。令：\n$$ c=(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)^T\\mathbf{w} $$\n则：\n$$ c(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1) = \\lambda S_W\\mathbf{w} $$\n两边左乘 \\(S_W^{-1}\\)：\n$$ \\mathbf{w} = \\frac{c}{\\lambda} S_W^{-1} (\\mathbf{\\mu}_0-\\mathbf{\\mu}_1) $$\n由于 \\(\\frac{c}{\\lambda}\\) 仅为标量，方向不变，因此得到：\n$$ \\boxed{ \\mathbf{w}^* = S_W^{-1}(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1) } $$\n这就是 Fisher 线性判别方向。\n直观理解：\n\\((\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)\\) 表示两个类别中心的方向； \\(S_W^{-1}\\) 会削弱类内方差大的方向的权重。 分类规则 # 得到 \\(\\mathbf{w}\\) 后，将新样本 \\(\\mathbf{x}\\) 投影：\n$$ y = \\mathbf{w}^T \\mathbf{x} $$\n一个简单的判别阈值是投影后两个类别均值的中点：\n$$ \\theta = \\frac{m_0 + m_1}{2} $$\n其中：\n$$ m_0 = \\mathbf{w}^T\\mathbf{\\mu}_0,\\quad m_1 = \\mathbf{w}^T\\mathbf{\\mu}_1 $$\n分类规则：\n$$ \\begin{cases} y \u0026gt; \\theta,\\quad \\text{Class 1} \\\\ y \\le \\theta,\\quad \\text{Class 0} \\end{cases} $$\n即将样本归入投影后距离更近的类别中心。\nFisher LDA 总结 # Fisher LDA 的完整流程如下：\n计算类别均值 \\(\\mathbf{\\mu}_0,\\mathbf{\\mu}_1\\) 计算协方差矩阵 \\(\\Sigma_0,\\Sigma_1\\) 构造散度矩阵： $$ S_W=\\Sigma_0+\\Sigma_1,\\quad S_B=(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1)^T $$\n求最优投影方向： $$ \\mathbf{w}^*=S_W^{-1}(\\mathbf{\\mu}_0-\\mathbf{\\mu}_1) $$\n将样本投影到直线上： $$ y=\\mathbf{w}^T\\mathbf{x} $$\n通过阈值进行分类： $$ y \\mathop{\\gtrless}_{\\text{Class 0}}^{\\text{Class 1}} \\theta $$\n因此，Fisher LDA 将高维二分类问题转化为一维阈值分类问题。\n","date":"2026年4月16日","externalUrl":null,"permalink":"/note/design-and-practice-of-intelligent-internet-of-things-machine-learning/fisher-lda/","section":"笔记","summary":"","title":"线性判别分析（Fisher 判别）","type":"note"},{"content":"","date":"2026年4月16日","externalUrl":null,"permalink":"/note/design-and-practice-of-intelligent-internet-of-things-machine-learning/","section":"笔记","summary":"","title":"智能物联网设计与实践（机器学习）","type":"note"},{"content":"免责声明：这些笔记中的部分内容和代码片段来自 Stanford CS336 的课程资料，尤其是 lectures 和 spring2025-lectures，并依据 MIT License 使用。\n","date":"2026年4月1日","externalUrl":null,"permalink":"/note/stanford-cs336/","section":"笔记","summary":"","title":"Stanford CS336","type":"note"},{"content":"","date":"2026年4月1日","externalUrl":null,"permalink":"/note/","section":"笔记","summary":"","title":"笔记","type":"note"},{"content":"","date":"2025年11月12日","externalUrl":null,"permalink":"/categories/cloud/","section":"Categories","summary":"","title":"Cloud","type":"categories"},{"content":"","date":"2025年11月12日","externalUrl":null,"permalink":"/categories/devops/","section":"Categories","summary":"","title":"DevOps","type":"categories"},{"content":"","date":"2025年11月12日","externalUrl":null,"permalink":"/authors/holgerhuo/","section":"Authors","summary":"","title":"Holgerhuo","type":"authors"},{"content":"","date":"2025年11月12日","externalUrl":null,"permalink":"/categories/k8s/","section":"Categories","summary":"","title":"K8s","type":"categories"},{"content":"","date":"2025年11月12日","externalUrl":null,"permalink":"/categories/kubernetes/","section":"Categories","summary":"","title":"Kubernetes","type":"categories"},{"content":"","date":"2025年11月12日","externalUrl":null,"permalink":"/categories/linux/","section":"Categories","summary":"","title":"Linux","type":"categories"},{"content":"This article is a repost of Provisioning a Highly-Available Production Kubernetes Cluster originally published by Holger Huo\nAll credit goes to the original author. This copy is provided here for educational and non-commercial purposes only.\nThis documentation will guide you through setting up a production-level High-Available multi-node Kubernetes cluster using kubeadmin, containerd, kube-vip, cilium, MetalLB, and OpenEBS. If you are using a managed Kubernetes service (K8s-as-a-Service) or already have an existing cluster, feel free to skip this section.\nPrerequisites # A minimum of 3 nodes are required for HA control plane At least 2 CPU Cores and 2 GB of RAM per node Full network connectivity between all machines in the cluster (public or private network is fine) Unique hostname, MAC address, and product_uuid for every node RHEL 10 (All recent distros should work but this tutorial is tailored for RHEL) Using MAAS, cloud-init, ansible or similar bare-metal provision tools is strongly recommended to automate and accelerate setup process. Setting Up Host OS # To prepare the host OS for Kubernetes installation, we need to disable swap, firewall, and SELinux, and enable IP forwarding.\n# Disable swap sudo swapoff -a sudo sed -i \u0026#39;/^[^#].*\\s\\+swap\\s\\+.*$/d\u0026#39; /etc/fstab # Also check systemd.swap # Disable firewall sudo systemctl disable --now firewalld # Disable SELinux sudo setenforce 0 sudo sed -i \u0026#39;s/^SELINUX=enforcing$/SELINUX=permissive/\u0026#39; /etc/selinux/config # Enable forwarding cat \u0026lt;\u0026lt;EOF | sudo tee /etc/sysctl.d/k8s.conf net.ipv4.ip_forward = 1 net.ipv6.conf.all.forwarding = 1 vm.nr_hugepages = 1024 EOF sudo sysctl --system sudo dnf install epel-release -y sudo dnf install wget htop btop curl vim nano git jq -y sudo dnf update -y Install containerd # containerd is a high-performance container runtime that is widely used in production Kubernetes clusters. We will install and configure containerd as the container runtime for our Kubernetes cluster.\nsudo dnf -y install dnf-plugins-core sudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo sudo dnf install containerd.io -y containerd config default | sudo tee /etc/containerd/config.toml sudo sed -i \u0026#39;s/SystemdCgroup = false/SystemdCgroup = true/\u0026#39; /etc/containerd/config.toml sudo systemctl enable --now containerd cat \u0026lt;\u0026lt;EOF | sudo tee /etc/crictl.yaml runtime-endpoint: \u0026#34;unix:///run/containerd/containerd.sock\u0026#34; timeout: 0 debug: false EOF Install Kubernetes # cat \u0026lt;\u0026lt;EOF | sudo tee /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://pkgs.k8s.io/core:/stable:/v1.34/rpm/ enabled=1 gpgcheck=1 gpgkey=https://pkgs.k8s.io/core:/stable:/v1.34/rpm/repodata/repomd.xml.key exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni EOF sudo yum install -y kubelet kubeadm kubectl --setopt=disable_excludes=kubernetes sudo systemctl enable --now kubelet Before proceeding with the Kubernetes setup, it is recommended to reboot the system to use the latest kernel version.\necho \u0026#34;exclude=containerd containerd.io kernel*\u0026#34; | sudo tee -a /etc/yum.conf sudo reboot Setup Kubernetes # To setup Kubernetes cluster, we will use kubeadm, the official tool for bootstrapping Kubernetes clusters. First, we will setup kube-vip using static pod. This provides VirtualIP for control plane nodes, so that we can access the cluster even if one of the control plane nodes goes down. Then we will bootstrap the cluster and join other nodes. Finally we will setup Cilium CNI and MetalLB for networking and load balancing, OpenEBS for storage, and other usefull addons.\nStarting now, we should use root user.\nsudo -i kube-vip # On all control plane nodes, we will setup kube-vip as a static pod. Replace the VIP value with your desired Virtual IP.\nexport VIP=\u0026lt;vip\u0026gt; export INTERFACE=\u0026lt;inter-node-network-interface\u0026gt; KVVERSION=$(curl -sL https://api.github.com/repos/kube-vip/kube-vip/releases | jq -r \u0026#34;.[0].name\u0026#34;) alias kube-vip=\u0026#34;ctr image pull ghcr.io/kube-vip/kube-vip:$KVVERSION; ctr run --rm --net-host ghcr.io/kube-vip/kube-vip:$KVVERSION vip /kube-vip\u0026#34; kube-vip manifest pod \\ --interface $INTERFACE \\ --address $VIP \\ --controlplane \\ --arp \\ --leaderElection | tee /etc/kubernetes/manifests/kube-vip.yaml # Only on bootstrap node, see https://github.com/kube-vip/kube-vip/issues/684 sed -i \u0026#39;s#path: /etc/kubernetes/admin.conf#path: /etc/kubernetes/super-admin.conf#\u0026#39; /etc/kubernetes/manifests/kube-vip.yaml \u0026amp;\u0026amp; systemctl restart kubelet This sets up kube-vip as a static pod using ARP mode so that kubelet will manage its lifecycle.\nBootstrap Cluster # kubeadm config images pull \\ --cri-socket unix:///var/run/containerd/containerd.sock kubeadm init \\ --control-plane-endpoint vip.k8s.example.com:6443 \\ --cri-socket unix:///var/run/containerd/containerd.sock \\ --upload-certs \\ --service-cidr 172.18.0.0/16 \\ --pod-network-cidr 172.19.0.0/16 \\ --service-dns-domain cluster.local \\ --skip-phases=addon/kube-proxy Now, kubeadm has initialized the cluster by installing control plane components through static pods, generating certificates for component/pod communications, and setting up kubeconfig for cluster access.\nkube-vip patch can now be reverted\nsed -i \u0026#39;s#path: /etc/kubernetes/super-admin.conf#path: /etc/kubernetes/admin.conf#\u0026#39; /etc/kubernetes/manifests/kube-vip.yaml \u0026amp;\u0026amp; systemctl restart kubelet You can copy the kubeconfig file to your home directory for easier access.\nmkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config Join Other Nodes # For additional Control Plane nodes, repeat the Host OS setup and the kube-vip setup (without the initial patch). For Worker nodes, only the Host OS setup is required.\nThen, use the kubeadm join command provided in the output of kubeadm init to add the remaining control plane and worker nodes.\nUntaint Control Plane Nodes (Optional) # By default, control plane nodes are tainted and won\u0026rsquo;t run user workloads. To allow workloads to run on them, remove the taint:\n# kubectl taint nodes \u0026lt;key\u0026gt;- : remove taint # kubectl taint nodes \u0026lt;key\u0026gt;=\u0026lt;value\u0026gt; : add taint kubectl taint nodes --all node-role.kubernetes.io/control-plane- # To re-taint: kubectl taint nodes \u0026lt;node-name\u0026gt; node-role.kubernetes.io/control-plane:NoSchedule Networking and Load Balancing # Now that we\u0026rsquo;ve setup the cluster, we can interact with the cluster from a local client. To proceed, you need to have helm installed on your local client.\ndnf install -y helm Setting Up Cilium CNI # We will use Cilium as the CNI plugin for the cluster. Cilium is a powerful and flexible CNI plugin that provides advanced networking features such as network policies, load balancing, and service mesh capabilities.\nhelm repo add cilium https://helm.cilium.io/ API_SERVER_IP=\u0026lt;vip\u0026gt; # replace with your api server vip or FQDN API_SERVER_PORT=6443 helm install cilium cilium/cilium --version 1.18.3 \\ --namespace kube-system \\ --set kubeProxyReplacement=true \\ --set k8sServiceHost=${API_SERVER_IP} \\ --set k8sServicePort=${API_SERVER_PORT} \\ --set ipam.mode=kubernetes \\ --set hubble.relay.enabled=true \\ --set hubble.peerService.clusterDomain=cluster.local # Verify installation kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep KubeProxyReplacement Install MetalLB # We will use MetalLB as the load balancer for the cluster. MetalLB is a popular load balancer for bare-metal Kubernetes clusters that provides network load balancing using standard protocols such as BGP and ARP.\nkubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.15.2/config/manifests/metallb-native.yaml Next, we need to configure MetalLB with a pool of IP addresses that it can use for load balancing. Replace the address-pool values with your desired IP range.\ncat \u0026lt;\u0026lt;EOF | kubectl apply -f - apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: namespace: metallb-system name: public spec: addresses: - \u0026#34;192.168.20.0/24\u0026#34; --- apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: public namespace: metallb-system spec: ipAddressPools: - public EOF OpenEBS for Storage # OpenEBS is very heavy and generally eats up more then 4 GiB of memory on each control node. If you don\u0026rsquo;t need such distributed storage solution, consider using nfs and host path provisioner. To achieve HA persistent storage, OpenEBS requires at least 3 nodes, each will run etcd and operator to run OpenEBS.\nPrepare Storage Nodes # Login to Storage Nodes # Enable nvme_tcp kernel module.\nsudo mkdir -p /etc/modules-load.d/ echo \u0026#34;nvme_tcp\u0026#34; | sudo tee /etc/modules-load.d/openebs.conf sudo modprobe nvme_tcp Label host as Storage Nodes # kubectl label node \u0026lt;node_name\u0026gt; openebs.io/engine=mayastor Install OpenEBS # Please update values according to your needs.\nhelm repo add openebs https://openebs.github.io/openebs helm install openebs \\ --namespace openebs openebs/openebs \\ --create-namespace \\ --set loki.enabled=false --set alloy.enabled=false --set minio.enabled=false Setup Storage Class # Now you should already have openebs-hostpath and openebs-single-replica storage classes available. But openebs-single-replica still doesn\u0026rsquo;t have a storage pool yet.\nSetup Mayastor Disk Pool # kubectl get sc # show predefined storage classes ls -l /dev/disk/by-id/ # show available block devices cat \u0026lt;\u0026lt;EOF | kubectl create -f - apiVersion: \u0026#34;openebs.io/v1beta3\u0026#34; kind: DiskPool metadata: name: pool-on-node-1 namespace: openebs spec: node: workernode-1-hostname disks: [\u0026#34;aio:///dev/disk/by-id/\u0026lt;id\u0026gt;\u0026#34;] EOF Create Storage Class Using Mayastor # cat \u0026lt;\u0026lt;EOF | kubectl create -f - apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: mayastor-1 parameters: protocol: nvmf repl: \u0026#34;1\u0026#34; provisioner: io.openebs.csi-mayastor EOF csi-driver-nfs # If you have an existing NFS server, you can use it as a storage backend for your Kubernetes cluster by installing the csi-driver-nfs CSI driver.\nhelm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts helm install csi-driver-nfs csi-driver-nfs/csi-driver-nfs \\ --namespace kube-system \\ --version 4.12.0 \\ --set externalSnapshotter.enabled=true \\ --set controller.runOnControlPlane=true \\ --set controller.replicas=2 Create Storage Class # cat \u0026lt;\u0026lt;EOF | kubectl create -f - apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: nfs-csi provisioner: nfs.csi.k8s.io parameters: server: nfs-server.default.svc.cluster.local share: / # csi.storage.k8s.io/provisioner-secret is only needed for providing mountOptions in DeleteVolume # csi.storage.k8s.io/provisioner-secret-name: \u0026#34;mount-options\u0026#34; # csi.storage.k8s.io/provisioner-secret-namespace: \u0026#34;default\u0026#34; reclaimPolicy: Delete volumeBindingMode: Immediate allowVolumeExpansion: true mountOptions: - nfsvers=4.1 EOF With that, you have successfully set up a production-level High-Available multi-node Kubernetes cluster using kubeadmin, containerd, cilium, MetalLB, OpenEBS, and kube-vip. You can now deploy your applications and services on the cluster.\nOther Useful Add-ons # cert-manager # cert-manager is a powerful tool for managing TLS certificates in Kubernetes. It automates the process of obtaining, renewing, and managing certificates from various sources such as Let\u0026rsquo;s Encrypt, HashiCorp Vault, and more.\nhelm install \\ cert-manager oci://quay.io/jetstack/charts/cert-manager \\ --version v1.19.1 \\ --namespace cert-manager \\ --create-namespace \\ --set crds.enabled=true ","date":"2025年11月12日","externalUrl":null,"permalink":"/blog/provisioning-a-highly-available-production-kubernetes-cluster/","section":"Blog","summary":"","title":"Provisioning a Highly-Available Production Kubernetes Cluster","type":"blog"},{"content":"","date":"2025年2月23日","externalUrl":null,"permalink":"/","section":"","summary":"","title":"","type":"home"},{"content":"欢迎来到我的博客！在这里，我分享关于技术、生活方式和交通等主题的想法、研究和故事。从对最新科技趋势的深入探讨，到日常生活的轻松讨论，每个人都能在这里找到感兴趣的内容。\n随时浏览，尽情探索！\n","date":"2025年2月23日","externalUrl":null,"permalink":"/blog/","section":"Blog","summary":"","title":"Blog","type":"blog"},{"content":"","date":"2025年2月22日","externalUrl":null,"permalink":"/categories/%E6%97%85%E8%A1%8C%E6%94%BB%E7%95%A5-/-travel-guide/","section":"Categories","summary":"","title":"旅行攻略 / Travel Guide","type":"categories"},{"content":"从零开始，带你轻松掌握日本铁路系统，玩转城市与远方\n序言：初入日本的铁路迷宫 # 你第一次来日本旅游。当飞机降落在成田机场，你早已抑制不住内心的兴奋。然而，刚推着行李走到“地铁口”，你就被眼前的景象吓住了： 满墙的线路图，密密麻麻如同一张复杂的电路板；自动售票机上不断闪烁的字母和数字，似乎需要高等数学才能解读；更别提耳边嘈杂的广播和不间断涌动的人潮。你突然意识到，原本以为“坐地铁很简单”的自己，或许完全低估了日本铁路的复杂程度。 这时候的你可能会怀疑：“这到底是交通工具，还是考验智商的迷宫？”\n别慌，这种“懵圈”状态再正常不过。日本的铁路系统堪称全球最复杂的网络之一：从新干线到地铁、从特急列车到地方支线，各种线路纵横交错，构成了一个庞大的交通王国。不过，好消息是——不论你对它有多陌生，只要掌握几个关键技巧，这个看似繁杂的系统就能瞬间变身成一件既高效又有趣的旅行工具。\n在接下来的文章中，我将用最简单易懂的语言，从基础概念开始，一步步带你掌握日本铁路的方方面面。无论你是喜爱自由行的背包客，还是首次踏上日本的旅游新手，都能跟着这份指南，从一片茫然到满怀自信，轻松玩转日本铁路。让我们一同出发，揭开日本铁道的神秘面纱吧！\n那么，让我们一起出发，揭开日本铁道的秘密吧！\n起步：日本铁路的基本概念 # 日本的铁路历史可以追溯到 1872 年，当时第一条铁路在新桥与横滨之间正式通车，连接了日本最重要的贸易枢纽之一。此后的百余年间，日本铁路迅速发展，从工业时代的货物运输逐渐转型为现代社会的城市通勤和高速铁路旅游，最终形成了如今覆盖全国的庞大铁路网络。\n那么，为什么日本的铁路会给人一种错综复杂的感觉？不外乎以下几点原因：\n历史背景：最初铁路由各地政府和私营企业分别建设，导致今天存在众多不同的运营商，每家都管辖自己的一部分线路。 高密度人口：二战后，日本城市化进程飞速发展。为了满足大都市圈数以百万计的通勤需求，必须建立紧凑而高效的铁路网络。 空间局促：在寸土寸金的都市区，铁路常常不得不采用“地下—地面—高架”多层叠加的方式，以尽可能利用有限的城市空间，这无形中也让线路布局更加复杂。 尽管看上去“密不透风”，日本铁路却以高效和精准闻名。它的主要特色包括：\n几乎精确到秒的列车时刻表，延误超过一分钟就可能登上地方新闻； 多家运营商之间的高度协同，一张 IC 卡便可在绝大多数线路上通行； 丰富而灵活的线路分布，从乡村小站到城市核心，无缝覆盖各类出行需求。 除此之外，日本丰富多彩的铁路文化也为它增添了独特魅力：新干线象征着高速科技，有些观光列车则提供奢华座椅乃至温泉足浴，还有车站便当、美妙的车站音乐、独树一帜的列车设计等诸多细节，让日本铁路旅行不只是单纯的“交通”，更是一场文化体验之旅。\n第一次买票：从机场到市中心 # 此时你正拖着行李站在成田机场的铁路售票机前，脑子里只有一个疑问：“我要去池袋，但到底要怎么买票？”\n面对屏幕上五花八门的选项，你开始有点后悔没提前做功课。不过别着急，只要了解了基本的购票原理，你会发现买票其实并没有那么难。\n乘车券制度：为什么他会吐出两张票？ # 在日本的铁路体系中，**乘车券（乗車券）**是最基础、最传统的一种车票。它对应从一个车站到另一个车站的“基础运费”，也就是说，购买了乘车券后，铁路公司就有责任把你从 A 车站运送到 B 车站。\n乘车券的价格主要由两站之间的距离决定，具体可参照车站公示的票价表（售票机也能自动计算）。 在车站的路线图上，通常会用明显的标识显示你所在车站，以及与目标车站之间的票价。 除了乘车券，日本铁路中还有几类重要的附加车票，专门针对特定列车或增加额外服务。它们通常和乘车券“配套”使用。\n特急券 # 当你乘坐特急列车时，除了乘车券以外，你需要额外购买一张特急券。 特急列车一般速度更快、停站更少，也往往提供更舒适的座椅。 指定席券 # 如果列车提供“指定席”选项（尤其是长途列车、特急列车、新干线等），你可以额外购买指定席券来享受固定座位；否则可以直接搭乘“自由席”（先到先得）。 一般你在购票时会被询问要指定席还是自由席。 如果你错过了指定的班次，车票会自动转为自由席。 新干线特急券 # 这是一种特殊的特急券，专门用于搭乘新干线列车。与普通的特急券类似，它是附加收费的一种，必须与基础的乘车券搭配使用。新干线的席位等级分为以下四种：\n自由席：最便宜，无固定座位，可能需要排队或者全程“站票”。（包含乘车券+自由席特急券） 指定席：最常见，有固定座位。（包含乘车券+指定席特急券） 绿色车厢（Green Car)：一种高级座席，类似中国高铁的一等座。（包含乘车券+绿色车厢特急券） Gran Class：最高级，类似中国高铁的商务座。目前只在东北、北海道、北陆、上越、山形新干线上运营。（包含乘车券+Gran Class特急券） IC 卡：你的万能通行证 # 如果你厌倦了每到一个站就排队买票，想让出行更简单，那么一张IC 卡（非接触式交通卡）几乎可以称得上是“神兵利器”。它不仅能在大多数铁路、地铁、公交上使用，还可以在便利店、自动贩卖机、车站储物柜等处消费，可谓一卡多用。\n最常见的 IC 卡包括：\nSuica（JR 东日本发行） Pasmo（东京地铁） ICOCA（JR 西日本） 它们彼此功能相似，并且在日本全国大部分地区都能相互通用。\n使用范围与限制 # 地域限制：绝大多数大城市都可以使用，但在某些偏远地区或地方线，闸机仅支持纸质车票，需先确认。 跨区域限制：如果你乘坐的路线跨越不同的 JR 公司或其他私铁公司管辖区，需要在边界车站出站后重新进站。若直接用 IC 卡跨区，可能会在出站时遇到闸机无法识别行程的问题。 特急与新干线：用 IC 卡进站通常只会扣除基础运费（乘车券），如果要乘坐特急或新干线，需要另外购买特急券或新干线特急券（线下或线上预订）。 退卡：不同 IC 卡只能在其所属地区退还。比如 Suica 只可在 JR 东日本区域（如东京）退卡，ICOCA 只能在 JR 西日本区域（如大阪）退卡。 通票：为旅行设计的高性价比之选 # 如果你计划在日本多城奔走，或是在单个城市内频繁搭乘公共交通，不妨考虑通票（Rail Pass）。它可以让你在指定时间内无限次乘坐目标范围内的列车，既能节省交通成本，又能省去一遍遍买票的麻烦。\n通票主要分为以下几类：\n全国通票（JR Pass） # 覆盖日本全国的 JR 线路，包括大部分新干线（除个别超高速列车如“のぞみ NOZOMI”）和地方 JR 线。 区域通票 # 面向特定地区，比如关西铁路周游券、北海道铁路周游券等，价格更亲民。 城市或地铁通票 # 适合只在某个城市活动的旅行者。比如： 东京地铁一日/二日/三日券：在 24/48/72 小时内可不限次数乘坐东京地铁和都营地铁。 如何购买车票 # 寻找售票机 在车站附近，特别是检票口旁边，通常能见到“自動券売機”的售票机。 选择目的地 屏幕上会显示线路图或车站列表。你可以点击目的地，机器会自动计算票价。 如果你想要“中文界面”，有时机子会将车站名显示为罗马拼音，比如“NIPPORI”而不是“日暮里”，这可能让你更难对照实际站名，请根据个人习惯选择。 支付票款 许多私铁公司或地铁售票机并不支持国际信用卡，这时只能使用现金。JR 的“みどりの窓口”（人工窗口）通常可以刷卡。 线上预订 你也可以在线上预定车票，比如京成电铁的 Skyliner 或新干线。这里建议搜索小红书。 回到你的行程：从成田机场到池袋 # 现在，你回过神来站在售票机前，已经掌握了必要的知识。 从成田机场到池袋，你可以：\n传统方式：买两张纸质票，一张京成电铁的乘车券到日暮里，再买一张东京地铁的票到池袋。 现代方式：购买Suica卡，充值3000日元，直接刷卡进站。 你决定选择后者，毕竟省事儿，而且那张绿色的小卡片摸起来质感不错。 下面是详细步骤：\n在 JR 东日本的黑色售票机购买 Suica 卡，充值 3000 日元； 单独购买 Skyliner 特急券，然后使用 Suica + 特急券进站（先塞特急券），乘坐京成电铁的 Skyliner 到日暮里； 出站换乘 JR 东日本的山手线（外环），从日暮里坐到池袋。 一路上，你只需要在日暮里站换乘时刷卡进出闸机，不用再去售票机排队。很快，列车驶过高楼林立的东京市区，池袋的繁华景象已映入眼帘，你带着一身轻松走出了地铁站，正式开始了属于你的东京旅程。\n（本文未完待续，敬请期待更多关于日本铁路的进阶操作、特价情报和旅行小贴士……）\n","date":"2025年2月22日","externalUrl":null,"permalink":"/blog/%E8%BF%B7%E5%A4%B1%E5%9C%A8%E4%B8%9C%E4%BA%AC%E5%9C%B0%E9%93%81%E7%9A%84%E7%AC%AC%E4%B8%80%E5%A4%A9--%E6%97%A5%E6%9C%AC%E9%93%81%E9%81%93%E7%94%9F%E5%AD%98%E6%8C%87%E5%8D%97/","section":"Blog","summary":"","title":"迷失在东京地铁的第一天 —— 日本铁道生存指南","type":"blog"},{"content":"","date":"2025年2月22日","externalUrl":null,"permalink":"/categories/%E9%93%81%E8%B7%AF/railway/","section":"Categories","summary":"","title":"铁路/Railway","type":"categories"},{"content":"\u003c!DOCTYPE html\u003e Redirecting to Email... If you are not redirected, click here.\n","externalUrl":null,"permalink":"/mail/","section":"","summary":"","title":"","type":"page"},{"content":" 基本信息 # 肉身在中国的计算机系本科生.\n喜欢 Apple 生态 🍎 移动端音游玩家（Project Sekai / バンドリ！ガルパ） 🎶 Minecraft 🧱 铁道/航空迷 🚄✈️ 正在学习日语（N1）。 技术相关 # 仍在探索阶段，正在使用 C++/Python 作为主要语言，正在学习 Rust、并行计算等。\n如果你想看我写的屎山代码可以在下面找到：\nmartin-cao/rhess A bare-metal chess game for an STM32F407ZGT6 Rust 2 0 未来规划 # 希望可以到处飞，并且有一个让自己不缺钱的工作（比如独立开发），但是具体还是走一步看一步😔\n联系我 # Email: i@martincao.cc PGP 公钥 🍎 Apple Music # headinghome on Apple Music\n对于中国大陆用户，建议使用全局代理访问。\n投喂 Martin # 如果你觉得我的内容对你有帮助，可以请我喝一杯咖啡 ☕️ 支持一下：\n赞助我（Ko-Fi） 谢谢你的支持与关注！\n","externalUrl":null,"permalink":"/about/","section":"","summary":"","title":"About me","type":"page"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"}]