三太子上身的痴漢水球2.0: [隨手亂畫] 一種P6 Pipeline，三種解釋

星期六, 5月 22, 2010

[隨手亂畫] 一種P6 Pipeline，三種解釋

其實我對這件事情已經感冒十年了。喵的。看來Intel自從詭異的NetBurst出現後，「管線深度」的標準就讓人摸不著頭緒，官方說法的P6 Pipeline從「12~14」砍到10 stages，說是misprediction penalty，但又看來不像，最好P6的misprediction penalty只有10 cycle啦。

說到管線深度，Intel的Optimization Manual倒是有提到兩件事情：

Intel Core Microarchitecture and Enhanced Intel Core Microarchitecture: Fourteen-stage efficient pipeline

Merom/Penryn「有效管線」的定義到底是哪一種啊？

The length of the pipeline in Intel microarchitecture (Nehalem) is two cycles longer than its predecessor in 45nm Intel Core 2 processor family, as measured by branch misprediction delay.

反正Nehalem/Westmere就是16 Stage「有效管線」就是了。

至於Pentium M，我當初用hotball寫的強迫製造misprediction程式，測X31的Banias，估計大概比P6多出2 cycle。換句話說，假設Intel的新標準只是把既有完整pipeline拿掉BTB Access和Retire、再加上潛在的delay，P6家族的管線深度就是：

Pentium Pro/Pentium II/Pentium III: 12 stages (12-14)
Pentium M/Yonah: 14 stages
Merom/Penryn: 16 stages
Nehalem/Westmere: 18 stages

這樣看來，Nehalem也蠻驚人的... 逼近20 stages耶。啊啊～連定義都沒釐清的我到底在爽什麼啊！

5 則留言:

sam 提到...: 你好,我也發現了這個問題...
但是我想只有Intel的工程師才知道...
或者是他們也弄不清楚...(說笑而已)

但是如果弄明白了,我想我可能知道為什麼Willamette/Northwood跟Prescott/Cedar Mill有11(?)個Pipeline Stages的分別了...; 7:55 下午
molesterwaterball 提到...: 我曾經測試過Prescott的misprediction penalty，的確比Northwood多出10-11個cycles，所以11 Stages的差別是貨真價實的，而Prescott效率不佳，其實跟加長50%的管線，關係也不大。

現在的問題是，自從Intel NetBurst引進trace cache後，發明出不包含branch prediction、fetch、decode和retire的「有效管線」，讓這票Intel近代x86 CPU幾乎沒有可以公平比較的基礎了。; 9:22 上午
sam 提到...: 我想Intel就是不想讓人比較P6/P-M,NetBurst,Core(Merom/Penryn) 跟 Nehalem 的Pipeline...

我認為,P4(20 Stages) 跟 Nehalem(差不多到20 Stages?) 的分別就是現在已經沒多大意義的Drive Stages了... (誰知道Prescott的Pipeline 有多少個Drive Stages?); 11:42 上午
molesterwaterball 提到...: Nehalem的pipeline結構和Pentium M、Yonah、Merom、Penryn相似，基本上沿襲P6，和NetBurst是完全不同的東西，最起碼你絕對看不到Nehalem有「TC Fetch」...; 12:36 下午
sam 提到...: 粗略看了Google找出來的東西,我才發現原來我真的是不懂數學...

1. Intel, ASC Training P6 Microarchitecture Tuning Guide (1999)

(Slide 9)

"P6 Microarchitecture has 12 stage pipeline
– 2 Branch Prediction stages
– 3 Instruction Fetch stages
– 2 Instruction Decode stages
– 1 Register Allocation stage
– 1 Re-order Buffer Read stage
– 1 Reservation Station stage
– 1 Re-order Buffer Write-back stage
– 1 Register Retirement File stage"

那最重要的Execution Stage在哪呢?

2. Intel, A 0.6 mm BiCMOS Processor With Dynamic Execution (1995)

(Comment of Slide 13)

"...Were you to count from left to right, you would end up with 14, but we don’t consider this
machine a 14-stage pipeline, because some of these stages overlap almost all the time."

我之前也是從Hot Chips的Presentation看見了這個14-Stages的版本,但沒想到原來它不當是14個Stages...

3. http://www.cs.clemson.edu/~mark/330/colwell/case_p6.html

雖然我想10-Stage的所謂Misprediction Pipeline就是差不多他說的樣子, 但是在[2]的裏面說...

(Comment of slide 14 of [2])

"Finally, the reservation station write cycle can usually be overlapped with at least one of the clock
cycles in the next pipeline segment." (我想它指的是HotChips的Pipeline的I8跟之後的O1 and/or O2吧)

如果這是對的話,那Misprediction Pipeline中就應該有一個Stage常常不見了才對...

結論: 哪有10/12/14-Stages的分別...我說根本就是讓大家發揮想像力的遊戲; 6:33 下午

張貼留言