星期六, 5月 22, 2010

[隨手亂畫] 一種P6 Pipeline,三種解釋

其實我對這件事情已經感冒十年了。喵的。看來Intel自從詭異的NetBurst出現後,「管線深度」的標準就讓人摸不著頭緒,官方說法的P6 Pipeline從「12~14」砍到10 stages,說是misprediction penalty,但又看來不像,最好P6的misprediction penalty只有10 cycle啦。

說到管線深度,Intel的Optimization Manual倒是有提到兩件事情:
Intel Core Microarchitecture and Enhanced Intel Core Microarchitecture: Fourteen-stage efficient pipeline
Merom/Penryn「有效管線」的定義到底是哪一種啊?
The length of the pipeline in Intel microarchitecture (Nehalem) is two cycles longer than its predecessor in 45nm Intel Core 2 processor family, as measured by branch misprediction delay.
反正Nehalem/Westmere就是16 Stage「有效管線」就是了。

至於Pentium M,我當初用hotball寫的強迫製造misprediction程式,測X31的Banias,估計大概比P6多出2 cycle。換句話說,假設Intel的新標準只是把既有完整pipeline拿掉BTB Access和Retire、再加上潛在的delay,P6家族的管線深度就是:

Pentium Pro/Pentium II/Pentium III: 12 stages (12-14)
Pentium M/Yonah: 14 stages
Merom/Penryn: 16 stages
Nehalem/Westmere: 18 stages

這樣看來,Nehalem也蠻驚人的... 逼近20 stages耶。啊啊~連定義都沒釐清的我到底在爽什麼啊!

5 則留言:

sam 提到...

你好,我也發現了這個問題...
但是我想只有Intel的工程師才知道...
或者是他們也弄不清楚...(說笑而已)

但是如果弄明白了,我想我可能知道為什麼Willamette/Northwood跟Prescott/Cedar Mill有11(?)個Pipeline Stages的分別了...

molesterwaterball 提到...

我曾經測試過Prescott的misprediction penalty,的確比Northwood多出10-11個cycles,所以11 Stages的差別是貨真價實的,而Prescott效率不佳,其實跟加長50%的管線,關係也不大。

現在的問題是,自從Intel NetBurst引進trace cache後,發明出不包含branch prediction、fetch、decode和retire的「有效管線」,讓這票Intel近代x86 CPU幾乎沒有可以公平比較的基礎了。

sam 提到...

我想Intel就是不想讓人比較P6/P-M,NetBurst,Core(Merom/Penryn) 跟 Nehalem 的Pipeline...

我認為,P4(20 Stages) 跟 Nehalem(差不多到20 Stages?) 的分別就是現在已經沒多大意義的Drive Stages了... (誰知道Prescott的Pipeline 有多少個Drive Stages?)

molesterwaterball 提到...

Nehalem的pipeline結構和Pentium M、Yonah、Merom、Penryn相似,基本上沿襲P6,和NetBurst是完全不同的東西,最起碼你絕對看不到Nehalem有「TC Fetch」...

sam 提到...

粗略看了Google找出來的東西,我才發現原來我真的是不懂數學...

1. Intel, ASC Training P6 Microarchitecture Tuning Guide (1999)

(Slide 9)

"P6 Microarchitecture has 12 stage pipeline
– 2 Branch Prediction stages
– 3 Instruction Fetch stages
– 2 Instruction Decode stages
– 1 Register Allocation stage
– 1 Re-order Buffer Read stage
– 1 Reservation Station stage
– 1 Re-order Buffer Write-back stage
– 1 Register Retirement File stage"

那最重要的Execution Stage在哪呢?


2. Intel, A 0.6 mm BiCMOS Processor With Dynamic Execution (1995)

(Comment of Slide 13)

"...Were you to count from left to right, you would end up with 14, but we don’t consider this
machine a 14-stage pipeline, because some of these stages overlap almost all the time."

我之前也是從Hot Chips的Presentation看見了這個14-Stages的版本,但沒想到原來它不當是14個Stages...

3. http://www.cs.clemson.edu/~mark/330/colwell/case_p6.html

雖然我想10-Stage的所謂Misprediction Pipeline就是差不多他說的樣子, 但是在[2]的裏面說...

(Comment of slide 14 of [2])

"Finally, the reservation station write cycle can usually be overlapped with at least one of the clock
cycles in the next pipeline segment." (我想它指的是HotChips的Pipeline的I8跟之後的O1 and/or O2吧)

如果這是對的話,那Misprediction Pipeline中就應該有一個Stage常常不見了才對...

結論: 哪有10/12/14-Stages的分別...我說根本就是讓大家發揮想像力的遊戲