 其實我對這件事情已經感冒十年了。喵的。看來Intel自從詭異的NetBurst出現後,「管線深度」的標準就讓人摸不著頭緒,官方說法的P6 Pipeline從「12~14」砍到10 stages,說是misprediction penalty,但又看來不像,最好P6的misprediction penalty只有10 cycle啦。
其實我對這件事情已經感冒十年了。喵的。看來Intel自從詭異的NetBurst出現後,「管線深度」的標準就讓人摸不著頭緒,官方說法的P6 Pipeline從「12~14」砍到10 stages,說是misprediction penalty,但又看來不像,最好P6的misprediction penalty只有10 cycle啦。說到管線深度,Intel的Optimization Manual倒是有提到兩件事情:
Intel Core Microarchitecture and Enhanced Intel Core Microarchitecture: Fourteen-stage efficient pipelineMerom/Penryn「有效管線」的定義到底是哪一種啊?
The length of the pipeline in Intel microarchitecture (Nehalem) is two cycles longer than its predecessor in 45nm Intel Core 2 processor family, as measured by branch misprediction delay.反正Nehalem/Westmere就是16 Stage「有效管線」就是了。
至於Pentium M,我當初用hotball寫的強迫製造misprediction程式,測X31的Banias,估計大概比P6多出2 cycle。換句話說,假設Intel的新標準只是把既有完整pipeline拿掉BTB Access和Retire、再加上潛在的delay,P6家族的管線深度就是:
Pentium Pro/Pentium II/Pentium III: 12 stages (12-14)
Pentium M/Yonah: 14 stages
Merom/Penryn: 16 stages
Nehalem/Westmere: 18 stages
這樣看來,Nehalem也蠻驚人的... 逼近20 stages耶。啊啊~連定義都沒釐清的我到底在爽什麼啊!
 
 
5 則留言:
你好,我也發現了這個問題...
但是我想只有Intel的工程師才知道...
或者是他們也弄不清楚...(說笑而已)
但是如果弄明白了,我想我可能知道為什麼Willamette/Northwood跟Prescott/Cedar Mill有11(?)個Pipeline Stages的分別了...
我曾經測試過Prescott的misprediction penalty,的確比Northwood多出10-11個cycles,所以11 Stages的差別是貨真價實的,而Prescott效率不佳,其實跟加長50%的管線,關係也不大。
現在的問題是,自從Intel NetBurst引進trace cache後,發明出不包含branch prediction、fetch、decode和retire的「有效管線」,讓這票Intel近代x86 CPU幾乎沒有可以公平比較的基礎了。
我想Intel就是不想讓人比較P6/P-M,NetBurst,Core(Merom/Penryn) 跟 Nehalem 的Pipeline...
我認為,P4(20 Stages) 跟 Nehalem(差不多到20 Stages?) 的分別就是現在已經沒多大意義的Drive Stages了... (誰知道Prescott的Pipeline 有多少個Drive Stages?)
Nehalem的pipeline結構和Pentium M、Yonah、Merom、Penryn相似,基本上沿襲P6,和NetBurst是完全不同的東西,最起碼你絕對看不到Nehalem有「TC Fetch」...
粗略看了Google找出來的東西,我才發現原來我真的是不懂數學...
1. Intel, ASC Training P6 Microarchitecture Tuning Guide (1999)
(Slide 9)
"P6 Microarchitecture has 12 stage pipeline
– 2 Branch Prediction stages
– 3 Instruction Fetch stages
– 2 Instruction Decode stages
– 1 Register Allocation stage
– 1 Re-order Buffer Read stage
– 1 Reservation Station stage
– 1 Re-order Buffer Write-back stage
– 1 Register Retirement File stage"
那最重要的Execution Stage在哪呢?
2. Intel, A 0.6 mm BiCMOS Processor With Dynamic Execution (1995)
(Comment of Slide 13)
"...Were you to count from left to right, you would end up with 14, but we don’t consider this
machine a 14-stage pipeline, because some of these stages overlap almost all the time."
我之前也是從Hot Chips的Presentation看見了這個14-Stages的版本,但沒想到原來它不當是14個Stages...
3. http://www.cs.clemson.edu/~mark/330/colwell/case_p6.html
雖然我想10-Stage的所謂Misprediction Pipeline就是差不多他說的樣子, 但是在[2]的裏面說...
(Comment of slide 14 of [2])
"Finally, the reservation station write cycle can usually be overlapped with at least one of the clock
cycles in the next pipeline segment." (我想它指的是HotChips的Pipeline的I8跟之後的O1 and/or O2吧)
如果這是對的話,那Misprediction Pipeline中就應該有一個Stage常常不見了才對...
結論: 哪有10/12/14-Stages的分別...我說根本就是讓大家發揮想像力的遊戲
張貼留言