星期四, 12月 29, 2005

日本冬季戰役之第二日:Comiket 69第一天




對我而言,Comiket的開場,永遠都是車站懸吊的大型掛報。




如果人不多,就不叫做Comiket了。





拍攝這張照片時,已經接近下午4:50分,距離Comiket商業區結束僅不到十分鐘了。但是,其他的「最後尾」都一個一個陸續消失,只剩下一群熱愛なのは的人們,死守到最後一刻。



曲終人散,Comiket結束時的Big Sight,以及臨海線國際展示展車站。






最後,這是今天在なのはA's Project企業攤排隊五小時又十分鐘的最後戰果,四本設定集和三位魔法少女的馬克杯,我應該會拿一些出來當作送人的謝禮吧。

下面是我寫在巴哈的文章,不要懷疑,這一切都是真的。
姑且不講那麼多啦....

今天 Comiket 69,各位知道なのはA's Project 企業攤,本痴漢花了多少時間去排隊嗎?

五小時又 10 分鐘,從上午 11:44 到下午 16:54,當整個西地區四階的外場幾乎空無一人時,只剩下なのはA's的排隊人潮而已。

然後排隊的人幾乎沒有減少,前後左右都在熱烈討論這部作品的內容,甚至 Comiket 的場內人員都跑來關切。我不清楚為何整個排隊人潮一直停滯不前,但是我所看到的排隊者,甚少有放棄者。本痴漢也是又冷又餓又鱉尿幾小時,給他硬撐下去。換句話說,第一天就整個獻給這幾個小女生了。

我只能說,這部作品受歡迎的程度,真的遠超過我自己的想像。

棍!羅莉控真多!
這是我第四次Comiket,也是第一次參加冬季Comiket,沒想到熱血的程度,較夏季有過之而無不及。真不愧是日本人。

日本冬季戰役之第一日



在新宿一下月台,馬上就看到這張海報,對我等族類真是當頭棒喝。

當然,到了日本,河豚一定是免不了的,雖然12/30已經預定水球東京大食團,但是飢渴難耐的我,硬是拖著隨行的春日一起去Torafugu新宿本店大吃一頓,吃到將近十二點才離開。

日幣12250就這樣飛了。
















最後,明天Comiket 69終於有機會使用七大武器之首-折凳。千萬不要小看這張折凳,這是我去年Comiket 66第一天晚上在秋葉原買的,本來打算當作後面兩天早上排隊之用,結果我竟然後面兩天都睡過頭了!折凳無用武之地。所以,我就下定決心,一定要在Comiket的入場排隊中使用這張折凳!請各位痴漢祝福我吧!

不過,我為何要替某個艦長控去Comiket搜括泰莎同人本啊....

星期三, 12月 21, 2005

日本行程敲定!


時間:12/28-1/5

12/28出發,國泰航空,中午12:45起飛,日本時間16:45降落
12/29冬コミ
12/30冬コミ,晚上舉辦水球東京大食團,與日本同行餐敘
12/31到處晃,迎接跨年
1/1到處晃
1/2殺去日光進行雪地溫泉之旅
1/3回東京,到處晃
1/4到處晃
1/5回台灣,國泰航空,下午15:40起飛,台灣時間18:35降落

旅館:

12/28-1/1新宿華盛頓本館
1/2日光美川民宿
1/3-1/4新宿京王廣場

預算總額:扣掉日光那晚(約日幣一萬六),一個人28885元。

[網友投稿]OOO & SMT

這是某位大濕級人物的投稿,可以參考一下。
「這可以當作論文題目了,當然最好是可以寫個模擬器(這樣一來並不適合專欄)。不過要是我的話,我會把重點放在增加thread個數,減少thread的平均latency,而不是像Intel一樣一直增加OOO on the fly的instruction entry。

原因喔...既然沒辦法完全解決RAW的問題,而指令latency又一直拉長,在本來就要排隊的情況下,幹嘛又額外浪費一堆電路空間專門做插隊的行為...

而PPC指令的latency範圍本來就不會像x86這麼廣,也沒必要做得像。x86這麼複雜(禍首是x87啦...),做到可以cover某個指令最長的latency即可(cache/memory access latency是另外一個問題)。另一個潛在的論文題目則是「SMT下的cache overhead optimization with OS support」。

至於Altivec 256bit版本,這是很重要的東西,絕對要做啊。XD」
以上言論不代表本痴漢立場。XD

星期二, 12月 20, 2005

IBM Power6的全貌

Real World Tech的David Kanter發表了一篇介紹IBM Power6處理器的文章「An eCLipz Looms on the Horizon」,簡而言之,Power6的規格有幾個重點:

.65nm製程,原先預定2006年推出,近期時程表延期至2007年。
.超高時脈架構(Ultra High Frequency),2006年第三季到達4.8GHz,簡化過的P6L(Power6 Light,可能是單核心版本)在2006年第二季可達到5.5GHz,在2007年第三季,mainframe及高階伺服器版本Z6/eCLipz時脈介於4GHz至4.4GHz。不過根據新版時程表,在2007年的最高時脈也僅有4GHz至4.5GHz。
.對應「eCLipz」計畫,透過binary translation以及部份的硬體支援,將長期使用S/360指令集的zSeries轉移至Power處理器。
.電晶體數目約為750M。
.可能和Power4/5一樣是雙核心的晶片,也有可能是四核心。
.每個核心的L1 D-cache為64kB,8 way set associative。
.每個核心各自擁有獨立的L2 cache,總容量介於6MB至12MB。
.所有核心共用外部的L3/L4 cache。
.Power6核心採用4 issue的超深管線設計,具備OOOE能力,但規模僅有PowerPC 604e的等級。
.採用SMT架構,每個核心支援2條同步多執行緒。
.記憶體頻寬是Power5的兩倍,約為32GB/s。
.為了支援mainframe,提供10進位整數格式。
.支援VMX、虛擬機器架構指令,以及ViVA-2向量指令集延伸規格。
.根據作者的推測,Power6的SPEC CPU 2000效能表現可以達到現有Power5+的兩倍水準。

這樣子,IBM採用Ultra High Frequency的目的就很明顯了:透過高時脈去改善執行mainframe軟體的效能。不過,IBM mainframe的效能優勢並非CPU,而是強大的I/O系統和虛擬化能力。IBM是否可以順利的將mainframe轉移到Power上,而且說服客戶更換系統,並不是簡單的工作。

星期四, 12月 15, 2005

ACK-230入手



桌面空間是省了不少,不過用慣了UltraNav,使用這種「傳統」的鍵盤,真的感覺很吃力。

星期日, 12月 11, 2005

"Is Out-Of-Order Out-Of-Date?"

這篇文章我原本發表在部落格分部,不過因為那套pLog系統實在是太爛了,讓我整篇文章費了九牛二虎之力、甚至還被迫用Google Desktop把暫存檔挖出來,才順利張貼成功,弄得一卵葩火,所以乾脆也貼到這裡。如果再沒改善,我會認真評估撤掉第二個blog的可能性。

昨天hotball兄的文章,讓我回想起五年前的往事。

這標題,原本是在2000年In-Stat/MDR的Microprocessor Forum中,HP的William Worley與Jerry Huck所共同發表的文章。他們認為,現有處理器的OOO(Out-Of-Order)機制與RISC指令集皆早已不合時宜,指令集應該提供 compiler發掘更高parallelism的空間,當然毫無意外的,這個指令集就是IA-64。後來IBM的Martin Hopkins同時發表"A Critical Look on IA-64",從code density的角度強烈批判IA-64的必要性,其中某段話還被P&H第三版收錄為第四章的引言。他講了什麼,應該不需要我提醒了。

不過,姑且不論IA-64到底好不好、有沒有必要投入如此之高的resource去追求ILP(雖然有人認為IA-64的精神,是為了追求「thread內部的parallelism」)、現有的RISC是否真的落伍,現在看來,OOO似乎真的已經出現"Out-Of-Date"的跡象。

Google的Luiz André Barroso發表了一篇名為"An Economic Case for Chip Multiprocessing"的文章。簡而言之,他認為未來的資料中心不需要OOO,而應採用大量簡單的In-Order CPU。這篇文章的重點大致如下:

.耗電量已經成為伺服器重要的成本因素,但近年來處理器廠商(包含Intel、AMD和IBM)所努力宣揚的Performance/Watt,卻一直停滯不前。以低階x86伺服器來說,只要使用四年,電費成本就高達硬體採購成本的40%。

.為了攤平硬體研發成本,往往個人電腦所使用的桌上型處理器和伺服器處理器採用相同的核心,例如AMD Opteron和Athlon 64都是K8微架構、Intel Xeon與Pentium 4皆為NetBurst、IBM PowerPC 970沿用Power4等等。但事實上,兩者的應用環境有很大的差異,伺服器環境不但需要較高的TLP(Thread Level Parallelism),而且也早已具備大量高度平行化的應用程式,個人市場則反。換言之,目前這些伺服器處理器,並不見得適合「實際上的應用」。

其實Luiz André Barroso本人在DEC/Compaq時,就是負責Piranha計畫,採用8個簡單的in-order/single issue的Alpha處理器,整合理論頻寬12.8GB/s的Direct Rambus記憶體控制器與Protocol Engine,以追求TLP的效能與最高的Performance/Watt。

後來正如大家所知道的,Alpha死掉了,Piranha計畫無疾而終,但現在很多人應該也注意到了,Sun的Niagara與RMI的XLR就是這種觀念下的產物。近期盛傳Sun正與Google洽談Niagara系統的生意,看來並不是空穴來風。

現在不少人就在猜測,一直堅稱OOO結合SMT仍具備高度效益的IBM,會不會真的把超高時脈的Power6作成in-order。如果沒有意外,明年二月的ISSCC 2006就可以看到答案了。

同樣的一個標題,相隔五年,意義竟然是如此的大不同。

星期二, 12月 06, 2005

冬季日本行程正式拍板定案

時間:12/28-1/5

12/28出發
12/29冬コミ
12/30冬コミ,晚上舉辦水球東京大食團
12/31可能去崎玉縣探望某位爆肝破病的日本鬍子大叔,晚上迎接跨年
1/1到處晃
1/2殺去日光進行雪地溫泉之旅
1/3回東京
1/4到處晃
1/5回台灣

奧日光湯之湖,我又來了!

星期二, 11月 29, 2005

超高時脈的怪物:IBM Power6

ISSCC 2006的議程已經出來了,最重要的,莫過於「Ultra High Frequency」IBM Power6了,看我的烏鴉嘴還蠻準的。關於Power6的議程,總計有三場:

.A 5GHz Duty-Cycle Correcting Clock Distribution Network for the POWER6 Microprocessor
Microprocessor global clock distribution networks use long buffered wires where reflections can be significant. Using accurate transmission-line models and optimization, these reflection effects can be exploited to improve clock-distribution characteristics. The clock distribution network of the POWER6 microprocessor is designed to run at frequencies exceeding 5GHz using only inverters and transmission lines and is capable of on-the-fly duty-cycle correction.
4GHz+ Low Latency Fixed-Point and Binary Floation-Point Execution Units for the POWER6 Processor
A 1-pipe stage, low-latency, 13 FO4, 64b fixed-point execution unit, implemented in a 65nm SOI CMOS process, allows back-to-back execution of data dependent adds, subtracts, compares, shifts, rotates, and logical operations. A 7-pipe stage, 91 FO4, double-precision floating-point unit allows forwarding of dependent results after 6 cycles in most cases.
.A 5.6GHz 64KB Dual-Read Data Cache for the POWER6 Processor
A dual-read 8-way set-associative data cache comprising four 16kB SRAMs and 2 set-prediction macros per POWER6 core is presented. The array utilizes a 0.75μm2 butted-junction split-wordline 6T cell in 65nm SOI. The design features dual power supplies, unidirectional polysilicon, and hierarchical unclamped bitlines for enhanced cell stability and performance.
感想:這個Ultra High Frequency還真的是有夠high啊....

星期一, 11月 28, 2005

今年十二月底前要出清的工作計畫

真的是一個比一個難搞。=_=

˙Sun Galaxy/N1
˙10GbE流量測試專題
˙NetScreen
˙去年九月的ISP roundtable

十二月沒搞定這些東西,大概也不必去日本逃避現實了。

星期二, 11月 15, 2005

星期日, 11月 13, 2005

[好文推薦]Itanium-is there light at the end of the tunnel?

Itanium-is there light at the end of the tunnel?

我之所以推薦這篇文章,主要是希望很多抱持打落水狗心態、對IA-64發展史及技術背景一竅不通、純粹為反對而反對的人,可以用比較平和的態度去看待Itanium的未來。

星期三, 11月 09, 2005

[好文推薦]捏造新聞,豬狗不如

強烈建議一定要看

『翻外電寫報導還不夠,現在開始編外電了,這就是號稱質報的中國時報啊。這就是號稱「一生從事體育事項,只有興奮只有感動」的吳清和先生啊。』

星期日, 11月 06, 2005

今天發神經在天瓏撒七千多




今天發神經,一起床就跑去天瓏買了Network Processor Design三連發,因為McGraw Hill那本內容實在太不足了,昨天晚上在塔麵長輩新家討論之後,下定決心一次買足。我的天,為了一個解析Screen OS和相關ASIC的專題報導,我付出的代價會不會太高了?如果這三本不能報公帳,我一定會幹死。

此外,「順便」將MKP的Computer Networks更新至第三版。

熊熊發現,我也差不多該買新的書櫃了。

星期四, 11月 03, 2005

今天又跑去金色三麥喝酒



之前是和天兵長輩去美麗華,今天則是自己和某爽人跑環亞店,後者的生意遠比前者興隆。不過,環亞店的燈光實在太暗了,所以兩個老痴漢就孤零零的坐在吧台上。

小麥啤酒還是太甜了,下次就專心喝黑麥啤酒。

星期一, 10月 24, 2005

MOVE32Int:Transport Triggered Architecture





昨天晚上才剛喝下兩公升啤酒,某個「老戰友」丟給我看這玩意,酒馬上就醒了。

基本上,這個MOVE Machine有幾個主要特點:

一、只有一個指令「move」。

二、program counter是採用memory mapping實作的visible設計。所謂的jump動作,就是將一個值、搬移到PC內,一個unconditional jump就是「move destination PC」,subroutine linkage則反。換言之,沒有條件控制流程指令。

三、VLIW架構,將四個指令包在一起。

四、full guarding,每個指令都有3 bits欄位標示guard條件碼。

問題來了,這個MOVE指令集有什麼缺點?

第一個比較明顯的是:code density太低,像一個add r1,r2,r3的簡單加法動作(Oint <= R2; Tadd1 <= R3; R1 <= Rint1),在MOVE上就需要三個指令。換句話說,需要48 bits長度,這個還比IA-64(41 bits)來得高。

另外一個問題是:register file的複雜度,MOVE這種架構需要很多register,如果減少register的數量,將會需要更多的指令去進行運算工作;但如果實作成集中式的register file,可能會有困難,因為register file的port數會隨著issue rate而線性成長,直接增加複雜度。

更重要的是,大多數的register file都會被拿去作為控制用途之類的特殊register,實際上的GPR反而不多,以MOVEint32為例,GPR只有10個(hardwired zero的r0不算)。

最後,就我自己的感覺,雖然這是1990年代的東西,但MOVE架構的例外處理機制一定不簡單,這地方我得再好好想想,很久沒碰這些東西了。

我真的很期待他努力多年的成果:要如何設計出一個兼具MOVE優點與高code density的指令集?

[網友投稿]我要重申水球三大謊言 XD

一、我愛好和平。
二、我是窮人。
三、我要戒酒。

星期日, 10月 23, 2005

也許有機會去日本跨年吧

今天和長輩牛晚上在大和吃掉兩千塊後,隨即殺到某長髮散仙家討論今年第二次日本三人行的可能性。基本上,應該是七天六夜或六天五夜,我今年的工作計畫也應該可以在十二月中全部搞定。唯一的變數是,某位理論上二月就該隨行、卻因身體不適推辭的超級大長輩,會不會加入我們這個團,能加入當然是最好的了,雖然開銷也許會因此暴增。

現在的問題是:去日本幹什麼?我只能想到以下的內容:

˙冬コミ,雖然官網尚未公佈日期和天數,不過八九不離十。
˙去日光或鬼怒川來一趟雪地溫泉之旅。
˙去秋葉原晃ヨドバシAkiba。
˙去崎玉縣探望某位爆肝破病、需要長期修養、連明年春季IDF都去不成的某日本鬍子大叔。
˙跨年和新年第一天到處跑神社參拜。
˙純粹裝死,逃避現實。
˙當電車痴漢。
˙趁機把長榮的里程數用一用。
˙其他。

不過,我還是很想去日本一趟,從二月到現在,我一直都在懷念日本的雪景,今年還忙到連夏コミ都去不成,不去逃避現實一下,實在是說不過去。

說到最後,我真的很難忘這輩子第一次的『水球雪地仆街四部曲』,希望年底可以再表演一次。

星期五, 10月 21, 2005

教召三天後的新玩具






收到機器的初步感想,請見這裡,雖然都是一堆廢話就是了。

一般來說,和工作有關的日記(就是寫給老闆看的)都會放在iThome的部落格分部,不過今天就破格寫在這邊好了,因為晚上和中華電信某爽人吃完港式飲茶後,一時想不開回公司,把Windows Server 2003 x64和Platform SDK給裝好。

先談一下Dell PE2850這台機器,依舊延續過去Dell server機構設計與管理設定「粗曠」的傳統,尤其DRAC 4/I的設定介面和HP iLO與IBM RSA II相比,實在是很.... 系統BIOS也差不多,相隔近兩年,一點都沒有長進。上蓋很沒誠意的貼著一張小小的拆解說明(拜託,人家HP是怎麼作的?),機櫃指示燈的設計也不佳,IPMI只支援到1.5版(反正支援2.0的也不多)。Dell OpenManage實在是用了太多次,今天時間不夠,所以就懶得把它裝起來。

「你的設計,很有魄力,我很欣賞」,這應該就是我所能對Dell PE2850下的最短評語。

另外,不知道是ATi Rage XL停產缺貨(我記得Rage XL並沒有RoHS製程的版本)還是Mobility Radeon太便宜怎樣,或著是考量到安裝Windows Vista的可能性,PE2850是我經手的機器中,第一台採用這麼「高檔」VGA者(Radeon 7000M 16MB)。當然,這對server來說,本來就不是重點,只是覺得很有趣罷了,不足掛齒。

倒是有一點不得不提,Windows Server 2003 x64預設的Rage XL driver,一直都會有預設螢幕更新率過高的老問題,如果螢幕支援的更新率不高(尤其是比較舊款的LCD),就會碰到開機無法顯示畫面的狀況,所以在第一次開機前,必須先進入保護模式移除Rage XL,才能正常顯示畫面。因為這問題我碰過太多次了,所以才會如此在意PE2850採用Radeon 7000M這件事情。

不過,PE2850並非無可取之處,姑且不論價格遠比DL380G4低廉這個顯而易見的優勢(要不然Dell就不用活了),機構設計上算是蠻實用的。還有一點讓我蠻意外的,新版本(MSDN)的Windows Server 2003 x64竟然有內建PERC 4/Di的driver,DRAC 4/I也是fast Ethernet(IBM RSA II只有10Mbps)。更重要的是,「看起來」RAID控制器的效率很不錯,等明天回公司再仔細跑跑看,這地方一直都是HP的弱項,Smart Array 6i真的是太鳥了。P600很不錯,只是太貴,又是SAS。

有點懶得把SPEC CPU 2000給跑起來,但姑且不論表面上的效能,這是必須進行的穩定性測試,尤其是記憶體的可靠性,這地方很容易被外行人忽略。不了解SPEC CPU(外加計算機結構不知道修到哪裡去、甚至以為SPEC CPU只需要裝起來「輕鬆跑一跑」)的外行人,絕對無法理解這個「benchmark」的實際價值究竟在哪裡。這一次我會撰寫全新的EM64T config,之前那個版本的效率實在是不盡理想。

唯一感到遺憾的是,這次PE2850測試期只有一個禮拜,和之前HP三連發的狀況相同,所以就不搬到塔麵長輩那邊去了,直接在自家的lab搞定。



華碩3112F也蠻有趣的-我要一台12埠GBIC的機器幹嘛?今天機器已經送到塔麵長輩的地下秘密核子試爆場,等Anritsu的某機器確定可以用,馬上就送上刑場。在這之前,就給塔麵和小貓兩位長輩玩一玩。說到這個,ZyXEL GS-4024已經在某個社區網路跑了兩個月了,也差不多該撰寫一下測試報告,如果我還有時間的話。

最後,在IBM的協助下,談了將近一個月,今天正式敲定SQL Server 2005的測試細節,server使用8 way的x460,storage採用DS4800,基本組態為SAN boot。至於提供「足以服眾」資料庫的苦主,現在還不能講,等著看好戲就對了。

沒辦法,無論是何種企業產品,都是拿來用的,不是直接用來測的,最佳的測試就是實際的部署,這也是企業產品測試的基本要求,本痴漢親自操刀的server測試也少有低於測試期一個月者,姑且不論過去一年的工作和三次server採購特輯,像六月的IBM x366、七月的Tyan VX50(就是鳥窩某個在匿名板亂放砲的智障想要見識的「兩條PCIe x16」板子。當然,我是為了8 way/16 core Opteron,對手是Iwill H8501,誰管你有沒有兩條PCIe x16)、八月的IBM x460都在大台北寬頻的機房實際部署一個半月以上。

應該不會有人只看了本痴漢在blog隨便寫寫,就真的以為我會真的在工作中就寫出這樣的東西吧?坦白講,甚至只要有點常識的人,都不會講出這麼白痴的話。

可惜的是,這種白痴好像還不少,唉。

本次教召感想

一個人才六發子彈,結果一張靶紙出現14個彈孔是什麼意思。 ̄▽ ̄

我隔壁那個滿靶,恐怕也是在下「友情贊助」的結果。

星期二, 10月 18, 2005

教召三天去

本週二、三、四,本痴漢將爆笑國家三天,所以本blog也因此公休三日。

星期日, 10月 16, 2005

今天買了Creative X-Fi音效卡



今天晚上和電腦王的人吃完韓國烤肉後,一時想不開,就跑去現代生活廣場買了一張。售價4490,附贈一隻Creative HS600耳機。

使用感想如下:

一、傳說中的24-bit crystalizer真的很有效果,但是,在播放bit rate比較高的MP3時,似乎會產生額外的雜音,這我還得研究一下。

二、在Battle Field 2啟動X-Fi的EAX 5.0功能後,音效表現明顯勝過Audigy 2;不過,當場景畫面中角色增多時,似乎會大幅提高CPU的utilization,導致畫面flame drop。

三、整個X-Fi的驅動程式與相關工具的體積都肥得嚇人,在新增/移除程式中,光是Creative系統資訊一欄就吃掉了415MB,難怪光碟安裝driver的時間這麼久。

整體而言,Creative X-Fi的使用感覺還算可以,畢竟也是一張四千多塊的玩意。當然,我是覺得一些小地方還頗有改進空間就是了。

星期六, 10月 15, 2005

IBM還是大幅修改Cell的SPE設計了

IBM Power Architecture Forum上的某篇文章

重點如下:
"There are already some modification going on within CELL. There is some progress in replacing the entire FPU inside the SPU with a full-blown DP (double precision) unit. The estimated performance should be about 1:2 against the current SP unit, which would be a major improvement compared to the current situation of about 1:10 - 1:14."
換句話說,如果成真,Cell的double precision FPU效能就會大幅提升至single precision的一半。

其實這一點都不讓我感到意外,Cell SPE的double precision FPU本來就該改進了。

這一週的10GbE流量測試



相關文章:

10Gbps流量測試結果揭曉
慶祝國慶日的另類方式
暴風雨前的寧靜
這實在是很蠢的事情
AMD Opteron如何撐住10GbE的有效頻寬

星期三, 10月 05, 2005

Power5+果然只是Power5的90nm縮小版本

IBM網站上都已經公佈相關資料以及對應的伺服器了,我還得想想測試的方法。

不過這些都不重要了,重要的是我在Yahoo找到的這篇文章,特此備份:
IBM's Power6 Gets First Silicon as Power5+ Looms

By Timothy Prickett Morgan

The word on the street is that IBM Corp last month achieved first silicon on its forthcoming Power6 chip, due in servers perhaps in late 2006 and maybe in early 2007, just as it is getting ready to ship a kicker to the current Power5, appropriately called the Power5+ chip. The rumors have it that Big Blue is getting ready to launch the Power5+ in its pSeries AIX-based server line in September or October, which is consistent with past announcements and customer expectations.

IBM (NYSE: IBM-news) refused to comment on the veracity of these rumors, as is the traditional stance of all IT vendors when it comes to rumors about the timing and technical features of future products--excepting their own statements and roadmaps, of course.

Various high-level sources at IBM were very clear in late 2004, and again in early 2005, that the iSeries line of servers, also based on the "Squadron" server design and the Power5 processors, like the pSeries line of machines, would not be upgraded to Power5+ processors in 2005. While IBM has not said why this is the case, it is not hard to surmise.

The Power5+ chips will be using a new 90 nanometer chip-making process, and the yields will not be particularly high. Every one of them that comes off the line working properly will be precious, and will be delivered to customers who need the absolute best raw performance that IBM can bring to bear in the server market.

IBM has direct competition in the Unix market, and Power5+ is really aimed at these customers. To put it bluntly, in terms of green-screen performance, the iSeries line was overkill for most customers back in the late 1990s with the S-Star and I-Star processor lines, so the Power5+ must be a nuclear holocaust or something (to take a bad analogy and make it worse, with my apologies to Paul McCartney on that parenthetical).

The Power5+ chip will be a shrink of the current Power5 chip, which is based on a 130 nanometer, copper/SOI process used first in the 1.7 GHz Power4+ chip that came out in July 2003 and was subsequently used to create the 1.9 GHz Power4+ in February 2004. With the Power5+ chips, IBM is moving to a 90 nanometer copper/SOI process, a very similar process that is being used by IBM to create the "Cell" PowerPC processor that will be used by Sony and Toshiba in various electronic devices.

While IBM will probably implement some circuitry changes in the Power5+ chip, the rumor is that there will be no significant changes to the cores in the processors. IBM could possibly increase the size of the on-chip L2 cache, which is shared by both cores in the Power4 and Power5 families of chips. For instance, when IBM moved from the Power4 to the Power4+ chip, it increased the size of that shared L2 cache to 1.9 MB from 1.4 MB. IBM could tweak other things here and there, but the Power5+ chip should plug into existing Squadron machines; most server designs are created to handle at least two generations of processors.

It would be interesting if IBM could boost the logical partitioning capabilities of the Squadron platform with the Power5+ chips, perhaps doubling from the current 10 partitions per processor core to 20 partitions--or even higher. With anywhere between 30 and 60% higher performance (comparing a 1.9 GHz Power5 chip to a 2.5 GHz or 3 GHz Power5+ chip), there should be room to do this. Many customers would love to support more than 254 partitions on a big Squadron box, and frankly, it might even make sense for IBM to quadruple this and really go after big server consolidation jobs.

The main benefits of the Power5+ chip should be much lower power consumption and heat dissipation in the same clock speed range, as well as more performance in about the same heat range. The Power5 processors run at 1.5 GHz, 1.65 GHz, and 1.9 GHz (with the two lower speeds available in the iSeries line and the top-end speed only available in pSeries machines where the extra performance is critical). As I have said before, because of the differing performance and heat constraints of the server market, I think IBM will probably offer a wider variety of clock speeds for the Power5+ generation, allowing customers to push up into the 2.5 GHz to 3 GHz range for machines with about the same CPU thermals and maybe even down into the 1.5 GHz range or a little lower for customers who want about the performance as the Power5s, but with half or less of the power consumption and heat dissipation.

If IBM doesn't offer customers options that trade off compute power and heat, it is being silly; this is what its main competitors in server processors--Intel Corp (NASDAQ: INTC - news) and Advanced Micro Devices (NYSE: AMD - news) --are doing. Such a chip, running at as little as 1 GHz, would make a nice entry iSeries processor. IBM, if you have a lot of duds that don't run at 2.5 GHz, make some puppy iSeries boxes out of them--don't throw them in the trash.

Moreover, offering a low-speed, low-heat Power5+ would allow IBM to create a very powerful hybrid AIX/Linux workstation. Hewlett-Packard Co and Sun Microsystems Inc (NASDAQ: SUNW - news) have let their Unix workstation lines languish--HP withdrew support of HP-UX on Itanium workstations last summer, in fact. So there is a chance to go after flops-hungry workstation customers with Power5+ as well. But IBM may not go for this opportunity if Power5+ yields are not high.

Considering the trouble IBM's Microelectronics Division had getting its 90 nanometer processes online--and one of the reasons why it has lost Apple as a chip customer--it is hard to believe that IBM will have the chip volumes to do pSeries servers with Power5+ in 2005, and then maybe add some iSeries servers in 2006 (perhaps when i5/OS V5R4 debuts sometime next year), and then do maybe 10 times the volume of these servers in Unix/Linux workstations using Power5. But, if it could get yields on 90 nanometer for Apple Computer (NASDAQ: AAPL - news) (eventually), maybe it can get yields for a workstation line, too.

Power6: To ECLipz or Not to ECLipz

There is a lot of chatter about what Power6 is and isn't and 18 months has not really cleared up the confusion about what IBM future Power6 processor is and isn't. IBM finished up the design of the chip earlier this year (prior to March, and I am not sure when) and did what is called a "tape out," which means the data that describes the process by which you make the masks to make the chips is finished and sent to the chip factory (also called a fab) so they can start making the chips.

When the first chips that function come off the assembly lines at the factory (in this case, IBM's 90 nanometer, 300mm wafer facility in East Fishkill, New York), this is called "first silicon." According to my sources, the Power6 went into first silicon sometime in July, and IBM has actually put the chips into test systems. Those sources say that IBM has booted the open source Linux operating system on the Power6 chips, but has not yet put the AIX or i5/OS operating systems on them.

The Power chip roadmaps from a few years ago caused some confusion in that they indicated that IBM would be using a 65 nanometer process for these chips, due in 2006 and 2007, and could be ramping clock speeds up as high as 6 GHz. The roadmap characterized the clock speeds on the Power6 chips as "ultra high frequency," and unlike the Power4, Power5, and Power5+ chips on the roadmap, the Power6 item did not show two cores, but simply an area that said "cores," plural. It also said "L2 caches," and said "Advanced System Features" instead of the distributed switch that occupies two sides of every Power4, Power4+, Power5, and Power5+ chip.

This distributed switch is the high-speed interconnection that allows four dual-core Power chips to be lashed together into an eight-way SMP server inside a multichip module (MCM), which is a single piece of electronics that is about as big as the palm of your hand. This MCM also contains the L3 caches. To make a big SMP box, like the 64-way Squadron i5 595 and p5 595 machines, you put eight of these eight-core MCMs on cell boards (which IBM calls books) and you have made a big, bad box.

Contrary to what a lot of people have written based on earlier roadmaps and IBM's own statements, the initial Power6 chips will use the same 90 nanometer process that is used for the Power5+ chip. Further down the road--perhaps in the late spring or late summer of 2007--IBM will roll out its Power6+ chips using a future 65 nanometer processes.

Earlier this year, in clarifying Power6 clock speeds, IBM sources told me that the leap from Power5 to Power6 will involve a big jump in gigahertz--more than the jump from Power4 to Power5. The fastest initial Power4 clocked at 1.3 GHz, and the fastest Power5 clocks at 1.9 GHz, which is a bump of 46%. It seems likely that Power6 chips will probably start out at 3 GHz and then push up to 4 GHz. If it can keep the Power6 in the same thermal envelope of the Power5s, there is no reason not to do this.

I think it highly unlikely that IBM will try to push clock speeds to 6 GHz as the initial Power6 specs suggested a number of years ago. Rather than do this, I think IBM will probably have brought more electronics onto the Power6 core to boost performance. If L3 caches are not shrunk and then integrated on the chip with the Power5+, you can bet IBM will do it with the Power6, and then possibly add an external L4 cache to keep those hungry processors fed.

IBM could, of course, add more processor cores to the Power chip with the Power6. Intel is trying to get its "Tukwila" Itanium chip out the door in 2007. Tukwila is expected to have at least four Itanium cores per chip and, like the Power6+, it will use a 65 nanometer manufacturing process. IBM could take this four-core approach with Power6+ and keep the clock speed relatively low on Power6 core and dial up the number of cores on the chip from two to four. Could is the key word here.

IBM has plenty of time to change its mind with Power6+, even if Power6 is done. Remember, Intel was going to ship Montecito a year ago, but them, after taking a drubbing from IBM with the Power4 chips, decided to make Montecito a dual-core rather than a single-core chip. IBM could redraw its roadmap for Power6+ in the same way, keeping the clock speeds low and doubling the cores. On multi-threaded jobs, a four-core Power6+ chip would have four physical threads and four virtual threads though SMT, and keeping the chip count the same as the Squadron boxes, that would mean a big Power6+ box would have 256 threads. This would help databases a great deal, but its value to big batch jobs would be limited. What seems clear is that we are going to have to figure out how to thread batch jobs on all computer architectures.

Having said all that, given IBM's whole "system on a chip" philosophy, I think Big Blue might put off four cores until Power6+ in 2007, and maybe even Power7 in 2008.

Take a look at the history (and breathe deeply before you read this): The Power4 chip put what were essentially two S-Star PowerPC cores with their own L1 caches, the L1 cache controllers, a shared L2 cache, and a single L2 cache controller onto the chip and put the L3 cache off the die. With Power5, IBM added simultaneous multithreading (SMT), doubled the speed of the distributed switch interconnection on chips so it ran at full clock speed (it was half speed on the Power4s), boosted the size of the L2 cache, went from two-way to four-way set association for the caches, moved the L3 cache controller into the chip, moved the L3 cache into the chip package and, most importantly, hung that L3 cache off the L2 cache with a direct link rather than making it go through the interconnection fabric of the MCM, which it did with the Power4. (This wickedly reduced memory latencies.)

I think Power6 will include an on-die L3 cache for each core (or maybe shared by two cores), hung off of individual L2 caches (one per each core), plus an integrated L4 controller, and L4 cache that is implemented in the MCM packaging like L3 caches are today on the Power5s. As I speculated a few months ago, I think there is also a possibility that IBM ditches this hierarchical cache structure and creates a whole new scheme above the L2 caches in each core that boosts memory bandwidth beyond what is possible with a staged cache architecture.

Here's another interesting idea: Imagine (Paris: FR0004150647 - news) if IBM used its thermal conduction module (TCM) technology from mainframes to put an entire 32-chip, 64-core machine in four blocks of ceramic, thus shortening many of the wires in a server complex and significantly reducing interprocessor and memory latencies to the very limits of physics? IBM could do this TCM packaging with the Power6, or hold off until the Power6+. IBM seems to have removed the distributed switch with the Power6 design and replaced it with "advanced system features." What is more advanced than a mainframe's TCM?

What seems clear is that the Power6 chip has been a major redesign, according to my sources, and much of this redesign is being driven apparently by the necessities of moving to a 65 nanometer chip making process. But it may also be done so IBM can do the full tilt TCM integration like it does in mainframes for its very high-end i5 and p5 boxes, as well as deliver single chip, dual-core Power6 chips for volume markets where a TCM is overkill. I think IBM is also committed to getting low-power, dual-core Power6s into entry and midrange servers, blade servers, and even embedded devices. IBM is concerned about power management, which is why it is merging simultaneous multithreading and multiple cores in the Power5 design. Both of these technologies make better use of transistors, and deliver performance without having to add significantly to clock speed.

IBM has also hinted that the Power6 chips will add a lot more functions for self-management from the microcode underpinning OS/400 and AIX, and now its Virtualization Engine hypervisor, into the Power6 chip itself. It would not be surprising for the large pieces of the virtualization embodied in the Virtualization Engine to somehow be implemented in chip transistors and firmware loaded into the processor. Intel and AMD are embedding X86 instruction set virtualization in their chips using their respective VT and Pacifica technologies. IBM could do something similar, providing electronic assist to Virtualization Engine.

The Power6 chip could, being implemented as a TCM, also consolidate the iSeries, pSeries, and zSeries lines down in some way to support mainframe as well as i5/OS, AIX, Linux workloads on the same processor complexes. This is the fabled "Project ECLipz," which IBM has not confirmed and has weakly denied. Exactly how mainframe workloads might be supported is unclear, but there is certainly a prospect of mixing and matching zSeries and Power6 processors within the same complex or TCM.

Using mainframe simulation software from Transitive is also an option. That is how Apple is going to be supporting Power-based workloads on Intel's chips in its future machines. QuickTransit, Transitive's emulation software, can already support mainframe workloads on Power, Xeon, Itanium, and Opteron processors. IBM might go so far as to license Transitive's QuickTransit, implement much of its features in silicon, and put that inside a Power6 or Power7 TCM to make a hybrid mainframe-Power box.

For ECLipz, IBM could also implement zSeries processor instructions in "millicode," a kind of on-chip microcode that would create a CISC mainframe instruction from a bunch of RISC instructions. The zSeries processors already do this a little, by the way, and so does an Itanium chip do this when it is running HP-UX workloads since the Itanium doesn't support PA-RISC instructions. Even the Pentium chip that is probably on your desktop uses similar technology; that Pentium is not using the 80486 instruction set, but has a RISC-like core that assembles these 486 CISC instructions out of smaller RISC instructions. It just tricks the software into thinking it is running 486 instructions.

Whatever IBM has decided, with the Power6 chip in first silicon, whatever it is going to do in terms of core count and mainframe support can now be found out. It is now just a matter of time.