\REF{[1]} DARPA. High productivity computing systems (HPCS), vision: Focus on
the lost dimension of HPC --- ``User \& system efficiency and
productivity''. http://www.darpa.mil/ipto/programs/hpcs/vision.htm.
\REF{[2]} John Hennessy, David Patterson. Computer Architecture: A
Quantitative Approach. Fourth edition, Morgan Kaufmann, ISBN:
0123704901, 2006.
\REF{[3]} Wm A Wulf, Sally A McKee. Hitting the memory wall:
Implications of the obvious. \it ACM SIGARPH Computer Architecture News,
\rm March 1995, 23(1): 20$\sim$24.
\REF{[4]} Chen T F, Baer J L. Effective hardware-based data prefetching
for high performance processors. \it IEEE Transactions on Computers, \rm
1995, 44(5): 609$\sim$623.
\REF{[5]} Dahlgren F, Dubois M, Stenstr\"om P. Fixed and adaptive
sequential prefetching in shared-memory multiprocessors. In
{\it Proc. International Conference on Parallel Processing $($ICPP$)$},
Los Alamitos, CA, USA, CRC Press, 1993, Vol.1, pp.56$\sim$63.
\REF{[6]} Fu J, Patel J H. Data prefetching in multiprocessor vector
cache memories. In {\it Proc. the 17th Annual International
Symposium on Computer Architecture}, Toronto, Canada, 1991, pp.54$\sim$63.
\REF{[7]} Joseph D, Grunwald D. Prefetching using Markov predictors. In
{\it Proc. the 24th International Symposium on Computer
Architecture,} Denver-Colorado, 1997, pp.252$\sim$263.
\REF{[8]} Gokul Kandiraju, Anand Sivasubramaniam. Going the distance for
TLB prefetching: An application-driven study. In {\it Proc. the
International Symposium on Computer Architecture}, Anchorage, Alaska,
2002, p.195.
\REF{[9]} Alexander T, Kedem G. Distributed predictive cache design for
high performance memory system. In {\it Proc. the 2nd
International Symposium on High Performance Computer Architecture
$($HPCA$)$}, San Jose, CA, 1996, pp.254$\sim$263.
\REF{[10]} Collins J, Tullsen D, Wang H, Shen J. Dynamic speculative
precomputation. In {\it Proc. the 34th International Symposium on
Microarchitecture}, Austin, Texas, 2001, pp.306$\sim$317.
\REF{[11]} Wessam Hassanein, Jos\'e Fortes, Rudolf Eigenmann. Data
forwarding through in-memory precomputation threads. In {\it Proc.
the International Conference on Supercomputing $($ICS$)$}, 2004.
\REF{[12]} Hughes C J. Prefetching linked data structures in systems with
merged DRAM-logic [Thesis]. University of Illinois at
Urbana-Champaign, Technical Report UIUCDCS-R-2001-2221, May 2000.
\REF{[13]} Liao S, Wang P, Wang H, Hoflehner G, Lavery D, Shen J.
Post-pass binary adaptation tool for software-based speculative
precomputation. In {\it Proc. the ACM SIGPLAN Conference on
Programming Language Design and Implementation $($PLDI'02$)$}, Berlin,
Germany, 2002, pp.117$\sim$128.
\REF{[14]} Chi-Keung Luk. Tolerating memory latency through
software-controlled pre-execution in simultaneous multithreading
processors. In {\it Proc. the 28th Annual International Symposium
on Computer Architecture}, G\"oeborg, Sweden, 2001, pp.40$\sim$51.
\REF{[15]} Amir Roth, Gurindar S Sohi. Speculative data-driven
multithreading. In {\it Proc. the 7th International Symposium on
High Performance Computer Architecture}, Nuevo Lenone, Mexico, 2001,
p.37.
\REF{[16]} Craig Zilles, Gurindar Sohi. Execution-based prediction using
speculative slices. In {\it Proc. the 28th Annual International
Symposium on Computer Architecture $($ISCA$)$}, G\"oeborg, Sweden, 2001,
pp.2$\sim$13.
\REF{[17]} Yang C L, Lebeck A R. Push vs. pull: Data movement for
linked data structures. In {\it Proc. the International Conference
on Supercomputing $($ICS$)$}, Santa Fe, New Mexcio, 2000,
pp.176$\sim$186, pp.176$\sim$186.
\REF{[18]} James E Smith. Decoupled access/execute computer
architectures. In {\it Proc. the 9th Annual International Symposium on
Computer Architecture $($ISCA$)$}, Gold Coast, Queensland, 1982, pp.112$\sim$119.
\REF{[19]} Culler D, Singh J P, Gupta A. Parallel Computer Architecture:
A Hardware/Software Approach. Morgan Kaufmann, ISBN 1558603433, August
1998.
\REF{[20]} Xian-He Sun, Surendra Byna. Data-access memory servers for
multi-processor environments. IIT CS TR-2005-001, November 2005,
http://www.cs.iit.edu/$\sim$suren/research.html.
\REF{[21]} Burger D C, Austin T M, Bennett S. Evaluating future
microprocessors: The SimpleScalar tool set. Technical Report 1308,
University of Wisconsin-Madison Computer Sciences, 1996.
\REF{[22]} Surendra Byna, Xian-He Sun, William Gropp, Rajeev Thakur.
Predicting the memory-access cost based on data access patterns. In
{\it Proc. the IEEE International Conference on Cluster Computing},
San Diego, September 2004, pp.327$\sim$336.
\REF{[23]} Annavaram M, Patel J M, Davidson E S. Data prefetching
by dependence graph pre-computation. In {\it Proc. the 28th
International Symposium on Computer Architecture $($ISCA$)$}, G\"oeborg,
Sweden, 2001, pp.52$\sim$61.
\REF{[24]} Kohout N, Choi S, Kim D, Yeung D. Multi-chain
prefetching: Effective exploitation of inter-chain memory parallelism
for pointer-chasing codes. In {\it Proc. the 10th International
Conference on Parallel Architectures and Compilation Techniques},
Barcelona, Spain, 2001, pp.268$\sim$279.
\REF{[25]} Roth A, Moshovos A, Sohi G S. Dependence based
prefetching for linked data structures. In {\it Proc. the 8th
International Conference on Architectural Support for Programming
Languages and Operating Systems}, San Jose, CA, 1998, pp.115$\sim$126.
\REF{[26]} Ilya Ganusov, Martin Burtscher. Future execution: A hardware
prefetching technique for chip multiprocessors. In {\it Proc. the
14th Annual International Conference on Parallel Architectures and
Compilation Techniques $($PACT'05$)$}, Saint Louis, MO, 2005,
pp.350$\sim$360.
\REF{[27]} Conway J H, Guy R K. The Book of Numbers.
Springer-Verlag, New York, 1996, ISBN: 038797993X.
\REF{[28]} Box G E P X, Jenkins G M, Reinsel G C. Time Series
Analysis: Forecasting and Control. 3rd ed, Prentice Hall, 1994.
\REF{[29]} Jack Doweck. Inside Intel core microarchitecture and smart memory
access. White paper, Intel Research website, Available online at
http://download.intel.com/technology/architecture/sma.pdf, 2006.
\REF{[30]} Sun Microsystems. UltraSPARC IV Processor Architecture
Overview. www.sun.com/processors/white\-papers/us4\_whitepaper.pdf
\REF{[31]} IBM. Cell Broadband Engine resource center.
http://www-128.ibm.com/developerworks/power/cell/.
\REF{[32]} Thomas R Puzak, A Hartstein, P G Emma, V Srinivasan.
When prefetching improves/degrades performance. In {\it Proc. the
2nd Conference on Computing Frontiers}, Ischia, Italy, May 04$\sim$06,
2005, pp.342$\sim$352.
\REF{[33]} Standard Performance Evaluation Corporation. SPEC Benchmarks,
http://www.spec.org/.
\REF{[34]} Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, Iain Duff.
A set of level 3 basic linear algebra subprograms. \it ACM Transactions
on Mathematical Software, \rm 1990, 16(1): 1$\sim$17.
\REF{[35]} John D McCalpin. Memory bandwidth and machine balance in current
high performance computers. \it IEEE Technical Committee on Computer
Architecture, \rm 1995, http://www.cs.virginia.edu/stream.
\REF{[36]} Sherwood T, Perelman E, Calder B. Basic block distribution
analysis to find periodic behavior and simulation points in
applications. In {\it Proc. the International Conference on
Parallel Architectures and Compilation Techniques}, Barcelona, Spain,
2001, pp.3$\sim$14.
\REF{[37]} Dahlgren F, Dubois M, Stenstr\"om P. Sequential hardware
prefetching in shared-memory multiprocessors. \it IEEE Transactions on
Parallel and Distributed Systems, \rm 1995, 6(7): 733$\sim$746.
\REF{[38]} Yue Liu, David R Kaeli. Branch-directed and stride-based data
cache prefetching. In {\it Proc. the 1996 International Conference
on Computer Design, VLSI in Computers and Processors},
October 7$\sim$9, 1996, pp.225$\sim$230.
\REF{[39]} Mowry T, Gupta A. Tolerating latency through
software-controlled prefetching in shared-memory multiprocessors.
\it Journal of Parallel and Distributed Computing, \rm June
1991, 12(2): 87$\sim$106.
\REF{[40]} Pai V S, Ranganathan P, Abdel-Shafi H, Adve S. The impact
of exploiting instruction-level parallelism on shared-memory
multiprocessors. \it IEEE Transactions on Computers, \rm February 1999, 48(2): 218$\sim$226.
\REF{[41]} Zhou H. Dual-core execution: Building a highly scalable
single-thread instruction window. In {\it Proc. the 2005
International Conference on Parallel Architectures and Compilation
Techniques $($PACT'05$)$}, Saint Louis, MO, 2005, pp.231$\sim$242.
\REF{[42]} Solihin Y, Lee J, Torrellas J. Using a user-level memory
thread for correlation prefetching. In {\it Proc. International
Symposium on Computer Architecture}, Anchorage, Alaska, May 2002,
pp.171$\sim$182. |