G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS '67 (Spring), pages 483-485. ACM, 1967.
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks; summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing '91, pages 158-165. ACM, 1991.
C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Mainmemory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 362-373, 2013.
M. Banikazemi, D. Poff, and B. Abali. PAM: a novel performance/ power aware meta-scheduler for multi-core systems. In Proceedings of the International Conference on Supercomputing, pages 39:1-39:12, 2008.
B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In Proceedings of the 22nd Annual International Conference on Supercomputing, ICS '08, pages 368-377. ACM, 2008.
M. Bhadauria and S. A. McKee. An approach to resourceaware co-scheduling for CMPs. In Proceedings of the 24th International Conference on Supercomputing, pages 189-199. ACM, 2010.
L. Carrington, A. Snavely, and N.Wolter. A performance prediction framework for scientific applications. Future Generation Computer Systems, 22(3):336-346, Feb. 2006.
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005.
G. Chatzopoulos, A. Dragojević, and R. Guerraoui. ESTIMA: Extrapolating scalability of in-memory applications. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, 2016.
A. Collins, T. Harris, M. Cole, and C. Fensch. LIRA: Adaptive contention-aware thread placement for parallel runtime systems. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '15, pages 2:1-2:8. ACM, 2015.
T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Transactions on Architecture and Code Optimization, 10(4):41:1-41:25, Dec 2013.
G. Dhiman, G. Marchetti, and T. Rosing. vGreen: A system for energy efficient computing in virtualized environments. In Proceedings of the 14th International Symposium on Low Power Electronics and Design, pages 243-248. ACM, 2009.
A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 25-38. IEEE, 2007.
T. Harris and S. Kaestle. Callisto-RTS: Fine-grain parallel loops. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, pages 45-56, July 2015.
D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J.Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, SC '01, pages 37-37. ACM, 2001.
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 28(3):54-66, May 2008.
B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on NUMA systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, pages 277-289, July 2015.
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In 14th International Conference on High-Performance Computer Architecture, HPCA-14 '08, pages 367-378, 2008.
J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Quéma, and A. Fedorova. The Linux scheduler: A decade of wasted cores. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16. ACM, 2016.
G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. SIGMETRICS Performance Evaluation Review, 32(1):2-13, June 2004.
R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society, 2005.
A. Merkel, J. Stoess, and F. Bellosa. Resource-conscious scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European Conference on Computer Systems, pages 153-166. ACM, 2010.
M. S. Müller, J. Baron, W. C. Brantley, H. Feng, D. Hackenberg, R. Henschel, G. Jost, D. Molka, C. Parrott, J. Robichaux, P. Shelepugin, M. vanWaveren, B. Whitney, and K. Kumaran. SPEC OMP2012 - An Application Benchmark Suite for Parallel Systems Using OpenMP, pages 223-236. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems, pages 237-250. ACM, 2010.
OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 3.0. May 2008. http://www. openmp.org/mp-documents/spec30.pdf.
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. SIGARCH Comput. Archit. News, 35(2):381-391, June 2007.
Y. Solihin, V. Lam, and J. Torrellas. Scal-Tool: Pinpointing and quantifying scalability bottlenecks in DSM multiprocessors. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, SC '99. ACM, 1999.
R. West, P. Zaroo, C. A. Waldspurger, and X. Zhang. Online cache modeling for commodity multicore processors. SIGOPS Operating Systems Review, 44(4):19-29, Dec. 2010.
Y. Xie and G. H. Loh. Dynamic classification of program memory behaviors in CMPs. In Proceedings 2ndWorkshop on CMP Memory Systems and Interconnects (CMP-MSI), June 2008.
A. Yasin. A top-down method for performance analysis and counters architecture. 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 0:35-44, 2014.
J. Zhai, W. Chen, and W. Zheng. PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a single node. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 305-314. ACM, 2010.
X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 379-391. ACM, 2013.
S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 129-142. ACM, 2010.
S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys, 45(1):4, 2012.