AVX2 is slower than SSE2-4.x under Windows ARM emulation If you compile your app for AVX2 and it runs on Windows ARM under Prism emulation, is it faster or slower than compiling for SSE2-4.x? I assumed it would be roughly the same — maybe slightly slower due to emulation overhead, but AVX2’s wider operations would compensate. The headline gives it away: I was wrong. 💡 TLDR: AVX2 code runs at 2/3 the speed of equivalent SSE2-SSE4.x optimised code under emulation on Windows 11 ARM. ‘Should I compile for AVX2 if my app might run on Windows ARM?’ has a clear answer: No. At least if performance matters. This post explains how I found out, what I measured and how, the benchmark results, and why. Curiosity A few weeks ago, in a Hacker News thread on WoW (the game) emulated performance on Windows ARM, I wondered: I’ve been testing some math benchmarks on ARM emulating x64, and saw very little performance improvement with the AVX2+FMA builds, compared to the SSE4.x level. (X64 v2 to v3.) … I’ve found very little info online about this. Well, I nerdsniped myself, because those math benchmarks are now complete and so we have the perfect framework for testing AVX2+FMA emulation performance overhead on ARM Windows. I have no technical reason to do so: if you use our compiler we encourage that if you want to run your app on Windows ARM to just compile your app for Windows ARM. It’s simply: I want to know. Thus I spent much of Sunday crunching our data and figuring it out. ARM emulation of x86 You can skip this bit if you know about Windows ARM’s emulation and what various Intel instruction sets like SSE through AVX2 are: go forward to Benchmarks . Windows 11 lets you run both 32-bit and 64-bit Intel apps on ARM. It does this via emulation. Essentially, x86/64 code is translated on the fly into ARM. Windows 10 supported emulating 32-bit Intel, and by 2021 Windows 11 introduced emulating 64-bit apps. In 2024 Windows 11 was updated with a new emulation layer , Prism. The main user-facing change seems to have been performance: ‘Microsoft told Ars Technica that Prism is as fast as Apple’s Rosetta 2’ and: Most x86 apps now run without issues, and in many cases don’t even feel like they’re being emulated. These days, the majority of users won’t notice a difference between using an Intel PC or a Snapdragon one – Windows Central Is emulation complete / entire? x86 and x86_64 have not always remained the same. Over time they add more functionality, which is exposed as instruction sets. These are the base instructions that an app can be compiled to use and are often focused around doing things faster. For example, the x87 floating point math instruction set still exists (it was introduced in the 1980s!) but was succeeded a quarter century ago by SSE2, introduced with the Pentium 4. SSE2 lets you perform floating point math operations much faster. A few years later the SSE 4.x series also improved largely integer-based operations. This is a very handwavy summary: in fac
Source: Hacker News | Original Link