MSVC Backend Updates since Visual Studio 2022 version 17.3

Since Visual Studio 2022 version 17.3 we have continued to improve the C++ backend with new features and new and improved optimizations. Here are some of our exciting improvements.

17.9 improvements for x86 and x64, thanks to our friends at Intel.

Support for Scalar FP intrinsics with double/float arguments
Improve code generation by replacing VINSERTPS with VBLENDPS for x64 only
Support for round scalar functions

17.8 improvements

The new /ARM64XFUNCTIONPADMINX64:# flag allows specifying the number of bytes of padding for x64 functions in arm64x images
The new /NOFUNCTIONPADSECTION:sec flag allows disabling function padding for functions in a particular section
LTCG build takes better advantage of threads, improving throughput.
Support for RAO-INT, thanks to our friends at Intel.
Address sanitizer improvements:

The Address Sanitizer flag is now compatible with C++ modules.
The compiler will now report an error when /fsanitize=address is combined with an incompatible flag, instead of silently disabling ASAN checks.
ASAN checks are now emitted for loads and stores in memchr, memcmp, and the various string functions.

Performance improvements that will help every architecture:

Improve hoisting of loads and stores outside of loops.

Performance improvements for arm64:

Improve memcmp performance on both arm64 and arm64ec.
When calling memcpy, memset, memchr, or memcmp from emulated x64 code, remove the performance overhead of switching to arm64ec versions of these functions.
Optimize scalar immediate loads (from our friends at ARM)
Combine CSET and ADD instructions into a single CINC instruction (from our friends at ARM)

Performance improvements for x86 and x64, many thanks to our friends at Intel:

Improve code generation for _mm_fmadd_sd.
Improve code generation for UMWAIT and TPAUSE, preserving implicit input registers.
Improve code generation for vector shift intrinsics by improving auto-vectorizer.
Tune internal vectorization thresholds to improve auto-vectorization.
Implement optimization for FP classification beyond std::isnan.
Performance improvements for x64:

Generate a single PSHUFLW instruction for _mm_set1_epi16 when only the lower 64 bits of the result are used.
Improve code generation for abs(). (Thanks to our friends at AMD)
No longer generate redundant loads and stores when LDDQU is combined with VBROADCAST128.
Generate PMADDWD instead of PMULLD where possible.
Combine two contiguous stores into a single unaligned store.
Use 32 vector registers in functions that use AVX512 intrinsics even when not compiling with /arch:AVX512.
Don’t emit unnecessary register to register moves.

Performance improvements for x86:

Improve code generation for expf().

17.7 improvements

New /jumptablerdata flag places jump tables for switch statements in the .rdata section instead of the .text section.
Link time with a cold file system cache is now faster.
Improve compilation time of POGO-instrumented builds.
Speed up LTCG compilation in a variety of ways.
OpenMP improvements with /openmp:llvm, thanks to our friends at Intel:

#pragma omp atomic update and #pragma omp atomic capture no longer need to call into the runtime, improving performance.
Better code generation for OpenMP floating point atomics.
The clause schedule(static) is now respected for ordered loops.

Performance improvements for all architectures:

Copy propagation optimizations are now more effective, thanks to our friends from AMD.
Improve optimization for DeBruijn table.
Fully unroll loops of fixed size even if they contain function calls.
Improve bit optimizations.
Deeply nested loops are now optimized.

Performance improvements and additional functionality for x86 and x64, many thanks to our friends at Intel:

Support Intel Sierra Forest instruction set (AVX-IFMA, AVX-NE-CONVERT, AVX-VNNI-INT8, CMPCCXADD, Additional MSR support).
Support Intel Granite Rapids instruction set (AMX-COMPLEX).
Support LOCK_SUB.
Add overflow detection functions for addition, subtraction, and multiplication.
Implement intrinsic functions for isunordered, isnan, isnormal, isfinite, isinf, issubnormal, fmax, and fmin.
Reduce code size of bitwise vector operations.
Improve code generation for AVX2 instructions during tail call optimization.
Improve code generation for floating point instructions without an SSE version.
Remove unneeded PAND instructions.
Improve assembler output for FP16 truncating conversions to use surpress-all-exceptions instead of embedded rounding.
Eliminate unnecessary hoisting of conversions from FP to unsigned long long.
Performance improvements for x64:

No longer emit unnecessary MOVSX/MOVZX instructions.
Do a better job of devirtualizing calls to class functions.
Improve performance of memmove.
Improve code generation for XOR-EXTRACT combination pattern.

Performance improvements for arm64:

Improve register coloring for destinations of NEON BIT, BIF, and BSL instructions, thanks to our friends at ARM.
Convert cross-binary indirect calls that use the import address table into direct calls.
Add the _CountTrailingZeros and _CountTrailingZeros64 intrinsics for counting trailing zeros in integers
Generate BFI instructions in more places.

17.6 improvements

The /openmp:llvm flag now supports the collapse clause on #pragma omp loop (Full Details.)
The new /d2AsanInstrumentationPerFunctionThreshold:# flag allows turning off ASAN instrumentation on functions that would add more than a certain number of extra ASAN calls.
New /OTHERARCHEXPORTS option for dumpbin /EXPORTS will dump the x64 exports of an arm64x dll.
Build time improvements:

Improved LTCG build throughput.
Reduced LTCG build memory usage.
Reduced link time during incremental linking.

Performance improvements that will help every architecture:

Vectorize loops that use min, max, and absolute, thanks to our friends at ARM.
Turn loops with a[i] = ((a[i]>>15)&0x10001)*0xffff into vector compares.
Hoist calculation of array bases of the form (a + constant)[i] out of the loop.

Performance improvements on arm64:

Load floats directly into floating point registers instead of using integer load and FMOV instructions.
Improve code generation for abs(), thanks to our friends at ARM.
Improve code generation for vectors when NEON instructions are available.
Generate CSINC instructions when the ? operator has the constant 1 as a possible result of the expression, thanks to our friends at ARM.
Improve code generation for loops that sum an array by using vector add instructions.
Combine vector extend and arithmetic instructions into a single instruction.
Remove extraneous adds, subtractions, and ors with 0.
Auxiliary delayload IAT: new import address table for calls into delayloaded DLLs in arm64x. At runtime, Windows will patch this table to speed up program execution.

Performance improvements and additional features on x86 and x64, many thanks to our friends at Intel:

Support for Intel Granite Rapids x64 instruction set, specifically TDPFP16PS (AMX-FP16) and PREFETCHIT0/PREFETCHIT1.
Support for ties-to-away rounding for round and roundf intrinsic functions.
Reduce small loops to vectors.
No longer generate redundant MOVD/MOVQ instructions.
Use VBLEND instructions instead of the slower VINSERTF128 and VBLENDPS instructions on AVX512 where possible.
Promote PCLMULQDQ instructions to VPCLMULQDQ where possible with /arch:AVX or later.
Replace VEXTRACTI128 instructions that extract the lower half of a vector with VMOVDQU instructions, thanks to our friends at AMD.
Support for missing AVX512-FP16 intrinsics.
Better code generation with correct VEX/EVEX encoding for VCMPXX pseudo-ops in MASM.
Improve conversions from 64-bit integer to floating-point.
Improve code generation on x64 with correct instruction scheduling for STMXCSR.

17.5 improvements

The new /Zc:checkGwOdr flag allows for enforcing C++ standards for ODR violations even when compiling with /Gw.
Combine a MOV and a CSEL instruction into a CSINV instruction on arm64.
Performance and code quality improvements for x86 and x64, thanks to our friends at Intel:

Improve code generation for returns of structs consisting of 2 64-bit values on x64.
Type conversions no longer generate unnecessary FSTP/FLD instructions.
Improve checking floating-point values for Not-a-Number.
Emit smaller sequence in auto-vectorizer with bit masking and reduction.
Correct expansion of round to use ROUND instruction only under /fp:fast.

17.4 improvements

Performance improvements that will help every architecture:

Improve bswap for signed integers.
Improve stackpacking for functions with memset calls.

Improve the debugging support and performance for Arm64:

Edit and Continue is now possible for programs targeting Arm64.
Added support for armv8 int8 matrix multiplication instructions.
Use BIC instructions in place of an MVN and AND.
Use BIC_SHIFT instruction where appropriate.

Performance and code quality improvements on x64 and x86, thanks to our friends at Intel:

std::memchr now meets the additional C++17 requirement of stopping as soon as a matching byte is found.
Improve code generation for 16-bit interlocked add.
Coalesce register initialization on AVX/AVX2.
Improve code generation for returns of structs consisting of 2 64-bit values.
Improve codegen for _mm_ucomieq_ss.
Use VROUNDXX instructions for ceil, floor, trunc, and round.
Improve checking floating-point values for Not-a-Number.

Support for OpenMP Standard 3.1 under the experimental -openmp:llvm switch expanded to include the min and max operators on the reduction clause.
Improve copy and move elision
The new /Qspectre-jmp flag adds an int3 after unconditional jump instructions.

Do you want to experience the new improvements in the C++ backend? Please download the latest Visual Studio 2022 and give it a try! Any feedback is welcome. We can be reached via the comments below, Developer Community, Twitter (@VisualC), or email at visualcpp@microsoft.com.
Stay tuned for more information on updates to the latest Visual Studio.