GCC modified for the FreeChainXenon project
Find a file
Tamar Christina 85002f8085 middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403].
This fixes a bug with the interaction between peeling for gaps and early break.

Before I go further, I'll first explain how I understand this to work for loops
with a single exit.

When peeling for gaps we peel N < VF iterations to scalar.
This happens by removing N iterations from the calculation of niters such that
vect_iters * VF == niters is always false.

In other words, when we exit the vector loop we always fall to the scalar loop.
The loop bounds adjustment guarantees this. Because of this we potentially
execute a vector loop iteration less.  That is, if you're at the boundary
condition where niters % VF by peeling one or more scalar iterations the vector
loop executes one less.

This is accounted for by the adjustments in vect_transform_loops.  This
adjustment happens differently based on whether the the vector loop can be
partial or not:

Peeling for gaps sets the bias to 0 and then:

when not partial:  we take the floor of (scalar_upper_bound / VF) - 1 to get the
		   vector latch iteration count.

when loop is partial:  For a single exit this means the loop is masked, we take
                       the ceil to account for the fact that the loop can handle
		       the final partial iteration using masking.

Note that there's no difference between ceil an floor on the boundary condition.
There is a difference however when you're slightly above it. i.e. if scalar
iterates 14 times and VF = 4 and we peel 1 iteration for gaps.

The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in effect
the partial iteration is ignored and it's done as scalar.

This is fine because the niters modification has capped the vector iteration at
2.  So that when we reduce the induction values you end up entering the scalar
code with ind_var.2 = ind_var.1 + 2 * VF.

Now lets look at early breaks.  To make it esier I'll focus on the specific
testcase:

char buffer[64];

__attribute__ ((noipa))
buff_t *copy (buff_t *first, buff_t *last)
{
  char *buffer_ptr = buffer;
  char *const buffer_end = &buffer[SZ-1];
  int store_size = sizeof(first->Val);
  while (first != last && (buffer_ptr + store_size) <= buffer_end)
    {
      const char *value_data = (const char *)(&first->Val);
      __builtin_memcpy(buffer_ptr, value_data, store_size);
      buffer_ptr += store_size;
      ++first;
    }

  if (first == last)
    return 0;

  return first;
}

Here the first, early exit is on the condition:

  (buffer_ptr + store_size) <= buffer_end

and the main exit is on condition:

  first != last

This is important, as this bug only manifests itself when the first exit has a
known constant iteration count that's lower than the latch exit count.

because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 16
bytes per iteration.  So the exit has a known bounds of 8 + 1.

The vectorizer correctly analizes this:

Statement (exit)if (ivtmp_21 != 0)
 is executed at most 8 (bounded by 8) + 1 times in loop 1.

and as a consequence the IV is bound by 9:

  # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)>
  ...
  vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615 };
  mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 };
  if (mask_patt_22.17_126 == { -1, -1, -1, -1 })
    goto <bb 3>; [88.89%]
  else
    goto <bb 30>; [11.11%]

The imporant bits are this:

In this example the value of last - first = 416.

the calculated vector iteration count, is:

    x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27

the bounds generated, adjusting for gaps:

   x == (((x - 1) >> 2) << 2)

which means we'll always fall through to the scalar code. as intended.

Here are two key things to note:

1. In this loop, the early exit will always be the one taken.  When it's taken
   we enter the scalar loop with the correct induction value to apply the gap
   peeling.

2. If the main exit is taken, the induction values assumes you've finished all
   vector iterations.  i.e. it assumes you have completed 24 iterations, as we
   treat the main exit the same for normal loop vect and early break when not
   PEELED.
   This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 * VF;

So what's going wrong.  The vectorizer's codegen is correct and efficient,
however when we adjust the upper bounds, that code knows that the loops upper
bound is based on the early exit. i.e. 8 latch iterations. or in other words.
It thinks the loop iterates once.

This is incorrect as the vector loop iterates twice, as it has set up the
induction value such that it exits at the early exit.   So it in effect iterates
2.5x times.

Becuase the upper bound is incorrect, when we unroll it now exits from the main
exit which uses the incorrect induction value.

So there are three ways to fix this:

1.  If we take the position that the main exit should support both premature
    exits and final exits then vect_update_ivs_after_vectorizer needs to be
    skipped for this case, and vectorizable_induction updated with  third case
    where we reduce with LAST reduction based on the IVs instead of assuming
    you're at the end of the vector loop.

    I don't like this approach.  It don't think we should add a third induction
    style to cover up an issue introduced by unrolling.  It makes the code
    harder to follow and makes main exits harder to reason about.

2. We could say that vec_init_loop_exit_info should pick the exit which has the
   smallest known iteration count.  This would turn this case into a PEELED case
   and the induction values would be correct as we'd always recalculate them
   from a reduction.  This is suboptimal though as the reason we pick the latch
   exit as the IV one is to prevent having to rotate the loop.  This results
   in more efficient code for what we assume is the common case, i.e. the main
   exit.

3. In PR113734 we've established that for vectorization of early breaks that we
   must always treat the loop as partial.  Here partiallity means that we have
   enough vector elements to start the iteration, but we may take an early exit
   and so never reach the latch/main exit.

   This requirement is overwritten by the peeling for gaps adjustment of the
   upper bound.  I believe the bug is simply that this shouldn't be done.
   The adjustment here is to indicate that the main exit always leads to the
   scalar loop when peeling for gaps.

   But this invariant is already always true for all early exits.  Remember that
   early exits restart the scalar loop at the start of the vector iteration, so
   the induction values will start it where we want to do the gaps peeling.

I think no# 3 is the correct fix, and also one that doesn't degrade code quality.

gcc/ChangeLog:

	PR tree-optimization/114403
	* tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when
	peeling for gaps and early break.

gcc/testsuite/ChangeLog:

	PR tree-optimization/114403
	* gcc.dg/vect/vect-early-break_124-pr114403.c: New test.
	* gcc.dg/vect/vect-early-break_125-pr114403.c: New test.
2024-04-15 12:06:52 +01:00
.github Minor formatting fix for newly-added file from previous commit 2023-11-01 19:28:56 -04:00
c++tools Update copyright years. 2024-01-03 12:19:35 +01:00
config build: Check for cargo when building rust language 2024-04-15 13:03:35 +02:00
contrib Daily bump. 2024-04-13 00:17:47 +00:00
fixincludes Daily bump. 2023-11-23 00:18:14 +00:00
gcc middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403]. 2024-04-15 12:06:52 +01:00
gnattools Update Copyright year in ChangeLog files 2024-01-03 11:35:18 +01:00
gotools Daily bump. 2023-11-04 00:16:45 +00:00
include Daily bump. 2024-04-09 00:17:24 +00:00
INSTALL
libada Update copyright years. 2024-01-03 12:19:35 +01:00
libatomic Daily bump. 2024-04-08 12:15:19 +00:00
libbacktrace Daily bump. 2024-03-09 00:17:14 +00:00
libcc1 Daily bump. 2024-03-17 00:17:21 +00:00
libcody Update Copyright year in ChangeLog files 2024-01-03 11:35:18 +01:00
libcpp Daily bump. 2024-03-15 00:17:52 +00:00
libdecnumber Daily bump. 2024-04-03 00:17:29 +00:00
libffi Daily bump. 2023-10-27 00:17:12 +00:00
libgcc Daily bump. 2024-04-11 00:17:54 +00:00
libgfortran Daily bump. 2024-04-13 00:17:47 +00:00
libgm2 Daily bump. 2024-04-03 00:17:29 +00:00
libgo libgo: bump libgo version for GCC 14 release 2024-02-05 11:28:30 -08:00
libgomp Daily bump. 2024-04-09 00:17:24 +00:00
libgrust Update copyright years. 2024-02-21 13:51:26 +01:00
libiberty Daily bump. 2024-04-03 00:17:29 +00:00
libitm Daily bump. 2024-04-03 00:17:29 +00:00
libobjc Daily bump. 2024-04-03 00:17:29 +00:00
libphobos Daily bump. 2024-04-08 12:15:19 +00:00
libquadmath Daily bump. 2024-04-10 00:16:50 +00:00
libsanitizer Daily bump. 2024-02-17 00:17:08 +00:00
libssp Daily bump. 2024-02-07 00:18:31 +00:00
libstdc++-v3 Daily bump. 2024-04-14 00:16:53 +00:00
libvtv Daily bump. 2024-04-03 00:17:29 +00:00
lto-plugin Update copyright years. 2024-01-03 12:19:35 +01:00
maintainer-scripts Daily bump. 2023-11-14 12:23:39 +00:00
zlib Daily bump. 2023-10-23 00:16:43 +00:00
.dir-locals.el
.gitattributes
.gitignore *: add modern gettext 2023-11-14 00:47:11 +01:00
ABOUT-NLS
ar-lib
ChangeLog Daily bump. 2024-04-05 00:16:44 +00:00
ChangeLog.jit
ChangeLog.tree-ssa
compile
config-ml.in LoongArch: Reimplement multilib build option handling. 2023-09-15 10:42:12 +08:00
config.guess
config.rpath
config.sub
configure build: Check for cargo when building rust language 2024-04-15 13:03:35 +02:00
configure.ac build: Check for cargo when building rust language 2024-04-15 13:03:35 +02:00
COPYING
COPYING.LIB
COPYING.RUNTIME
COPYING3
COPYING3.LIB
depcomp
install-sh
libtool-ldflags
libtool.m4 Build: fix error in fixinclude configure 2023-11-22 11:54:33 +01:00
ltgcc.m4
ltmain.sh
ltoptions.m4
ltsugar.m4
ltversion.m4
lt~obsolete.m4
MAINTAINERS MAINTAINERS: Update my email address 2024-04-04 16:39:52 +02:00
Makefile.def gccrs: Fix missing build dependency 2024-01-16 16:23:02 +01:00
Makefile.in Fix up postboot dependencies [PR106472] 2024-04-02 13:40:27 +02:00
Makefile.tpl Fix up postboot dependencies [PR106472] 2024-04-02 13:40:27 +02:00
missing
mkdep
mkinstalldirs
move-if-change
multilib.am
README
SECURITY.txt SECURITY.txt: Drop "exploitable" in reference to hardening issues 2024-01-09 10:49:01 -05:00
symlink-tree
test-driver
ylwrap

This directory contains the GNU Compiler Collection (GCC).

The GNU Compiler Collection is free software.  See the files whose
names start with COPYING for copying permission.  The manuals, and
some of the runtime libraries, are under different terms; see the
individual source files for details.

The directory INSTALL contains copies of the installation information
as HTML and plain text.  The source of this information is
gcc/doc/install.texi.  The installation information includes details
of what is included in the GCC sources and what files GCC installs.

See the file gcc/doc/gcc.texi (together with other files that it
includes) for usage and porting information.  An online readable
version of the manual is in the files gcc/doc/gcc.info*.

See http://gcc.gnu.org/bugs/ for how to report bugs usefully.

Copyright years on GCC source files may be listed using range
notation, e.g., 1987-2012, indicating that every year in the range,
inclusive, is a copyrightable year that could otherwise be listed
individually.