Commit graph

202917 commits

Author SHA1 Message Date
Tamar Christina
0e52059129 AArch64: update costing for MLA by invariant
When determining issue rates we currently discount non-constant MLA accumulators
for Advanced SIMD but don't do it for the latency.

This means the costs for Advanced SIMD with a constant accumulator are wrong and
results in us costing SVE and Advanced SIMD the same.  This can cauze us to
vectorize with Advanced SIMD instead of SVE in some cases.

This patch adds the same discount for SVE and Scalar as we do for issue rate.

This gives a 5% improvement in fotonik3d_r in SPECCPU 2017 on large
Neoverse cores.

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_multiply_add_p): Update handling
	of constants.
	(aarch64_adjust_stmt_cost): Use it.
	(aarch64_vector_costs::count_ops): Likewise.
	(aarch64_vector_costs::add_stmt_cost): Pass vinfo to
	aarch64_adjust_stmt_cost.
2023-08-04 13:46:36 +01:00
Richard Biener
1a599caab8 tree-optimization/110838 - vectorization of widened right shifts
The following fixes a problem with my last attempt of avoiding
out-of-bound shift values for vectorized right shifts of widened
operands.  Instead of truncating the shift amount with a bitwise
and we actually need to saturate it to the target precision.

The following does that and adds test coverage for the constant
and invariant but variable case that would previously have failed.

	PR tree-optimization/110838
	* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
	Fix right-shift value sanitizing.  Properly emit external
	def mangling in the preheader rather than in the pattern
	def sequence where it will fail vectorizing.

	* gcc.dg/vect/pr110838.c: New testcase.
2023-08-04 13:15:05 +02:00
Matthew Malcomson
0782b01c9e mid-end: Use integral time intervals in timevar.cc
On some AArch64 bootstrapped builds, we were getting a flaky test
because the floating point operations in `get_time` were being fused
with the floating point operations in `timevar_accumulate`.

This meant that the rounding behaviour of our multiplication with
`ticks_to_msec` was different when used in `timer::start` and when
performed in `timer::stop`.  These extra inaccuracies led to the
testcase `g++.dg/ext/timevar1.C` being flaky on some hardware.

------------------------------
Avoiding the inlining which was agreed to be undesirable.  Three
alternative approaches:
1) Use `-ffp-contract=on` to avoid this particular optimisation.
2) Adjusting the code so that the "tolerance" is always of the order of
   a "tick".
3) Recording times and elapsed differences in integral values.
   - Could be in terms of a standard measurement (e.g. nanoseconds or
     microseconds).
   - Could be in terms of whatever integral value ("ticks" /
     seconds&microseconds / "clock ticks") is returned from the syscall
     chosen at configure time.

While `-ffp-contract=on` removes the problem that I bumped into, there
has been a similar bug on x86 that was to do with a different floating
point problem that also happens after `get_time` and
`timevar_accumulate` both being inlined into the same function.  Hence
it seems worth choosing a different approach.

Of the two other solutions, recording measurements in integral values
seems the most robust against slightly "off" measurements being
presented to the user -- even though it could avoid the ICE that creates
a flaky test.

I considered storing time in whatever units our syscall returns and
normalising them at the time we print out rather than normalising them
to nanoseconds at the point we record our "current time".  The logic
being that normalisation could have some rounding affect (e.g. if
TICKS_PER_SECOND is 3) that would be taken into account in calculations.

I decided against it in order to give the values recorded in
`timevar_time_def` some interpretive value so it's easier to read the
code.  Compared to the small rounding that would represent a tiny amount
of time and AIUI can not trigger the same kind of ICE's as we are
attempting to fix, said interpretive value seems more valuable.

Recording time in microseconds seemed reasonable since all obvious
values for ticks and `getrusage` are at microsecond granularity or less
precise.  That said, since TICKS_PER_SECOND and CLOCKS_PER_SEC are both
variables given to use by the host system I was not sure of that enough
to make this decision.

------------------------------
timer::all_zero is ignoring rows which are inconsequential to the user
and would be printed out as all zeros.  Since upon printing rows we
convert to the same double value and print out the same precision as
before, we return true/false based on the same amount of time as before.

timer::print_row casts to a floating point measurement in units of
seconds as was printed out before.

timer::validate_phases -- I'm printing out nanoseconds here rather than
floating point seconds since this is an error message for when things
have "gone wrong" printing out the actual nanoseconds that have been
recorded seems like the best approach.
N.b. since we now print out nanoseconds instead of floating point value
the padding requirements are different.  Originally we were padding to
24 characters and printing 18 decimal places.  This looked odd with the
now visually smaller values getting printed.  I judged 13 characters
(corresponding to 2 hours) to be a reasonable point at which our
alignment could start to degrade and this provides a more compact output
for the majority of cases (checked by triggering the error case via
GDB).

------------------------------
N.b. I use a literal 1000000000 for "NANOSEC_PER_SEC".  I believe this
would fit in an integer on all hosts that GCC supports, but am not
certain there are not strange integer sizes we support hence am pointing
it out for special attention during review.

------------------------------
No expected change in generated code.
Bootstrapped and regtested on AArch64 with no regressions.

Hope this is acceptable -- I had originally planned to use
`-ffp-contract` as agreed until I saw mention of the old x86 bug in the
same area which was not to do with floating point contraction of
operations (PR 99903).

gcc/ChangeLog:

	PR middle-end/110316
	PR middle-end/9903
	* timevar.cc (NANOSEC_PER_SEC, TICKS_TO_NANOSEC,
	CLOCKS_TO_NANOSEC, nanosec_to_floating_sec, percent_of): New.
	(TICKS_TO_MSEC, CLOCKS_TO_MSEC): Remove these macros.
	(timer::validate_phases): Use integral arithmetic to check
	validity.
	(timer::print_row, timer::print): Convert from integral
	nanoseconds to floating	point seconds before printing.
	(timer::all_zero): Change limit to nanosec count instead of
	fractional count of seconds.
	(make_json_for_timevar_time_def): Convert from integral
	nanoseconds to floating point seconds before recording.
	* timevar.h (struct timevar_time_def): Update all measurements
	to use uint64_t nanoseconds rather than seconds stored in a
	double.
2023-08-04 11:26:47 +01:00
Richard Biener
04aa0edcac tree-optimization/110838 - less aggressively fold out-of-bound shifts
The following adjusts the shift simplification patterns to avoid
touching out-of-bound shift value arithmetic right shifts of
possibly negative values.  While simplifying those to zero isn't
wrong it's violating the principle of least surprise.

	PR tree-optimization/110838
	* match.pd (([rl]shift @0 out-of-bounds) -> zero): Restrict
	the arithmetic right-shift case to non-negative operands.
2023-08-04 12:16:00 +02:00
Pan Li
2d2f090e67 Revert "RISC-V: Support RVV VFMACC rounding mode intrinsic API"
This reverts commit 51e5a5cefb.
2023-08-04 17:11:26 +08:00
Pan Li
7a6b4d87d8 Revert "RISC-V: Support RVV VFNMACC rounding mode intrinsic API"
This reverts commit 62d9c1dd8e.
2023-08-04 17:11:12 +08:00
Pan Li
b87a4739a4 Revert "RISC-V: Support RVV VFMSAC rounding mode intrinsic API"
This reverts commit dccd7e8a72.
2023-08-04 17:10:49 +08:00
Pan Li
098d6fbe64 Revert "RISC-V: Support RVV VFNMSAC rounding mode intrinsic API"
This reverts commit 236ec7aac0.
2023-08-04 17:10:29 +08:00
Georg-Johann Lay
85414e25ad AVR: Add some more devices: AVR16DD*, AVR32DD*, AVR64DD*, AVR64EA*, ATtiny42*, ATtiny82*, ATtiny162*, ATtiny322*, ATtiny10*.
gcc/
	* config/avr/avr-mcus.def (avr64dd14, avr64dd20, avr64dd28, avr64dd32)
	(avr64ea28, avr64ea32, avr64ea48, attiny424, attiny426, attiny427)
	(attiny824, attiny826, attiny827, attiny1624, attiny1626, attiny1627)
	(attiny3224, attiny3226, attiny3227, avr16dd14, avr16dd20, avr16dd28)
	(avr16dd32, avr32dd14, avr32dd20, avr32dd28, avr32dd32)
	(attiny102, attiny104): New devices.
	* doc/avr-mmcu.texi: Regenerate.
2023-08-04 10:25:18 +02:00
Georg-Johann Lay
14daa69fec Fix some minor typos in avr-mcus.def.
gcc/
	* config/avr/avr-mcus.def (avr128d*, avr64d*): Fix their FLASH_SIZE
	and PM_OFFSET entries.
2023-08-04 09:51:11 +02:00
Andrew Pinski
91c963ea6f Fix PR 110874: infinite loop in gimple_bitwise_inverted_equal_p with fre
This changes gimple_bitwise_inverted_equal_p to use a 2 different match patterns
to try to match bit_not wrapped with a possible nop_convert and a comparison
also wrapped with a possible nop_convert. This is to avoid being recursive.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

	PR tree-optimization/110874
	* gimple-match-head.cc (gimple_bit_not_with_nop): New declaration.
	(gimple_maybe_cmp): Likewise.
	(gimple_bitwise_inverted_equal_p): Rewrite to use gimple_bit_not_with_nop
	and gimple_maybe_cmp instead of being recursive.
	* match.pd (bit_not_with_nop): New match pattern.
	(maybe_cmp): Likewise.

gcc/testsuite/ChangeLog:

	PR tree-optimization/110874
	* gcc.c-torture/compile/pr110874-a.c: New test.
2023-08-04 00:26:42 -07:00
Drew Ross
9020da78df match.pd: Canonicalize (signed x << c) >> c [PR101955]
Canonicalizes (signed x << c) >> c into the lowest
precision(type) - c bits of x IF those bits have a mode precision or a
precision of 1. Also combines this rule with (unsigned x << c) >> c -> x &
((unsigned)-1 >> c) to prevent duplicate pattern.

	PR middle-end/101955
	* match.pd ((signed x << c) >> c): New canonicalization.

	* gcc.dg/pr101955.c: New test.
2023-08-04 09:08:05 +02:00
Pan Li
236ec7aac0 RISC-V: Support RVV VFNMSAC rounding mode intrinsic API
This patch would like to support the rounding mode API for the
VFNMSAC for the below samples.

* __riscv_vfnmsac_vv_f32m1_rm
* __riscv_vfnmsac_vv_f32m1_rm_m
* __riscv_vfnmsac_vf_f32m1_rm
* __riscv_vfnmsac_vf_f32m1_rm_m

Signed-off-by: Pan Li <pan2.li@intel.com>

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc
	(class vfnmsac_frm): New class for vfnmsac frm.
	(vfnmsac_frm_obj): New declaration.
	(BASE): Ditto.
	* config/riscv/riscv-vector-builtins-bases.h: Ditto.
	* config/riscv/riscv-vector-builtins-functions.def
	(vfnmsac_frm): New function definition.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/float-point-single-negate-multiply-sub.c:
	New test.
2023-08-04 14:03:10 +08:00
Pan Li
dccd7e8a72 RISC-V: Support RVV VFMSAC rounding mode intrinsic API
This patch would like to support the rounding mode API for the
VFMSAC for the below samples.

* __riscv_vfmsac_vv_f32m1_rm
* __riscv_vfmsac_vv_f32m1_rm_m
* __riscv_vfmsac_vf_f32m1_rm
* __riscv_vfmsac_vf_f32m1_rm_m

Signed-off-by: Pan Li <pan2.li@intel.com>

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc
	(class vfmsac_frm): New class for vfmsac frm.
	(vfmsac_frm_obj): New declaration.
	(BASE): Ditto.
	* config/riscv/riscv-vector-builtins-bases.h: Ditto.
	* config/riscv/riscv-vector-builtins-functions.def
	(vfmsac_frm): New function definition.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/float-point-single-multiply-sub.c: New test.
2023-08-04 14:02:41 +08:00
Pan Li
62d9c1dd8e RISC-V: Support RVV VFNMACC rounding mode intrinsic API
This patch would like to support the rounding mode API for the
VFNMACC for the below samples.

* __riscv_vfnmacc_vv_f32m1_rm
* __riscv_vfnmacc_vv_f32m1_rm_m
* __riscv_vfnmacc_vf_f32m1_rm
* __riscv_vfnmacc_vf_f32m1_rm_m

Signed-off-by: Pan Li <pan2.li@intel.com>

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc
	(class vfnmacc_frm): New class for vfnmacc.
	(vfnmacc_frm_obj): New declaration.
	(BASE): Ditto.
	* config/riscv/riscv-vector-builtins-bases.h: Ditto.
	* config/riscv/riscv-vector-builtins-functions.def
	(vfnmacc_frm): New function definition.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/float-point-single-negate-multiply-add.c:
	New test.
2023-08-04 10:37:59 +08:00
Hao Liu
4d8b556317 AArch64: Avoid the ICE on empty reduction definition in info_for_reduction [PR110625]
Fix the assertion failure on empty reduction define in info_for_reduction.
Even a stmt is live, it may still have empty reduction define.  Check the
reduction definition instead of live info before calling info_for_reduction.

gcc/ChangeLog:

	PR target/110625
	* config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
	STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/pr110625_3.c: New testcase.
2023-08-04 10:34:32 +08:00
Pan Li
51e5a5cefb RISC-V: Support RVV VFMACC rounding mode intrinsic API
This patch would like to support the rounding mode API for the
VFMACC for the below samples.

* __riscv_vfmacc_vv_f32m1_rm
* __riscv_vfmacc_vv_f32m1_rm_m
* __riscv_vfmacc_vf_f32m1_rm
* __riscv_vfmacc_vf_f32m1_rm_m

Signed-off-by: Pan Li <pan2.li@intel.com>

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc
	(class vfmacc_frm): New class for vfmacc frm.
	(vfmacc_frm_obj): New declaration.
	(BASE): Ditto.
	* config/riscv/riscv-vector-builtins-bases.h: Ditto.
	* config/riscv/riscv-vector-builtins-functions.def
	(vfmacc_frm): New function definition.
	* config/riscv/riscv-vector-builtins.cc
	(function_expander::use_ternop_insn): Add frm operand support.
	* config/riscv/vector.md: Add vfmuladd to frm_mode.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/float-point-single-multiply-add.c: New test.
2023-08-04 09:41:24 +08:00
Pan Li
dd03fb9962 RISC-V: Support RVV VFWMUL rounding mode intrinsic API
This patch would like to support the rounding mode API for the
VFWMUL for the below samples.

* __riscv_vfwmul_vv_f64m2_rm
* __riscv_vfwmul_vv_f64m2_rm_m
* __riscv_vfwmul_vf_f64m2_rm
* __riscv_vfwmul_vf_f64m2_rm_m

Signed-off-by: Pan Li <pan2.li@intel.com>

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc
	(vfwmul_frm_obj): New declaration.
	(vfwmul_frm): Ditto.
	* config/riscv/riscv-vector-builtins-bases.h:
	(vfwmul_frm): Ditto.
	* config/riscv/riscv-vector-builtins-functions.def
	(vfwmul_frm): New function definition.
	* config/riscv/vector.md: (frm_mode) Add vfwmul to frm_mode.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/float-point-widening-mul.c: New test.
2023-08-04 09:40:57 +08:00
Pan Li
b7ab3938c6 RISC-V: Support RVV VFDIV and VFRDIV rounding mode intrinsic API
This patch would like to support the rounding mode API for the
VFDIV and VFRDIV for the below samples.

* __riscv_vfdiv_vv_f32m1_rm
* __riscv_vfdiv_vv_f32m1_rm_m
* __riscv_vfdiv_vf_f32m1_rm
* __riscv_vfdiv_vf_f32m1_rm_m
* __riscv_vfrdiv_vf_f32m1_rm
* __riscv_vfrdiv_vf_f32m1_rm_m

Signed-off-by: Pan Li <pan2.li@intel.com>

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc
	(binop_frm): New declaration.
	(reverse_binop_frm): Likewise.
	(BASE): Likewise.
	* config/riscv/riscv-vector-builtins-bases.h:
	(vfdiv_frm): New extern declaration.
	(vfrdiv_frm): Likewise.
	* config/riscv/riscv-vector-builtins-functions.def
	(vfdiv_frm): New function definition.
	(vfrdiv_frm): Likewise.
	* config/riscv/vector.md: Add vfdiv to frm_mode.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/float-point-single-div.c: New test.
	* gcc.target/riscv/rvv/base/float-point-single-rdiv.c: New test.
2023-08-04 09:38:19 +08:00
GCC Administrator
86fa443330 Daily bump. 2023-08-04 00:17:17 +00:00
Jan Hubicka
4a0633d4d4 Print entry count in print_loop_info
gcc/ChangeLog:

	* tree-cfg.cc (print_loop_info): Print entry count.
2023-08-03 22:49:22 +02:00
Jan Hubicka
d6ac3aae2a Update loop iteration estimates after splitting
Hmmer's internal function has 4 loops.  The following is the profile at start:

  loop 1:
  estimate 472
  iterations by profile: 473.497707 (reliable) count in:84821 (precise, freq 0.9979)

    loop 2:
    estimate 99
    iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104)

    loop 3:
    estimate 99
    iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104)

  loop 4:
  estimate 100
  iterations by profile: 100.999596 (reliable) execution count:84167 (precise, freq 0.9902)

So the first loops is outer loop and second/third loops are nesed. Fourth loop is not critical.
Precise iteraiton counts are unknown (473 and 100 comes from profile)
Nested loop has following form:

    for (k = 1; k <= M; k++) {
      mc[k] = mpp[k-1]   + tpmm[k-1];
      if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc;
      mc[k] += ms[k];
      if (mc[k] < -INFTY) mc[k] = -INFTY;

      dc[k] = dc[k-1] + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
      if (dc[k] < -INFTY) dc[k] = -INFTY;

      if (k < M) {
        ic[k] = mpp[k] + tpmi[k];
        if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
        ic[k] += is[k];
        if (ic[k] < -INFTY) ic[k] = -INFTY;
      }

We do quite some belly dancing here.
 1) loop-ch slightly misupdates profile, so the estimates of 99
    does not match profile setimate of 100.
 2) loops-split splits on if (k < M) and produces two loops.
    It fails to notice that the second loop never iterates.
    It used to misupdate profile a lot which later caused internal
    loop to become cold.  This is fixed now.
 3) loop-dist introduces runtime aliasing checks for both loops
 4) tree vectorizer vectorizes some of the copies of the loop produces
    and drops expected iteration counts
 5) loop peeling peels the loops with expected low iteration counts
 6) complete loop unrolling kills some loops in prologues/epilogues.

We end up with quite many loops and run out of registers:

  iterations by profile: 5.312499 (unreliable, maybe flat)
    this is vectorized internal loops after loop peeling

  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
    Those are all versioned/peeled and vectorized variants of the loop never looping

  iterations by profile: 100.000008 (unreliable)
  iterations by profile: 100.000000 (unreliable)
    Those are variants with failed aliasing checks

  iterations by profile: 9.662853 (unreliable, maybe flat)
  iterations by profile: 4.646072 (unreliable)
  iterations by profile: 100.000007 (unreliable)
  iterations by profile: 5.312500 (unreliable)
  iterations by profile: 473.497707 (reliable)
    This is loop 1

  iterations by profile: 100.999596 (reliable)
    This is the loop 4.

This patch fixes loop iteration estimate update after loop split so we get:

  iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 (guessed, freq 149.9081)
    This is remainder of the peeled vectorized loop 2.  It misses estimate that is correct since after peeling it 6 times it is essentially
    impossible to tell what the remaining loop profile is (without histograms)

  iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374801 (guessed, freq 4.4094)
    Peeled split part of loop 2 (one that never loops).  We ought to work this out
    but at least w

  estimate 99
  iterations by profile: 100.000008 (unreliable) entry count:3945039 (guessed, freq 46.4122)
  estimate 99
  iterations by profile: 100.000000 (unreliable) entry count:35505353 (guessed, freq 417.7100)

  estimate 99
  iterations by profile: 9.662853 (unreliable, maybe flat) entry count:35505353 (guessed, freq 417.7100)
    Profile here mismatches estimate - I will need to work out why.

  estimate 5
  iterations by profile: 4.646072 (unreliable) entry count:31954818 (guessed, freq 375.9390)
    This is vectorized but not peeled loop 3
  estimate 99
  iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420)
    Unvectorized variant of loop 3
  estimate 5
  iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512)
    Another vectorized variant of loop 3
  estimate 472
  iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979)
    Outer loop

  estimate 100
  iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902)
    loop 4, not vectorized/peeled

So there is still work to do on this testcase, but with the patch we prevent 3 useless loops.

Bootstrapped/regtested x86_64-linux, plan to commit it later today.

gcc/ChangeLog:

	* tree-ssa-loop-split.cc (split_loop): Update estimated iteration counts.
2023-08-03 22:47:55 +02:00
Jan Hubicka
93236ad9e8 Fix profiledbootstrap
Profiledbootstrap fails with ICE in update_loop_exit_probability_scale_dom_bbs
called from loop unroling.
The reason is that under relatively rare situations, we may run into case where
loop has multiple exits and all are considered as likely but then we scale down
the profile and one of the exits becomes unlikely.

We pass around unadjusted_exit_count to scale exit probability correctly.  In this
case we may end up using uninitialized value and profile-count type intentionally
bombs on that.

gcc/ChangeLog:

	PR bootstrap/110857
	* cfgloopmanip.cc (scale_loop_profile): (Un)initialize
	unadjusted_exit_count.
2023-08-03 22:42:27 +02:00
Aldy Hernandez
c83528d236 Read global value/mask in IPA.
Instead of reading the known zero bits in IPA, read the value/mask
pair which is available.

There is a slight change of behavior here.  I have removed the check
for SSA_NAME, as the ranger can calculate the range and value/mask for
INTEGER_CST.  This simplifies the code a bit, since there's no special
casing when setting the jfunc bits.  The default range for VR is
undefined, so I think it's safe just to check for undefined_p().

gcc/ChangeLog:

	* ipa-prop.cc (ipa_compute_jump_functions_for_edge): Read global
	value/mask.

gcc/testsuite/ChangeLog:

	* g++.dg/ipa/pure-const-3.C: Move source to...
	* g++.dg/ipa/pure-const-3.h: ...here, and adjust original test
	accordingly.
	* g++.dg/ipa/pure-const-3b.C: New.
2023-08-03 22:31:34 +02:00
Xiao Zeng
9e3fd33295 [PATCH 3/5] [RISC-V] Generate Zicond instruction for select pattern with condition eq or neq to 0
[ This is a partial commit.  So not all the cases mentioned by
  Xiao are currently handled. ]

This patch recognizes Zicond patterns when the select pattern
with condition eq or neq to 0 (using eq as an example), namely:

1 rd = (rs2 == 0) ? non-imm : 0
2 rd = (rs2 == 0) ? non-imm : non-imm
3 rd = (rs2 == 0) ? reg : non-imm
4 rd = (rs2 == 0) ? reg : reg

gcc/ChangeLog:

	* config/riscv/riscv.cc (riscv_expand_conditional_move): Recognize
	various Zicond patterns.
	* config/riscv/riscv.md (mov<mode>cc): Allow TARGET_ZICOND.  Use
	sfb_alu_operand for both arms of the conditional move.

	Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
2023-08-03 16:14:02 -04:00
Cupertino Miranda
c2a447d840 bpf: CO-RE builtins support tests.
This patch adds tests for the following builtins:
  __builtin_preserve_enum_value
  __builtin_btf_type_id
  __builtin_preserve_type_info

gcc/testsuite/ChangeLog:

	* gcc.target/bpf/core-builtin-enumvalue.c: New test.
	* gcc.target/bpf/core-builtin-enumvalue-errors.c: New test.
	* gcc.target/bpf/core-builtin-enumvalue-opt.c: New test.
	* gcc.target/bpf/core-builtin-fieldinfo-const-elimination.c: New test.
	* gcc.target/bpf/core-builtin-fieldinfo-errors-1.c: Changed.
	* gcc.target/bpf/core-builtin-fieldinfo-errors-2.c: Changed.
	* gcc.target/bpf/core-builtin-type-based.c: New test.
	* gcc.target/bpf/core-builtin-type-id.c: New test.
	* gcc.target/bpf/core-support.h: New test.
2023-08-03 19:47:26 +01:00
Cupertino Miranda
e0a81559c1 bpf: Implementation of BPF CO-RE builtins
This patch updates the support for the BPF CO-RE builtins
__builtin_preserve_access_index and __builtin_preserve_field_info,
and adds support for the CO-RE builtins __builtin_btf_type_id,
__builtin_preserve_type_info and __builtin_preserve_enum_value.

These CO-RE relocations are now converted to __builtin_core_reloc which
abstracts all of the original builtins in a polymorphic relocation
specific builtin.

The builtin processing is now split in 2 stages, the first (pack) is
executed right after the front-end and the second (process) right before
the asm output.

In expand pass the __builtin_core_reloc is converted to a
unspec:UNSPEC_CORE_RELOC rtx entry.

The data required to process the builtin is now collected in the packing
stage (after front-end), not allowing the compiler to optimize any of
the relevant information required to compose the relocation when
necessary.
At expansion, that information is recovered and CTF/BTF is queried to
construct the information that will be used in the relocation.
At this point the relocation is added to specific section and the
builtin is expanded to the expected default value for the builtin.

In order to process __builtin_preserve_enum_value, it was necessary to
hook the front-end to collect the original enum value reference.
This is needed since the parser folds all the enum values to its
integer_cst representation.

More details can be found within the core-builtins.cc.

Regtested in host x86_64-linux-gnu and target bpf-unknown-none.

gcc/ChangeLog:

	PR target/107844
	PR target/107479
	PR target/107480
	PR target/107481
	* config.gcc: Added core-builtins.cc and .o files.
	* config/bpf/bpf-passes.def: Removed file.
	* config/bpf/bpf-protos.h (bpf_add_core_reloc,
	bpf_replace_core_move_operands): New prototypes.
	* config/bpf/bpf.cc (enum bpf_builtins, is_attr_preserve_access,
	maybe_make_core_relo, bpf_core_field_info, bpf_core_compute,
	bpf_core_get_index, bpf_core_new_decl, bpf_core_walk,
	bpf_is_valid_preserve_field_info_arg, is_attr_preserve_access,
	handle_attr_preserve, pass_data_bpf_core_attr, pass_bpf_core_attr):
	Removed.
	(def_builtin, bpf_expand_builtin, bpf_resolve_overloaded_builtin): Changed.
	* config/bpf/bpf.md (define_expand mov<MM:mode>): Changed.
	(mov_reloc_core<mode>): Added.
	* config/bpf/core-builtins.cc (struct cr_builtin, enum
	cr_decision struct cr_local, struct cr_final, struct
	core_builtin_helpers, enum bpf_plugin_states): Added types.
	(builtins_data, core_builtin_helpers, core_builtin_type_defs):
	Added variables.
	(allocate_builtin_data, get_builtin-data, search_builtin_data,
	remove_parser_plugin, compare_same_kind, compare_same_ptr_expr,
	compare_same_ptr_type, is_attr_preserve_access, core_field_info,
	bpf_core_get_index, compute_field_expr,
	pack_field_expr_for_access_index, pack_field_expr_for_preserve_field,
	process_field_expr, pack_enum_value, process_enum_value, pack_type,
	process_type, bpf_require_core_support, make_core_relo, read_kind,
	kind_access_index, kind_preserve_field_info, kind_enum_value,
	kind_type_id, kind_preserve_type_info, get_core_builtin_fndecl_for_type,
	bpf_handle_plugin_finish_type, bpf_init_core_builtins,
	construct_builtin_core_reloc, bpf_resolve_overloaded_core_builtin,
	bpf_expand_core_builtin, bpf_add_core_reloc,
	bpf_replace_core_move_operands): Added functions.
	* config/bpf/core-builtins.h (enum bpf_builtins): Added.
	(bpf_init_core_builtins, bpf_expand_core_builtin,
	bpf_resolve_overloaded_core_builtin): Added functions.
	* config/bpf/coreout.cc (struct bpf_core_extra): Added.
	(bpf_core_reloc_add, output_asm_btfext_core_reloc): Changed.
	* config/bpf/coreout.h (bpf_core_reloc_add) Changed prototype.
	* config/bpf/t-bpf: Added core-builtins.o.
	* doc/extend.texi: Added documentation for new BPF builtins.
2023-08-03 19:46:44 +01:00
Andrew MacLeod
9fedc3c010 Add operand ranges to op1_op2_relation API.
With additional floating point relations in the pipeline, we can no
longer tell based on the LHS what the relation of X < Y is without knowing
the type of X and Y.

	* gimple-range-fold.cc (fold_using_range::range_of_range_op): Add
	ranges to the call to relation_fold_and_or.
	(fold_using_range::relation_fold_and_or): Add op1 and op2 ranges.
	(fur_source::register_outgoing_edges): Add op1 and op2 ranges.
	* gimple-range-fold.h (relation_fold_and_or): Adjust params.
	* gimple-range-gori.cc (gori_compute::compute_operand_range): Add
	a varying op1 and op2 to call.
	* range-op-float.cc (range_operator::op1_op2_relation): New dafaults.
	(operator_equal::op1_op2_relation): New float version.
	(operator_not_equal::op1_op2_relation): Ditto.
	(operator_lt::op1_op2_relation): Ditto.
	(operator_le::op1_op2_relation): Ditto.
	(operator_gt::op1_op2_relation): Ditto.
	(operator_ge::op1_op2_relation) Ditto.
	* range-op-mixed.h (operator_equal::op1_op2_relation): New float
	prototype.
	(operator_not_equal::op1_op2_relation): Ditto.
	(operator_lt::op1_op2_relation): Ditto.
	(operator_le::op1_op2_relation): Ditto.
	(operator_gt::op1_op2_relation): Ditto.
	(operator_ge::op1_op2_relation): Ditto.
	* range-op.cc (range_op_handler::op1_op2_relation): Dispatch new
	variations.
	(range_operator::op1_op2_relation): Add extra params.
	(operator_equal::op1_op2_relation): Ditto.
	(operator_not_equal::op1_op2_relation): Ditto.
	(operator_lt::op1_op2_relation): Ditto.
	(operator_le::op1_op2_relation): Ditto.
	(operator_gt::op1_op2_relation): Ditto.
	(operator_ge::op1_op2_relation): Ditto.
	* range-op.h (range_operator): New prototypes.
	(range_op_handler): Ditto.
2023-08-03 14:19:54 -04:00
Andrew MacLeod
33f080a7f1 Provide a routine for NAME == NAME relation.
We've been assuming x == x s VREL_EQ in GORI, but this is not always going to
be true with floating point.  Provide an API to return the relation.

	* gimple-range-gori.cc (gori_compute::compute_operand1_range):
	Use identity relation.
	(gori_compute::compute_operand2_range): Ditto.
	* value-relation.cc (get_identity_relation): New.
	* value-relation.h (get_identity_relation): New prototype.
2023-08-03 14:19:54 -04:00
Andrew MacLeod
c47ceea551 Automatically set type is certain Value_Range routines.
Set routines which take a type shouldn't have to pre-set the type of the
underlying range as it is specified as a parameter already.

	* value-range.h (Value_Range::set_varying): Set the type.
	(Value_Range::set_zero): Ditto.
	(Value_Range::set_nonzero): Ditto.
2023-08-03 14:19:54 -04:00
Jeff Law
d61efa3cd3 [committed][RISC-V] Remove errant hunk of code
I'm using this hunk locally to more thoroughly exercise the zicond paths
due to inaccuracies elsewhere in the costing model.  It was never
supposed to be part of the costing commit though.  And as we've seen
it's causing problems with the vector bits.

While my testing isn't complete, this hunk was never supposed to be
pushed and it's causing problems.  So I'm just ripping it out.

There's a bigger TODO in this space WRT a top-to-bottom evaluation of
the costing on RISC-V.  I'm still formulating what that evaluation is
going to look like, so don't hold your breath waiting on it.

Pushed to the trunk.

gcc/

	* config/riscv/riscv.cc (riscv_rtx_costs): Remove errant hunk from
	recent commit.
2023-08-03 10:57:23 -04:00
David Malcolm
f80efa49b7 testsuite, analyzer: add test case [PR108171]
The ICE in PR analyzer/108171 appears to be a dup of the recently fixed
PR analyzer/110882 and is likewise fixed by it; adding this test case.

gcc/testsuite/ChangeLog:
	PR analyzer/108171
	* gcc.dg/analyzer/pr108171.c: New test.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2023-08-03 10:47:22 -04:00
Pan Li
93fd44fde6 RISC-V: Fix one comment for binop_frm insn
The previous patch missed the vfsub comment for binop_frm, this
patch would like to fix this.

Signed-off-by: Pan Li <pan2.li@intel.com>

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc: Add vfsub.
2023-08-03 22:40:32 +08:00
David Malcolm
c62f93d1e0 analyzer: fix ICE on zero-sized arrays [PR110882]
gcc/analyzer/ChangeLog:
	PR analyzer/110882
	* region.cc (int_size_in_bits): Fail on zero-sized types.

gcc/testsuite/ChangeLog:
	PR analyzer/110882
	* gcc.dg/analyzer/pr110882.c: New test.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2023-08-03 09:47:44 -04:00
Richard Biener
a9b6043983 [libbacktrace] fix up broken test
zstdtest has some inline data where some testcases lack the
uncompressed length field.  Thus it computes that but still
ends up allocating memory for the uncompressed buffer based on
that (zero) length.  Oops.  Causes memory corruption if the
allocator returns non-NULL.

libbacktrace/
	* zstdtest.c (test_samples): Properly compute the allocation
	size for the uncompressed data.
2023-08-03 15:26:04 +02:00
Richard Sandiford
9524718654 poly_int: Handle more can_div_trunc_p cases
can_div_trunc_p (a, b, &Q, &r) tries to compute a Q and r that
satisfy the usual conditions for truncating division:

     (1) a = b * Q + r
     (2) |b * Q| <= |a|
     (3) |r| < |b|

We can compute Q using the constant component (the case when
all indeterminates are zero).  Since |r| < |b| for the constant
case, the requirements for indeterminate xi with coefficients
ai (for a) and bi (for b) are:

     (2') |bi * Q| <= |ai|
     (3') |ai - bi * Q| <= |bi|

(See the big comment for more details, restrictions, and reasoning).

However, the function works on abstract arithmetic types, and so
it has to be careful not to introduce new overflow.  The code
therefore only handled the extreme for (3'), that is:

     |ai - bi * Q| = |bi|

for the case where Q is zero.

Looking at it again, the overflow issue is a bit easier to handle than
I'd originally thought (or so I hope).  This patch therefore extends the
code to handle |ai - bi * Q| = |bi| for all Q, with Q = 0 no longer
being a separate case.

The net effect is to allow the function to succeed for things like:

     (a0 + b1 (Q+1) x) / (b0 + b1 x)

where Q = a0 / b0, with various sign conditions.  E.g. we now handle:

     (7 + 8x) / (4 + 4x)

with Q = 1 and r = 3 + 4x,

gcc/
	* poly-int.h (can_div_trunc_p): Succeed for more boundary conditions.

gcc/testsuite/
	* gcc.dg/plugin/poly-int-tests.h (test_can_div_trunc_p_const)
	(test_can_div_trunc_p_const): Add more tests.
2023-08-03 13:54:11 +01:00
Richard Biener
29370f1387 tree-optimization/110838 - vectorization of widened shifts
The following makes sure to limit the shift operand when vectorizing
(short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
operand otherwise invokes undefined behavior.  When we determine
whether we can demote the operand we know we at most shift in the
sign bit so we can adjust the shift amount.

Note this has the possibility of un-CSEing common shift operands
as there's no good way to share pattern stmts between patterns.
We'd have to separately pattern recognize the definition.

	PR tree-optimization/110838
	* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
	Adjust the shift operand of RSHIFT_EXPRs.

	* gcc.dg/torture/pr110838.c: New testcase.
2023-08-03 14:52:11 +02:00
Richard Biener
13dfb01e5c tree-optimization/110702 - avoid zero-based memory references in IVOPTs
Sometimes IVOPTs chooses a weird induction variable which downstream
leads to issues.  Most of the times we can fend those off during costing
by rejecting the candidate but it looks like the address description
costing synthesizes is different from what we end up generating so
the following fixes things up at code generation time.  Specifically
we avoid the create_mem_ref_raw fallback which uses a literal zero
address base with the actual base in index2.  For the case in question
we have the address

  type = unsigned long
  offset = 0
  elements = {
    [0] = &e * -3,
    [1] = (sizetype) a.9_30 * 232,
    [2] = ivtmp.28_44 * 4
  }

from which we code generate the problematical

  _3 = MEM[(long int *)0B + ivtmp.36_9 + ivtmp.28_44 * 4];

which references the object at address zero.  The patch below
recognizes the fallback after the fact and transforms the
TARGET_MEM_REF memory reference into a LEA for which this form
isn't problematic:

  _24 = &MEM[(long int *)0B + ivtmp.36_34 + ivtmp.28_44 * 4];
  _3 = *_24;

hereby avoiding the correctness issue.  We'd later conclude the
program terminates at the null pointer dereference and make the
function pure, miscompling the main function of the testcase.

	PR tree-optimization/110702
	* tree-ssa-loop-ivopts.cc (rewrite_use_address): When
	we created a NULL pointer based access rewrite that to
	a LEA.

	* gcc.dg/torture/pr110702.c: New testcase.
2023-08-03 14:22:03 +02:00
Sheri Bernstein
4cd4d2733c ada: Add pragma Annotate for GNATcheck exemptions
Exempt the GNATcheck rule "Improper_Returns" with the rationale
"early returns for performance".

gcc/ada/

	* libgnat/s-aridou.adb: Add pragma to exempt Improper_Returns.
	* libgnat/s-atopri.adb (Lock_Free_Try_Write): Likewise.
	* libgnat/s-bitops.adb (Bit_Eq): Likewise.
	* libgnat/s-carsi8.adb: Likewise.
	* libgnat/s-carun8.adb: Likewise.
	* libgnat/s-casi16.adb: Likewise.
	* libgnat/s-casi32.adb: Likewise.
	* libgnat/s-casi64.adb: Likewise.
	* libgnat/s-caun16.adb: Likewise.
	* libgnat/s-caun32.adb: Likewise.
	* libgnat/s-caun64.adb: Likewise.
	* libgnat/s-exponn.adb: Likewise.
	* libgnat/s-expont.adb: Likewise.
	* libgnat/s-valspe.adb: Likewise.
	* libgnat/s-vauspe.adb: Likewise.
2023-08-03 14:07:36 +02:00
Vasiliy Fofanov
65a31e22a8 ada: Rewrite Set_Image_*_Unsigned routines to remove recursion.
This rewriting removes algorithm inefficiencies due to unnecessary
recursion and copying. The new version has much smaller and statically known
stack requirements and is additionally up to 2x faster.

gcc/ada/

	* libgnat/s-imageb.adb (Set_Image_Based_Unsigned): Rewritten.
	* libgnat/s-imagew.adb (Set_Image_Width_Unsigned): Likewise.
2023-08-03 14:07:36 +02:00
Eric Botcazou
3b21dae599 ada: Fix spurious error on 'Input of private type with Type_Invariant aspect
The problem is that it is necessary to break the privacy during the
expansion of the Input attribute, which may introduce a view mismatch
with the parameter of the routine checking the invariant of the type.

gcc/ada/

	* exp_util.adb (Make_Invariant_Call): Convert the expression to
	the type of the formal parameter if need be.
2023-08-03 14:07:36 +02:00
Eric Botcazou
5825635336 ada: Adjust again address arithmetics in System.Dwarf_Lines
Using the operator of System.Storage_Elements has introduced a range check
that may be tripped on, so this removes the intermediate conversion to the
Storage_Count subtype that is responsible for it.

gcc/ada/

	* libgnat/s-dwalin.adb ("-"): New subtraction operator.
	(Enable_Cache): Use it to compute the offset.
	(Symbolic_Address): Likewise.
2023-08-03 14:07:36 +02:00
Richard Biener
46c8c22545 Improve sinking with unrelated defs
statement_sink_location for loads is currently confused about
stores that are not on the paths we are sinking across.  The
following replaces the logic that tries to ensure we are not
sinking across stores by instead of walking all immediate virtual
uses and then checking whether found stores are on the paths
we sink through with checking the live virtual operand at the
sinking location.  To obtain the live virtual operand we rely
on the new virtual_operand_live class which provides an overall
cheaper and also more precise way to check the constraints.

	* tree-ssa-sink.cc: Include tree-ssa-live.h.
	(pass_sink_code::execute): Instantiate virtual_operand_live
	and pass it down.
	(sink_code_in_bb): Pass down virtual_operand_live.
	(statement_sink_location): Get virtual_operand_live and
	verify we are not sinking loads across stores by looking up
	the live virtual operand at the sink location.

	* gcc.dg/tree-ssa/ssa-sink-20.c: New testcase.
2023-08-03 13:21:32 +02:00
Richard Biener
021a0cd449 Add virtual operand global liveness computation class
The following adds an on-demand global liveness computation class
computing and caching the live-out virtual operand of basic blocks
and answering live-out, live-in and live-on-edge queries.  The flow
is optimized for the intended use in code sinking which will query
live-in and possibly can be optimized further when the originating
query is for live-out.

The code relies on up-to-date immediate dominator information and
on an unchanging virtual operand state.

	* tree-ssa-live.h (class virtual_operand_live): New.
	* tree-ssa-live.cc (virtual_operand_live::init): New.
	(virtual_operand_live::get_live_in): Likewise.
	(virtual_operand_live::get_live_out): Likewise.
2023-08-03 13:21:32 +02:00
Richard Biener
3d48c11ad0 Swap loop splitting and final value replacement
The following swaps the loop splitting pass and the final value
replacement pass to avoid keeping the IV of the earlier loop
live when not necessary.  The existing gcc.target/i386/pr87007-5.c
testcase shows that we otherwise fail to elide an empty loop
later.  I don't see any good reason why loop splitting would need
final value replacement, all exit values honor the constraints
we place on loop header PHIs automatically.

	* passes.def: Exchange loop splitting and final value
	replacement passes.

	* gcc.target/i386/pr87007-5.c: Make sure we split the loop
	and eliminate both in the end.
2023-08-03 13:20:00 +02:00
Stefan Schulze Frielinghaus
fab08d12b4 s390: Try to emit vlbr/vstbr instead of vperm et al.
gcc/ChangeLog:

	* config/s390/s390.cc (expand_perm_as_a_vlbr_vstbr_candidate):
	New function which handles bswap patterns for vec_perm_const.
	(vectorize_vec_perm_const_1): Call new function.
	* config/s390/vector.md (*bswap<mode>): Fix operands in output
	template.
	(*vstbr<mode>): New insn.

gcc/testsuite/ChangeLog:

	* gcc.target/s390/s390.exp: Add subdirectory vxe2.
	* gcc.target/s390/vxe2/vlbr-1.c: New test.
	* gcc.target/s390/vxe2/vstbr-1.c: New test.
	* gcc.target/s390/vxe2/vstbr-2.c: New test.
2023-08-03 10:30:08 +02:00
Stefan Schulze Frielinghaus
8ab12576bc s390: Enable vect_bswap test cases
This enables the following tests which rely on instruction vperm which
is available since z13 with the initial vector support.

testsuite/gcc.dg/vect/vect-bswap16.c
42:/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_bswap || sse4_runtime } } } } */

testsuite/gcc.dg/vect/vect-bswap32.c
42:/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_bswap || sse4_runtime } } } } */

testsuite/gcc.dg/vect/vect-bswap64.c
42:/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_bswap || sse4_runtime } } } } */

gcc/testsuite/ChangeLog:

	* lib/target-supports.exp (check_effective_target_vect_bswap):
	Add s390.
2023-08-03 10:29:51 +02:00
Alexandre Oliva
b6f4b00011 Introduce -msmp to select /lib_smp/ on ppc-vx6
The .spec files used for linking on ppc-vx6, when the rtp-smp runtime
is selected, add -L flags for /lib_smp/ and /lib/.

There was a problem, though: although /lib_smp/ and /lib/ were to be
searched in this order, and the specs files do that correctly, the
compiler would search /lib/ first regardless, because
STARTFILE_PREFIX_SPEC said so, and specs files cannot override that.

With this patch, we arrange for the presence of -msmp to affect
STARTFILE_PREFIX_SPEC, so that the compiler searches /lib_smp/ rather
than /lib/ for crt files.  A separate patch for GNAT ensures that when
the rtp-smp runtime is selected, -msmp is passed to the compiler
driver for linking, along with the --specs flags.

for  gcc/ChangeLog

	* config/vxworks-smp.opt: New.  Introduce -msmp.
	* config.gcc: Enable it on powerpc* vxworks prior to 7r*.
	* config/rs6000/vxworks.h (STARTFILE_PREFIX_SPEC): Choose
	lib_smp when -msmp is present in the command line.
	* doc/invoke.texi: Document it.
2023-08-03 03:34:31 -03:00
Yanzhang Wang
39663298b5 RISCV: Add -m(no)-omit-leaf-frame-pointer support.
gcc/ChangeLog:

	* config/riscv/riscv.cc (riscv_save_reg_p): Save ra for leaf
	when enabling -mno-omit-leaf-frame-pointer
	(riscv_option_override): Override omit-frame-pointer.
	(riscv_frame_pointer_required): Save s0 for non-leaf function
	(TARGET_FRAME_POINTER_REQUIRED): Override defination
	* config/riscv/riscv.opt: Add option support.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/omit-frame-pointer-1.c: New test.
	* gcc.target/riscv/omit-frame-pointer-2.c: New test.
	* gcc.target/riscv/omit-frame-pointer-3.c: New test.
	* gcc.target/riscv/omit-frame-pointer-4.c: New test.
	* gcc.target/riscv/omit-frame-pointer-test.c: New test.

Signed-off-by: Yanzhang Wang <yanzhang.wang@intel.com>
2023-08-03 14:20:54 +08:00
Roger Sayle
790c1f60a5 PR target/110792: Early clobber issues with rot32di2_doubleword on i386.
This patch is a conservative fix for PR target/110792, a wrong-code
regression affecting doubleword rotations by BITS_PER_WORD, which
effectively swaps the highpart and lowpart words, when the source to be
rotated resides in memory. The issue is that if the register used to
hold the lowpart of the destination is mentioned in the address of
the memory operand, the current define_insn_and_split unintentionally
clobbers it before reading the highpart.

Hence, for the testcase, the incorrectly generated code looks like:

        salq    $4, %rdi		// calculate address
        movq    WHIRL_S+8(%rdi), %rdi	// accidentally clobber addr
        movq    WHIRL_S(%rdi), %rbp	// load (wrong) lowpart

Traditionally, the textbook way to fix this would be to add an
explicit early clobber to the instruction's constraints.

 (define_insn_and_split "<insn>32di2_doubleword"
- [(set (match_operand:DI 0 "register_operand" "=r,r,r")
+ [(set (match_operand:DI 0 "register_operand" "=r,r,&r")
        (any_rotate:DI (match_operand:DI 1 "nonimmediate_operand" "0,r,o")
                       (const_int 32)))]

but unfortunately this currently generates significantly worse code,
due to a strange choice of reloads (effectively memcpy), which ends up
looking like:

        salq    $4, %rdi		// calculate address
        movdqa  WHIRL_S(%rdi), %xmm0	// load the double word in SSE reg.
        movaps  %xmm0, -16(%rsp)	// store the SSE reg back to the stack
        movq    -8(%rsp), %rdi		// load highpart
        movq    -16(%rsp), %rbp		// load lowpart

Note that reload's "&" doesn't distinguish between the memory being
early clobbered, vs the registers used in an addressing mode being
early clobbered.

The fix proposed in this patch is to remove the third alternative, that
allowed offsetable memory as an operand, forcing reload to place the
operand into a register before the rotation.  This results in:

        salq    $4, %rdi
        movq    WHIRL_S(%rdi), %rax
        movq    WHIRL_S+8(%rdi), %rdi
        movq    %rax, %rbp

I believe there's a more advanced solution, by swapping the order of
the loads (if first destination register is mentioned in the address),
or inserting a lea insn (if both destination registers are mentioned
in the address), but this fix is a minimal "safe" solution, that
should hopefully be suitable for backporting.

2023-08-03  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR target/110792
	* config/i386/i386.md (<any_rotate>ti3): For rotations by 64 bits
	place operand in a register before gen_<insn>64ti2_doubleword.
	(<any_rotate>di3): Likewise, for rotations by 32 bits, place
	operand in a register before gen_<insn>32di2_doubleword.
	(<any_rotate>32di2_doubleword): Constrain operand to be in register.
	(<any_rotate>64ti2_doubleword): Likewise.

gcc/testsuite/ChangeLog
	PR target/110792
	* g++.target/i386/pr110792.C: New 32-bit C++ test case.
	* gcc.target/i386/pr110792.c: New 64-bit C test case.
2023-08-03 07:12:04 +01:00