When determining issue rates we currently discount non-constant MLA accumulators
for Advanced SIMD but don't do it for the latency.
This means the costs for Advanced SIMD with a constant accumulator are wrong and
results in us costing SVE and Advanced SIMD the same. This can cauze us to
vectorize with Advanced SIMD instead of SVE in some cases.
This patch adds the same discount for SVE and Scalar as we do for issue rate.
This gives a 5% improvement in fotonik3d_r in SPECCPU 2017 on large
Neoverse cores.
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_multiply_add_p): Update handling
of constants.
(aarch64_adjust_stmt_cost): Use it.
(aarch64_vector_costs::count_ops): Likewise.
(aarch64_vector_costs::add_stmt_cost): Pass vinfo to
aarch64_adjust_stmt_cost.
The following fixes a problem with my last attempt of avoiding
out-of-bound shift values for vectorized right shifts of widened
operands. Instead of truncating the shift amount with a bitwise
and we actually need to saturate it to the target precision.
The following does that and adds test coverage for the constant
and invariant but variable case that would previously have failed.
PR tree-optimization/110838
* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
Fix right-shift value sanitizing. Properly emit external
def mangling in the preheader rather than in the pattern
def sequence where it will fail vectorizing.
* gcc.dg/vect/pr110838.c: New testcase.
On some AArch64 bootstrapped builds, we were getting a flaky test
because the floating point operations in `get_time` were being fused
with the floating point operations in `timevar_accumulate`.
This meant that the rounding behaviour of our multiplication with
`ticks_to_msec` was different when used in `timer::start` and when
performed in `timer::stop`. These extra inaccuracies led to the
testcase `g++.dg/ext/timevar1.C` being flaky on some hardware.
------------------------------
Avoiding the inlining which was agreed to be undesirable. Three
alternative approaches:
1) Use `-ffp-contract=on` to avoid this particular optimisation.
2) Adjusting the code so that the "tolerance" is always of the order of
a "tick".
3) Recording times and elapsed differences in integral values.
- Could be in terms of a standard measurement (e.g. nanoseconds or
microseconds).
- Could be in terms of whatever integral value ("ticks" /
secondsµseconds / "clock ticks") is returned from the syscall
chosen at configure time.
While `-ffp-contract=on` removes the problem that I bumped into, there
has been a similar bug on x86 that was to do with a different floating
point problem that also happens after `get_time` and
`timevar_accumulate` both being inlined into the same function. Hence
it seems worth choosing a different approach.
Of the two other solutions, recording measurements in integral values
seems the most robust against slightly "off" measurements being
presented to the user -- even though it could avoid the ICE that creates
a flaky test.
I considered storing time in whatever units our syscall returns and
normalising them at the time we print out rather than normalising them
to nanoseconds at the point we record our "current time". The logic
being that normalisation could have some rounding affect (e.g. if
TICKS_PER_SECOND is 3) that would be taken into account in calculations.
I decided against it in order to give the values recorded in
`timevar_time_def` some interpretive value so it's easier to read the
code. Compared to the small rounding that would represent a tiny amount
of time and AIUI can not trigger the same kind of ICE's as we are
attempting to fix, said interpretive value seems more valuable.
Recording time in microseconds seemed reasonable since all obvious
values for ticks and `getrusage` are at microsecond granularity or less
precise. That said, since TICKS_PER_SECOND and CLOCKS_PER_SEC are both
variables given to use by the host system I was not sure of that enough
to make this decision.
------------------------------
timer::all_zero is ignoring rows which are inconsequential to the user
and would be printed out as all zeros. Since upon printing rows we
convert to the same double value and print out the same precision as
before, we return true/false based on the same amount of time as before.
timer::print_row casts to a floating point measurement in units of
seconds as was printed out before.
timer::validate_phases -- I'm printing out nanoseconds here rather than
floating point seconds since this is an error message for when things
have "gone wrong" printing out the actual nanoseconds that have been
recorded seems like the best approach.
N.b. since we now print out nanoseconds instead of floating point value
the padding requirements are different. Originally we were padding to
24 characters and printing 18 decimal places. This looked odd with the
now visually smaller values getting printed. I judged 13 characters
(corresponding to 2 hours) to be a reasonable point at which our
alignment could start to degrade and this provides a more compact output
for the majority of cases (checked by triggering the error case via
GDB).
------------------------------
N.b. I use a literal 1000000000 for "NANOSEC_PER_SEC". I believe this
would fit in an integer on all hosts that GCC supports, but am not
certain there are not strange integer sizes we support hence am pointing
it out for special attention during review.
------------------------------
No expected change in generated code.
Bootstrapped and regtested on AArch64 with no regressions.
Hope this is acceptable -- I had originally planned to use
`-ffp-contract` as agreed until I saw mention of the old x86 bug in the
same area which was not to do with floating point contraction of
operations (PR 99903).
gcc/ChangeLog:
PR middle-end/110316
PR middle-end/9903
* timevar.cc (NANOSEC_PER_SEC, TICKS_TO_NANOSEC,
CLOCKS_TO_NANOSEC, nanosec_to_floating_sec, percent_of): New.
(TICKS_TO_MSEC, CLOCKS_TO_MSEC): Remove these macros.
(timer::validate_phases): Use integral arithmetic to check
validity.
(timer::print_row, timer::print): Convert from integral
nanoseconds to floating point seconds before printing.
(timer::all_zero): Change limit to nanosec count instead of
fractional count of seconds.
(make_json_for_timevar_time_def): Convert from integral
nanoseconds to floating point seconds before recording.
* timevar.h (struct timevar_time_def): Update all measurements
to use uint64_t nanoseconds rather than seconds stored in a
double.
The following adjusts the shift simplification patterns to avoid
touching out-of-bound shift value arithmetic right shifts of
possibly negative values. While simplifying those to zero isn't
wrong it's violating the principle of least surprise.
PR tree-optimization/110838
* match.pd (([rl]shift @0 out-of-bounds) -> zero): Restrict
the arithmetic right-shift case to non-negative operands.
This changes gimple_bitwise_inverted_equal_p to use a 2 different match patterns
to try to match bit_not wrapped with a possible nop_convert and a comparison
also wrapped with a possible nop_convert. This is to avoid being recursive.
OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.
gcc/ChangeLog:
PR tree-optimization/110874
* gimple-match-head.cc (gimple_bit_not_with_nop): New declaration.
(gimple_maybe_cmp): Likewise.
(gimple_bitwise_inverted_equal_p): Rewrite to use gimple_bit_not_with_nop
and gimple_maybe_cmp instead of being recursive.
* match.pd (bit_not_with_nop): New match pattern.
(maybe_cmp): Likewise.
gcc/testsuite/ChangeLog:
PR tree-optimization/110874
* gcc.c-torture/compile/pr110874-a.c: New test.
Canonicalizes (signed x << c) >> c into the lowest
precision(type) - c bits of x IF those bits have a mode precision or a
precision of 1. Also combines this rule with (unsigned x << c) >> c -> x &
((unsigned)-1 >> c) to prevent duplicate pattern.
PR middle-end/101955
* match.pd ((signed x << c) >> c): New canonicalization.
* gcc.dg/pr101955.c: New test.
This patch would like to support the rounding mode API for the
VFNMSAC for the below samples.
* __riscv_vfnmsac_vv_f32m1_rm
* __riscv_vfnmsac_vv_f32m1_rm_m
* __riscv_vfnmsac_vf_f32m1_rm
* __riscv_vfnmsac_vf_f32m1_rm_m
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc
(class vfnmsac_frm): New class for vfnmsac frm.
(vfnmsac_frm_obj): New declaration.
(BASE): Ditto.
* config/riscv/riscv-vector-builtins-bases.h: Ditto.
* config/riscv/riscv-vector-builtins-functions.def
(vfnmsac_frm): New function definition.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-single-negate-multiply-sub.c:
New test.
This patch would like to support the rounding mode API for the
VFMSAC for the below samples.
* __riscv_vfmsac_vv_f32m1_rm
* __riscv_vfmsac_vv_f32m1_rm_m
* __riscv_vfmsac_vf_f32m1_rm
* __riscv_vfmsac_vf_f32m1_rm_m
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc
(class vfmsac_frm): New class for vfmsac frm.
(vfmsac_frm_obj): New declaration.
(BASE): Ditto.
* config/riscv/riscv-vector-builtins-bases.h: Ditto.
* config/riscv/riscv-vector-builtins-functions.def
(vfmsac_frm): New function definition.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-single-multiply-sub.c: New test.
This patch would like to support the rounding mode API for the
VFNMACC for the below samples.
* __riscv_vfnmacc_vv_f32m1_rm
* __riscv_vfnmacc_vv_f32m1_rm_m
* __riscv_vfnmacc_vf_f32m1_rm
* __riscv_vfnmacc_vf_f32m1_rm_m
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc
(class vfnmacc_frm): New class for vfnmacc.
(vfnmacc_frm_obj): New declaration.
(BASE): Ditto.
* config/riscv/riscv-vector-builtins-bases.h: Ditto.
* config/riscv/riscv-vector-builtins-functions.def
(vfnmacc_frm): New function definition.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-single-negate-multiply-add.c:
New test.
Fix the assertion failure on empty reduction define in info_for_reduction.
Even a stmt is live, it may still have empty reduction define. Check the
reduction definition instead of live info before calling info_for_reduction.
gcc/ChangeLog:
PR target/110625
* config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/pr110625_3.c: New testcase.
This patch would like to support the rounding mode API for the
VFMACC for the below samples.
* __riscv_vfmacc_vv_f32m1_rm
* __riscv_vfmacc_vv_f32m1_rm_m
* __riscv_vfmacc_vf_f32m1_rm
* __riscv_vfmacc_vf_f32m1_rm_m
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc
(class vfmacc_frm): New class for vfmacc frm.
(vfmacc_frm_obj): New declaration.
(BASE): Ditto.
* config/riscv/riscv-vector-builtins-bases.h: Ditto.
* config/riscv/riscv-vector-builtins-functions.def
(vfmacc_frm): New function definition.
* config/riscv/riscv-vector-builtins.cc
(function_expander::use_ternop_insn): Add frm operand support.
* config/riscv/vector.md: Add vfmuladd to frm_mode.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-single-multiply-add.c: New test.
This patch would like to support the rounding mode API for the
VFWMUL for the below samples.
* __riscv_vfwmul_vv_f64m2_rm
* __riscv_vfwmul_vv_f64m2_rm_m
* __riscv_vfwmul_vf_f64m2_rm
* __riscv_vfwmul_vf_f64m2_rm_m
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc
(vfwmul_frm_obj): New declaration.
(vfwmul_frm): Ditto.
* config/riscv/riscv-vector-builtins-bases.h:
(vfwmul_frm): Ditto.
* config/riscv/riscv-vector-builtins-functions.def
(vfwmul_frm): New function definition.
* config/riscv/vector.md: (frm_mode) Add vfwmul to frm_mode.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-widening-mul.c: New test.
This patch would like to support the rounding mode API for the
VFDIV and VFRDIV for the below samples.
* __riscv_vfdiv_vv_f32m1_rm
* __riscv_vfdiv_vv_f32m1_rm_m
* __riscv_vfdiv_vf_f32m1_rm
* __riscv_vfdiv_vf_f32m1_rm_m
* __riscv_vfrdiv_vf_f32m1_rm
* __riscv_vfrdiv_vf_f32m1_rm_m
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc
(binop_frm): New declaration.
(reverse_binop_frm): Likewise.
(BASE): Likewise.
* config/riscv/riscv-vector-builtins-bases.h:
(vfdiv_frm): New extern declaration.
(vfrdiv_frm): Likewise.
* config/riscv/riscv-vector-builtins-functions.def
(vfdiv_frm): New function definition.
(vfrdiv_frm): Likewise.
* config/riscv/vector.md: Add vfdiv to frm_mode.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-single-div.c: New test.
* gcc.target/riscv/rvv/base/float-point-single-rdiv.c: New test.
Hmmer's internal function has 4 loops. The following is the profile at start:
loop 1:
estimate 472
iterations by profile: 473.497707 (reliable) count in:84821 (precise, freq 0.9979)
loop 2:
estimate 99
iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104)
loop 3:
estimate 99
iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104)
loop 4:
estimate 100
iterations by profile: 100.999596 (reliable) execution count:84167 (precise, freq 0.9902)
So the first loops is outer loop and second/third loops are nesed. Fourth loop is not critical.
Precise iteraiton counts are unknown (473 and 100 comes from profile)
Nested loop has following form:
for (k = 1; k <= M; k++) {
mc[k] = mpp[k-1] + tpmm[k-1];
if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
mc[k] += ms[k];
if (mc[k] < -INFTY) mc[k] = -INFTY;
dc[k] = dc[k-1] + tpdd[k-1];
if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
if (dc[k] < -INFTY) dc[k] = -INFTY;
if (k < M) {
ic[k] = mpp[k] + tpmi[k];
if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
ic[k] += is[k];
if (ic[k] < -INFTY) ic[k] = -INFTY;
}
We do quite some belly dancing here.
1) loop-ch slightly misupdates profile, so the estimates of 99
does not match profile setimate of 100.
2) loops-split splits on if (k < M) and produces two loops.
It fails to notice that the second loop never iterates.
It used to misupdate profile a lot which later caused internal
loop to become cold. This is fixed now.
3) loop-dist introduces runtime aliasing checks for both loops
4) tree vectorizer vectorizes some of the copies of the loop produces
and drops expected iteration counts
5) loop peeling peels the loops with expected low iteration counts
6) complete loop unrolling kills some loops in prologues/epilogues.
We end up with quite many loops and run out of registers:
iterations by profile: 5.312499 (unreliable, maybe flat)
this is vectorized internal loops after loop peeling
iterations by profile: 0.009495 (unreliable, maybe flat)
iterations by profile: 0.009495 (unreliable, maybe flat)
iterations by profile: 0.009495 (unreliable, maybe flat)
iterations by profile: 0.009495 (unreliable, maybe flat)
Those are all versioned/peeled and vectorized variants of the loop never looping
iterations by profile: 100.000008 (unreliable)
iterations by profile: 100.000000 (unreliable)
Those are variants with failed aliasing checks
iterations by profile: 9.662853 (unreliable, maybe flat)
iterations by profile: 4.646072 (unreliable)
iterations by profile: 100.000007 (unreliable)
iterations by profile: 5.312500 (unreliable)
iterations by profile: 473.497707 (reliable)
This is loop 1
iterations by profile: 100.999596 (reliable)
This is the loop 4.
This patch fixes loop iteration estimate update after loop split so we get:
iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 (guessed, freq 149.9081)
This is remainder of the peeled vectorized loop 2. It misses estimate that is correct since after peeling it 6 times it is essentially
impossible to tell what the remaining loop profile is (without histograms)
iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374801 (guessed, freq 4.4094)
Peeled split part of loop 2 (one that never loops). We ought to work this out
but at least w
estimate 99
iterations by profile: 100.000008 (unreliable) entry count:3945039 (guessed, freq 46.4122)
estimate 99
iterations by profile: 100.000000 (unreliable) entry count:35505353 (guessed, freq 417.7100)
estimate 99
iterations by profile: 9.662853 (unreliable, maybe flat) entry count:35505353 (guessed, freq 417.7100)
Profile here mismatches estimate - I will need to work out why.
estimate 5
iterations by profile: 4.646072 (unreliable) entry count:31954818 (guessed, freq 375.9390)
This is vectorized but not peeled loop 3
estimate 99
iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420)
Unvectorized variant of loop 3
estimate 5
iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512)
Another vectorized variant of loop 3
estimate 472
iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979)
Outer loop
estimate 100
iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902)
loop 4, not vectorized/peeled
So there is still work to do on this testcase, but with the patch we prevent 3 useless loops.
Bootstrapped/regtested x86_64-linux, plan to commit it later today.
gcc/ChangeLog:
* tree-ssa-loop-split.cc (split_loop): Update estimated iteration counts.
Profiledbootstrap fails with ICE in update_loop_exit_probability_scale_dom_bbs
called from loop unroling.
The reason is that under relatively rare situations, we may run into case where
loop has multiple exits and all are considered as likely but then we scale down
the profile and one of the exits becomes unlikely.
We pass around unadjusted_exit_count to scale exit probability correctly. In this
case we may end up using uninitialized value and profile-count type intentionally
bombs on that.
gcc/ChangeLog:
PR bootstrap/110857
* cfgloopmanip.cc (scale_loop_profile): (Un)initialize
unadjusted_exit_count.
Instead of reading the known zero bits in IPA, read the value/mask
pair which is available.
There is a slight change of behavior here. I have removed the check
for SSA_NAME, as the ranger can calculate the range and value/mask for
INTEGER_CST. This simplifies the code a bit, since there's no special
casing when setting the jfunc bits. The default range for VR is
undefined, so I think it's safe just to check for undefined_p().
gcc/ChangeLog:
* ipa-prop.cc (ipa_compute_jump_functions_for_edge): Read global
value/mask.
gcc/testsuite/ChangeLog:
* g++.dg/ipa/pure-const-3.C: Move source to...
* g++.dg/ipa/pure-const-3.h: ...here, and adjust original test
accordingly.
* g++.dg/ipa/pure-const-3b.C: New.
[ This is a partial commit. So not all the cases mentioned by
Xiao are currently handled. ]
This patch recognizes Zicond patterns when the select pattern
with condition eq or neq to 0 (using eq as an example), namely:
1 rd = (rs2 == 0) ? non-imm : 0
2 rd = (rs2 == 0) ? non-imm : non-imm
3 rd = (rs2 == 0) ? reg : non-imm
4 rd = (rs2 == 0) ? reg : reg
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_expand_conditional_move): Recognize
various Zicond patterns.
* config/riscv/riscv.md (mov<mode>cc): Allow TARGET_ZICOND. Use
sfb_alu_operand for both arms of the conditional move.
Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
This patch adds tests for the following builtins:
__builtin_preserve_enum_value
__builtin_btf_type_id
__builtin_preserve_type_info
gcc/testsuite/ChangeLog:
* gcc.target/bpf/core-builtin-enumvalue.c: New test.
* gcc.target/bpf/core-builtin-enumvalue-errors.c: New test.
* gcc.target/bpf/core-builtin-enumvalue-opt.c: New test.
* gcc.target/bpf/core-builtin-fieldinfo-const-elimination.c: New test.
* gcc.target/bpf/core-builtin-fieldinfo-errors-1.c: Changed.
* gcc.target/bpf/core-builtin-fieldinfo-errors-2.c: Changed.
* gcc.target/bpf/core-builtin-type-based.c: New test.
* gcc.target/bpf/core-builtin-type-id.c: New test.
* gcc.target/bpf/core-support.h: New test.
This patch updates the support for the BPF CO-RE builtins
__builtin_preserve_access_index and __builtin_preserve_field_info,
and adds support for the CO-RE builtins __builtin_btf_type_id,
__builtin_preserve_type_info and __builtin_preserve_enum_value.
These CO-RE relocations are now converted to __builtin_core_reloc which
abstracts all of the original builtins in a polymorphic relocation
specific builtin.
The builtin processing is now split in 2 stages, the first (pack) is
executed right after the front-end and the second (process) right before
the asm output.
In expand pass the __builtin_core_reloc is converted to a
unspec:UNSPEC_CORE_RELOC rtx entry.
The data required to process the builtin is now collected in the packing
stage (after front-end), not allowing the compiler to optimize any of
the relevant information required to compose the relocation when
necessary.
At expansion, that information is recovered and CTF/BTF is queried to
construct the information that will be used in the relocation.
At this point the relocation is added to specific section and the
builtin is expanded to the expected default value for the builtin.
In order to process __builtin_preserve_enum_value, it was necessary to
hook the front-end to collect the original enum value reference.
This is needed since the parser folds all the enum values to its
integer_cst representation.
More details can be found within the core-builtins.cc.
Regtested in host x86_64-linux-gnu and target bpf-unknown-none.
gcc/ChangeLog:
PR target/107844
PR target/107479
PR target/107480
PR target/107481
* config.gcc: Added core-builtins.cc and .o files.
* config/bpf/bpf-passes.def: Removed file.
* config/bpf/bpf-protos.h (bpf_add_core_reloc,
bpf_replace_core_move_operands): New prototypes.
* config/bpf/bpf.cc (enum bpf_builtins, is_attr_preserve_access,
maybe_make_core_relo, bpf_core_field_info, bpf_core_compute,
bpf_core_get_index, bpf_core_new_decl, bpf_core_walk,
bpf_is_valid_preserve_field_info_arg, is_attr_preserve_access,
handle_attr_preserve, pass_data_bpf_core_attr, pass_bpf_core_attr):
Removed.
(def_builtin, bpf_expand_builtin, bpf_resolve_overloaded_builtin): Changed.
* config/bpf/bpf.md (define_expand mov<MM:mode>): Changed.
(mov_reloc_core<mode>): Added.
* config/bpf/core-builtins.cc (struct cr_builtin, enum
cr_decision struct cr_local, struct cr_final, struct
core_builtin_helpers, enum bpf_plugin_states): Added types.
(builtins_data, core_builtin_helpers, core_builtin_type_defs):
Added variables.
(allocate_builtin_data, get_builtin-data, search_builtin_data,
remove_parser_plugin, compare_same_kind, compare_same_ptr_expr,
compare_same_ptr_type, is_attr_preserve_access, core_field_info,
bpf_core_get_index, compute_field_expr,
pack_field_expr_for_access_index, pack_field_expr_for_preserve_field,
process_field_expr, pack_enum_value, process_enum_value, pack_type,
process_type, bpf_require_core_support, make_core_relo, read_kind,
kind_access_index, kind_preserve_field_info, kind_enum_value,
kind_type_id, kind_preserve_type_info, get_core_builtin_fndecl_for_type,
bpf_handle_plugin_finish_type, bpf_init_core_builtins,
construct_builtin_core_reloc, bpf_resolve_overloaded_core_builtin,
bpf_expand_core_builtin, bpf_add_core_reloc,
bpf_replace_core_move_operands): Added functions.
* config/bpf/core-builtins.h (enum bpf_builtins): Added.
(bpf_init_core_builtins, bpf_expand_core_builtin,
bpf_resolve_overloaded_core_builtin): Added functions.
* config/bpf/coreout.cc (struct bpf_core_extra): Added.
(bpf_core_reloc_add, output_asm_btfext_core_reloc): Changed.
* config/bpf/coreout.h (bpf_core_reloc_add) Changed prototype.
* config/bpf/t-bpf: Added core-builtins.o.
* doc/extend.texi: Added documentation for new BPF builtins.
With additional floating point relations in the pipeline, we can no
longer tell based on the LHS what the relation of X < Y is without knowing
the type of X and Y.
* gimple-range-fold.cc (fold_using_range::range_of_range_op): Add
ranges to the call to relation_fold_and_or.
(fold_using_range::relation_fold_and_or): Add op1 and op2 ranges.
(fur_source::register_outgoing_edges): Add op1 and op2 ranges.
* gimple-range-fold.h (relation_fold_and_or): Adjust params.
* gimple-range-gori.cc (gori_compute::compute_operand_range): Add
a varying op1 and op2 to call.
* range-op-float.cc (range_operator::op1_op2_relation): New dafaults.
(operator_equal::op1_op2_relation): New float version.
(operator_not_equal::op1_op2_relation): Ditto.
(operator_lt::op1_op2_relation): Ditto.
(operator_le::op1_op2_relation): Ditto.
(operator_gt::op1_op2_relation): Ditto.
(operator_ge::op1_op2_relation) Ditto.
* range-op-mixed.h (operator_equal::op1_op2_relation): New float
prototype.
(operator_not_equal::op1_op2_relation): Ditto.
(operator_lt::op1_op2_relation): Ditto.
(operator_le::op1_op2_relation): Ditto.
(operator_gt::op1_op2_relation): Ditto.
(operator_ge::op1_op2_relation): Ditto.
* range-op.cc (range_op_handler::op1_op2_relation): Dispatch new
variations.
(range_operator::op1_op2_relation): Add extra params.
(operator_equal::op1_op2_relation): Ditto.
(operator_not_equal::op1_op2_relation): Ditto.
(operator_lt::op1_op2_relation): Ditto.
(operator_le::op1_op2_relation): Ditto.
(operator_gt::op1_op2_relation): Ditto.
(operator_ge::op1_op2_relation): Ditto.
* range-op.h (range_operator): New prototypes.
(range_op_handler): Ditto.
We've been assuming x == x s VREL_EQ in GORI, but this is not always going to
be true with floating point. Provide an API to return the relation.
* gimple-range-gori.cc (gori_compute::compute_operand1_range):
Use identity relation.
(gori_compute::compute_operand2_range): Ditto.
* value-relation.cc (get_identity_relation): New.
* value-relation.h (get_identity_relation): New prototype.
Set routines which take a type shouldn't have to pre-set the type of the
underlying range as it is specified as a parameter already.
* value-range.h (Value_Range::set_varying): Set the type.
(Value_Range::set_zero): Ditto.
(Value_Range::set_nonzero): Ditto.
I'm using this hunk locally to more thoroughly exercise the zicond paths
due to inaccuracies elsewhere in the costing model. It was never
supposed to be part of the costing commit though. And as we've seen
it's causing problems with the vector bits.
While my testing isn't complete, this hunk was never supposed to be
pushed and it's causing problems. So I'm just ripping it out.
There's a bigger TODO in this space WRT a top-to-bottom evaluation of
the costing on RISC-V. I'm still formulating what that evaluation is
going to look like, so don't hold your breath waiting on it.
Pushed to the trunk.
gcc/
* config/riscv/riscv.cc (riscv_rtx_costs): Remove errant hunk from
recent commit.
The ICE in PR analyzer/108171 appears to be a dup of the recently fixed
PR analyzer/110882 and is likewise fixed by it; adding this test case.
gcc/testsuite/ChangeLog:
PR analyzer/108171
* gcc.dg/analyzer/pr108171.c: New test.
Signed-off-by: David Malcolm <dmalcolm@redhat.com>
The previous patch missed the vfsub comment for binop_frm, this
patch would like to fix this.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc: Add vfsub.
zstdtest has some inline data where some testcases lack the
uncompressed length field. Thus it computes that but still
ends up allocating memory for the uncompressed buffer based on
that (zero) length. Oops. Causes memory corruption if the
allocator returns non-NULL.
libbacktrace/
* zstdtest.c (test_samples): Properly compute the allocation
size for the uncompressed data.
can_div_trunc_p (a, b, &Q, &r) tries to compute a Q and r that
satisfy the usual conditions for truncating division:
(1) a = b * Q + r
(2) |b * Q| <= |a|
(3) |r| < |b|
We can compute Q using the constant component (the case when
all indeterminates are zero). Since |r| < |b| for the constant
case, the requirements for indeterminate xi with coefficients
ai (for a) and bi (for b) are:
(2') |bi * Q| <= |ai|
(3') |ai - bi * Q| <= |bi|
(See the big comment for more details, restrictions, and reasoning).
However, the function works on abstract arithmetic types, and so
it has to be careful not to introduce new overflow. The code
therefore only handled the extreme for (3'), that is:
|ai - bi * Q| = |bi|
for the case where Q is zero.
Looking at it again, the overflow issue is a bit easier to handle than
I'd originally thought (or so I hope). This patch therefore extends the
code to handle |ai - bi * Q| = |bi| for all Q, with Q = 0 no longer
being a separate case.
The net effect is to allow the function to succeed for things like:
(a0 + b1 (Q+1) x) / (b0 + b1 x)
where Q = a0 / b0, with various sign conditions. E.g. we now handle:
(7 + 8x) / (4 + 4x)
with Q = 1 and r = 3 + 4x,
gcc/
* poly-int.h (can_div_trunc_p): Succeed for more boundary conditions.
gcc/testsuite/
* gcc.dg/plugin/poly-int-tests.h (test_can_div_trunc_p_const)
(test_can_div_trunc_p_const): Add more tests.
The following makes sure to limit the shift operand when vectorizing
(short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
operand otherwise invokes undefined behavior. When we determine
whether we can demote the operand we know we at most shift in the
sign bit so we can adjust the shift amount.
Note this has the possibility of un-CSEing common shift operands
as there's no good way to share pattern stmts between patterns.
We'd have to separately pattern recognize the definition.
PR tree-optimization/110838
* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
Adjust the shift operand of RSHIFT_EXPRs.
* gcc.dg/torture/pr110838.c: New testcase.
Sometimes IVOPTs chooses a weird induction variable which downstream
leads to issues. Most of the times we can fend those off during costing
by rejecting the candidate but it looks like the address description
costing synthesizes is different from what we end up generating so
the following fixes things up at code generation time. Specifically
we avoid the create_mem_ref_raw fallback which uses a literal zero
address base with the actual base in index2. For the case in question
we have the address
type = unsigned long
offset = 0
elements = {
[0] = &e * -3,
[1] = (sizetype) a.9_30 * 232,
[2] = ivtmp.28_44 * 4
}
from which we code generate the problematical
_3 = MEM[(long int *)0B + ivtmp.36_9 + ivtmp.28_44 * 4];
which references the object at address zero. The patch below
recognizes the fallback after the fact and transforms the
TARGET_MEM_REF memory reference into a LEA for which this form
isn't problematic:
_24 = &MEM[(long int *)0B + ivtmp.36_34 + ivtmp.28_44 * 4];
_3 = *_24;
hereby avoiding the correctness issue. We'd later conclude the
program terminates at the null pointer dereference and make the
function pure, miscompling the main function of the testcase.
PR tree-optimization/110702
* tree-ssa-loop-ivopts.cc (rewrite_use_address): When
we created a NULL pointer based access rewrite that to
a LEA.
* gcc.dg/torture/pr110702.c: New testcase.
This rewriting removes algorithm inefficiencies due to unnecessary
recursion and copying. The new version has much smaller and statically known
stack requirements and is additionally up to 2x faster.
gcc/ada/
* libgnat/s-imageb.adb (Set_Image_Based_Unsigned): Rewritten.
* libgnat/s-imagew.adb (Set_Image_Width_Unsigned): Likewise.
The problem is that it is necessary to break the privacy during the
expansion of the Input attribute, which may introduce a view mismatch
with the parameter of the routine checking the invariant of the type.
gcc/ada/
* exp_util.adb (Make_Invariant_Call): Convert the expression to
the type of the formal parameter if need be.
Using the operator of System.Storage_Elements has introduced a range check
that may be tripped on, so this removes the intermediate conversion to the
Storage_Count subtype that is responsible for it.
gcc/ada/
* libgnat/s-dwalin.adb ("-"): New subtraction operator.
(Enable_Cache): Use it to compute the offset.
(Symbolic_Address): Likewise.
statement_sink_location for loads is currently confused about
stores that are not on the paths we are sinking across. The
following replaces the logic that tries to ensure we are not
sinking across stores by instead of walking all immediate virtual
uses and then checking whether found stores are on the paths
we sink through with checking the live virtual operand at the
sinking location. To obtain the live virtual operand we rely
on the new virtual_operand_live class which provides an overall
cheaper and also more precise way to check the constraints.
* tree-ssa-sink.cc: Include tree-ssa-live.h.
(pass_sink_code::execute): Instantiate virtual_operand_live
and pass it down.
(sink_code_in_bb): Pass down virtual_operand_live.
(statement_sink_location): Get virtual_operand_live and
verify we are not sinking loads across stores by looking up
the live virtual operand at the sink location.
* gcc.dg/tree-ssa/ssa-sink-20.c: New testcase.
The following adds an on-demand global liveness computation class
computing and caching the live-out virtual operand of basic blocks
and answering live-out, live-in and live-on-edge queries. The flow
is optimized for the intended use in code sinking which will query
live-in and possibly can be optimized further when the originating
query is for live-out.
The code relies on up-to-date immediate dominator information and
on an unchanging virtual operand state.
* tree-ssa-live.h (class virtual_operand_live): New.
* tree-ssa-live.cc (virtual_operand_live::init): New.
(virtual_operand_live::get_live_in): Likewise.
(virtual_operand_live::get_live_out): Likewise.
The following swaps the loop splitting pass and the final value
replacement pass to avoid keeping the IV of the earlier loop
live when not necessary. The existing gcc.target/i386/pr87007-5.c
testcase shows that we otherwise fail to elide an empty loop
later. I don't see any good reason why loop splitting would need
final value replacement, all exit values honor the constraints
we place on loop header PHIs automatically.
* passes.def: Exchange loop splitting and final value
replacement passes.
* gcc.target/i386/pr87007-5.c: Make sure we split the loop
and eliminate both in the end.
gcc/ChangeLog:
* config/s390/s390.cc (expand_perm_as_a_vlbr_vstbr_candidate):
New function which handles bswap patterns for vec_perm_const.
(vectorize_vec_perm_const_1): Call new function.
* config/s390/vector.md (*bswap<mode>): Fix operands in output
template.
(*vstbr<mode>): New insn.
gcc/testsuite/ChangeLog:
* gcc.target/s390/s390.exp: Add subdirectory vxe2.
* gcc.target/s390/vxe2/vlbr-1.c: New test.
* gcc.target/s390/vxe2/vstbr-1.c: New test.
* gcc.target/s390/vxe2/vstbr-2.c: New test.
The .spec files used for linking on ppc-vx6, when the rtp-smp runtime
is selected, add -L flags for /lib_smp/ and /lib/.
There was a problem, though: although /lib_smp/ and /lib/ were to be
searched in this order, and the specs files do that correctly, the
compiler would search /lib/ first regardless, because
STARTFILE_PREFIX_SPEC said so, and specs files cannot override that.
With this patch, we arrange for the presence of -msmp to affect
STARTFILE_PREFIX_SPEC, so that the compiler searches /lib_smp/ rather
than /lib/ for crt files. A separate patch for GNAT ensures that when
the rtp-smp runtime is selected, -msmp is passed to the compiler
driver for linking, along with the --specs flags.
for gcc/ChangeLog
* config/vxworks-smp.opt: New. Introduce -msmp.
* config.gcc: Enable it on powerpc* vxworks prior to 7r*.
* config/rs6000/vxworks.h (STARTFILE_PREFIX_SPEC): Choose
lib_smp when -msmp is present in the command line.
* doc/invoke.texi: Document it.
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_save_reg_p): Save ra for leaf
when enabling -mno-omit-leaf-frame-pointer
(riscv_option_override): Override omit-frame-pointer.
(riscv_frame_pointer_required): Save s0 for non-leaf function
(TARGET_FRAME_POINTER_REQUIRED): Override defination
* config/riscv/riscv.opt: Add option support.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/omit-frame-pointer-1.c: New test.
* gcc.target/riscv/omit-frame-pointer-2.c: New test.
* gcc.target/riscv/omit-frame-pointer-3.c: New test.
* gcc.target/riscv/omit-frame-pointer-4.c: New test.
* gcc.target/riscv/omit-frame-pointer-test.c: New test.
Signed-off-by: Yanzhang Wang <yanzhang.wang@intel.com>
This patch is a conservative fix for PR target/110792, a wrong-code
regression affecting doubleword rotations by BITS_PER_WORD, which
effectively swaps the highpart and lowpart words, when the source to be
rotated resides in memory. The issue is that if the register used to
hold the lowpart of the destination is mentioned in the address of
the memory operand, the current define_insn_and_split unintentionally
clobbers it before reading the highpart.
Hence, for the testcase, the incorrectly generated code looks like:
salq $4, %rdi // calculate address
movq WHIRL_S+8(%rdi), %rdi // accidentally clobber addr
movq WHIRL_S(%rdi), %rbp // load (wrong) lowpart
Traditionally, the textbook way to fix this would be to add an
explicit early clobber to the instruction's constraints.
(define_insn_and_split "<insn>32di2_doubleword"
- [(set (match_operand:DI 0 "register_operand" "=r,r,r")
+ [(set (match_operand:DI 0 "register_operand" "=r,r,&r")
(any_rotate:DI (match_operand:DI 1 "nonimmediate_operand" "0,r,o")
(const_int 32)))]
but unfortunately this currently generates significantly worse code,
due to a strange choice of reloads (effectively memcpy), which ends up
looking like:
salq $4, %rdi // calculate address
movdqa WHIRL_S(%rdi), %xmm0 // load the double word in SSE reg.
movaps %xmm0, -16(%rsp) // store the SSE reg back to the stack
movq -8(%rsp), %rdi // load highpart
movq -16(%rsp), %rbp // load lowpart
Note that reload's "&" doesn't distinguish between the memory being
early clobbered, vs the registers used in an addressing mode being
early clobbered.
The fix proposed in this patch is to remove the third alternative, that
allowed offsetable memory as an operand, forcing reload to place the
operand into a register before the rotation. This results in:
salq $4, %rdi
movq WHIRL_S(%rdi), %rax
movq WHIRL_S+8(%rdi), %rdi
movq %rax, %rbp
I believe there's a more advanced solution, by swapping the order of
the loads (if first destination register is mentioned in the address),
or inserting a lea insn (if both destination registers are mentioned
in the address), but this fix is a minimal "safe" solution, that
should hopefully be suitable for backporting.
2023-08-03 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR target/110792
* config/i386/i386.md (<any_rotate>ti3): For rotations by 64 bits
place operand in a register before gen_<insn>64ti2_doubleword.
(<any_rotate>di3): Likewise, for rotations by 32 bits, place
operand in a register before gen_<insn>32di2_doubleword.
(<any_rotate>32di2_doubleword): Constrain operand to be in register.
(<any_rotate>64ti2_doubleword): Likewise.
gcc/testsuite/ChangeLog
PR target/110792
* g++.target/i386/pr110792.C: New 32-bit C++ test case.
* gcc.target/i386/pr110792.c: New 64-bit C test case.