The eagle-eyed may have spotted that my recent testcases for DImode shifts
on x86_64 included -mno-stv in the dg-options. This is because the
Scalar-To-Vector (STV) pass currently transforms these shifts to use
SSE vector operations, producing larger code even with -Os. The issue
is that the compute_convert_gain currently underestimates the size of
instructions required for interunit moves, which is corrected with the
patch below.
For the simple test case:
unsigned long long shl1(unsigned long long x) { return x << 1; }
without this patch, GCC -m32 -Os -mavx2 currently generates:
shl1: push %ebp // 1 byte
mov %esp,%ebp // 2 bytes
vmovq 0x8(%ebp),%xmm0 // 5 bytes
pop %ebp // 1 byte
vpaddq %xmm0,%xmm0,%xmm0 // 4 bytes
vmovd %xmm0,%eax // 4 bytes
vpextrd $0x1,%xmm0,%edx // 6 bytes
ret // 1 byte = 24 bytes total
with this patch, we now generate the shorter
shl1: push %ebp // 1 byte
mov %esp,%ebp // 2 bytes
mov 0x8(%ebp),%eax // 3 bytes
mov 0xc(%ebp),%edx // 3 bytes
pop %ebp // 1 byte
add %eax,%eax // 2 bytes
adc %edx,%edx // 2 bytes
ret // 1 byte = 15 bytes total
Benchmarking using CSiBE, shows that this patch saves 1361 bytes
when compiling with -m32 -Os, and saves 172 bytes when compiling
with -Os.
2023-10-24 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-features.cc (compute_convert_gain): Provide
more accurate values (sizes) for inter-unit moves with -Os.
This patch completes the ARC back-end's transition to using pre-reload
splitters for SImode shifts and rotates on targets without a barrel
shifter. The core part is that the shift_si3 define_insn is no longer
needed, as shifts and rotates that don't require a loop are split
before reload, and then because shift_si3_loop is the only caller
of output_shift, both can be significantly cleaned up and simplified.
The output_shift function (Claudiu's "the elephant in the room") is
renamed output_shift_loop, which handles just the four instruction
zero-overhead loop implementations.
Aside from the clean-ups, the user visible changes are much improved
implementations of SImode shifts and rotates on affected targets.
For the function:
unsigned int rotr_1 (unsigned int x) { return (x >> 1) | (x << 31); }
GCC with -O2 -mcpu=em would previously generate:
rotr_1: lsr_s r2,r0
bmsk_s r0,r0,0
ror r0,r0
j_s.d [blink]
or_s r0,r0,r2
with this patch, we now generate:
j_s.d [blink]
ror r0,r0
For the function:
unsigned int rotr_31 (unsigned int x) { return (x >> 31) | (x << 1); }
GCC with -O2 -mcpu=em would previously generate:
rotr_31:
mov_s r2,r0 ;4
asl_s r0,r0
add.f 0,r2,r2
rlc r2,0
j_s.d [blink]
or_s r0,r0,r2
with this patch we now generate an add.f followed by an adc:
rotr_31:
add.f r0,r0,r0
j_s.d [blink]
add.cs r0,r0,1
Shifts by constants requiring a loop have been improved for even counts
by performing two operations in each iteration:
int shl10(int x) { return x >> 10; }
Previously looked like:
shl10: mov.f lp_count, 10
lpnz 2f
asr r0,r0
nop
2: # end single insn loop
j_s [blink]
And now becomes:
shl10:
mov lp_count,5
lp 2f
asr r0,r0
asr r0,r0
2: # end single insn loop
j_s [blink]
So emulating ARC's SWAP on architectures that don't have it:
unsigned int rotr_16 (unsigned int x) { return (x >> 16) | (x << 16); }
previously required 10 instructions and ~70 cycles:
rotr_16:
mov_s r2,r0 ;4
mov.f lp_count, 16
lpnz 2f
add r0,r0,r0
nop
2: # end single insn loop
mov.f lp_count, 16
lpnz 2f
lsr r2,r2
nop
2: # end single insn loop
j_s.d [blink]
or_s r0,r0,r2
now becomes just 4 instructions and ~18 cycles:
rotr_16:
mov lp_count,8
lp 2f
ror r0,r0
ror r0,r0
2: # end single insn loop
j_s [blink]
2023-10-24 Roger Sayle <roger@nextmovesoftware.com>
Claudiu Zissulescu <claziss@gmail.com>
gcc/ChangeLog
* config/arc/arc-protos.h (output_shift): Rename to...
(output_shift_loop): Tweak API to take an explicit rtx_code.
(arc_split_ashl): Prototype new function here.
(arc_split_ashr): Likewise.
(arc_split_lshr): Likewise.
(arc_split_rotl): Likewise.
(arc_split_rotr): Likewise.
* config/arc/arc.cc (output_shift): Delete local prototype. Rename.
(output_shift_loop): New function replacing output_shift to output
a zero overheap loop for SImode shifts and rotates on ARC targets
without barrel shifter (i.e. no hardware support for these insns).
(arc_split_ashl): New helper function to split *ashlsi3_nobs.
(arc_split_ashr): New helper function to split *ashrsi3_nobs.
(arc_split_lshr): New helper function to split *lshrsi3_nobs.
(arc_split_rotl): New helper function to split *rotlsi3_nobs.
(arc_split_rotr): New helper function to split *rotrsi3_nobs.
(arc_print_operand): Correct whitespace.
(arc_rtx_costs): Likewise.
(hwloop_optimize): Likewise.
* config/arc/arc.md (ANY_SHIFT_ROTATE): New define_code_iterator.
(define_code_attr insn): New code attribute to map to pattern name.
(<ANY_SHIFT_ROTATE>si3): New expander unifying previous ashlsi3,
ashrsi3 and lshrsi3 define_expands. Adds rotlsi3 and rotrsi3.
(*<ANY_SHIFT_ROTATE>si3_nobs): New define_insn_and_split that
unifies the previous *ashlsi3_nobs, *ashrsi3_nobs and *lshrsi3_nobs.
We now call arc_split_<insn> in arc.cc to implement each split.
(shift_si3): Delete define_insn, all shifts/rotates are now split.
(shift_si3_loop): Rename to...
(<insn>si3_loop): define_insn to handle loop implementations of
SImode shifts and rotates, calling ouput_shift_loop for template.
(rotrsi3): Rename to...
(*rotrsi3_insn): define_insn for TARGET_BARREL_SHIFTER's ror.
(*rotlsi3): New define_insn_and_split to transform left rotates
into right rotates before reload.
(rotlsi3_cnt1): New define_insn_and_split to implement a left
rotate by one bit using an add.f followed by an adc.
* config/arc/predicates.md (shiftr4_operator): Delete.
The test was declaring 'int *carry;' and wrote to '*carry' without
initializing 'carry' first, leading to an attempt to write at address
zero, and a crash.
Fix by declaring 'int carry;' and passing '&carrry' instead of 'carry'
as parameter.
2023-09-08 Christophe Lyon <christophe.lyon@linaro.org>
gcc/testsuite/
* gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: Fix.
The mpy_dest_reg_operand is just a wrapper for
register_operand. Remove it.
gcc/
* config/arc/arc.md (mulsi3_700): Update pattern.
(mulsi3_v2): Likewise.
* config/arc/predicates.md (mpy_dest_reg_operand): Remove it.
Signed-off-by: Claudiu Zissulescu <claziss@gmail.com>
In the case of a NOP conversion (precisions of the 2 types are equal),
factoring out the conversion can be done even if int_fits_type_p returns
false and even when the conversion is defined by a statement inside the
conditional. Since it is a NOP conversion there is no zero/sign extending
happening which is why it is ok to be done here; we were trying to prevent
an extra sign/zero extend from being moved away from definition which no-op
conversions are not.
Bootstrapped and tested on x86_64-linux-gnu with no regressions.
gcc/ChangeLog:
PR tree-optimization/104376
PR tree-optimization/101541
* tree-ssa-phiopt.cc (factor_out_conditional_operation):
Allow nop conversions even if it is defined by a statement
inside the conditional.
gcc/testsuite/ChangeLog:
PR tree-optimization/101541
* gcc.dg/tree-ssa/phi-opt-39.c: New test.
So this pattern needs a little help on the gimple side of things to know what
the type popcount should be. For most builtins, the type is the same as the input
but popcount and others are not. And when using it with another outer expression,
genmatch needs some slight help to know that the return type was type rather than
the argument type.
Bootstrapped and tested on x86_64-linux-gnu with no regressions.
PR tree-optimization/111913
gcc/ChangeLog:
* match.pd (`popcount(X&Y) + popcount(X|Y)`): Add the resulting
type for popcount.
gcc/testsuite/ChangeLog:
* gcc.c-torture/compile/fold-popcount-1.c: New test.
* gcc.dg/fold-popcount-8a.c: New test.
If make_uses_available was called twice for the same use,
we could end up trying to create duplicate definitions for
the same extended live range.
gcc/
* rtl-ssa/blocks.cc (function_info::create_degenerate_phi): Check
whether the requested phi already exists.
rtl_ssa::can_insert_after didn't handle insns that can throw.
Fixing that avoids a regression with a later patch.
gcc/
* rtl-ssa.h: Include cfgbuild.h.
* rtl-ssa/movement.h (can_insert_after): Replace is_jump with the
more comprehensive control_flow_insn_p.
RTL-SSA queues up some invasive changes for later. But sometimes
the insns involved in those changes can be deleted by later
optimisations, making the queued change unnecessary. This patch
checks for that case.
gcc/
* rtl-ssa/changes.cc (function_info::perform_pending_updates): Check
whether an insn has been replaced by a note.
first_any_insn_use implicitly (but contrary to its documentation)
assumed that there was at least one use.
gcc/
* rtl-ssa/member-fns.inl (first_any_insn_use): Handle null
m_first_use.
For the V2HI -> V2SI zero extension in:
typedef unsigned short v2hi __attribute__((vector_size(4)));
typedef unsigned int v2si __attribute__((vector_size(8)));
v2si f (v2hi x) { return (v2si) {x[0], x[1]}; }
ix86_expand_sse_extend would generate:
(set (reg:V2HI 102)
(const_vector:V2HI [(const_int 0 [0])
(const_int 0 [0])]))
(set (subreg:V8HI (reg:V2HI 101) 0)
(vec_select:V8HI
(vec_concat:V16HI (subreg:V8HI (reg/v:V2HI 99 [ x ]) 0)
(subreg:V8HI (reg:V2HI 102) 0))
(parallel [(const_int 0 [0])
(const_int 8 [0x8])
(const_int 1 [0x1])
(const_int 9 [0x9])
(const_int 2 [0x2])
(const_int 10 [0xa])
(const_int 3 [0x3])
(const_int 11 [0xb])])))
(set (reg:V2SI 100)
(subreg:V2SI (reg:V2HI 101) 0))
(expr_list:REG_EQUAL (zero_extend:V2SI (reg/v:V2HI 99 [ x ])))
But using (subreg:V8HI (reg:V2HI 101) 0) as the destination of
the vec_select means that only the low 4 bytes of the destination
are stored. Only the lower half of reg 100 is well-defined.
Things tend to happen to work if the register allocator ties reg 101
to reg 100. But it caused problems with the upcoming late-combine pass
because we propagated the set of reg 100 into its uses.
gcc/
* config/i386/i386-expand.cc (ix86_split_mmx_punpck): Allow the
destination to be wider than the sources. Take the mode from the
first source.
(ix86_expand_sse_extend): Pass the destination directly to
ix86_split_mmx_punpck, rather than using a fresh register that
is half the size.
I hit an ICE in aeswidekl_operation while testing the late-combine
pass on x86. The predicate tested REGNO without first testing REG_P.
gcc/
* config/i386/predicates.md (aeswidekl_operation): Protect
REGNO check with REG_P.
This patch adds a bare-bones TARGET_INSN_COST. See the comment
in the patch for the rationale.
Just to get a flavour for how much difference it makes, I tried
compiling the testsuite with -Os -fno-schedule-insns{,2} and
seeing what effect the patch had on the number of instructions.
Very few tests changed, but all the changes were positive:
Tests Good Bad Delta Best Worst Median
===== ==== === ===== ==== ===== ======
19 19 0 -177 -52 -1 -4
The change for -O2 was even smaller, but more mixed:
Tests Good Bad Delta Best Worst Median
===== ==== === ===== ==== ===== ======
6 3 3 -8 -9 6 -2
There were no obvious effects on SPEC CPU2017.
The patch is needed to avoid a regression with a later change.
gcc/
* config/aarch64/aarch64.cc (aarch64_insn_cost): New function.
(TARGET_INSN_COST): Define.
The non-LSE pattern aarch64_atomic_exchange<mode> comes before the
LSE pattern aarch64_atomic_exchange<mode>_lse. From a recog
perspective, the only difference between the patterns is that
the non-LSE one clobbers CC and needs a scratch.
However, combine and RTL-SSA can both add clobbers to make a
pattern match. This means that if they try to rerecognise an
LSE pattern, they could end up turning it into a non-LSE pattern.
This patch adds a !TARGET_LSE test to avoid that.
This is needed to avoid a regression with later patches.
gcc/
* config/aarch64/atomics.md (aarch64_atomic_exchange<mode>): Require
!TARGET_LSE.
Calling vget/vset intrinsic without receiving a return value will cause
a crash. Because in this case e.target is null.
This patch should be backported to releases/gcc-13.
PR target/111935
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc: fix bug.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/pr111935.c: New test.
ICE on vsetvli a5, 8 instruction demand info.
The AVL is const_int 8 which ICE on RENGO caller.
Committed as it is obvious fix.
PR target/111947
gcc/ChangeLog:
* config/riscv/riscv-vsetvl.cc (pre_vsetvl::compute_lcm_local_properties): Add REGNO check.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/vsetvl/pr111947.c: New test.
The PR requests an enhancement to the diagnostic issued for the use of a
poisoned identifier. Currently, we show the location of the usage, but not
the location which requested the poisoning, which would be helpful for the
user if the decision to poison an identifier was made externally, such as
in a library header.
In order to output this information, we need to remember a location_t for
each identifier that has been poisoned, and that data needs to be preserved
as well in a PCH. One option would be to add a field to struct cpp_hashnode,
but there is no convenient place to add it without increasing the size of
the struct for all identifiers. Given this facility will be needed rarely,
it seemed better to add a second hash map, which is handled PCH-wise the
same as the current one in gcc/stringpool.cc. This hash map associates a new
struct cpp_hashnode_extra with each identifier that needs one. Currently
that struct only contains the new location_t, but it could be extended in
the future if there is other ancillary data that may be convenient to put
there for other purposes.
libcpp/ChangeLog:
PR preprocessor/36887
* directives.cc (do_pragma_poison): Store in the extra hash map the
location from which an identifier has been poisoned.
* lex.cc (identifier_diagnostics_on_lex): When issuing a diagnostic
for the use of a poisoned identifier, also add a note indicating the
location from which it was poisoned.
* identifiers.cc (alloc_node): Convert to template function.
(_cpp_init_hashtable): Handle the new extra hash map.
(_cpp_destroy_hashtable): Likewise.
* include/cpplib.h (struct cpp_hashnode_extra): New struct.
(cpp_create_reader): Update prototype to...
* init.cc (cpp_create_reader): ...accept an argument for the extra
hash table and pass it to _cpp_init_hashtable.
* include/symtab.h (ht_lookup): New overload for convenience.
* internal.h (struct cpp_reader): Add EXTRA_HASH_TABLE member.
(_cpp_init_hashtable): Adjust prototype.
gcc/c-family/ChangeLog:
PR preprocessor/36887
* c-opts.cc (c_common_init_options): Pass new extra hash map
argument to cpp_create_reader().
gcc/ChangeLog:
PR preprocessor/36887
* toplev.h (ident_hash_extra): Declare...
* stringpool.cc (ident_hash_extra): ...this new global variable.
(init_stringpool): Handle ident_hash_extra as well as ident_hash.
(ggc_mark_stringpool): Likewise.
(ggc_purge_stringpool): Likewise.
(struct string_pool_data_extra): New struct.
(spd2): New GC root variable.
(gt_pch_save_stringpool): Use spd2 to handle ident_hash_extra,
analogous to how spd is used to handle ident_hash.
(gt_pch_restore_stringpool): Likewise.
gcc/testsuite/ChangeLog:
PR preprocessor/36887
* c-c++-common/cpp/diagnostic-poison.c: New test.
* g++.dg/pch/pr36887.C: New test.
* g++.dg/pch/pr36887.Hs: New test.
This is a mechanical change to move Selector_expression up in expressions.cc.
This will make it visible to Builtin_call_expression for later work.
This produces a very large "git --diff", but "git diff --minimal" is clear.
Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/536642
This changes the Expression {numeric,string,boolean}_constant_value
methods non-const. This does not affect anything immediately,
but will be useful for later CLs in this series.
The only real effect is to Builtin_call_expression::do_export,
which remains const and can no longer call numeric_constant_value.
But it never needed to call it, as do_export runs after do_lower,
and do_lower replaces a constant expression with the actual constant.
Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/536641
In PR111794 we miss a vectorization because on riscv type precision and
mode precision differ for mask types. We can still vectorize when
allowing assignments with the same precision for dest and source which
is what this patch does.
gcc/ChangeLog:
PR tree-optimization/111794
* tree-vect-stmts.cc (vectorizable_assignment): Add
same-precision exception for dest and source.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/slp-mask-1.c: New test.
* gcc.target/riscv/rvv/autovec/slp-mask-run-1.c: New test.
I didn't manage to get back to the generic vectorizer fallback for
popcount so I figured I'd rather create a popcount fallback in the
riscv backend. It uses the WWG algorithm from libgcc.
gcc/ChangeLog:
* config/riscv/autovec.md (popcount<mode>2): New expander.
* config/riscv/riscv-protos.h (expand_popcount): Define.
* config/riscv/riscv-v.cc (expand_popcount): Vectorize popcount
with the WWG algorithm.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-2.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount.c: New test.
The following adjusts a leftover BIT_FIELD_REF special-casing to only
cover the cases general code doesn't handle.
PR tree-optimization/111916
* tree-sra.cc (sra_modify_assign): Do not lower all
BIT_FIELD_REF reads that are sra_handled_bf_read_p.
* gcc.dg/torture/pr111916.c: New testcase.
The change to allow SLP of non-grouped accesses failed to check
for the case of mixing with grouped accesses.
PR tree-optimization/111915
* tree-vect-slp.cc (vect_build_slp_tree_1): Check all
accesses are either grouped or not.
* gcc.dg/vect/pr111915.c: New testcase.
The following addresses a mismatch in SSA name vs. symbol when
we emit a dummy assignment when not optimizing. The temporary
we create is not remapped by initialize_inlined_parameters because
we have no easy way to get at it. The following instead emits
the additional statement after we have remapped the type of
the replacement variable.
PR ipa/111914
* tree-inline.cc (setup_one_parameter): Move code emitting
a dummy load when not optimizing ...
(initialize_inlined_parameters): ... here to after when
we remapped the parameter type.
* gcc.dg/pr111914.c: New testcase.
This was a rebase error, that managed to pass testing on Darwin and
Linux (but fails on bare metal).
PR libquadmath/111928
libquadmath/ChangeLog:
* Makefile.in: Regenerate.
* configure: Regenerate.
* configure.ac: Remove AC_CHECK_LIBM.
Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>
The previous patch tried to remove PHI nodes that dominated the first loop,
however the correct fix is to only remove .MEM nodes.
This patch thus makes the condition a bit stricter and only tries to remove
MEM phi nodes.
I couldn't figure out a way to easily determine if a particular PHI is vUSE
related, so the patch does:
1. check if the definition is a vDEF and not defined in main loop.
2. check if the definition is a PHI and not defined in main loop.
3. check if the definition is a default definition.
For no 2 and 3 we may misidentify the PHI, in both cases the value is defined
outside of the loop version block which also makes it ok to remove.
gcc/ChangeLog:
PR tree-optimization/111860
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Drop .MEM nodes only.
gcc/testsuite/ChangeLog:
PR tree-optimization/111860
* gcc.dg/vect/pr111860-2.c: New test.
* gcc.dg/vect/pr111860-3.c: New test.
This patch moves the `(a-b) CMP 0 ? (a-b) : (b-a)` optimization
from fold_cond_expr_with_comparison to match.
Bootstrapped and tested on x86_64-linux-gnu.
Changes in:
v2: Removes `(a == b) ? 0 : (b - a)` handling since it was handled
via r14-3606-g3d86e7f4a8ae
Change zerop to integer_zerop for `(a - b) == 0 ? 0 : (b - a)`,
Add `(a - b) != 0 ? (a - b) : 0` handling.
gcc/ChangeLog:
* match.pd (`(A - B) CMP 0 ? (A - B) : (B - A)`):
New patterns.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/phi-opt-38.c: New test.
While working on PR c/111903, I Noticed that
convert will convert integer_zero_node to that
type after an error instead of returning error_mark_node.
From what I can tell this was the old way of not having
error recovery since other places in this file does return
error_mark_node and the places I am replacing date from
when the file was imported into the repro (either via a gcc2 merge
or earlier).
I also had to update the objc front-end to allow for the error_mark_node
change, I suspect you could hit the ICE without this change though.
Bootstrapped and tested on x86_64-linux-gnu with no regressions.
gcc/ChangeLog:
* convert.cc (convert_to_pointer_1): Return error_mark_node
after an error.
(convert_to_real_1): Likewise.
(convert_to_integer_1): Likewise.
(convert_to_complex_1): Likewise.
gcc/objc/ChangeLog:
* objc-gnu-runtime-abi-01.cc (build_objc_method_call): Allow
for error_operand after call to build_c_cast.
* objc-next-runtime-abi-01.cc (build_objc_method_call): Likewise.
* objc-next-runtime-abi-02.cc (build_v2_build_objc_method_call): Likewise.
convert_to_complex when creating a COMPLEX_EXPR does
not currently check if either the real or imag parts
was not error_mark_node. This later on confuses the gimpilfier
when there was a SAVE_EXPR wrapped around that COMPLEX_EXPR.
The simple fix is after calling convert inside convert_to_complex_1,
check that the either result was an error_operand and return
an error_mark_node in that case.
Bootstrapped and tested on x86_64-linux-gnu with no regressions.
PR c/111903
gcc/ChangeLog:
* convert.cc (convert_to_complex_1): Return
error_mark_node if either convert was an error
when converting from a scalar.
gcc/testsuite/ChangeLog:
* gcc.target/i386/float16-8.c: New test.
The unswitching code to hoist guards inserts conditions in wrong
places. The following fixes this, simplifying code.
PR tree-optimization/111917
* tree-ssa-loop-unswitch.cc (hoist_guard): Always insert
new conditional after last stmt.
* gcc.dg/torture/pr111917.c: New testcase.
ICE:
during RTL pass: vsetvl
<source>: In function 'riscv_lms_f32':
<source>:240:1: internal compiler error: in merge, at config/riscv/riscv-vsetvl.cc:1997
240 | }
In general compatible_p (avl_equal_p) has:
if (next.has_vl () && next.vl_used_by_non_rvv_insn_p ())
return false;
Don't fuse AVL of vsetvl if the VL operand is used by non-RVV instructions.
It is reasonable to add it into 'can_use_next_avl_p' since we don't want to
fuse AVL of vsetvl into a scalar move instruction which doesn't demand AVL.
And after the fusion, we will alway use compatible_p to check whether the demand
is correct or not.
PR target/111927
gcc/ChangeLog:
* config/riscv/riscv-vsetvl.cc: Fix bug.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/vsetvl/pr111927.c: New test.
The vsetvl asm check is unnecessary for the vector convert. We
should be focus for constrait and leave the vsetvl test to the
vsetvl pass.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/unop/cvt-0.c: Remove the vsetvl
asm check from func body.
* gcc.target/riscv/rvv/autovec/unop/cvt-1.c: Ditto.
Signed-off-by: Pan Li <pan2.li@intel.com>
./multilib.am already specifies this same command, and make warns about
the earlier one being ignored when seeing the later one. All that needs
retaining to still satisfy the preceding comment is the extra
dependency.
libatomic/
* Makefile.am (all-multi): Drop commands.
* Makefile.in: Update accordingly.
For trunc function autovec, there will be one step like below take MU
for the merge operand.
rtx tmp = gen_reg_rtx (vec_int_mode);
emit_vec_cvt_x_f_rtz (tmp, op_1, mask, vec_fp_mode);
The MU will leave the tmp (aka dest register) register unmasked elements
unchanged and it is undefined here. This patch would like to adjust the
MU to MA.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (emit_vec_cvt_x_f_rtz): Add insn type
arg.
(expand_vec_trunc): Take MA instead of MU for cvt_x_f_rtz.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* doc/invoke.texi (-mexplicit-relocs=style): Document.
(-mexplicit-relocs): Document as an alias of
-mexplicit-relocs=always.
(-mno-explicit-relocs): Document as an alias of
-mexplicit-relocs=none.
(-mcmodel=extreme): Mention -mexplicit-relocs=always instead of
-mexplicit-relocs.
In these cases, if we use explicit relocs, we end up with 2
instructions:
pcalau12i t0, %pc_hi20(x)
ld.d t0, t0, %pc_lo12(x)
If we use la.local pseudo-op, in the best scenario (x is in +/- 2MiB
range) we still have 2 instructions:
pcaddi t0, %pcrel_20(x)
ld.d t0, t0, 0
If x is out of the range we'll have 3 instructions. So for these cases
just emit machine instructions with explicit relocs.
gcc/ChangeLog:
* config/loongarch/predicates.md (symbolic_pcrel_operand): New
predicate.
* config/loongarch/loongarch.md (define_peephole2): Optimize
la.local + ld/st to pcalau12i + ld/st if the address is only used
once if -mexplicit-relocs=auto and -mcmodel=normal or medium.
gcc/testsuite/ChangeLog:
* gcc.target/loongarch/explicit-relocs-auto-single-load-store.c:
New test.
* gcc.target/loongarch/explicit-relocs-auto-single-load-store-no-anchor.c:
New test.
The linker does not know how to relax TLS access for LoongArch, so let's
emit machine instructions with explicit relocs for TLS.
gcc/ChangeLog:
* config/loongarch/loongarch.cc (loongarch_explicit_relocs_p):
Return true for TLS symbol types if -mexplicit-relocs=auto.
(loongarch_call_tls_get_addr): Replace TARGET_EXPLICIT_RELOCS
with la_opt_explicit_relocs != EXPLICIT_RELOCS_NONE.
(loongarch_legitimize_tls_address): Likewise.
* config/loongarch/loongarch.md (@tls_low<mode>): Remove
TARGET_EXPLICIT_RELOCS from insn condition.
gcc/testsuite/ChangeLog:
* gcc.target/loongarch/explicit-relocs-auto-tls-ld-gd.c: New
test.
* gcc.target/loongarch/explicit-relocs-auto-tls-le-ie.c: New
test.
If we are performing LTO for a final link and linker plugin is enabled,
then we are sure any GOT access may resolve to a symbol out of the link
unit (otherwise the linker plugin will tell us the symbol should be
resolved locally and we'll use PC-relative access instead).
Produce machine instructions with explicit relocs instead of la.global
for better scheduling.
gcc/ChangeLog:
* config/loongarch/loongarch-protos.h
(loongarch_explicit_relocs_p): Declare new function.
* config/loongarch/loongarch.cc (loongarch_explicit_relocs_p):
Implement.
(loongarch_symbol_insns): Call loongarch_explicit_relocs_p for
SYMBOL_GOT_DISP, instead of using TARGET_EXPLICIT_RELOCS.
(loongarch_split_symbol): Call loongarch_explicit_relocs_p for
deciding if return early, instead of using
TARGET_EXPLICIT_RELOCS.
(loongarch_output_move): CAll loongarch_explicit_relocs_p
instead of using TARGET_EXPLICIT_RELOCS.
* config/loongarch/loongarch.md (*low<mode>): Remove
TARGET_EXPLICIT_RELOCS from insn condition.
(@ld_from_got<mode>): Likewise.
* config/loongarch/predicates.md (move_operand): Call
loongarch_explicit_relocs_p instead of using
TARGET_EXPLICIT_RELOCS.
gcc/testsuite/ChangeLog:
* gcc.target/loongarch/explicit-relocs-auto-lto.c: New test.
To take a better balance between scheduling and relaxation when -flto is
enabled, add three-way -mexplicit-relocs={auto,none,always} options.
The old -mexplicit-relocs and -mno-explicit-relocs options are still
supported, they are mapped to -mexplicit-relocs=always and
-mexplicit-relocs=none.
The default choice is determined by probing assembler capabilities at
build time. If the assembler does not supports explicit relocs at all,
the default will be none; if it supports explicit relocs but not
relaxation, the default will be always; if both explicit relocs and
relaxation are supported, the default will be auto.
Currently auto is same as none. We will make auto more clever in
following changes.
gcc/ChangeLog:
* config/loongarch/genopts/loongarch-strings: Add strings for
-mexplicit-relocs={auto,none,always}.
* config/loongarch/genopts/loongarch.opt.in: Add options for
-mexplicit-relocs={auto,none,always}.
* config/loongarch/loongarch-str.h: Regenerate.
* config/loongarch/loongarch.opt: Regenerate.
* config/loongarch/loongarch-def.h
(EXPLICIT_RELOCS_AUTO): Define.
(EXPLICIT_RELOCS_NONE): Define.
(EXPLICIT_RELOCS_ALWAYS): Define.
(N_EXPLICIT_RELOCS_TYPES): Define.
* config/loongarch/loongarch.cc
(loongarch_option_override_internal): Error out if the old-style
-m[no-]explicit-relocs option is used with
-mexplicit-relocs={auto,none,always} together. Map
-mno-explicit-relocs to -mexplicit-relocs=none and
-mexplicit-relocs to -mexplicit-relocs=always for backward
compatibility. Set a proper default for -mexplicit-relocs=
based on configure-time probed linker capability. Update a
diagnostic message to mention -mexplicit-relocs=always instead
of the old-style -mexplicit-relocs.
(loongarch_handle_model_attribute): Update a diagnostic message
to mention -mexplicit-relocs=always instead of the old-style
-mexplicit-relocs.
* config/loongarch/loongarch.h (TARGET_EXPLICIT_RELOCS): Define.
When fixing an issue, I find there is a typo in VSETVL PASS.
Change 'use_by' into 'used_by'.
Committed it as it is very obvious.
gcc/ChangeLog:
* config/riscv/riscv-vsetvl.cc (pre_vsetvl::fuse_local_vsetvl_info): Fix typo.
(pre_vsetvl::pre_global_vsetvl_info): Ditto.
The main_test function returns void, so return with an expression
is a constraint violation. The test case still fails with this
change applied before the fix for PR 93262 in r14-4813.
gcc/testsuite/
* gcc.c-torture/execute/builtins/pr93262-chk.c (main_test):
Remove unnecessary return statement.