Update STORE_MAX_PIECES to allow 16/32/64 bytes only if inter-unit move
is enabled since vec_duplicate enabled by inter-unit move is used to
implement store_by_pieces of 16/32/64 bytes.
gcc/
PR target/101742
* config/i386/i386.h (STORE_MAX_PIECES): Allow 16/32/64 bytes
only if TARGET_INTER_UNIT_MOVES_TO_VEC is true.
gcc/testsuite/
PR target/101742
* gcc.target/i386/pr101742a.c: New test.
* gcc.target/i386/pr101742b.c: Likewise.
To avoid stack realignment, call ix86_gen_scratch_sse_rtx to get a
scratch SSE register to copy data with with SSE register from one
memory location to another.
gcc/
PR target/101772
* config/i386/i386-expand.c (ix86_expand_vector_move): Call
ix86_gen_scratch_sse_rtx to get a scratch SSE register to copy
data with SSE register from one memory location to another.
gcc/testsuite/
PR target/101772
* gcc.target/i386/eh_return-2.c: New test.
This patch makes use of the vector permute double immediate
instruction for constant permute vectors.
gcc/ChangeLog:
* config/s390/s390.c (expand_perm_with_vpdi): New function.
(vectorize_vec_perm_const_1): Call expand_perm_with_vpdi.
* config/s390/vector.md (*vpdi1<mode>, @vpdi1<mode>): Enable a
parameterized expander.
(*vpdi4<mode>, @vpdi4<mode>): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/perm-vpdi.c: New test.
This patch implements the TARGET_VECTORIZE_VEC_PERM_CONST in the IBM Z
backend. The initial implementation only exploits the vector merge
instruction but there is more to come.
gcc/ChangeLog:
* config/s390/s390.c (MAX_VECT_LEN): Define macro.
(struct expand_vec_perm_d): Define struct.
(expand_perm_with_merge): New function.
(vectorize_vec_perm_const_1): New function.
(s390_vectorize_vec_perm_const): New function.
(TARGET_VECTORIZE_VEC_PERM_CONST): Define target macro.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/perm-merge.c: New test.
* gcc.target/s390/vector/vec-types.h: New test.
The patch gets rid of the unspec used for the vector permute double
immediate instruction and replaces it with generic rtx.
gcc/ChangeLog:
* config/s390/s390.md (UNSPEC_VEC_PERMI): Remove constant
definition.
* config/s390/vector.md (*vpdi1<mode>, *vpdi4<mode>): New pattern
definitions.
* config/s390/vx-builtins.md (*vec_permi<mode>): Emit generic rtx
instead of an unspec.
gcc/testsuite/ChangeLog:
* gcc.target/s390/zvector/vec-permi.c: Removed.
* gcc.target/s390/zvector/vec_permi.c: New test.
This patch gets rid of the unspecs we were using for the vector merge
instruction and replaces it with generic rtx.
gcc/ChangeLog:
* config/s390/s390-modes.def: Add more vector modes to support
concatenation of two vectors.
* config/s390/s390-protos.h (s390_expand_merge_perm_const): Add
prototype.
(s390_expand_merge): Likewise.
* config/s390/s390.c (s390_expand_merge_perm_const): New function.
(s390_expand_merge): New function.
* config/s390/s390.md (UNSPEC_VEC_MERGEH, UNSPEC_VEC_MERGEL):
Remove constant definitions.
* config/s390/vector.md (V_HW_2): Add mode iterators.
(VI_HW_4, V_HW_4): Rename VI_HW_4 to V_HW_4.
(vec_2x_nelts, vec_2x_wide): New mode attributes.
(*vmrhb, *vmrlb, *vmrhh, *vmrlh, *vmrhf, *vmrlf, *vmrhg, *vmrlg):
New pattern definitions.
(vec_widen_umult_lo_<mode>, vec_widen_umult_hi_<mode>)
(vec_widen_smult_lo_<mode>, vec_widen_smult_hi_<mode>)
(vec_unpacks_lo_v4sf, vec_unpacks_hi_v4sf, vec_unpacks_lo_v2df)
(vec_unpacks_hi_v2df): Adjust expanders to emit non-unspec RTX for
vec merge.
* config/s390/vx-builtins.md (V_HW_4): Remove mode iterator. Now
in vector.md.
(vec_mergeh<mode>, vec_mergel<mode>): Use s390_expand_merge to
emit vec merge pattern.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/long-double-asm-in-out-hard-fp-reg.c:
Instead of vpdi with 0 and 5 vmrlg and vmrhg are used now.
* gcc.target/s390/vector/long-double-asm-inout-hard-fp-reg.c: Likewise.
* gcc.target/s390/zvector/vec-types.h: New test.
* gcc.target/s390/zvector/vec_merge.c: New test.
The Neon multiply/multiply-accumulate/multiply-subtract instructions
can select the top or bottom half of the operand registers. This
selection does not change the cost of the underlying instruction and
this should be reflected by the RTL cost function.
This patch adds RTL tree traversal in the Neon multiply cost function
to match vec_select high-half of its operands. This traversal
prevents the cost of the vec_select from being added into the cost of
the multiply - meaning that these instructions can now be emitted in
the combine pass as they are no longer deemed prohibitively
expensive.
gcc/ChangeLog:
2021-07-19 Jonathan Wright <jonathan.wright@arm.com>
* config/aarch64/aarch64.c (aarch64_strip_extend_vec_half):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
vec_select high-half from being added into Neon multiply
cost.
* rtlanal.c (vec_series_highpart_p): Define.
* rtlanal.h (vec_series_highpart_p): Declare.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/vmul_high_cost.c: New test.
The Neon multiply/multiply-accumulate/multiply-subtract instructions
can take various forms - multiplying full vector registers of values
or multiplying one vector by a single element of another. Regardless
of the form used, these instructions have the same cost, and this
should be reflected by the RTL cost function.
This patch adds RTL tree traversal in the Neon multiply cost function
to match the vec_select used by the lane-referencing forms of the
instructions already mentioned. This traversal prevents the cost of
the vec_select from being added into the cost of the multiply -
meaning that these instructions can now be emitted in the combine
pass as they are no longer deemed prohibitively expensive.
gcc/ChangeLog:
2021-07-19 Jonathan Wright <jonathan.wright@arm.com>
* config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent
vec_select cost from being added into Neon multiply cost.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/vmul_element_cost.c: New test.
This patch uses a more accurate scalar iteration estimate when
comparing the epilogue of a constant-iteration loop with a candidate
replacement epilogue.
In the testcase, the patch prevents a 1-to-3-element SVE epilogue
from seeming better than a 64-bit Advanced SIMD epilogue.
gcc/
* tree-vect-loop.c (vect_better_loop_vinfo_p): Detect cases in
which old_loop_vinfo is an epilogue loop that handles a constant
number of iterations.
gcc/testsuite/
* gcc.target/aarch64/sve/cost_model_12.c: New test.
After vect_analyze_loop has successfully analysed a loop for
one base vector mode B1, it considers using following base vector
modes to vectorise an epilogue. However, for VECT_COMPARE_COSTS,
a later mode B2 might turn out to be better than B1 was. Initially
this comparison will be between an epilogue loop (for B2) and a main
loop (for B1). However, in r11-6458 I'd added code to reanalyse the
B2 epilogue loop as a main loop, partly for correctness and partly
for better costing.
This can lead to a situation in which we think that the B2 epilogue
loop was better than the B1 main loop, but that the B2 main loop is
not better than the B1 main loop. There was no dump message to say
that this had happened, which made it look like B2 had still won.
gcc/
* tree-vect-loop.c (vect_analyze_loop): Print a dump message
when a reanalyzed loop fails to be cheaper than the current
main loop.
Ignored functions decls that are compiled at the start of
the assembly have bogus line numbers until the first .file
directive, as reported in PR101575.
The corresponding binutils bug report is
https://sourceware.org/bugzilla/show_bug.cgi?id=28149
The work around for this issue is to emit a dummy .file
directive before the first function is compiled, unless
another .file directive was already emitted previously.
2021-08-04 Bernd Edlinger <bernd.edlinger@hotmail.de>
PR ada/101575
* dwarf2out.c (dwarf2out_assembly_start): Emit a dummy
.file statement when needed.
I believe PR101750 to be a testism. Fix it by giving the class a name.
gcc/testsuite/ChangeLog:
PR tree-optimization/101750
* g++.dg/vect/pr99149.cc: Name class.
This adds a gather vectorization capability to the vectorizer
without target support by decomposing the offset vector, doing
sclar loads and then building a vector from the result. This
is aimed mainly at cases where vectorizing the rest of the loop
offsets the cost of vectorizing the gather.
Note it's difficult to avoid vectorizing the offset load, but in
some cases later passes can turn the vector load + extract into
scalar loads, see the followup patch.
On SPEC CPU 2017 510.parest_r this improves runtime from 250s
to 219s on a Zen2 CPU which has its native gather instructions
disabled (using those the runtime instead increases to 254s)
using -Ofast -march=znver2 [-flto]. It turns out the critical
loops in this benchmark all perform gather operations.
2021-07-30 Richard Biener <rguenther@suse.de>
* tree-vect-data-refs.c (vect_check_gather_scatter):
Include widening conversions only when the result is
still handed by native gather or the current offset
size not already matches the data size.
Also succeed analysis in case there's no native support,
noted by a IFN_LAST ifn and a NULL decl.
(vect_analyze_data_refs): Always consider gathers.
* tree-vect-patterns.c (vect_recog_gather_scatter_pattern):
Test for no IFN gather rather than decl gather.
* tree-vect-stmts.c (vect_model_load_cost): Pass in the
gather-scatter info and cost emulated gathers accordingly.
(vect_truncate_gather_scatter_offset): Properly test for
no IFN gather.
(vect_use_strided_gather_scatters_p): Likewise.
(get_load_store_type): Handle emulated gathers and its
restrictions.
(vectorizable_load): Likewise. Emulate them by extracting
scalar offsets, doing scalar loads and a vector construct.
* gcc.target/i386/vect-gather-1.c: New testcase.
* gfortran.dg/vect/vect-8.f90: Adjust.
Pass MAX_PIECES to op_by_pieces_d::op_by_pieces_d for move, store and
compare.
PR target/101742
* expr.c (op_by_pieces_d::op_by_pieces_d): Add a max_pieces
argument to set m_max_size.
(move_by_pieces_d): Pass MOVE_MAX_PIECES to op_by_pieces_d.
(store_by_pieces_d): Pass STORE_MAX_PIECES to op_by_pieces_d.
(compare_by_pieces_d): Pass COMPARE_MAX_PIECES to op_by_pieces_d.
The easiest way to motivate these additions to match.pd is with the
following example:
unsigned int foo(unsigned char i) {
return i | (i<<8) | (i<<16) | (i<<24);
}
which mainline with -O2 on x86_64 currently generates:
foo: movzbl %dil, %edi
movl %edi, %eax
movl %edi, %edx
sall $8, %eax
sall $16, %edx
orl %edx, %eax
orl %edi, %eax
sall $24, %edi
orl %edi, %eax
ret
but with this patch now becomes:
foo: movzbl %dil, %eax
imull $16843009, %eax, %eax
ret
Interestingly, this transformation is already applied when using
addition, allowing synth_mult to select an optimal sequence, but
not when using the equivalent bit-wise ior or xor operators.
The solution is to use tree_nonzero_bits to check that the
potentially non-zero bits of each operand don't overlap, which
ensures that BIT_IOR_EXPR and BIT_XOR_EXPR produce the same
results as PLUS_EXPR, which effectively generalizes the old
fold_plusminus_mult_expr. Technically, the transformation
is to canonicalize (X*C1)|(X*C2) and (X*C1)^(X*C2) to
X*(C1+C2) where X and X<<C are considered special cases.
2021-08-04 Roger Sayle <roger@nextmovesoftware.com>
Marc Glisse <marc.glisse@inria.fr>
gcc/ChangeLog
* match.pd (bit_ior, bit_xor): Canonicalize (X*C1)|(X*C2) and
(X*C1)^(X*C2) as X*(C1+C2), and related variants, using
tree_nonzero_bits to ensure that operands are bit-wise disjoint.
gcc/testsuite/ChangeLog
* gcc.dg/fold-ior-4.c: New test.
This teaches forwprop to rewrite more vector loads that are only
used in BIT_FIELD_REFs as scalar loads. This provides the
remaining uplift to SPEC CPU 2017 510.parest_r on Zen 2 which
has CPU gathers disabled.
In particular vector load + vec_unpack + bit-field-ref is turned
into (extending) scalar loads which avoids costly XMM/GPR
transitions. To not conflict with vector load + bit-field-ref
+ vector constructor matching to vector load + shuffle the
extended transform is only done after vector lowering.
2021-07-30 Richard Biener <rguenther@suse.de>
* tree-ssa-forwprop.c (pass_forwprop::execute): Split
out code to decompose vector loads ...
(optimize_vector_load): ... here. Generalize it to
handle intermediate widening and TARGET_MEM_REF loads
and apply it to loads with a supported vector mode as well.
The following avoids vectorizing MIN/MAX reductions on bools which,
when ending up as vector(2) <signed-boolean:64> would need to be
adjusted because of the sign change. The fix instead avoids any
reduction vectorization where the result isn't compatible
to the original scalar type since we don't compensate for that
either.
2021-08-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/101756
* tree-vect-slp.c (vectorizable_bb_reduc_epilogue): Make sure
the result of the reduction epilogue is compatible to the original
scalar result.
* gcc.dg/vect/bb-slp-pr101756.c: New testcase.
When parsing default arguments, we need to temporarily clear parser->omp_declare_simd
and parser->oacc_routine, otherwise it can clash with further declarations
inside of e.g. lambdas inside of those default arguments.
2021-08-04 Jakub Jelinek <jakub@redhat.com>
PR c++/101759
* parser.c (cp_parser_default_argument): Temporarily override
parser->omp_declare_simd and parser->oacc_routine to NULL.
* g++.dg/gomp/pr101759.C: New test.
* g++.dg/goacc/pr101759.C: New test.
The file has two identical halves, seems like twice applied patch.
2021-08-04 Jakub Jelinek <jakub@redhat.com>
* gcc.c-torture/execute/ieee/pr29302-1.x: Undo doubly applied patch.
The define_peephole2 which is added by r12-2640-gf7bf03cf69ccb7dc
should only work on general registers, considering that x86 also
supports mov instructions between gpr, sse reg, mask reg, limiting the
peephole2 predicate to general_reg_operand.
gcc/ChangeLog:
PR target/101743
* config/i386/i386.md (peephole2): Refine predicate from
register_operand to general_reg_operand.
The file has two identical halves, seems like twice applied patch.
2021-08-04 Jakub Jelinek <jakub@redhat.com>
* config/t-slibgcc-fuchsia: Undo doubly applied patch.
This makes tail recursion optimization produce a loop structure
manually rather than relying on loop fixup. That also allows the
loop to be marked as finite (it would eventually blow the stack
if it were not).
2021-08-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/101769
* tree-tailcall.c (eliminate_tail_call): Add the created loop
for the first recursion and return it via the new output parameter.
(optimize_tail_call): Pass through new output param.
(tree_optimize_tail_calls_1): After creating all latches,
add the created loop to the loop tree. Do not mark loops for fixup.
* g++.dg/tree-ssa/pr101769.C: New testcase.
Previous CLs add new language constructs in Go 1.17, specifically,
unsafe.Add, unsafe.Slice, and conversion from a slice to a pointer
to an array. This CL handles them in the escape analysis.
At the point of the escape analysis, unsafe.Add and unsafe.Slice
are still builtin calls, so just handle them in data flow.
Conversion from a slice to a pointer to an array has already been
lowered to a combination of compound expression, conditional
expression and slice info expressions, so handle them in the
escape analysis.
Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/339671
The only different between selectnbrecv and selectnbrecv2 is the later
set the input pointer value by second return value from chanrecv.
So by making selectnbrecv return two values from chanrecv, we can get
rid of selectnbrecv2, the compiler can now call only selectnbrecv and
generate simpler code.
This is the gofrontend version of https://golang.org/cl/292890.
Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/339529
It is the prefix of the "es" and "eI" constraints.
2021-08-03 Segher Boessenkool <segher@kernel.crashing.org>
* config/rs6000/constraints.md: Remove "e" from the list of available
constraint characters.
The histogram value for indirect calls was incorrectly set up.
That is fixed now.
With this change the tree-prof tests checking indirect call inlining with AutoFDO
in gcc.dg and g++.dg are passing.
Resolves:
PR gcov-profile/71672 - inlining indirect calls does not work with autofdo
gcc/ChangeLog:
PR gcov-profile/71672
* auto-profile.c (afdo_indirect_call): Fix setup of the historgram value for indirect calls.
* create_gcov tool doesn't currently support dwarf 5 so I made a change in profopt.exp
to pass -gdwarf-4 when compiling the binary to profile.
* I updated the invocation of create_gcov in profopt.exp to pass -gcov_version=2.
I recently made a change to create_gcov to support version 2:
https://github.com/google/autofdo/pull/117 .
* I removed useless -o perf.data from the invocation of gcc-auto-profile in
target-supports.exp.
These changes contribute to fixing PR gcov-profile/71672.
gcc/testsuite/ChangeLog:
* lib/profopt.exp: Pass gdwarf-4 when compiling test to profile; pass -gcov_version=2.
* lib/target-supports.exp: Remove unnecessary -o perf.data passed to gcc-auto-profile.
indir-call-prof-2.c has -fno-early-inlining but AutoFDO can't work without
early inlining (it needs to match the inlining of the profiled binary).
I changed profopt.exp to always pass -fearly-inlining for AutoFDO.
With that change the indirect call inlining in indir-call-prof-2.c happens in the early inliner
so I changed the dg-final-use-autofdo.
Contributes to fixing PR gcov-profile/71672
gcc/testsuite/ChangeLog:
* gcc.dg/tree-prof/indir-call-prof-2.c: Fix dg-final-use-autofdo.
* lib/profopt.exp: Pass -fearly-inlining when compiling with AutoFDO.
* Changed several tests to use -fdump-ipa-afdo-optimized instead of -fdump-ipa-afdo
in dg-options so that the expected output can be found
* Increased the number of iterations in several tests so that perf can have
enough sampling events
Contributes to fixing PR gcov-profile/71672.
gcc/testsuite/ChangeLog:
* g++.dg/tree-prof/indir-call-prof.C: Fix options, increase the number of iterations.
* g++.dg/tree-prof/morefunc.C: Fix options, increase the number of iterations.
* g++.dg/tree-prof/reorder.C: Fix options, increase the number of iterations.
* gcc.dg/tree-prof/indir-call-prof-2.c: Fix options, increase the number of iterations.
* gcc.dg/tree-prof/indir-call-prof.c: Fix options.
Resolves:
PR testsuite/101688 - g++.dg/warn/Wstringop-overflow-4.C fails on 32-bit archs with new jump threader
gcc/testsuite:
PR testsuite/101688
* g++.dg/warn/Wstringop-overflow-4.C: Disable a test case in ILP32.
Copy the test for _mm_minpos_epu16 from
gcc/testsuite/gcc.target/i386/sse4_1-phminposuw.c, with
a few adjustments:
- Adjust the dejagnu directives for powerpc platform.
- Make the data not be monotonically increasing,
such that some of the returned values are not
always the first value (index 0).
- Create a list of input data testing various scenarios
including more than one minimum value and different
orders and indices of the minimum value.
- Fix a masking issue where the index was being truncated
to 2 bits instead of 3 bits, which wasn't found because
all of the returned indices were 0 with the original
generated data.
- Support big-endian.
2021-08-03 Paul A. Clarke <pc@us.ibm.com>
gcc/testsuite
* gcc.target/powerpc/sse4_1-phminposuw.c: Copy from
gcc/testsuite/gcc.target/i386, adjust dg directives to suit,
make more robust.
Add a naive implementation of the subject x86 intrinsic to
ease porting.
2021-08-03 Paul A. Clarke <pc@us.ibm.com>
gcc
* config/rs6000/smmintrin.h (_mm_minpos_epu16): New.
In C++17 the out-of-class definitions for static constexpr variables are
redundant, because they are implicitly inline. This change avoids
"redundant redeclaration" warnings from -Wsystem-headers -Wdeprecated.
Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:
* include/bits/random.tcc (linear_congruential_engine): Do not
define static constexpr members when they are implicitly inline.
* include/std/ratio (ratio, __ratio_multiply, __ratio_divide)
(__ratio_add, __ratio_subtract): Likewise.
* include/std/type_traits (integral_constant): Likewise.
* testsuite/26_numerics/random/pr60037-neg.cc: Adjust dg-error
line number.
This adds a partial specialization of allocator_traits, similar to what
was already done for std::allocator. This means that most uses of
polymorphic_allocator via the traits can avoid the metaprogramming
overhead needed to deduce the properties from polymorphic_allocator.
In addition, I'm changing polymorphic_allocator::delete_object to invoke
the destructor (or pseudo-destructor) directly, rather than calling
allocator_traits::destroy, which calls polymorphic_allocator::destroy
(which is deprecated). This is observable if a user has specialized
allocator_traits<polymorphic_allocator<Foo>> and expects to see its
destroy member function called. I consider explicit specializations of
allocator_traits to be wrong-headed, and this use case seems unnecessary
to support. So delete_object just invokes the destructor directly.
Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:
* include/std/memory_resource (polymorphic_allocator::delete_object):
Call destructor directly instead of using destroy.
(allocator_traits<polymorphic_allocator<T>>): Define partial
specialization.