Jump to content


Photo

Best GCC command options for Pandora compilation (Sourcery toolchain)

compiler options ARM GCC cortex optimization tuning

  • Please log in to reply
21 replies to this topic

#1 OFFLINE   PatientFan

PatientFan

    Member

  • Members
  • PipPip
  • 25 posts
  • Local time: 10:25 AM

Posted 13 February 2012 - 10:35 PM

[Last updated 2012-03-04]

This is an effort to establish the optimal GCC command options for cross-compiling for the Pandora on another computer. All contributions are welcome!

This should be seen as a sane starting point, but not prevent anyone from experimenting and posting their experience here.
Note that there may well be options which may have to be set depending on the characteristics of the application to be compiled.

You should also read the replies in this thread, they contain interesting information.

These are options specific to the HW and SW of the Pandora and (theoretically) should invariably produce the best results:

arm-none-linux-gnueabi-g++ \
  -pipe \
  -march=armv7-a \
  -mcpu=cortex-a8 \
  -mtune=cortex-a8 \
  -mfpu=neon \
  ...

These are some good optimization options to try:

  -O2 \          	# standard optimizations, should always be a safe bet, you may also want to try -O3
  -fno-exceptions \	# if it does not cause errors, USE IT: omits support for C++ try/catch exception handling [thanks to foxblock]
  -fno-rtti \    	# if it does not cause errors, USE IT: omits support for RTTI (Run-Time Type Information) [thanks to foxblock]
  ...

Originally this post was only for the free Sourcery toolchain by Mentor Graphics. But I have learned that Sourcery is essentially GCC.
Mentor Graphics thankfully also offers the documentation for the toolchain for download.
Chapter 3.17.2 ARM Options in the compiler manual (PDF) seems like a good start for relevant options.

#2 OFFLINE   Ivanovic

Ivanovic

    Advanced Member

  • Members
  • PipPipPip
  • 219 posts
  • Local time: 10:25 AM

Posted 13 February 2012 - 11:06 PM

The toolchain you are talking about is basically just a nicely packaged gcc. So the "manual" is basically just the gcc manual and that's it. The problem with the options is that anything besides the stock options has potential to break some stuff (otherwise it would be default!). I won't be able to tell you a "perfect" set of options, but this is what I use to build Wesnoth using the crosscompiler toolchain based on codesourcery (btw I just updated my toolchain installer (available at git.openpandora.org) to make use of the latest version which now relies on gcc 4.6.1):
CFLAGS="-DPANDORA -O2 -pipe -march=armv7-a -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -fno-inline-functions" CXXFLAGS="-DPANDORA -O2 -pipe -march=armv7-a -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -fno-inline-functions"

Some stuff like "-ffast-math -fsingle-precision-constant" is possible, too, but at least in the case of Wesnoth this *might* lead to problems when playing a multiplayer game against users which don't set this option (eg because they are using an x86 based system and because of those got fast enough math and float calculations anyway).

To get a real speedup you basically have to change code so that you don't rely on float operations and stuff like this. What is most promising is replacing bottlenecks by neon operations as Notaz has done for pcsxrearmed.

#3 OFFLINE   Exophase

Exophase

    Advanced Member

  • Members
  • PipPipPip
  • 3770 posts
  • Local time: 05:25 AM
  • LocationCleveland, OH

Donator

Posted 14 February 2012 - 02:10 AM

Code Sourcery is where a lot of gcc's ARM development happens, and it takes a while for it to get merged upstream. On the flip side, it's buggier/less stable than mainline gcc. Still worth trying out, though.

#4 OFFLINE   Ivanovic

Ivanovic

    Advanced Member

  • Members
  • PipPipPip
  • 219 posts
  • Local time: 10:25 AM

Posted 14 February 2012 - 08:50 AM

Code Sourcery is where a lot of gcc's ARM development happens, and it takes a while for it to get merged upstream. On the flip side, it's buggier/less stable than mainline gcc. Still worth trying out, though.

All I wanted to say is just that the manual is basically identical to the gcc online manual. Regarding the arm options check this site: http://gcc.gnu.org/o...RM-Options.html

#5 OFFLINE   PatientFan

PatientFan

    Member

  • Members
  • PipPip
  • 25 posts
  • Local time: 10:25 AM

Posted 16 February 2012 - 11:03 PM

The toolchain you are talking about is basically just a nicely packaged gcc.
...
CFLAGS="-DPANDORA -O2 -pipe -march=armv7-a -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -fno-inline-functions" CXXFLAGS="-DPANDORA -O2 -pipe -march=armv7-a -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -fno-inline-functions"


Thanks a lot for your input! I have updated the first post.

#6 OFFLINE   notaz

notaz

    Advanced Member

  • Official OpenPandora Team
  • PipPipPip
  • 2214 posts
  • Local time: 12:25 PM

Donator

Posted 16 February 2012 - 11:46 PM

-ftree-vectorize

This very rarely does any good (at least in my projects), and historically had many bugs, don't know how it goes these days.

-fno-inline-functions

Why? Function inlining is usually a good thing.

To get a real speedup you basically have to change code so that you don't rely on float operations and stuff like this. What is most promising is replacing bottlenecks by neon operations as Notaz has done for pcsxrearmed.

There is almost no float ops used in rearmed, or did you have integer NEON in mind?

#7 OFFLINE   Linux-SWAT

Linux-SWAT

    Advanced Member

  • Members
  • PipPipPip
  • 6142 posts
  • Local time: 11:25 AM
  • LocationParis, France

Posted 17 February 2012 - 08:57 AM

Thx for sharing, i've wondered about some of those options.

#8 OFFLINE   Ziz

Ziz

    Multitasking nagger

  • Members
  • PipPipPip
  • 638 posts
  • Local time: 11:25 AM
  • LocationUtopia

Posted 17 February 2012 - 11:54 AM

These are the optimizations I use. Most of them are snake oil, but... They don't make it slower and I like long gcc-calls ^^

-O3
-fsingle-precision-constant
-ffast-math
-fgcse-sm
-fsched-spec-load
-fmodulo-sched
-fgcse-las
-ftracer
-funsafe-loop-optimizations -Wunsafe-loop-optimizations
-fvariable-expansion-in-unroller

And do I really need these?
-march=armv7-a -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
I thought the compiler already compiles automaticly with these parameters.

 Post generated in 0.1337 seconds.

I am a leaf on the wind - watch how I soar. Wash
http://ziz.openhandhelds.org

#9 OFFLINE   foxblock

foxblock

    Mind over body

  • Members
  • PipPipPip
  • 864 posts
  • Local time: 11:25 AM
  • LocationAachen, Germany

Posted 02 March 2012 - 03:35 PM

I like the idea behind this thread, thanks for starting it!
I just copied my gcc options from other projects and always wondered what some of them did or rather whether they did any good and if so in which scenario (like the mentioned -ftree-vectorize) - so I would love to see not only a list of good options, but also an explanation why (and maybe when better not) to set them.

Another tip, which might be useful (and not obvious to novice programmers), though nor for everyone:
When using C++ you can manually disable features like exceptions (which I personally don't use as I find C++ exceptions horribly lacking) and RTTI (RunTime Type Information, to get information about the type of an object at runtime), which you might not use.
-fno-exceptions
-fno-rtti
will do the job.
It got me a 5-10% speed boost (which is a pretty arbitrary number obviously as it depends on your code), but at the very least it decreases the size of your binary.
Keep in mind that you might run into errors disabling those when libraries used in your project depend on the features.

YouTube --- DeviantArt --- GitHub --- GreyOut --- PndTools
If you like a project you see here on the forums, let the dev(s) know by leaving a comment! Don't keep your appreciation silent, it's just like cheering after a good concert or show.


#10 OFFLINE   PatientFan

PatientFan

    Member

  • Members
  • PipPip
  • 25 posts
  • Local time: 10:25 AM

Posted 04 March 2012 - 06:56 PM


-ftree-vectorize

This very rarely does any good (at least in my projects), and historically had many bugs, don't know how it goes these days.


Thanks for the input, I will change the OP to show only command options that are in some way specific for the Pandora.
"-ftree-vectorize" seems to be an aggressive optimization that is included in "-O3": -ftree-vectorize is going to be turned on under -O3

#11 OFFLINE   PatientFan

PatientFan

    Member

  • Members
  • PipPip
  • 25 posts
  • Local time: 10:25 AM

Posted 04 March 2012 - 07:09 PM

And do I really need these?

-march=armv7-a -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
I thought the compiler already compiles automaticly with these parameters.


Sorry, I forgot to make it clear that I am not compiling on the Pandora itself but cross-compiling on an Intel PC with 64-bit Ubuntu.
The Sourcery toolchain is not specifically for the CPU of the Pandora and I think it is a good idea to tell it the exact details of the target platform.

As for the other options: Thank you for your suggestions! But I have decided to change the purpose of the OP: I will limit it to options that are either specific for the Pandora HW or have been proven to be a better starting point than the defaults.
There are probably enough discussions about GCC options in general. Nevertheless, it is surely a good idea to experiment with all options!

#12 OFFLINE   PatientFan

PatientFan

    Member

  • Members
  • PipPip
  • 25 posts
  • Local time: 10:25 AM

Posted 04 March 2012 - 07:16 PM

Another tip, which might be useful (and not obvious to novice programmers), though nor for everyone:
When using C++ you can manually disable features like exceptions (which I personally don't use as I find C++ exceptions horribly lacking) and RTTI (RunTime Type Information, to get information about the type of an object at runtime), which you might not use.

-fno-exceptions
-fno-rtti
will do the job.
It got me a 5-10% speed boost (which is a pretty arbitrary number obviously as it depends on your code), but at the very least it decreases the size of your binary.
Keep in mind that you might run into errors disabling those when libraries used in your project depend on the features.


Very interesting points, thanks! I will update the OP with "optimizations to try".

#13 OFFLINE   Linux-SWAT

Linux-SWAT

    Advanced Member

  • Members
  • PipPipPip
  • 6142 posts
  • Local time: 11:25 AM
  • LocationParis, France

Posted 09 March 2012 - 09:11 AM

http://gcc.gnu.org/o...tml#ARM-Options
https://wiki.linaro....ndbox/CoreMark1

And for newer gcc :

Specifying both -march= and -mcpu= is redundant, and may not in fact have done what you expected in previous compiler versions
(maybe even depending on the order in which the arguments were given).
The -march switch selects a "generic" ARMv7-A CPU, and -mcpu selects specifically a Cortex-A8 CPU with tuning specific for that core.
 
 
Either use "-march=armv7-a -mtune=cortex-a8", or just use "-mcpu=cortex-a8".


#14 OFFLINE   Galaxis

Galaxis

    Advanced Member

  • Members
  • PipPipPip
  • 244 posts
  • Local time: 10:25 AM

Posted 13 March 2012 - 08:08 PM

Btw., I just ran two basic old FPU benchmarks, fbench and ffbench, with neon and vfp. At least for those two simple test cases - one does trigonometry, the other fourier transforms - neon and vfp code are basically on par. Software floating point is about 40% slower (all three with -O3 -mtune=cortex-a8 otherwise).

http://www.fourmilab.ch/fbench/

#15 OFFLINE   notaz

notaz

    Advanced Member

  • Official OpenPandora Team
  • PipPipPip
  • 2214 posts
  • Local time: 12:25 PM

Donator

Posted 13 March 2012 - 09:24 PM

It's more likely it didn't use NEON at all, regardless of what you set in compiler flags.

#16 OFFLINE   Galaxis

Galaxis

    Advanced Member

  • Members
  • PipPipPip
  • 244 posts
  • Local time: 10:25 AM

Posted 13 March 2012 - 09:42 PM

It's more likely it didn't use NEON at all, regardless of what you set in compiler flags.


Duh. Should have checked what Neon actually does before posting that one ;). Consequently didn't know that -mfpu=neon enables vfp for simple floating point stuff either - thought that everything would be somehow handled by Neon with that compiler flag.

#17 OFFLINE   Linux-SWAT

Linux-SWAT

    Advanced Member

  • Members
  • PipPipPip
  • 6142 posts
  • Local time: 11:25 AM
  • LocationParis, France

Posted 25 August 2012 - 09:31 PM

What about gcc-java ?
Is there any OP-related options ?

#18 OFFLINE   Genboo

Genboo

    Member

  • Members
  • PipPip
  • 20 posts
  • Local time: 11:25 AM

Posted 26 August 2012 - 09:27 PM

Good old Segfaultflags errrr CFLAGS here we go:

My GCC (atm 4.7.1 and after release 4.8.0) CFLAGS="-Os -pipe -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=hard -ftree-vectorize -fassociative-math -funsafe-math-optimizations"

-mfloat-abi=hard:
Forget this use -mfloat-abi=softfp or your libs/sgx video driver will break. soft(fp) and hard libs/programs can't be mixed. Just trying to get hardfloat sgx version running on gentoo.

-pipe:
The compilation process is faster. On systems with low memory, gcc might get killed. Example:
Compiling gcc itself on pandora :) At least without swap.

-mcpu=cortex-a8:
Should be the same as -march=armv7-a + -mtune=cortex-a8 or even better like Linux-SWAT already said.

-mfpu=neon:
Will use NEON if possible or fallback to vfpv3. Could mess with scientific / multimedia programs because its not fully IEEE754 compatible = Segfaults or multimedia output may look somehow weird.

-Os:
NAND/SD/Caches should be the bottleneck.

-ftree-vectorize:
Activates auto-vectorization but should be kicked out. Gives between zero and negligible performance gains with NEON (or overall...broken part of gcc or other compilers). Part of -O3 -Ofast

http://wiki.debian.o...t/VfpComparison
The best performance comes from deriving parallelization using mathematical proof of the original function, and autovectorizing compilers don't do this. Pretty much all they do is unroll loops.
Therefore: make sure -ftree-vectorize is turned off :)


-fassociative-math:
Needed to enable auto-vectorization on arm. Part of -funsafe-math-optimizations -ffast-math -Ofast

-funsafe-math-optimizations:
Needed to enable auto-vectorization for NEON (because its not fully IEEE754 compatible). Part of -ffast-math -Ofast

So normally you should stay with:

Your GCC CFLAGS="-Os -pipe -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp"

Maybe you want to stay with -mfpu=vfpv3 if you are not 100% sure NEON is good for you.

And if you got time you can play around with the already called -fno-exceptions / -fno-rtti / -ffast-math or even harder stuff for single (multimedia) packages.

But make backups....chances are good to break the system with the harder stuff :)

#19 OFFLINE   Linux-SWAT

Linux-SWAT

    Advanced Member

  • Members
  • PipPipPip
  • 6142 posts
  • Local time: 11:25 AM
  • LocationParis, France

Posted 26 August 2012 - 10:16 PM

CFLAGS are also applied to java ?

#20 ONLINE   Wally

Wally

    King of the Carrot Flowers

  • Moderators
  • 1382 posts
  • Local time: 07:25 PM
  • LocationMelbourne, Australia

Donator

Posted 26 August 2012 - 10:25 PM

Isn't it wiser to keep the default CFLAGS for GCC then later you can apply the optimisation flags?



Also tagged with one or more of these keywords: compiler options, ARM, GCC, cortex, optimization, tuning

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users