* ARMeilleure: Add AVX512{F,VL,DQ,BW} detection
Add `UseAvx512Ortho` and `UseAvx512OrthoFloat` optimization flags as
short-hands for `F+VL` and `F+VL+DQ`.
* ARMeilleure: Add initial support for EVEX instruction encoding
Does not implement rounding, or exception controls.
* ARMeilleure: Add `X86Vpternlogd`
Accelerates the vector-`Not` instruction.
* ARMeilleure: Add check for `OSXSAVE` for AVX{2,512}
* ARMeilleure: Add check for `XCR0` flags
Add XCR0 register checks for AVX and AVX512F, following the guidelines
from section 14.3 and 15.2 from the Intel Architecture Software
Developer's Manual.
* ARMeilleure: Increment InternalVersion
* ARMeilleure: Remove redundant `ReProtect` and `Dispose`, formatting
* ARMeilleure: Move XCR0 procedure to GetXcr0Eax
* ARMeilleure: Add `XCR0` to `FeatureInfo` structure
* ARMeilleure: Utilize `ReadOnlySpan` for Xcr0 assembly
Avoids an additional allocation
* ARMeilleure: Formatting fixes
* A64: Add fast path for Fcvtas_Gp/S/V, Fcvtau_Gp/S/V and Frinta_S/V instructions;
they use "Round to Nearest with Ties to Away" rounding mode not supported in x86.
All instructions involved have been tested locally in both release and debug modes, in both lowcq and highcq.
The titles Mario Strikers and Super Smash Bros. U. use these instructions intensively.
* Update Ptc.cs
* A32: Add fast path for Vcvta_RM, Vrinta_RM and Vrinta_V instructions aswell.
* ARMeilleure: Add `GFNI` detection
This is intended for utilizing the `gf2p8affineqb` instruction
* ARMeilleure: Add `gf2p8affineqb`
Not using the VEX or EVEX-form of this instruction is intentional. There
are `GFNI`-chips that do not support AVX(so no VEX encoding) such as
Tremont(Lakefield) chips as well as Jasper Lake.
13df339fe7/GenuineIntel/GenuineIntel00806A1_Lakefield_LC_InstLatX64.txt (L1297-L1299)13df339fe7/GenuineIntel/GenuineIntel00906C0_JasperLake_InstLatX64.txt (L1252-L1254)
* ARMeilleure: Add `gfni` acceleration of `Rbit_V`
Passes all `Rbit_V*` unit tests on my `i9-11900k`
* ARMeilleure: Add `gfni` acceleration of `S{l,r}i_V`
Also added a fast-path for when the shift amount is greater than the
size of the element.
* ARMeilleure: Add `gfni` acceleration of `Shl_V` and `Sshr_V`
* ARMeilleure: Increment InternalVersion
* ARMeilleure: Fix Intrinsic and Assembler Table alignment
`gf2p8affineqb` is the longest instruction name I know of. It shouldn't
get any wider than this.
* ARMeilleure: Remove SSE2+SHA requirement for GFNI
* ARMeilleure Add `X86GetGf2p8LogicalShiftLeft`
Used to generate GF(2^8) 8x8 bit-matrices for bit-shifting for the `gf2p8affineqb` instruction.
* ARMeilleure: Append `FeatureInfo7Ecx` to `FeatureInfo`
* Implemented in IR the managed methods of the Saturating region ...
... of the SoftFallback class (the SatQ ones).
The need to natively manage the Fpcr and Fpsr system registers is still a fact.
Contributes to https://github.com/Ryujinx/Ryujinx/issues/2917 ; I will open another PR to implement in Intrinsics-branchless the methods of the Saturation region as well (the SatXXXToXXX ones).
All instructions involved have been tested locally in both release and debug modes, in both lowcq and highcq.
* Ptc.InternalVersion = 3665
* Addressed PR feedback.
* Implemented in IR the managed methods of the ShlReg region of the SoftFallback class.
It also includes the last two SatQ ones (following up on https://github.com/Ryujinx/Ryujinx/pull/3665).
All instructions involved have been tested locally in both release and debug modes, in both lowcq and highcq.
* Fpsr and Fpcr freed.
Handling/isolation of Fpsr and Fpcr via register for IR and via memory for Tests and Threads, with synchronization to context exchanges (explicit for SoftFloat); without having to call managed methods. Thanks to the inlining work of the previous two PRs and others in this.
Tests performed locally in both release and debug modes, in both lowcq and highcq, with FastFP to true and false (explicit FP tests included). Tested with the title Tony Hawk's PS.
Depends on shlreg.
* Update InstEmitSimdHelper.cs
* De-magic Masks.
Remove the Stride and Len flags; Fpsr.NZCV are A32 only, then moved to Fpscr: this leads to emitting less IR in reference to Get/Set Fpsr/Fpcr/Fpscr methods in reference to Mrs/Msr (A64) and Vmrs/Vmsr (A32) instructions.
* Addressed PR feedback.
* Implemented in IR the managed methods of the Saturating region ...
... of the SoftFallback class (the SatQ ones).
The need to natively manage the Fpcr and Fpsr system registers is still a fact.
Contributes to https://github.com/Ryujinx/Ryujinx/issues/2917 ; I will open another PR to implement in Intrinsics-branchless the methods of the Saturation region as well (the SatXXXToXXX ones).
All instructions involved have been tested locally in both release and debug modes, in both lowcq and highcq.
* Ptc.InternalVersion = 3665
* Addressed PR feedback.
* Implemented in IR the managed methods of the ShlReg region of the SoftFallback class.
It also includes the last two SatQ ones (following up on https://github.com/Ryujinx/Ryujinx/pull/3665).
All instructions involved have been tested locally in both release and debug modes, in both lowcq and highcq.
* Update InstEmitSimdHelper.cs
* Implement VRSRA, VRSHRN, VQSHRUN, VQMOVN, VQMOVUN, VQADD, VQSUB, VRHADD, VPADDL, VSUBL, VQDMULH and VMLAL Arm32 NEON instructions
* PPTC version
* Fix VQADD/VQSUB
* Improve MRC/MCR handling and exception messages
In case data is being recompiled as code, we don't want to throw at emit stage, instead we should only throw if it actually tries to execute
* Implemented in IR the managed methods of the Saturating region ...
... of the SoftFallback class (the SatQ ones).
The need to natively manage the Fpcr and Fpsr system registers is still a fact.
Contributes to https://github.com/Ryujinx/Ryujinx/issues/2917 ; I will open another PR to implement in Intrinsics-branchless the methods of the Saturation region as well (the SatXXXToXXX ones).
All instructions involved have been tested locally in both release and debug modes, in both lowcq and highcq.
* Ptc.InternalVersion = 3665
* Addressed PR feedback.
* Refactor CPU interface
* Use IExecutionContext interface on SVC handler, change how CPU interrupts invokes the handlers
* Make CpuEngine take a ITickSource rather than returning one
The previous implementation had the scenario where the CPU engine had to implement the tick source in mind, like for example, when we have a hypervisor and the game can read CNTPCT on the host directly. However given that we need to do conversion due to different frequencies anyway, it's not worth it. It's better to just let the user pass the tick source and redirect any reads to CNTPCT to the user tick source
* XML docs for the public interfaces
* PPTC invalidation due to NativeInterface function name changes
* Fix build of the CPU tests
* PR feedback
* Add an early `TailMerge` pass
Some translations can have a lot of guest calls and since for each guest
call there is a call guard which may return. This can produce a lot of
epilogue code for returns. This pass merges the epilogue into a single
block.
```
Using filter 'hcq'.
Using metric 'code size'.
Total diff: -1648111 (-7.19 %) (bytes):
Base: 22913847
Diff: 21265736
Improved: 4567, regressed: 14, unchanged: 144
```
* Set PTC version
* Address feedback
* Handle `void` returning functions
* Actually handle `void` returning functions
* Fix `RegisterToLocal` logging
* Add `Operand.Label` support to `Assembler`
This adds label support to `Assembler` and enables branch tightening
when compiling with relocatables. Jump management and patching has been
moved to the `Assembler`.
* Move instruction table to `Assembler.Table`
* Set PTC internal version
* Rename `Assembler.Table` to `AssemblerTable`
Seems like this is used as an optimized way to clear memory in homebrew applications. Unfortunately, calling the software fallback method every 8 bytes was not very optimal.
The existing EmitStore is used by passing in ZR as the register to get a 0 write.
Fix an issue introduced in #2190 where by 2 different count table entry
addresses were used for LCQ functions. E.g:
```asm
.L1:
mov rbp,COUNT_TABLE_0 ;; This gets an address.
mov ebp,[rbp]
lea esi,[rbp+1]
mov rdi,COUNT_TABLE_1 ;; This gets another address.
mov [rdi],esi
cmp ebp,64h
je near .L34
```
This caused LCQ functions to not tier up when they're loaded from the
PTC cache. This does not happen when they're freshly compiled.
This PR fixes the issue by ensuring only a single counter is created per
translation.
* Refactor `PtcInfo`
This change reduces the coupling of `PtcInfo` by moving relocation
tracking to the backend. `RelocEntry`s remains as `RelocEntry`s through
out the pipeline until it actually needs to be written to the PTC
streams. Keeping this representation makes inspecting and manipulating
relocations after compilations less painful. This is something I needed
to do to patch relocations to 0 to diff dumps.
Contributes to #1125.
* Turn `Symbol` & `RelocInfo` into readonly structs
* Add documentation to `CompiledFunction`
* Remove `Compiler.Compile<T>`
Remove `Compiler.Compile<T>` and replace it by `Map<T>` of the
`CompiledFunction` returned.
* Fix type mismatch in `BitwiseAnd` simplification
`TryEliminateBitwiseAnd` would turn the `BitwiseAnd` operation into a
copy of the wrong type. E.g:
Before `Simplification`:
```llvm
i64 %0 = BitwiseAnd i64 0x0, %1
```
After `Simplication`:
```llvm
i64 %0 = Copy i32 0x0
```
Since the with the changes in #2515, we iterate in reverse order and
`Simplication`, `ConstantFolding` does not indicate if it modified
the CFG, the second pass to "retype" the copy into the proper
destination type does not happen.
This also blocked copy propagation since its destination type did not
match with its source type. But in the cases I've seen, the
`PreAllocator` would insert a copy for the propagated constant, which
results in no diffs.
Since the copy remained as is, asserts are fired when generating it.
* Set PPTC version
* Turn `MemoryOperand` into a struct
* Remove `IntrinsicOperation`
* Remove `PhiNode`
* Remove `Node`
* Turn `Operand` into a struct
* Turn `Operation` into a struct
* Clean up pool management methods
* Add `Arena` allocator
* Move `OperationHelper` to `Operation.Factory`
* Move `OperandHelper` to `Operand.Factory`
* Optimize `Operation` a bit
* Fix `Arena` initialization
* Rename `NativeList<T>` to `ArenaList<T>`
* Reduce `Operand` size from 88 to 56 bytes
* Reduce `Operation` size from 56 to 40 bytes
* Add optimistic interning of Register & Constant operands
* Optimize `RegisterUsage` pass a bit
* Optimize `RemoveUnusedNodes` pass a bit
Iterating in reverse-order allows killing dependency chains in a single
pass.
* Fix PPTC symbols
* Optimize `BasicBlock` a bit
Reduce allocations from `_successor` & `DominanceFrontiers`
* Fix `Operation` resize
* Make `Arena` expandable
Change the arena allocator to be expandable by allocating in pages, with
some of them being pooled. Currently 32 pages are pooled. An LRU removal
mechanism should probably be added to it.
Apparently MHR can allocate bitmaps large enough to exceed the 16MB
limit for the type.
* Move `Arena` & `ArenaList` to `Common`
* Remove `ThreadStaticPool` & co
* Add `PhiOperation`
* Reduce `Operand` size from 56 from 48 bytes
* Add linear-probing to `Operand` intern table
* Optimize `HybridAllocator` a bit
* Add `Allocators` class
* Tune `ArenaAllocator` sizes
* Add page removal mechanism to `ArenaAllocator`
Remove pages which have not been used for more than 5s after each reset.
I am on fence if this would be better using a Gen2 callback object like
the one in System.Buffers.ArrayPool<T>, to trim the pool. Because right
now if a large translation happens, the pages will be freed only after a
reset. This reset may not happen for a while because no new translation
is hit, but the arena base sizes are rather small.
* Fix `OOM` when allocating larger than page size in `ArenaAllocator`
Tweak resizing mechanism for Operand.Uses and Assignemnts.
* Optimize `Optimizer` a bit
* Optimize `Operand.Add<T>/Remove<T>` a bit
* Clean up `PreAllocator`
* Fix phi insertion order
Reduce codegen diffs.
* Fix code alignment
* Use new heuristics for degree of parallelism
* Suppress warnings
* Address gdkchan's feedback
Renamed `GetValue()` to `GetValueUnsafe()` to make it more clear that
`Operand.Value` should usually not be modified directly.
* Add fast path to `ArenaAllocator`
* Assembly for `ArenaAllocator.Allocate(ulong)`:
.L0:
mov rax, [rcx+0x18]
lea r8, [rax+rdx]
cmp r8, [rcx+0x10]
ja short .L2
.L1:
mov rdx, [rcx+8]
add rax, [rdx+8]
mov [rcx+0x18], r8
ret
.L2:
jmp ArenaAllocator.AllocateSlow(UInt64)
A few variable/field had to be changed to ulong so that RyuJIT avoids
emitting zero-extends.
* Implement a new heuristic to free pooled pages.
If an arena is used often, it is more likely that its pages will be
needed, so the pages are kept for longer (e.g: during PPTC rebuild or
burst sof compilations). If is not used often, then it is more likely
that its pages will not be needed (e.g: after PPTC rebuild or bursts
of compilations).
* Address riperiperi's feedback
* Use `EqualityComparer<T>` in `IntrusiveList<T>`
Avoids a potential GC hole in `Equals(T, T)`.
* Add AddressTable<T>
* Use AddressTable<T> for dispatch
* Remove JumpTable & co.
* Add fallback for out of range addresses
* Add PPTC support
* Add documentation to `AddressTable<T>`
* Make AddressTable<T> configurable
* Fix table walk
* Fix IsMapped check
* Remove CountTableCapacity
* Add PPTC support for fast path
* Rename IsMapped to IsValid
* Remove stale comment
* Change format of address in exception message
* Add TranslatorStubs
* Split DispatchStub
Avoids recompilation of stubs during tests.
* Add hint for 64bit or 32bit
* Add documentation to `Symbol`
* Add documentation to `TranslatorStubs`
Make `TranslatorStubs` disposable as well.
* Add documentation to `SymbolType`
* Add `AddressTableEventSource` to monitor function table size
Add an EventSource which measures the amount of unmanaged bytes
allocated by AddressTable<T> instances.
dotnet-counters monitor -n Ryujinx --counters ARMeilleure
* Add `AllowLcqInFunctionTable` optimization toggle
This is to reduce the impact this change has on the test duration.
Before everytime a test was ran, the FunctionTable would be initialized
and populated so that the newly compiled test would get registered to
it.
* Implement unmanaged dispatcher
Uses the DispatchStub to dispatch into the next translation, which
allows execution to stay in unmanaged for longer and skips a
ConcurrentDictionary look up when the target translation has been
registered to the FunctionTable.
* Remove redundant null check
* Tune levels of FunctionTable
Uses 5 levels instead of 4 and change unit of AddressTableEventSource
from KB to MB.
* Use 64-bit function table
Improves codegen for direct branches:
mov qword [rax+0x408],0x10603560
- mov rcx,sub_10603560_OFFSET
- mov ecx,[rcx]
- mov ecx,ecx
- mov rdx,JIT_CACHE_BASE
- add rdx,rcx
+ mov rcx,sub_10603560
+ mov rdx,[rcx]
mov rcx,rax
Improves codegen for dispatch stub:
and rax,byte +0x1f
- mov eax,[rcx+rax*4]
- mov eax,eax
- mov rcx,JIT_CACHE_BASE
- lea rax,[rcx+rax]
+ mov rax,[rcx+rax*8]
mov rcx,rbx
* Remove `JitCacheSymbol` & `JitCache.Offset`
* Turn `Translator.Translate` into an instance method
We do not have to add more parameter to this method and related ones as
new structures are added & needed for translation.
* Add symbol only when PTC is enabled
Address LDj3SNuD's feedback
* Change `NativeContext.Running` to a 32-bit integer
* Fix PageTable symbol for host mapped
* Refactoring of KMemoryManager class
* Replace some trivial uses of DRAM address with VA
* Get rid of GetDramAddressFromVa
* Abstracting more operations on derived page table class
* Run auto-format on KPageTableBase
* Managed to make TryConvertVaToPa private, few uses remains now
* Implement guest physical pages ref counting, remove manual freeing
* Make DoMmuOperation private and call new abstract methods only from the base class
* Pass pages count rather than size on Map/UnmapMemory
* Change memory managers to take host pointers
* Fix a guest memory leak and simplify KPageTable
* Expose new methods for host range query and mapping
* Some refactoring of MapPagesFromClientProcess to allow proper page ref counting and mapping without KPageLists
* Remove more uses of AddVaRangeToPageList, now only one remains (shared memory page checking)
* Add a SharedMemoryStorage class, will be useful for host mapping
* Sayonara AddVaRangeToPageList, you served us well
* Start to implement host memory mapping (WIP)
* Support memory tracking through host exception handling
* Fix some access violations from HLE service guest memory access and CPU
* Fix memory tracking
* Fix mapping list bugs, including a race and a error adding mapping ranges
* Simple page table for memory tracking
* Simple "volatile" region handle mode
* Update UBOs directly (experimental, rough)
* Fix the overlap check
* Only set non-modified buffers as volatile
* Fix some memory tracking issues
* Fix possible race in MapBufferFromClientProcess (block list updates were not locked)
* Write uniform update to memory immediately, only defer the buffer set.
* Fix some memory tracking issues
* Pass correct pages count on shared memory unmap
* Armeilleure Signal Handler v1 + Unix changes
Unix currently behaves like windows, rather than remapping physical
* Actually check if the host platform is unix
* Fix decommit on linux.
* Implement windows 10 placeholder shared memory, fix a buffer issue.
* Make PTC version something that will never match with master
* Remove testing variable for block count
* Add reference count for memory manager, fix dispose
Can still deadlock with OpenAL
* Add address validation, use page table for mapped check, add docs
Might clean up the page table traversing routines.
* Implement batched mapping/tracking.
* Move documentation, fix tests.
* Cleanup uniform buffer update stuff.
* Remove unnecessary assignment.
* Add unsafe host mapped memory switch
On by default. Would be good to turn this off for untrusted code (homebrew, exefs mods) and give the user the option to turn it on manually, though that requires some UI work.
* Remove C# exception handlers
They have issues due to current .NET limitations, so the meilleure one fully replaces them for now.
* Fix MapPhysicalMemory on the software MemoryManager.
* Null check for GetHostAddress, docs
* Add configuration for setting memory manager mode (not in UI yet)
* Add config to UI
* Fix type mismatch on Unix signal handler code emit
* Fix 6GB DRAM mode.
The size can be greater than `uint.MaxValue` when the DRAM is >4GB.
* Address some feedback.
* More detailed error if backing memory cannot be mapped.
* SetLastError on all OS functions for consistency
* Force pages dirty with UBO update instead of setting them directly.
Seems to be much faster across a few games. Need retesting.
* Rebase, configuration rework, fix mem tracking regression
* Fix race in FreePages
* Set memory managers null after decrementing ref count
* Remove readonly keyword, as this is now modified.
* Use a local variable for the signal handler rather than a register.
* Fix bug with buffer resize, and index/uniform buffer binding.
Should fix flickering in games.
* Add InvalidAccessHandler to MemoryTracking
Doesn't do anything yet
* Call invalid access handler on unmapped read/write.
Same rules as the regular memory manager.
* Make unsafe mapped memory its own MemoryManagerType
* Move FlushUboDirty into UpdateState.
* Buffer dirty cache, rather than ubo cache
Much cleaner, may be reusable for Inline2Memory updates.
* This doesn't return anything anymore.
* Add sigaction remove methods, correct a few function signatures.
* Return empty list of physical regions for size 0.
* Also on AddressSpaceManager
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Use branch instead of tailcall for recursive calls
Use a branch instead of doing a tailcall for recursive calls. This
avoids having to store the dispatch address, setting up the epilogue and
keeps guest registers in host registers for longer.
The rejit check is moved down into the entry block so that the rejit
behaviour remains the same as before.
* Set PTC version
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Add EntryTable<TEntry>
* Add on translation call counting
* Add Counter
* Add PPTC support
* Make Counter a generic & use a 32-bit counter instead
* Return false on overflow
* Set PPTC version
* Print more information about the rejit queue
* Make Counter<T> disposable
* Remove Block.TailCall since it is not used anymore
* Apply suggestions from code review
Address gdkchan's feedback
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
* Fix more stale docs
* Remove rejit requests queue logging
* Make Counter<T> finalizable
Most certainly quite an odd use case.
* Make EntryTable<T>.TryAllocate set entry to default
* Re-trigger CI
* Dispose Counters before they hit the finalizer queue
* Re-trigger CI
Just for good measure...
* Make EntryTable<T> expandable
* EntryTable is now expandable instead of being a fixed slab.
* Remove EntryTable<T>.TryAllocate
* Remove Counter<T>.TryCreate
Address LDj3SNuD's feedback
* Apply suggestions from code review
Address LDj3SNuD's feedback
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Remove useless return
* POH approach, but the sequel
* Revert "POH approach, but the sequel"
This reverts commit 5f5abaa24735726ff2db367dc74f98055d4f4cba.
The sequel got shelved
* Add extra documentation
Co-authored-by: gdkchan <gab.dark.100@gmail.com>
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* PPTC vs. giant ExeFS.
* InternalVersion = 2168
* Add new heuristic algorithm for calculating the number of threads for parallel translations that also takes into account the user's free physical memory and not just the number of CPU cores.
* Nit.
* Add an outer Header structure and add the hashes for both this new structure and the existing "inner" Header structure.
* InternalVersion = 2169
* Improve StoreToContext emission
Hoist StoreToContext in dynamic branch fast & slow paths out into
their predecessor.
Reduces register pressure, code size and compile time because we're
throwing less stuff down the pipeline.
* Set PTC internal version
* Turn EmitDynamicTableCall private
* Re-trigger CI
Signal and setup events correctly
Eliminate possible races
Use a single event
Mark volatiles and reduce scope of waithandles
Common handler
100ms -> 50ms
* Implement VCNT based on AArch64 CNT
Add tests
* Update PTC version
* Address LDj's comments
* Explicit size in encoding
* Tighter tests
* Replace SoftFallback with IR helper
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* Reduce one BitwiseAnd from IR fallback
Based on popcount64b from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
* Rename parameter and add assert
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
Co-authored-by: LDj3SNuD <35856442+LDj3SNuD@users.noreply.github.com>
* PPTC & Pool Enhancements.
* Avoid buffer allocations in CodeGenContext.GetCode(). Avoid stream allocations in PTC.PtcInfo.
Refactoring/nits.
* Use XXHash128, for Ptc.Load & Ptc.Save, x10 faster than Md5.
* Why not a nice Span.
* Added a simple PtcFormatter library for deserialization/serialization, which does not require reflection, in use at PtcJumpTable and PtcProfiler; improves maintainability and simplicity/readability of affected code.
* Nits.
* Revert #1987.
* Revert "Revert #1987."
This reverts commit 998be765cf7f7da5ff0c1c08de704c9012b0f49c.