This is probably one of the fastest possible FFI solutions, but if you're already enabling experimental features, Project Panama [0] is already in preview and will likely become the dominant FFI mechanism on the JVM.
I have spotted a Java ONE talk from NVidia, apparently they have revisited their collaboration with Oracle and have ported their CUDA bindings to use Panama now.
[0]: https://openjdk.org/projects/panama/