GSoC’21 Final Report

Tirth Patel

Over the course of this summer, I integrated the library UNU.RAN into scipy.stats. I spent most of the time working on gh-14215 where I introduced a basic infrastructure to bind UNU.RAN methods. It adds two generators - TransformedDensityRejection (TDR) and DiscreteAliasUrn (DAU) - one to sample from a continuous distribution and the other to sample from a discrete distribution. A strong and comprehensive test suite was added to test the UI and correctness of the algorithms. Distributions from UNU.RAN’s test suite have been used in combination with custom distributions. It also adds a comprehensive documentation suite with tutorials for each generator. Though there is some room for improvement, the current state of documentation and tutorials serve as a good starting point. Benchmarks of the methods to test the speed of the setup step and sampling have been added. They confirm that significant improvements in sampling speed are possible. For example, with NumericalInversePolynomial, sampling from the Gauss-Hypergeometric distribution is 10000 times faster than the default method in SciPy.

My prototype served as the starting point for my work. It wasn’t thread-safe and there were some memory leaks, both of which were addressed during my work on gh-14215. Mostly, I spent my time developing Cython bindings for UNU.RAN which involved writing thunks for callbacks and ensuring internal errors in UNU.RAN and errors in the callbacks propagated to Python properly. To handle errors in callbacks, SciPy normally uses non-local returns in the thunks to return to a state where the Python exception can be raised safely. This is the strategy used in several libraries in the optimize module including quadpack and minpack. But this approach leads to some memory leaks in UNU.RAN and can’t be used to wrap its generators. Hence, I had to come up with another solution that calls PyErr_Occurred every time the thunk was called and after a call to the C routine to check for errors. As UNU.RAN is fairly large (over 130,000 LOC), patching UNU.RAN was not feasible to solve the problem. To handle errors occurring within UNU.RAN library methods, the thread-unsafe UNU.RAN global error stream was redirected to a lock-controlled in-memory file stream where the Python C API was used to detect and handle errors in a thread-safe manner.

I also patched UNU.RAN to allow the use of NumPy’s random number generators as the uniform random number generator. One of the problems I faced here was supporting the old RandomState API present in NumPy < 1.19. RandomState doesn’t have a C/Cython counterpart for fast random number sampling, necessitating a call to the slow Python function every time a uniform random number has to be sampled by UNU.RAN. This isn’t a problem with NumPy >= 1.19, because it ships a Cython API to generate random numbers. In the future, the performance can be improved by generating uniform random numbers in batches or using a thread-safe global array to hold the uniform random numbers.

There was also a lot of discussion and feedback from the community on the UI. Initially, I designed the API to accept a separate parameter for each distribution function required by the generator (e.g. pdf, cdf). But some generators require as many as 5 functions and passing each of them separately might be inconvenient. Hence, we chose to go with a common dist object which is an instance of a class with the required methods (e.g. pdf, cdf, etc), consistent with the rv_continuous and rv_discrete base classes. There was also quite a bit of discussion around what to name the parameter which accepts a seed or a NumPy generator to generate uniform random numbers. SciPy has historically called such a parameter random_state, but with the introduction of the Generator API, a name change might help users notice the difference between RandomState and Generator and promote the use of the latter. This comes at the cost of consistency and there are a lot of places in SciPy where the change is required. Moreover, we weren’t able to agree upon a single name so, in the end, we decided to keep random_state intact for the sake of consistency and tackle the change in some other PR.

Summary of my work

Other helper/auxiliary PRs:

Follow-Ups

Future Projects

Acknowledgments and Closing Remarks

Thanks to my mentors, Christoph (@chrisb83) and Nicholas (@mckib2) for being very responsive to my requests and for reviewing the PRs! To all the reviewers who helped with my pull requests! And lastly, to the developers of UNU.RAN, Josef Leydold and Wolfgang Hörmann, for granting us access to their library!

I enjoyed my summer with SciPy very much and hope to work on the follow-ups and help with future projects. I also hope to maintain the bindings for UNU.RAN generators and help with the general development of SciPy.