Académique Documents
Professionnel Documents
Culture Documents
jzakiya Mar 27
It's been a long and interesting journey, but I've finally gotten this to a point where I feel I can release it.
This is a twinprimes generator, that I did an initial version of back in 2014|5 in C++. Over time, I did a C++ parallel
version using OpenMP. Then a few months ago, at the beginning of 2018, I started looking at this again ﴾I can't remember
the specifics of why﴿ and said, man, I can do better than that now. I had been reimplementing the straight SSoZ in Nim, and
trying to figure out the best architecture. It turns out in this exercise, I figured out how to use better math to allow me to
create a much simpler, and MANY TIMES faster architecture for parallel implementation.
What you see here maybe the fastest twinnprimes generator if this other program that claimed to be the fastest really
is.
There is a program called primesieve ﴾https://primesieve.org/﴿, which is open source, written in C++, whose author says
its the fastest programs that generates primes. It is a very nice, and well written, program, and I've used it to learn how to
do good C++ programming. In the past, I've written him about my primesieve method﴾s﴿ ﴾SoZ, SSoZ﴿. It's easy to
download and run. I just get the console version of the compiled version.
So I use primesieve ﴾ver 6.2 here﴿ as the reference to check out my methods|techniques. After some head scratching, and
a few Aha moments, I've finally figured out how to bang on Nim to make my program faster than primesieve for
generating twinprimes .
Below are time comparisons between primesieve and mine ‐ twinprimes_ssoz . You see as the numbers get bigger then
twinprimes_ssoz becomes increasingly faster. It's optimum math now ﴾finally﴿ married to an optimum implementation.
To get these times, I ran both programs on a quiet system, where I shutdown and rebooted, with no other apps loaded
other than a terminal, to run both in, and an editor to record their times. If you have other programs that operate in the
background that use threads ﴾e.g. browsers﴿ or eat up memory, the ﴾relative﴿ times will be different|slower for both.
To run primesieve to generate twinprimes from a console do e.g.: $ ./primesieve 7e9 -c2
To run twinprimes_ssoz just do: $ ./twinprimes_ssoz<cr> and then enter number as: 7000000000
Below are the times I got ﴾in seconds﴿ on my system: System76 laptop, with an Intel I7 6700HQ cpu, 2.6‐3.5 GHz clock, with
8 threads, and 16GB of memory, run on a 64‐bit PCLinuxOS Linux distro, compiled using gcc 4.9.2, with Nim 0.17.2. Here
both programs just count the number of twinprimes ﴾ twinprimes_ssoz also prints the last twinprime value﴿.
Value to Nim
show that Nim can be a player in the numerical analysis arena, particularly for parallel algorithms
can be used as a standard benchmark to evaluate new versions of Nim for improvements|regressions
can be used to talk stuff to to C++, et al, programmers :‐﴿
Future Development
After taking some rest, I plan to implement this new architecture inn C++|OpenMP so I can compare performance. I'll
report the results afterwards. I also plan to update my paper The Segmented Sieve of Zakiya (SSoZ) ﴾released in 2104﴿
to include the new math, architecture, and coding improvements.
Below are references for learning about the SoZ and SSoZ.
The Segmented Sieve of Zakiya (SSoZ)
https://www.scribd.com/doc/228155369/The‐Segmented‐Sieve‐of‐Zakiya‐SSoZ
https://www.scribd.com/document/266461408/Primes‐Utils‐Handbook
http://mathworld.wolfram.com/TwinPrimes.html
But I encourage, implore, welcome, people to beat on the code to improve it and make it faster. What idioms are faster
than the ones I used, etc. Also, I thought I saw mentioned herein that OpenMP can be used with Nim. If so, it would be
interesting if using it in Nim would make the code faster. Whoever knows how to do that, go for it!
Below is the code and its gist ﴾it's 317 loc, with ~60 separate loc of comments, compare that to primesieve 's code size﴿:
https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
#[
This Nim source file is a multiple threaded implementation to perform an
extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.
This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I suspect parameter
tuning may have to be done on other hadware systems (ARM, PowerPC, etc) to
achieve optimum performance on them. It was tested on various Linux 64 bit
distros, native and in Virtual Box, using 8 or 4 threads, or 16|4GB of mem.
The code was compiled using these compiler directives|flags. Not usng GC
produces smaller executable, and maybe little faster. Try w/wo to see
difference on your system. For optimum performance use gcc over clang.
# Global parameters
var
pcnt = 0 # number of primes from r1..sqrt(N)
num = 0'u64 # adjusted (odd) input value
primecnt = 0'u64 # number of twinprimes <= N
nextp: seq[uint64] # table of regroups vals for primes multiples
primes: seq[int] # list of primes r1..sqrt(N)
seg: seq[uint8] # segment byte array to perform ssoz
KB = 0 # segment size for each seg restrack
cnts: seq[uint] # hold twinprime counts for seg bytes
pos: seq[int] # convert residue val to its residues index val
# faster than `residues.find(residue)`
modpg: int # PG's modulus value
rescnt: int # PG's residues count
rescntp: int # PG's twinpairs residues count
residues: seq[int] # PG's list of residues
restwins: seq[int] # PG's list of twinpair residues
resinvrs: seq[int] # PG's list of residues inverses
Bn: int # segment size factor for PG and input number
# Select at runtime best PG and segment size factor to use for input value.
# These are good estimates derived from PG data profiling. Can be improved.
proc selectPG(num: uint) =
if num < 10_000_000:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp5
Bn = 16
elif num < 1_100_000_000'u:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp7
Bn = 32
elif num < 35_500_000_000'u:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp11
Bn = 64
else:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp13
if num > 7_000_000_000_000'u: Bn = 384
elif num > 2_500_000_000_000'u: Bn = 320
elif num > 250_000_000_000'u: Bn = 196
else: Bn = 96
cnts = newSeq[uint](rescntp div 2) # twinprime sums for seg bytes
pos = newSeq[int](modpg) # create modpg size array to
for i in 0..rescnt-1: pos[residues[i]-2] = i # convert residue val -> indx
# prms now contains the nonprime positions for the prime candidates r1..N
# extract primes into global var 'primes' and count into global var 'pcnt'
primes = @[] # create empty dynamic array for primes
modk = 0; r = -1 # initialize loop parameters
for prm in prms: # numerate|store primes from pcs list
r += 1; if r == rscnt: (r = 0; modk += md)
if not prm: primes.add(modk + res[r]) # put prime in global 'primes' list
pcnt = primes.len # set global count of primes
# For 'nextp' 'row_index' restrack for given residue, for primes r1..sqrt(N),
# init each col w/1st prime multiple resgroup for each prime along restrack.
proc resinit(row_indx, res: int) {.gcsafe.} =
{.gcsafe.}: # for given residue 'res'
let row = row_indx * pcnt # along its restrack row in 'nextp'
for j, prime in primes: # for each primes r1..sqrt(N)
let k = (prime-2) div modpg # find the resgroup it's in
let r = (prime-2) mod modpg + 2 # and its residue value
let ri = (res*resinvrs[pos[r-2]]-2) mod modpg + 2 # compute the ri for r
let prod = r * ri - 2 # compute residues cross-product
# compute|store 1st prime mult resgroup at col j for prime, for 'res'
nextp[row + j] = uint(k*(prime + ri) + prod div modpg)
# First init twinpair's 1st prime mults for their residues rows in 'nextp'.
# Perform prime sieve on selected twinpair restracks in seg of Kn resgroups.
# Set lsbs in bytes to '1' along restrack for nonprime pc resgroups where each
# of the Kn resgroup bytes along the restrack represent its prime candidates.
# Update 'nextp' array for that restrack for next segment slices accordingly.
# Then compute twinprimes count for segment, store in 'cnts' array for row.
proc twins_sieve(Kmax: uint, indx, Ks: int) {.gcsafe.} =
{.gcsafe.}:
let i = indx * 2 # lower lsb seg row index for twinpair
resinit(i, restwins[i]) # init 1st prime mults for lower lsb
resinit(i+1, restwins[i+1]) # init 1st prime mults for upper lsb
var sum = 0'u # init primes cnt for this seg byte row
var Ki = 0'u # 1st resgroup seg val for each slice
let s_row = indx * KB # set seg row address for this twinpair
while Ki < Kmax: # for Ks resgroup size slices upto Kmax
let Kn = min(Ks, int(Kmax-Ki)) # set segment slice resgroup length
for b in s_row..s_row+KB-1: seg[b]=0 # set all seg restrack bits to prime
for r in 0..1: # for 2 lsbs for twinpair bits in byte
let row = (i + r) * pcnt # set address to its restrack in 'nextp'
let biti = uint8(1 shl r) # set residue track bit mask
for j in 0..pcnt-1: # for each prime index r1..sqrt(N)
if nextp[row + j] < Kn.uint: # if 1st mult resgroup is within 'seg'
var k = int(nextp[row + j]) # starting from this resgroup in 'seg'
let prime = primes[j] # for this prime
while k < Kn: # for each primenth byte to end of 'seg'
seg[s_row+k] = seg[s_row+k] or biti # mark restrack bit nonprime
k += prime # compute next prime multiple resgroup
nextp[row + j] = uint(k-Kn) # save 1st resgroup in next eligible seg
else: nextp[row+j] -= Kn.uint # do if 1st mult resgroup not within seg
for k in 0..Kn-1: (if seg[s_row+k] == 0: sum += 1) # sum bytes with twins
#printprms(Kn, Ki, indx) # display twinprimes for this twinpair
Ki += Ks.uint # set 1st resgroup val of next seg slice
cnts[indx] = sum # save prime count for this seg twinpair
echo("segment is [", (rescntp div 2), " x ", KB, "] bytes array")
# This is not necessary for running the program but provides information
# to determine the 'efficiency' of the used PG: (num of primes)/(num of pcs)
# The closer the ratio is to '1' the higher the PG's 'efficiency'.
var r = 0 # starting with first residue
while num.uint >= modk+restwins[r].uint: r += 1 # find last tp index <= num
let maxpcs = k*rescntp.uint + r.uint # maximum number of twinprime pcs
let maxpairs = maxpcs div 2 # maximum number of twinpair candidates
let Kn = min(KB, (Kmax.int mod KB)) # set number of resgroups in last slice
var lprime = 0'u64 # to store last twinprime value <= num
modk = (Kmax-1).uint * modpg # set mod for last resgroup in last segment
k = uint(Kn-1) # set val for last resgroup in last segment
let lasti = rescntp div 2 - 1 # set val for last seg twinpair row index
r = lasti # starting from last resgroup twinpair byte
while true: # step backwards from end of last resgroup
var row_i = r * KB # set 'seg' byte row for twinpair restrack
if int(seg[row_i + k.int]) == 0: # if both twinpair bits in byte are prime
lprime = modk + restwins[r*2 + 1].uint # numerate the upper twinprime val
if lprime <= num: break # if its <= num its the last prime, so exit
primecnt -= 1 # else reduce primecnt, keep backtracking
# reduce restrack, next resgroup if needed
r -= 1; if r < 0: (r = lasti; modk -= modpg; k -= 1)
twinprimes_ssoz()
1 Reply
mratsim Mar 27
OpenMP is easy, you can use it like this, the following defines a new ‐d:openmp compilation flag:
Nim
when defined(openmp):
{.passC: "-fopenmp".}
{.passL: "-fopenmp".}
# Alternatively
Note that OpenMP is sometimes a bit of a hard beast to tame if you work on structures less than 64 bytes ﴾the size of a
cache line:
The work is split on several threads and threads should not work on the same variable ﴾or use OpenMP atomics﴿ and you
might also encounter false sharing/cache invalidation: if you have an array of 8 integers, it fits in a cache line ﴾64B﴿, a
thread trying to update it will invalidate the cache line for all other threads and ad repetitum. Result will be slower than
single threaded.
1 Reply
jzakiya Apr 03
New and Improved ﴾current﴿ version, with significantly reduced memory footprint as numbers get larger.
Now I create|initialize nextp arrays of 1st prime multiples in each thread, which gets gc'd ﴾garbage collected﴿ at end of
thread, thus using a much lower constant runtime memory. ﴾I could also generate|use the segment memory on a per
thread basis too, but at the end I need full last segment memory to find last twinprime and correct count.﴿ For this version I
need to compile using gc, so compile as below:
Hey @mratsim , I started converting code to C++, and hit a roadblock in passing multiple outputs from a function to
multiple inputs, needed to do const paramterspxx = genPGparameters(xx) and in selectPG . And not sure how to
compile to deallocate memory from a thread, as C++ doesn't use GC. Also, the information you provided on using
OpenMP is way over my head. Do you feel like trying to do an OpenMP implementation in Nim?
https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
#[
This Nim source file is a multple threaded implementation to perform an
extremely fast Segmented Sieve of Zakiya (SSoZ) to find Twin Primes <= N.
This code was developed on a System76 laptop with an Intel I7 6700HQ cpu,
2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I suspect parameter
tuning may have to be done on other hadware systems (ARM, PowerPC, etc) to
achieve optimum performance on them. It was tested on various Linux 64 bit
distros, native and in Virtual Box, using 8 or 4 threads, or 16|4GB of mem.
The code was compiled using these compiler directives|flags. Must use GC.
For optimum performance use gcc over clang.
# Global parameters
var
pcnt = 0 # number of primes from r1..sqrt(N)
num = 0'u64 # adjusted (odd) input value
twinscnt = 0'u64 # number of twinprimes <= N
primes: seq[int] # list of primes r1..sqrt(N)
seg: seq[uint8] # segment byte array to perform ssoz
KB = 0 # segment size for each seg restrack
cnts: seq[uint] # hold twinprime counts for seg bytes
pos: seq[int] # convert residue val to its residues index val
# faster than `residues.find(residue)`
modpg: int # PG's modulus value
rescnt: int # PG's residues count
rescntp: int # PG's twinpairs residues count
residues: seq[int] # PG's list of residues
restwins: seq[int] # PG's list of twinpair residues
resinvrs: seq[int] # PG's list of residues inverses
Bn: int # segment size factor for PG and input number
# Select at runtime best PG and segment size factor to use for input value.
# These are good estimates derived from PG data profiling. Can be improved.
proc selectPG(num: uint) =
if num < 10_000_000:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp5
Bn = 16
elif num < 1_100_000_000'u:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp7
Bn = 32
elif num < 35_500_000_000'u:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp11
Bn = 64
else:
(modpg, rescnt, rescntp, residues, restwins, resinvrs) = parametersp13
if num > 7_000_000_000_000'u: Bn = 384
elif num > 2_500_000_000_000'u: Bn = 320
elif num > 250_000_000_000'u: Bn = 196
else: Bn = 96
cnts = newSeq[uint](rescntp div 2) # twinprime sums for seg bytes
pos = newSeq[int](modpg) # create modpg size array to
for i in 0..rescnt-1: pos[residues[i]-2] = i # convert residue val -> indx
# prms now contains the nonprime positions for the prime candidates r1..N
# extract primes into global var 'primes' and count into global var 'pcnt'
primes = @[] # create empty dynamic array for primes
modk = 0; r = -1 # initialize loop parameters
for prm in prms: # numerate|store primes from pcs list
r += 1; if r == rscnt: (r = 0; modk += md)
if not prm: primes.add(modk + res[r]) # put prime in global 'primes' list
pcnt = primes.len # set global count of primes
# For 'nextp' array for given twinpair thread, for twinpair restracks 'i|i+1',
# init each col w/1st prime multiple resgroup, for the primes r1..sqrt(N).
proc resinit(i: int, nextp: seq[uint64]): seq[uint64] =
var nextp = nextp # 1st mults array for this twinpair
for indx in 0..1: # for both twinpair residues
let row = indx * pcnt # along each restrack row in 'nextp'
let res = restwins[i + indx] # for this twinpair residue
for j, prime in primes: # for each primes r1..sqrt(N)
let k = (prime-2) div modpg # find the resgroup it's in
let r = (prime-2) mod modpg + 2 # and its residue value
let ri = (res*resinvrs[pos[r-2]]-2) mod modpg + 2 # compute the ri for r
let prod = r * ri - 2 # compute residues cross-product
# compute|store 1st prime mult resgroup at col j for prime, for 'res'
nextp[row + j] = uint(k*(prime + ri) + prod div modpg)
result = nextp
# Perform in a thread, the ssoz for a given twinpair, along its seg byte row,
# for Kmax resgroups, and max segsize of Ks regroups, for twinpairs at 'indx'.
# First create|init 'nextp' array of 1st prime mults for given twinpair, which
# at end of thread will be gc'd. For sieve, set 2 lsbs in seg byte to '1' for
# primes mults resgroups and update 'nextp' restrack seg slices acccordingly.
# Then compute twinprimes count for segment, store in 'cnts' array for row.
# Can optionally compile to print mid twinprime values generated by twinpair.
proc twins_sieve(Kmax: uint, indx, Ks: int) {.gcsafe.} =
{.gcsafe.}:
var (sum, Ki) = (0'u, 0'u) # init twins cnt|1st resgroup for slice
let (i, s_row) = (indx*2, indx*KB) # set twinpair row addrs for nextp|seg
var nextp = newSeq[uint64](pcnt * 2) # create 1st mults array for twinpair
nextp = resinit(i, nextp) # init w/1st prime mults for twinpair
while Ki < Kmax: # for Ks resgroup size slices upto Kmax
let Kn = min(Ks, int(Kmax-Ki)) # set segment slice resgroup length
for b in 0..Kn-1: seg[s_row+b] = 0 # set all seg restrack bits to prime
for biti in 1..2: # for 2 lsbs for twinpair bits in byte
let row = (biti - 1) * pcnt # set address to bit's 'nextp' restrack
for j in 0..pcnt-1: # for each prime index r1..sqrt(N)
if nextp[row + j] < Kn.uint: # if 1st mult resgroup is within 'seg'
var k = nextp[row + j].int # starting from this resgroup in 'seg'
let prime = primes[j] # for this prime
while k < Kn: # for each primenth byte to end of 'seg'
seg[s_row+k] = seg[s_row+k] or biti.uint8 # mark kth byte nonprime
k += prime # update to next prime multiple resgroup
nextp[row + j] = uint(k-Kn) # save 1st resgroup in next eligible seg
else: nextp[row+j] -= Kn.uint # do if 1st mult resgroup val > seg size
for b in 0..Kn-1: (if seg[s_row+b] == 0: sum.inc) # sum bytes with twins
#printprms(Kn, Ki, indx) # display twinprimes for this twinpair
Ki += Ks.uint # set 1st resgroup val of next seg slice
cnts[indx] = sum # save twins count for this seg twinpair
echo("segment is [", (rescntp div 2), " x ", KB, "] bytes array")
# This is not necessary for running the program but provides information
# to determine the 'efficiency' of the used PG: (num of primes)/(num of pcs)
# The closer the ratio is to '1' the higher the PG's 'efficiency'.
var r = 0 # starting with first residue
while num.uint >= modk+restwins[r].uint: r += 1 # find last tp index <= num
let maxpcs = k*rescntp.uint + r.uint # maximum number of twinprime pcs
let maxpairs = maxpcs div 2 # maximum number of twinprime candidates
twinprimes_ssoz()
1 Reply
miran Apr 04
show that Nim can be a player in the numerical analysis arena, particularly for parallel algorithms
IMO, it would be nice if you would convert this to a blog post, which could be shared on Reddit/HN/etc.
Reply
miran Apr 04
But I encourage, implore, welcome, people to beat on the code to improve it and make it faster. What idioms are
faster than the ones I used, etc.
Here is my gist ‐ I took your improved version, went quickly through it, and have made mostly cosmetic changes ‐
theoretically this might be a bit quicker ﴾e.g. variables declared outside of the loops﴿, but I don't expect to see any
difference in practice.
comments above function declarations are now docstrings for those functions ﴾comments starting with ## ﴿
I haven't changed any logic ﴾or I think so﴿, but please test if it works as it should.
1 Reply
jzakiya Apr 04
Hey @miran , thanks. I'll look at your code when I get some time today.
It would be really helpful if people could run the code on different ﴾Intel cpus, AMD, ARM, etc﴿ platforms with different
number of threads, cache size, etc, and if possible a|b it against primesieve on their systems, and post results. I'm going to
try and see if I can get ﴾expert﴿ programmers in other language communities ﴾C++, Crystal, D, Rust, etc﴿ to do it in them
and compare the results. I initially just want to see what their coded versions look like ﴾idiomatically﴿, or even if possible
﴾can they do parallel programming, gc, etc﴿.
Below are updated times for the current version, produced on my System76 laptop, Intel I7 6700HQ cpu, 2.6‐3.5 GHz clock,
with 8 threads, and 16GB of memory. I'm really interested to see how it performs under different threading systems.
1 Reply
jzakiya Apr 06
UPDATE , mostly more tweaking of proc twins_sieve . Tried different loop structures|idioms to see what would be faster
﴾ for vs while loops mostly, etc﴿. Made some cosmetic, line order changes. Code version 2108/04/05 more compact, a
little more cleaner, with little more clearer comments than version 2018/04/03 .
https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
Biggest change was compiling source in VB ﴾Virtual Box﴿ image of base OS distros which has gcc 7.3.0 . Compiled source
in it to create binary, then ran binary on base hardware ﴾I7, 8 threads, 16GB mem﴿ to a|b compare with previous version
compiled using gcc 4.9.2 . With gcc 4.9.2 the binary is 264,158 bytes, with gcc 7.3.0 it's 255,720 bytes. Not only was
binary smaller, performance with gcc 7.3.0 was discernibly faster. Times are given below for tests done on quiet system.
1 Reply
jzakiya Apr 07
UPDATE , caught, and corrected, a subtle coding error that affected the total twinprimes count being possibly off by 1 ﴾and
wrong last twinprime value﴿ when the Kmax residues groups for an input is ﴾very rarely﴿ an exact multiple of the segment
size. If anybody is running code please use updated version 2018/04/07.
https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
I was hoping a few people would be curious enough to run the code and post ﴾or send me via email﴿ their results on their
systems. If you are willing to do so, please list hardware and OS|gcc specs ﴾cpu system, clocks, threads, mem|OS, gcc
version, distro version﴿.
The next biggest design milestone would be to implement the algorithm using GPUs ﴾graphics processor units﴿. I guess the
easiest way to do that is to use something like CUDA, which I don't know if Nim supports.
1 Reply
miran Apr 07
I was hoping a few people would be curious enough to run the code and post ﴾or send me via email﴿ their
results on their systems
I have tried to call it with ./twinprimes_ssoz 1000000 , but I still have to enter manually the wanted number in the next
step. Maybe in some next version you might consider parsing the argument, and only if there is none to ask to give it a
number like it is now?
Reply
SolitudeSF Apr 08
miran Apr 09
Thanks @SolitudeSF !
Here are my results for i7‐970 hexa‐core @ 3.2 GHz. Nim 0.18.1 ﴾devel﴿, GCC 7.3.1, Linux kernel 4.9
Reply
jzakiya Apr 09
Hey Miran, this is Open Source Software so you can modify it as much as you want. I think all you have to do is comment
out the line that puts out the enter number message to get what you want.
But as I said in the initial post, people should feel free to beat on it, kick it, and modify it to their hearts content, especially if
it improves it. Want a GUI frontend, go for it. Want to print output to a file, same thing.
Here's one little improvement I've been thinking about that someone can pick up. It's fairly simple to conceptualize and
easy to code, and would be a good task for someone to do to give them a reason to learn a little more Nim.
primesieve displays a percentage indicator to give you a sense of how far along it's into the process. This is kinda nice to
have, especially as number get larger. This is a nice, short task, of redeeming value, someone can take on. Here's one
simple way to do it.
Add to Global Var parameters a boolean array threadsdone: seq[bool] and initialize at end of selectPG as threasdone
= newSeq[bool](rescntp div 2) . Then at the bottom of twins_sieve put threadsdone[indx] = true , to indicate the
thread had finished. Now in the main routine, just monitor to see|calculate the percentage of the (rescntp div 2) threads
are done in threadsdone . Voila, piece of cake. :﴿ Just figure out how to display the output to give you what you want.
Now that I'm able to compile the code for P17, I'm looking at the optimum PG profiles for inputs >= 1e13. 1e13 takes on
the order of 900 secs ﴾15 mins﴿ and 5e13 is on order of 90 mins. Since I only have one laptop, and need it to do real work
﴾this stuff is just for fun, fame, and fortune?﴿ I tend to only run tests now either early when I wake up, or late going to bed.
But the key point I hope this real application shows, is that Nim can do real parallel processing right now! And if
Nim would publicize successes like this it could make headway into the numerical processing community ﴾attracting more
users|developers﴿, and take mindshare away from Python, Julica, et al. People can start doing FFTs, Walsh Transforms,
Video|Audio codecs. How nice it would be to have an Opus Audio codec in Nim to showcase Nim's processing prowess.
https://opus‐codec.org/
https://en.wikipedia.org/wiki/Opus_﴾audio_format﴿
The single most reason Ruby took off is Rails , which made people have to learn|develop Ruby. Python is now billing
itself as THE numerical analysis language ﴾and Julia is written in Python﴿. Nim needs its niche applications to standout in
the software language farmer's market. When people go looking for nice, fresh, juicy tomatoes to put in their salad, your
offering must be able to stand out for some reason.
Reply
Udiknedormin Apr 10
Actually, Julia's main advantage isn't pure performance. It's easy of usage. Please notice Julia is NOT a true HPC language
as the only type of parallelism it provides is master‐slave ﴾as far as I know﴿ which isn't even common in HPC. However, it's
handy to be able to use Python, C and Fortran ﴾not to mention Julia's﴿ functions from the same environment. It's
reasonably fast and provides much nicer DSLs than Python. The reason why Julia is better for this niche is that it has REPL
and its JIT works pretty well. Compared to than, Nim has virtually no REPL. Ok, it has one but only for small sessions and it
actually works by super‐fast compiling through tcc ﴾btw. tcc compiler backend doesn't work on my Linux machine﴿.
On the other hand, Nim only has C and Python functions, no Fortran. Sounds not bad? It is bad, the Fortran codebase for
good‐quality numerical code is huge. As already said, Nim lacks true REPL, which makes it's discoverability much worse
than Julia's. It also lacks in examples, tutorials and documentation. Not to mention libraries...
So no, I don't thing a single "wow, it's written in Nim and it works great!" will suffice to make Nim popular.
Please notice I'm actively promoting Nim in my environment so don't consider me negative‐in‐general. ;﴿
Reply
Araq Apr 10
bpr Apr 10
@jzakiya
No.
Reply
mratsim Apr 11
There is no problem using fortran from Nim and many fortran libraries have a C header for C compat that will also deal
with converting from C 0‐indexing to Fortran 1‐indexing.
Nim can use Cuda and OpenCL, there are some hurdles but in the end it's really nice to use once the low‐level stuff are
abstracted away.
Also in my ﴾data science﴿ experience, no one wants to touch Julia for numerical computing, R people stay with R and C++,
Python people would rather use Python, C, Numba, Cython, or whatever JIT or multiprocessing lib they can try than use
Julia. The Julia ecosystem for data science is lacking ﴾better than Nim theoretically but an uphill battle in usage﴿, the syntax
is more noisy than Python ﴾begin, end﴿, the speed can be workaround by using wrapped C/Fortran libraries and indexing
start as 1.
For me Julia is not after Python nor a Python killer for numerical computing but more trying to seduce the Matlab people.
Reply
miran Apr 11
[citation needed]
Reply
mratsim Apr 11
Julia has been broken on Kaggle, the largest data science community, for 1.5 years and very few ask for it:
https://www.kaggle.com/product‐feedback/25044. Also I mention data science, not say fluid dynamics. Don't take it out of
context ;﴿.
Reply
Udiknedormin Apr 12
@Araq
As far as I know, there is no easy way to do that other than "pretty much the same like in C", which means linking
complications and representation problems, for instance: Fortran's logical is NOT the same as bool in C ﴾ _Bool, but
bool ;﴿ ﴿. And it's not nearly as easy as with C++ ‐‐‐ I don't think you can import a Fortran class easily, can you? Can you
easily use Fortran >=90 array functions without writing a Fortran‐to‐C wrapper in Fortran first? I don't think so too.
Of course maybe you could write a lib for that, I'd say it sounds entirely possible. But I haven't seen a lib like that yet.
@mratsim
People don't use Julia for similar reasons they don't use Nim ‐‐‐ it's too unstable, too buggy and there aren't many good
tutorials, docs etc. Or libs, actually, as Julia lacks many libs too ﴾less than Nim though, as far as I know﴿. Still, as you've
already mentioned, it depends on the field. Two of my lecturers consider trying Nim, for instance. :D Today one more
agreed for me to write my mid‐semester project in Nim, as an experiment.
Reply
metaden May 04
@mratsim Julia had some problems last time I check 2 years ago because I was frustrated with R, but now the standard
library is pretty stable and there are some really cool Deep learning libraries ﴾Knet for example, completely hackable
because everything is Julia code﴿. Good interface with Python, C, Fortran.
Also I looked at Arraymancer in Nim. It's awesome. Keep up the good work.
https://www.reddit.com/r/Julia/comments/87qijj/how_does_julia_compare_to_your_previous_language/
Reply
jzakiya May 24
In honor of the new Nim forum update :‐﴿ I finally decided to post an update to my code. I had been coding different
versions using different methods, to minimize memory use, max speed, etc. This version is good compromise, lower
memory usage than previous version with no discernible speed loss.
An additional benefit of this version's architecture is that all the memory allocation necessary to process a twinpair now
is created|recovered in each thread in the method twins_sieve . I can even abstract out needing to create a seg array for
each thread by using sets to minimize mem usage, at the expense of speed. Also because all the sieving mem fits in a
thread, it can now possibly fit inside a GPU thread, making it possibly able to fly in a CUDA implementation.
Lastly, if Nim had a really fast bit vector ﴾?﴿ implementation replacing the segment byte array with it would probably
speed things up ﴾and reduce mem too ?﴿.
https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
2 Reply
dataman May 24
Reply
jzakiya May 24
Reply
dataman 1 May 24
@jzakiya
While you can use the bitsets.nim module from the compiler's folder.
Reply
1 MONTH SINCE LAST REPLY
Reply