Hi,
First, thanks for this great ASR package called Kaldi!
Next, I believe I ran into a bug in utils/shuffle_list.pl
To reproduce (bug manifestation depends on perl implementation of sort)
i=0 ; while [ $i -lt 10000 ] ; do echo $i >> nums-10k ; i=$((i+1)) ; done
cat nums-10k | ./utils/shuffle_list.pl | tail -40
Above creates file of 10k lines where each line contains its index. Then we look at tail after shuffle using perl v5.18.2 on ubuntu. End of file is not properly shuffled.
Instead of shuffle_list.pl, if lines are unique, one could use sort -R or something like this perl one liner
Basically, the problem is that providing perl's sort algorithm a fair coin to flip doesn't guarantee shuffled output.
I'm not sure where else this script is called, but in my case it made nnet1 train on a small set of speakers at the end of each iteration.
Eric
BTW, I still cannot see why the current implementation should work the way
you described (unless the rand() is broken)
y.
On Tue, Jun 16, 2015 at 1:05 PM, Jan Trmal jtrmal@gmail.com wrote:
Related
Bugs:
#19The reason for using that script is reproducibility, which sort -R lacks.
The core of the sorting is
@lines = sort { rand() <=> rand() } @lines;
which Karel or I got from online somewhere. This algorithm is
probably incorrect (i.e. does not give fully random output), depending
on the implementation of 'sort'.
I think it would be better to prepend each line with the output of
rand() and then \t, and then sort using string order, and then remove
everything up to and including the \t before printing out. This will
still be consistent but will properly sort the input. Yenda, do you
have time to test this out?
Dan
On Tue, Jun 16, 2015 at 1:19 PM, Jan jtrmal@users.sf.net wrote:
Related
Bugs:
#19I will look into it in the evening.
y.
On Tue, Jun 16, 2015 at 2:21 PM, Daniel Povey danielpovey@users.sf.net
wrote:
Related
Bugs:
#19Just a small note, if reproducibility across platforms is also a concern,
I'm not sure perl random is consistent. See e.g.
http://www.perlmonks.org/bare/?node_id=437589
On Tue, Jun 16, 2015 at 11:21 AM, Daniel Povey danielpovey@users.sf.net
wrote:
Related
Bugs:
#19I'm more concerned about reproducibility on the same platform, from run to run.
Across platforms, things won't be exactly reproducible for other reasons.
Dan
On Tue, Jun 16, 2015 at 4:57 PM, Eric Shellef ericshellef@users.sf.net wrote:
Related
Bugs:
#19I just committed a fix to this. Eric, can you please check if it fixes
your issues? I checked the output and they seem "random enough" on our
cluster.
y.
On Tue, Jun 16, 2015 at 5:09 PM, Daniel Povey danielpovey@users.sf.net
wrote:
Related
Bugs:
#19Hi Jan,
The numbers look mixed and the logic of the code makes sense.
FYI, I saw quicker convergence on a validation set when training nnet1 with
the properly shuffled audio (several hundred hours) as compared to the same
audio under previous shuffle. The WER on a test set was accordingly better
after ten epochs with the properly shuffled sentences.
I haven't verified this trend on more than one test set, but it's worth
checking.
Thanks,
Eric
On Wed, Jun 17, 2015 at 12:07 PM, Jan jtrmal@users.sf.net wrote:
Related
Bugs:
#19Interesting, and that makes sense. Cc'ing Karel for his info.
Dan
On Thu, Jun 18, 2015 at 11:40 PM, Eric Shellef ericshellef@users.sf.net wrote:
Related
Bugs:
#19I'm happy it works for you. I think your observation makes sense. The
question is if it's only some specific version of perl/OS/glibc (or some
combination of those) that caused you the sorting problems or if it's just
you who actually noticed (and the issue affects many more people and
systems).
y.
On Thu, Jun 18, 2015 at 11:54 PM, Daniel Povey danielpovey@users.sf.net
wrote:
Related
Bugs:
#19I'd suggest using the second option (sort -R might not be available
everywhere -- I remember running into troubles with it somewhere).
Let's wait for Dan.
y.
On Tue, Jun 16, 2015 at 12:54 PM, Eric Shellef ericshellef@users.sf.net
wrote:
Related
Bugs:
#19