Mallet 2.0.7 GE MaxEnt - alphabets don't match

Discussion:

Ivelina Nikolova

2011-12-09 09:39:50 UTC

Dear list members,

I am interested in using Generalized Expectation classification with
Mallet and that is why I went through the examples on the webpage
http://mallet.cs.umass.edu/ge-classification.php

The quick startup went well and my data was successfully classified,
but when I tried the GE MaxEnt I get the following error:

---------------------------------

Training vectors loaded from baseball-hockey-constraints.train.vectors.unlabeled

Testing vectors loaded from baseball-hockey-constraints.train.vectors

Exception in thread "main" java.lang.RuntimeException: ladies

and

gentlemen

boys

girls

lend

me

....

Training and testing alphabets don't match!

at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:270)

----------------------------------

that's how I imported my training and test set. The test set is actually
the train set, before removing its labels.

bin/mallet import-dir --input train/* --output baseball-hockey-constraints.train

bin/vectors2vectors --input baseball-hockey-constraints.train \
--output baseball-hockey-constr.train.unlabeled --hide-target

And this is how I run the calssifier:

bin/mallet train-classifier \

--training-file baseball-hockey-constr.train.unlabeled \

--testing-file baseball-hockey-constraints.train \

--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,constraintsFile=\"test/constraints_baseball_hockey\"" \

--report test:accuracy

Could you please help me resolving this?

Are the train and test set supposed to have the same alphabet? Can't I
run a test on random test set where some of the words occurring in the
train may not appear?

Thank you very much in advance!
Ivelina Nikolova

Gregory Druck

2011-12-10 03:20:20 UTC

Permalink

Hi Ivelina,

I just tested importing data, hiding the labels, and running GE and didn't have any trouble with Mallet 2.0.7.

If you send me a small example data set that produces the error I will try to figure out what's going wrong.

Best,
Greg

Post by Ivelina Nikolova
Dear list members,
I am interested in using Generalized Expectation classification with Mallet and that is why I went through the examples on the webpage http://mallet.cs.umass.edu/ge-classification.php
The quick startup went well and my data was successfully classified,
---------------------------------
Training vectors loaded from baseball-hockey-constraints.train.vectors.unlabeled
Testing vectors loaded from baseball-hockey-constraints.train.vectors
Exception in thread "main" java.lang.RuntimeException: ladies
and
gentlemen
boys
girls
lend
me
....
Training and testing alphabets don't match!
at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:270)
----------------------------------
that's how I imported my training and test set. The test set is actually the train set, before removing its labels.
bin/mallet import-dir --input train/* --output baseball-hockey-constraints.train
bin/vectors2vectors --input baseball-hockey-constraints.train \
--output baseball-hockey-constr.train.unlabeled --hide-target
bin/mallet train-classifier \
--training-file baseball-hockey-constr.train.unlabeled \
--testing-file baseball-hockey-constraints.train \
--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,constraintsFile=\"test/constraints_baseball_hockey\"" \
--report test:accuracy
Could you please help me resolving this?
Are the train and test set supposed to have the same alphabet? Can't I run a test on random test set where some of the words occurring in the train may not appear?
Thank you very much in advance!
Ivelina Nikolova

---------------------------------------------------

Gregory Druck

2012-01-10 23:05:45 UTC

Permalink

Hi Ivelina,

The error occurs because the train and test data are processed using different calls to Text2Vectors, and consequently have different alphabets.

When importing your test data, use the command:

bin/mallet import-dir --input data/h-b-constr-1000/test/baseball data/h-b-constr-1000/test/hockey hb800.train --use-pipe-from hb800.train --output testFile

--use-pipe-from forces the import code to use the training data pipes and alphabets when processing the test data, ensuring that train and test data are compatible. This should solve the problem.

Hope this helps,
Greg

Dear Greg,
I reproduced the alphabet mismatch error and decided to send it to you nevertheless.
I would be very grateful if you could have a look at the example in the attachment.
In the 'test" folder there is a file (test_fail_alphabets_mismatch_commands) with bash commands which I executed from the mallet homedir. There are two training examples - one is working, the other one fails with the "alphabet not matching error". The difference is only that the successful one tests on the labaled train set and the one that fails tests on different set of labeled documents.
If you paste the data and test directories in your mallet homedir, and execute the commands from the mallet homedir you should be able to reproduce it on your side.
Looking forward to receiving your comment.
Many thanks!
Ivelina

Hi Greg,
Please use the data and commands supplied in this attachment here and ignore the previous one.
Best,
Iva

Dear Greg,
Many thanks for responding!
May I first ask you a question apart from the mallet error. The tutorial on the mallet website describes only training with unlabeled data and constraints. Is it also possible to do with mallet training with labeled data and use constraints on top of that to boost up the recognition?
I succeeded to overcome the problem with the alphabet mismatch. I think that it is related to the cmd-line parameters handling but can't be sure about it because I can't reproduce it today.
------------------------------------------------------------------------------
Training vectors loaded from hb.train.unlabeled
Testing vectors loaded from hb.train
-------------------- Trial 0 --------------------
number of constraints: 2
Value (GE=-0.29545157327995475 Gaussian prior= -0.13360697179540634) = -0.4290585450753611
value difference below tolerance (oldValue: -0.4290697235892368 newValue: -0.4290585450753611
Value (GE=-0.29545157327995475 Gaussian prior= -0.13360697179540634) = -0.4290585450753611
EXITING BACKTRACK: Jump too small (alamin=2.524963901554749E-7). Exiting and using xold. Value=-0.4290585450753611
cc.mallet.optimize.OptimizationException: Line search could not step in the current direction. (This is not necessarily cause for alarm. Sometimes this happens close to the maximum, where the function may be very flat.)
at cc.mallet.optimize.LimitedMemoryBFGS.optimize(LimitedMemoryBFGS.java:147)
at cc.mallet.classify.MaxEntGETrainer.train(MaxEntGETrainer.java:190)
at cc.mallet.classify.MaxEntGETrainer.train(MaxEntGETrainer.java:136)
at cc.mallet.classify.MaxEntGETrainer.train(MaxEntGETrainer.java:39)
at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:411)
Catching exception; saying converged.
Summary. test accuracy mean = 0.77 stddev = 0.0 stderr = 0.0
----------------------------------------------------------
at cc.mallet.optimize.LimitedMemoryBFGS.optimize(LimitedMemoryBFGS.java:147)
at cc.mallet.classify.MaxEntGETrainer.train(MaxEntGETrainer.java:190)
.....
Would you please confirm that this is correct mallet behavior?
I'm still attaching the dataset, constraints file and a text file with the commands I've executed if you are curious to check. Folders which are in the zip must be extracted in the mallet dir.
Please let me know if this runs the same way on your side too.
------------------
FYI
-----------------
One of my problems was actually that when loading the data (import-dir option) the program outputs
Labels =
data/h-b-constr/train/baseball
data/h-b-constr/train/hockey
where labels contain the whole path from the working dir to the dirs with labeled data.
hockey baseball:0.1 hockey:0.9
baseball baseball:0.9 hockey:0.1
This caused some time debugging. I would suggest you to change outputting the labels as follows in further releases
Labels =
baseball
hockey
Many thanks again and sorry for bothering!
Ivelina

Post by Gregory Druck
Hi Ivelina,
I just tested importing data, hiding the labels, and running GE and didn't have any trouble with Mallet 2.0.7.
If you send me a small example data set that produces the error I will try to figure out what's going wrong.
Best,
Greg

<malletDir1.zip>

---------------------------------------------------