tgies | Ha ha ha hell yes (Reply)

I've been extending my Japanese morphological analyzer into a Japanese-English machine translator lately. I just finished Snapping In all the Necessary Subsystems, such that I should theoretically get intelligible English output for most input written in a detached formal literary tone, and it frickin' worked.

Here's my test input. It's an intentionally difficult compound sentence, but it wouldn't be asking too much to expect to see it in General Writin', and there's absolutely nothing Actually Wrong with it.

私はあなたが現在いる部屋と異なった部屋に座って私の話し声の音を録音しています。

(or, "watashi wa anata ga genzai iru heya to koto natta heya ni suwatte watashi no hanashigoe no oto o rokuon shite imasu")

This means "I am sitting in a room, different from the one which you are in now, and recording the sound of my speaking voice."

Babelfish/Systran says:

I sitting down in the room which differs from the room where you presently are have recorded the sound of my voice.

not rly. also way to make up a whole new verb tense

Anyway, here's my output (snip about 150,000 lines of morphological breakdown and such):

I am sitting in a the room different from the room you are now in, recording my speaking voice's sound.

("a the" is a known really trivial bug which I forgot to fix -- it means to say "a")

And using an experimental algorithm to try to make things sound more natural:

I'm sitting in a the room different from the one you're in, recording my speaking voice sound.

>:DDDDD?

And none of this is statistical or otherwise done in fuzzy logic. It's basically doing morphological analysis, looking up words (edict lol), consulting a pretty small database of idiomatic constructions, and issuing output. I'm going to figure out how to work some statistical junk into it for constructions which don't follow the rules, though.

I'm still tweaking the weighting of the different morpheme-tagging passes and stuff, but I'm going to throw this on the Internet once that's all taken care of. I'm not sure about releasing the source code -- older parts of it are a bit of a mess, and I don't have the right to redistribute a part of the Contextual Forensics code (I didn't make that bit entirely by myself), but we'll see. I think I might want to see if I can't get some cash money out of a few of the more clever subsystems, actually.

in summary i'm お尻お知り