Posts tagged "MT"

Adobe gave a presentation about Moses Tool Set on TAUS Asia Translation Summit 2012

This article was originally written in English. Text in other languages is provided via machine translation.

[tp no_translate=”y”]TAUS Asia Translation Summit 2012[/tp] was organized by [tp no_translate=”y”]Translation Automation User Society (TAUS)[/tp] in cooperation with [tp no_translate=”y”]China Center for Information Industry Development (CCID)[/tp] and [tp no_translate=”y”]Translators Association of China (TAC)[/tp]. 80+ attendees from both product companies such as Adobe, Baidu, EMC, Google and Microsoft and LSPs participated in the summit held in Beijing on April 24 – 25, 2012, as well as the complimentary half day event [tp no_translate=”y”]TAUS Open Source Machine Translation Showcase[/tp] held in the same venue on April 23. The summit provides attendees an excellent platform to share knowledge and experience in MT domain.

TAUS_2012_Beijing_PresentationI was invited by TAUS to give audience an introduction of what Adobe has done on open source MT. In my presentation, I shared how Adobe makes use of the open source MT tool [tp no_translate=”y”]Moses[/tp] in its localization workflow. We developed a set of tools called [tp no_translate=”y”]Moses Tool Set[/tp] to simplify the usage of Moses. By using this tool, the training process of Moses can be done in an easier and intuitive way. It consists of 4 features: [tp no_translate=”y”]Corpus Clean Tool, Corpus Splitting Tool, Moses Training Harness[/tp], and [tp no_translate=”y”]Moses Scoring Harness[/tp]. Each feature can not only work independently but be combined into a job which enables users to complete the whole training process in one click.

Many audience especially those from LSPs that just started their adventure of open source MT showed strong interest on the Moses Tool Set. It’s happy to see that seeking for ways to improve localization productivity is no more the responsibility only for the language service buyers. Some LSPs have also started their attempts in MT field. [tp no_translate=”y”]Moses[/tp] is a good option for them because of its lower entrance cost. In the offline discussion, however, I got a lot of complaints from these potential Moses users about usage of Moses. For those who don’t dive deeply into statistical machine translation, Moses is too complicated to start with. Many parameters are required to generate a trained MT engine. Lack of a friendly user interface is another headache for them. No wonder the very first thing audience eager to know is where they can find and download [tp no_translate=”y”]Moses Tool Set[/tp].

Actually, [tp no_translate=”y”]Moses Tool Set[/tp] is an open source project. Both its installer packages and source codes are available in Google Code.

Adobe Machine Translation Tooling For Moses Presented At MT Summit 13

This article was originally written in English. Text in other languages was provided by machine translation.

Members of the Adobe MT team were on hand at MT Summit 13 in Xiamen China to present Adobe’s MT achievements and demonstrate next generation tooling for the Moses open source MT platform.  Adobe team members Ray Flournoy, Yu Gong, Christine Duran, and Jeff Rueppel made the journey to attend the 4 day biannual conference.  The conference moves from Europe to North America and this year was hosted in China for the first time.  (Adobe Summer 2011 Intern Yifan Hi took a break from his post doctoral duties and presented his research as well.)

Yu Gong and Jeff Rueppel gave a demonstration of 3 new Adobe tools for streamlining the development of Machine Translation engines using the Moses open source system.

(Adobe’s Scoring Harness Tool)

Adobe employees demonstrated Adobe’s Scoring Harness Tool. (seen above) The scoring harness is one several building blocks Adobe is putting in place to facilitate the automation of MT engine development and deployment.  The scoring Harness automates the quality testing of MT engines using industry recognized standards for engine quality, (Bleu, Nist, Meteor, and TER) and will permit the dynamic testing of new engines against engines already used for production.

The Adobe Moses Corpus Tool – And Crossing That Bridge When You Come To It.

This article was originally written in English. Text in other languages was provided by machine translation.

Here is the scenario:

It’s the 1950’s.  You are at the head of an expedition in Nepal, and the brave leader of a dozen mountaineers plus a couple hundred porters all walking deep into the Himalayas in search of an unclimbed summit.  The risks of the journey are high but you will be showered in glory by your nation, ticker tape parade and everything, when you return home successful. Entering a deep valley you come upon a long and narrow rope bridge which the whole expedition will have to cross.  The bridge is too weak to hold more then one person at a time and it takes 5 minutes for each person to cross.

You can get the the first 12 climbers across in an hour.

(12 Climbers x 5 minutes each = 60 minutes) so 1 hour to cross.

But the very last porter won’t make it across until almost 2 days after the first climber starts out.

(200 Porters x 5 minutes each = 1000 minutes) or an additional 41.6 hours to cross!

You may not be getting that ticker tape parade after all.

 

The success of the entire expedition is a stake.   Valuable resources, food, tents, climbing gear, etc. are going to end up spread all up and down the trail with their respective porters.  This means they won’t be arriving at base camp when and where you need them.  This is not a good way to get started.

The bridge crossing metaphor used here is a textbook example of encountering the limiting factor in your process chain.  No matter how many resources you can bring to bare on the project there is a choke point.  It can take many forms but identifying and solving this problem will be critical to reaching your goals.  It doesn’t matter how fast you proceed through all the other steps of your plan, you are going to lose those 2 days here unless something changes.

Does the narrow rope bridge which will only let one person across at a time sound like an unlikely obstacle to face in your machine translation project?  It’s not.  When we launched the Adobe Moses MT project last spring getting across this bridge was the first problem was faced.  Why?  Quite simply we had years of translation memory stored up from Adobe localization projects. All those years of TM were the raw materials to be used in building Adobe specific engines.  We knew with them that we could build better engines for translating Adobe products then we would ever find on the open market.  However, the sheer volume of TM that needed to be processed into a Moses ready corpus represented a blockage of serious proportions.

 

A quick back of the napkin metric to put this inperspective:

We found, given the existing tooling for corpus work, that it required 1-2 weeks of an engineer’s time to process 5-10 million words of translation from .tmx format into a pair of aligned flat corpus files. (i.e. Moses ready)

Moses does come with a set of support scripts for working these problems. (tokenizer.pl, clean-corpus-n.perl, etc.)  and they are functional.  That said, the effort is time consuming.   The scripts are all run from the command line.  A great deal of organization and discipline is required of the user or all the required steps can quickly get confusing.

If you have millions of words across multiple languages, as Adobe did,  you can see it’s going to take a long time for that one engineer to process those .tmx files.  If you add a couple more engineers then you can speed up the process but the overall time required per unit of .tmx cleaned hasn’t gone down.  This would be the equivalent of building a couple of more bridges across that chasm in the Himalayas.  It speeds things up but it’s expensive now and doesn’t lower costs in the future.

 

So if we’ve only got one bridge to cross then the solution is to reduce the time it takes us to cross that bridge.

The Adobe Moses Corpus tool was our solution to this problem.  While none of the individual steps in taking a .tmx file to a Moses ready state are too time consuming, those small steps all add up.  We decided to solve the problem once and for all and to develop a light weight, modular, GUI based, AIR app which any user could install and use to process TM files for Moses.  What does it do? Quite simply it lets you automate your corpus cleaning to improve efficiency.  It takes the multiple command line options available and allows the user to orchestrate using them on any .tmx without the worry of calling scripts and passing parameters.  How much does it help? While these numbers are loose, we’ve been able to increase the productivity of a single engineer working on corpus cleaning by up to 10x.

 

We can now do it in 2 days what used to take 2 weeks.

When you have millions of words of translation memory this is a big deal.  If you want to do MT for yourself you will need to solve this problem.  For us, the Adobe Moses Corpus tool continues to evolve as we learn more about the cleaning steps we want access to and how to order these steps.  It is our vision that it will fit into a greater more comprehensive package of MT related tools which may include the automatic testing and tuning of engines.  We continue to consider all the possibilities this tool would open up for the greater MT interested public and are open to ideas and collaborations with others around it’s improvement and extension.

 

There are plenty of bridges to cross on the way to building MT systems. Corpus handling is just one of them. Hopefully this knowledge makes your journey a bit more clear. Now get out there and build an engine!

 

A quick (but by no means complete) list of things of things that could be done to improve MT engine quality:

This is a short list of the steps the Adobe Moses Corpus tool can currently perform.  We are open to suggestions about adding other steps or refining the nature of these steps.

Clean Placeholder Tags

Clean URLS

Tokenize

Lowercase

Clean Numbers

Clean Duplicate Lines

Clean Long Segments

Clean Misaligned Pairs

The efficacy of each of these steps could be debated around the MT round table but in general most people will need to process their TM files through these steps before the can be used with Moses for engine building as well as to improve quality.