[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

大家好,又见面了,我是全栈君,今天给大家准备了Idea注册码。

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee   Alon Lavie 
Language Technologies Institute  
Carnegie Mellon University  
Pittsburgh, PA 15213  
banerjee+@cs.cmu.edu  alavie@cs.cmu.edu


Important Snippets:

1. In  order  to  be  both  effective  and  useful,  an automatic metric for MT evaluation has to satisfy several basic criteria.  The primary and most intuitive requirement is that the metric have very high correlation with quantified human notions of MT quality.  Furthermore, a good metric should be as sensitive as possible to differences in MT quality between  different  systems,  and  between  different versions of the same system.  The metric should be 
consistent  (same  MT  system  on  similar  texts should produce similar scores), reliable (MT systems that score similarly can be trusted to perform similarly) and general (applicable to different MT tasks in a wide range of domains and scenarios).  Needless to say, satisfying all of the above criteria is  extremely  difficult,  and  all  of  the metrics  that have been proposed so far fall short of adequately addressing  most  if  not  all  of  these requirements.


2. It  is  based  on  an explicit word-to-word  matching  between  the  MT  output being evaluated and one or more reference translations.    Our  current  matching  supports  not  only matching  between  words that are  identical in the two  strings  being  compared,  but  can  also  match words  that  are  simple  morphological  variants  of each other


3. Each possible matching is scored based on a combination of several features.  These  currently  include  uni-gram-precision,  uni-gram-recall, and a direct measure of how out-of-order the words of the MT output are with respect to the reference. 


4.Furthermore, our results demonstrated that recall plays a more important role than precision  in  obtaining  high-levels  of  correlation  with human judgments. 


5.BLEU does not take recall into account directly.


6.BLEU  does  not  use  recall  because  the notion of recall is unclear when matching simultaneously  against  a  set  of  reference  translations (rather than a single reference).  To compensate for recall, BLEU uses a Brevity Penalty, which penalizes translations for being “too short”. 


7.BLEU  and  NIST  suffer  from  several  weaknesses:

   >The Lack of Recall

   >Use  of Higher Order  N-grams

   >Lack  of  Explicit  Word-matching  Between Translation and Reference

   >Use  of  Geometric  Averaging  of  N-grams


8.METEOR was designed to explicitly address the weaknesses in BLEU identified above.  It evaluates a  translation  by  computing  a  score  based  on  explicit  word-to-word  matches  between  the  translation and a reference translation. If more than one reference translation is available, the given translation  is  scored  against  each  reference  independently,  and  the  best  score  is  reported. 


9.Given a pair of translations to be compared (a system  translation  and  a  reference  translation), METEOR  creates  an alignment between  the  two strings. We define an alignment as a mapping be-tween unigrams, such that every unigram in each string  maps  to  zero  or  one  unigram  in  the  other string, and to no unigrams in the same string. 


10.This  alignment  is  incrementally  produced through a series of stages, each stage consisting of  two distinct phases. 


11.In the first phase an external module lists all the possible  unigram  mappings  between  the  two strings. 


12.Different modules map unigrams based  on  different  criteria.  The  “exact”  module maps  two  unigrams  if  they  are  exactly  the  same (e.g.  “computers”  maps  to  “computers”  but  not “computer”). The “porter stem” module maps two unigrams  if  they  are  the  same after they  are stemmed  using  the  Porter  stemmer  (e.g.:  “com-puters”  maps  to  both  “computers”  and  to  “com-puter”).  The  “WN  synonymy”  module  maps  two unigrams if they are synonyms of each other.


13.In  the  second  phase  of  each  stage,  the  largest subset of these unigram mappings is selected such 
that  the  resulting  set  constitutes  an alignment as defined above


14. METEOR selects that set that has the least number of unigram mapping crosses.


15.By default the first stage uses the “exact” mapping  module,  the  second  the  “porter  stem” module and the third the “WN synonymy” module.  

16. unigram precision (P)  

      unigram  recall  (R)  

      Fmean by combining the precision and recall via a harmonic-mean

      [文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

To  take  into  account  longer matches, METEOR computes a penalty for a given alignment as follows.

chunks such that  the  uni-grams  in  each  chunk  are  in  adjacent  positions  in the system translation, and are also mapped to uni-grams that are in adjacent positions in the reference translation. 

     [文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments 

    [文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments


Conclusion: METEOR prefer recall to precision while BLEU is converse.Meanwhile, it incorporates many information.

版权声明:本文博客原创文章,博客,未经同意,不得转载。

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请联系我们举报,一经查实,本站将立刻删除。

发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/117748.html原文链接:https://javaforall.net

(0)
全栈程序员-站长的头像全栈程序员-站长


相关推荐

  • java中两个list取交集_判断两个list是否有交集

    java中两个list取交集_判断两个list是否有交集List<int>a1=newList<int>{1,2,3,4,5};List<int>a2=newList<int>{3,4,5,6,7};vara3=a1.Intersect(a2);foreach(varitemina3){Console.WriteLine(it…

    2022年10月7日
    6
  • mybatis拦截器详解_Java拦截器

    mybatis拦截器详解_Java拦截器拦截器可在mybatis进行sql底层处理的时候执行额外的逻辑,最常见的就是分页逻辑、对结果集进行处理过滤敏感信息等。}}}}}else{}}}折叠从上面的代码可以看到mybatis支持的拦截类型只有四种(按拦截顺序)1.Executor执行器接口2.StatementHandlersql构建处理器3.ParameterHandler参数处理器4.ResultSetHandler结果集处理器。…

    2022年9月10日
    4
  • js浅拷贝和深拷贝的区别_前端面试深拷贝和浅拷贝

    js浅拷贝和深拷贝的区别_前端面试深拷贝和浅拷贝1、JS数据类型基本数据类型:Boolean、String、Number、null、undefined引用数据类型:Object、Array、Function、RegExp、Date等2、深拷贝与浅拷贝深拷贝和浅拷贝都只针对引用数据类型,浅拷贝会对对象逐个成员依次拷贝,但只复制内存地址,而不复制对象本身,新旧对象成员还是共享同一内存;深拷贝会另外创建一个一模一样的对象,新对象跟原对象不共享内存,修改新对象不会改到原对象。区别:浅拷贝只复制对象的第一层属性,而深拷贝会对对象的属性进行递归

    2022年10月1日
    2
  • 海思35xx实现GT911触摸屏功能「建议收藏」

    海思35xx实现GT911触摸屏功能「建议收藏」海思35xx通过gpio模拟i2c实现GT911触摸功能1.遇到的问题地址选配后一直不对,首先检测硬件问题,然后通过调试驱动部分,打印调试从设备给的ack(没有逻辑分析仪);发现寄存器地址一直为FF或00,检查发现GT911地址均为16bit,而读写i2c接口是8位的;成功后点击触摸板点击位置与实际不一致;可以进行坐标转换;2.网上下载GT91xx编程指南文件电容触摸芯片GT911Datasheet文件3.Datasheet分析(1)gpio模拟时,可能需要注意这个延时时间;

    2022年6月22日
    57
  • 概率论协方差_均值方差协方差公式

    概率论协方差_均值方差协方差公式除了数学期望外,方差、均方差、协方差也是重要的数字特征。方差方差的代数意义很简单,两个数的方差就是两个数差值的平方,作为衡量实际问题的数字特征,方差有代表了问题的波动性。方差的意义甲、乙二人是

    2022年8月4日
    6
  • 自己的中文名用英文_如何根据姓名首字母排序

    自己的中文名用英文_如何根据姓名首字母排序一个功能需求,做一个类似联系人列表的功能,点击名称获取对应的id,样式简陋,只是一个模板,原来是uniapp项目,根据需要改成了vue,需要的自行设计css流程获取数据提取首个字的拼音的首个字母排序并分组直接上代码吧<template> <div> <divv-for=”(item,index)inindexList”> <div><b>{{item.title}}</b></div> .

    2022年10月10日
    3

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注全栈程序员社区公众号