[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

大家好,又见面了,我是全栈君,今天给大家准备了Idea注册码。

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee   Alon Lavie 
Language Technologies Institute  
Carnegie Mellon University  
Pittsburgh, PA 15213  
banerjee+@cs.cmu.edu  alavie@cs.cmu.edu


Important Snippets:

1. In  order  to  be  both  effective  and  useful,  an automatic metric for MT evaluation has to satisfy several basic criteria.  The primary and most intuitive requirement is that the metric have very high correlation with quantified human notions of MT quality.  Furthermore, a good metric should be as sensitive as possible to differences in MT quality between  different  systems,  and  between  different versions of the same system.  The metric should be 
consistent  (same  MT  system  on  similar  texts should produce similar scores), reliable (MT systems that score similarly can be trusted to perform similarly) and general (applicable to different MT tasks in a wide range of domains and scenarios).  Needless to say, satisfying all of the above criteria is  extremely  difficult,  and  all  of  the metrics  that have been proposed so far fall short of adequately addressing  most  if  not  all  of  these requirements.


2. It  is  based  on  an explicit word-to-word  matching  between  the  MT  output being evaluated and one or more reference translations.    Our  current  matching  supports  not  only matching  between  words that are  identical in the two  strings  being  compared,  but  can  also  match words  that  are  simple  morphological  variants  of each other


3. Each possible matching is scored based on a combination of several features.  These  currently  include  uni-gram-precision,  uni-gram-recall, and a direct measure of how out-of-order the words of the MT output are with respect to the reference. 


4.Furthermore, our results demonstrated that recall plays a more important role than precision  in  obtaining  high-levels  of  correlation  with human judgments. 


5.BLEU does not take recall into account directly.


6.BLEU  does  not  use  recall  because  the notion of recall is unclear when matching simultaneously  against  a  set  of  reference  translations (rather than a single reference).  To compensate for recall, BLEU uses a Brevity Penalty, which penalizes translations for being “too short”. 


7.BLEU  and  NIST  suffer  from  several  weaknesses:

   >The Lack of Recall

   >Use  of Higher Order  N-grams

   >Lack  of  Explicit  Word-matching  Between Translation and Reference

   >Use  of  Geometric  Averaging  of  N-grams


8.METEOR was designed to explicitly address the weaknesses in BLEU identified above.  It evaluates a  translation  by  computing  a  score  based  on  explicit  word-to-word  matches  between  the  translation and a reference translation. If more than one reference translation is available, the given translation  is  scored  against  each  reference  independently,  and  the  best  score  is  reported. 


9.Given a pair of translations to be compared (a system  translation  and  a  reference  translation), METEOR  creates  an alignment between  the  two strings. We define an alignment as a mapping be-tween unigrams, such that every unigram in each string  maps  to  zero  or  one  unigram  in  the  other string, and to no unigrams in the same string. 


10.This  alignment  is  incrementally  produced through a series of stages, each stage consisting of  two distinct phases. 


11.In the first phase an external module lists all the possible  unigram  mappings  between  the  two strings. 


12.Different modules map unigrams based  on  different  criteria.  The  “exact”  module maps  two  unigrams  if  they  are  exactly  the  same (e.g.  “computers”  maps  to  “computers”  but  not “computer”). The “porter stem” module maps two unigrams  if  they  are  the  same after they  are stemmed  using  the  Porter  stemmer  (e.g.:  “com-puters”  maps  to  both  “computers”  and  to  “com-puter”).  The  “WN  synonymy”  module  maps  two unigrams if they are synonyms of each other.


13.In  the  second  phase  of  each  stage,  the  largest subset of these unigram mappings is selected such 
that  the  resulting  set  constitutes  an alignment as defined above


14. METEOR selects that set that has the least number of unigram mapping crosses.


15.By default the first stage uses the “exact” mapping  module,  the  second  the  “porter  stem” module and the third the “WN synonymy” module.  

16. unigram precision (P)  

      unigram  recall  (R)  

      Fmean by combining the precision and recall via a harmonic-mean

      [文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

To  take  into  account  longer matches, METEOR computes a penalty for a given alignment as follows.

chunks such that  the  uni-grams  in  each  chunk  are  in  adjacent  positions  in the system translation, and are also mapped to uni-grams that are in adjacent positions in the reference translation. 

     [文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments 

    [文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments


Conclusion: METEOR prefer recall to precision while BLEU is converse.Meanwhile, it incorporates many information.

版权声明:本文博客原创文章,博客,未经同意,不得转载。

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/117748.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)
blank

相关推荐

  • Elasticsearch数据库优化实战:让你的ES飞起来

    Elasticsearch数据库优化实战:让你的ES飞起来摘要:ES已经成为了全能型的数据产品,在很多领域越来越受欢迎,本文旨在从数据库领域分析ES的使用。

  • C语言实现约分最简分式[通俗易懂]

    C语言实现约分最简分式[通俗易懂]题目要求:分数可以表示为分子/分母的形式。编写一个程序,要求用户输入一个分数,然后将其约分为最简分式。最简分式是指分子和分母不具有可以约分的成分了。如6/12可以被约分为1/2。当分子大于分母时,不需要表达为整数又分数的形式,即11/8还是11/8;而当分子分母相等时,仍然表达为1/1的分数形式。输入格式:输入在一行中给出一个分数,分子和分母中间以斜杠/分隔,如:12/34表示34分之12。…

    2022年10月25日
  • 《Dubbo进阶一》——RPC协议底层原理

    《Dubbo进阶一》——RPC协议底层原理一RPC协议简介在一个典型的RPC的使用场景中,包含了服务发现、负载、容错、序列化和网络传输等组件,其中RPC协议指明了程序如何进行序列化和网络传输,也就是说一个RPC协议的实现等于一个非透明的RPC调用。简单来说,分布式框架的核心是RPC框架,RPC框架的核心是RPC协议。二协议的基本组成IP:服务提供者的地址端口:协议指定开放端口运行服务(1)netty(2)mima…

  • pytest的assert_assert中文

    pytest的assert_assert中文前言断言是写自动化测试基本最重要的一步,一个用例没有断言,就失去了自动化测试的意义了。什么是断言呢?简单来讲就是实际结果和期望结果去对比,符合预期那就测试pass,不符合预期那就测试failed

  • linux防火墙查看状态firewall、iptable[通俗易懂]

    linux防火墙查看状态firewall、iptable[通俗易懂]CentOS7的防火墙配置跟以前版本有很大区别,CentOS7这个版本的防火墙默认使用的是firewall,与之前的版本Centos6.x使用iptables不一样一、iptables防火墙1、基本操作#查看防火墙状态serviceiptablesstatus#停止防火墙serviceiptablesstop#启动防火墙serviceipt…

  • 通用计算机的发展历程,中国计算机发展史

    通用计算机的发展历程,中国计算机发展史中国计算机发展史以下文字资料是由(历史新知网www.lishixinzhi.com)小编为大家搜集整理后发布的内容,让我们赶快一起来看一下吧!1、第一代电子管计算机研制(1958-1964年)我国从1957年在中科院计算所开始研制通用数字电子计算机,1958年8月1日该机可以表演短程序运行,标志着我国第一台电子数字计算机诞生。机器在738厂开始少量生产,命名为103型计算机(即DJS-1型)。19…

    2022年10月19日

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号