word2vec原理与实现「建议收藏」

word2vec原理与实现「建议收藏」定义word2vec是一种把词转到某种向量空间的方法,在新的向量空间,词之间的相互关系,上下文关系都以某种程度被表征出来。方法词向量的转换方法有两种:CBOW(Continounsbagsofwords)和Skip-gram。以下图示为CBOW的网络结构图上图中的x1,x2,….Xc代表的是源码中的context向量中的每个单词,这个上下文的窗口大小对每个词都是随…

大家好,又见面了,我是你们的朋友全栈君。

定义

word2vec是一种把词转到某种向量空间的方法,在新的向量空间,词之间的相互关系,上下文关系都以某种程度被表征出来。

方法

词向量的转换方法有两种: CBOW(Continouns bags of words)和Skip-gram。
以下图示为CBOW的网络结构图
CBOW
上图中的x1,x2,….Xc代表的是源码中的context向量中的每个单词,这个上下文的窗口大小对每个词都是随机取值的。

源码解读

这里选取一个开源实现代码:Word2vec GitHub code
训练流程:

  1. 加载文件,初始化词汇表
  2. 初始化神经网络和霍夫曼树
  3. 多进程训练
    1. 遍历文档每一行,为每行生成词索引向量
      1. 根据window大小配置该词的上下文context
      2. 输入NN训练

训练核心算法:


def train(fi, fo, cbow, neg, dim, alpha, win, min_count, num_processes, binary):
# Read train file to init vocab
vocab = Vocab(fi, min_count)
# Init net
syn0, syn1 = init_net(dim, len(vocab))
global_word_count = Value('i', 0)
table = None
if neg > 0:#默认参数是5
print 'Initializing unigram table'
table = UnigramTable(vocab)
else: #没有负样本,使用hierarchical softmax
print 'Initializing Huffman tree'
vocab.encode_huffman()
# Begin training using num_processes workers
t0 = time.time()
pool = Pool(processes=num_processes, initializer=__init_process,
initargs=(vocab, syn0, syn1, table, cbow, neg, dim, alpha,
win, num_processes, global_word_count, fi))
pool.map(train_process, range(num_processes))
t1 = time.time()
print
print 'Completed training. Training took', (t1 - t0) / 60, 'minutes'
# Save model to file
save(vocab, syn0, fo, binary)
def train_process(pid):
# Set fi to point to the right chunk of training file
#因为是多进程处理数据,所以根据进程号做好数据块划分
start = vocab.bytes / num_processes * pid
end = vocab.bytes if pid == num_processes - 1 else vocab.bytes / num_processes * (pid + 1)
fi.seek(start)
#print 'Worker %d beginning training at %d, ending at %d' % (pid, start, end)
alpha = starting_alpha
word_count = 0
last_word_count = 0
#遍历数据块
while fi.tell() < end:
line = fi.readline().strip()
# Skip blank lines
if not line:
continue
# 为一行句子初始化索引向量
sent = vocab.indices(['<bol>'] + line.split() + ['<eol>'])
#遍历一句话中的每个词
for sent_pos, token in enumerate(sent):
if word_count % 10000 == 0:
global_word_count.value += (word_count - last_word_count)
last_word_count = word_count
# 更新alpha值
alpha = starting_alpha * (1 - float(global_word_count.value) / vocab.word_count)
if alpha < starting_alpha * 0.0001: alpha = starting_alpha * 0.0001
# Print progress info
sys.stdout.write("\rAlpha: %f Progress: %d of %d (%.2f%%)" %
(alpha, global_word_count.value, vocab.word_count,
float(global_word_count.value) / vocab.word_count * 100))
sys.stdout.flush()
# Randomize window size, where win is the max window size
#随机初始化一个窗口大小
current_win = np.random.randint(low=1, high=win+1)
context_start = max(sent_pos - current_win, 0)
context_end = min(sent_pos + current_win + 1, len(sent))
#构造输入的上下文向量[x1,x2,...x_c]
context = sent[context_start:sent_pos] + sent[sent_pos+1:context_end] # Turn into an iterator?
# CBOW
if cbow:
# Compute neu1
#对上下文单词的词向量求均值做为输入层
neu1 = np.mean(np.array([syn0[c] for c in context]), axis=0)
assert len(neu1) == dim, 'neu1 and dim do not agree'
# Init neu1e with zeros
neu1e = np.zeros(dim)
# Compute neu1e and update syn1
#先处理target向量,也就是标签向量
if neg > 0:
classifiers = [(token, 1)] + [(target, 0) for target in table.sample(neg)]
else:
#把词汇表中的该单词对应的索引和标签向量压成标签对
classifiers = zip(vocab[token].path, vocab[token].code)
for target, label in classifiers:
z = np.dot(neu1, syn1[target])
p = sigmoid(z)
g = alpha * (label - p)
#计算反向回传的值
neu1e += g * syn1[target] # Error to backpropagate to syn0
#更新输出层值
syn1[target] += g * neu1  # Update syn1
# Update syn0
for context_word in context:
syn0[context_word] += neu1e
# Skip-gram
else:
for context_word in context:
# Init neu1e with zeros
neu1e = np.zeros(dim)
# Compute neu1e and update syn1
if neg > 0:
classifiers = [(token, 1)] + [(target, 0) for target in table.sample(neg)]
else:
classifiers = zip(vocab[token].path, vocab[token].code)
for target, label in classifiers:
z = np.dot(syn0[context_word], syn1[target])
p = sigmoid(z)
g = alpha * (label - p)
neu1e += g * syn1[target]              # Error to backpropagate to syn0
syn1[target] += g * syn0[context_word] # Update syn1
# Update syn0
syn0[context_word] += neu1e
word_count += 1
# Print progress info
global_word_count.value += (word_count - last_word_count)
sys.stdout.write("\rAlpha: %f Progress: %d of %d (%.2f%%)" %
(alpha, global_word_count.value, vocab.word_count,
float(global_word_count.value)/vocab.word_count * 100))
sys.stdout.flush()
fi.close()

这里边用到了一个很重要的类Vocab,该类里边将会负责对词汇进行霍夫曼树编码:

#这个类用来存储霍夫曼树
class VocabItem:
def __init__(self, word):
self.word = word
self.count = 0
self.path = None # Path (list of indices) from the root to the word (leaf)
self.code = None # Huffman encoding
class Vocab:
def __init__(self, fi, min_count):
vocab_items = []
vocab_hash = {}
word_count = 0
fi = open(fi, 'r')
# Add special tokens <bol> (beginning of line) and <eol> (end of line)
for token in ['<bol>', '<eol>']:
vocab_hash[token] = len(vocab_items)
vocab_items.append(VocabItem(token))
for line in fi:
tokens = line.split()
for token in tokens:
if token not in vocab_hash:
vocab_hash[token] = len(vocab_items)
vocab_items.append(VocabItem(token))
#assert vocab_items[vocab_hash[token]].word == token, 'Wrong vocab_hash index'
vocab_items[vocab_hash[token]].count += 1
word_count += 1
if word_count % 10000 == 0:
sys.stdout.write("\rReading word %d" % word_count)
sys.stdout.flush()
# Add special tokens <bol> (beginning of line) and <eol> (end of line)
vocab_items[vocab_hash['<bol>']].count += 1
vocab_items[vocab_hash['<eol>']].count += 1
word_count += 2
self.bytes = fi.tell()
self.vocab_items = vocab_items         # List of VocabItem objects
self.vocab_hash = vocab_hash           # Mapping from each token to its index in vocab
self.word_count = word_count           # Total number of words in train file
# Add special token <unk> (unknown),
# merge words occurring less than min_count into <unk>, and
# sort vocab in descending order by frequency in train file
self.__sort(min_count)
#assert self.word_count == sum([t.count for t in self.vocab_items]), 'word_count and sum of t.count do not agree'
print 'Total words in training file: %d' % self.word_count
print 'Total bytes in training file: %d' % self.bytes
print 'Vocab size: %d' % len(self)
def __getitem__(self, i):
return self.vocab_items[i]
def __len__(self):
return len(self.vocab_items)
def __iter__(self):
return iter(self.vocab_items)
def __contains__(self, key):
return key in self.vocab_hash
def __sort(self, min_count):
tmp = []
tmp.append(VocabItem('<unk>'))
unk_hash = 0
count_unk = 0
for token in self.vocab_items:
if token.count < min_count:
count_unk += 1
tmp[unk_hash].count += token.count
else:
tmp.append(token)
tmp.sort(key=lambda token : token.count, reverse=True)
# Update vocab_hash
vocab_hash = {}
for i, token in enumerate(tmp):
vocab_hash[token.word] = i
self.vocab_items = tmp
self.vocab_hash = vocab_hash
print
print 'Unknown vocab size:', count_unk
def indices(self, tokens):
return [self.vocab_hash[token] if token in self else self.vocab_hash['<unk>'] for token in tokens]
def encode_huffman(self):
# Build a Huffman tree
vocab_size = len(self)
count = [t.count for t in self] + [1e15] * (vocab_size - 1)
parent = [0] * (2 * vocab_size - 2)
binary = [0] * (2 * vocab_size - 2)
pos1 = vocab_size - 1
pos2 = vocab_size
for i in xrange(vocab_size - 1):
# Find min1
if pos1 >= 0:
if count[pos1] < count[pos2]:
min1 = pos1
pos1 -= 1
else:
min1 = pos2
pos2 += 1
else:
min1 = pos2
pos2 += 1
# Find min2
if pos1 >= 0:
if count[pos1] < count[pos2]:
min2 = pos1
pos1 -= 1
else:
min2 = pos2
pos2 += 1
else:
min2 = pos2
pos2 += 1
count[vocab_size + i] = count[min1] + count[min2]
parent[min1] = vocab_size + i
parent[min2] = vocab_size + i
binary[min2] = 1
# Assign binary code and path pointers to each vocab word
root_idx = 2 * vocab_size - 2
for i, token in enumerate(self):
path = [] # List of indices from the leaf to the root
code = [] # Binary Huffman encoding from the leaf to the root
node_idx = i
while node_idx < root_idx:
if node_idx >= vocab_size: path.append(node_idx)
code.append(binary[node_idx])
node_idx = parent[node_idx]
path.append(root_idx)
# These are path and code from the root to the leaf
token.path = [j - vocab_size for j in path[::-1]]
token.code = code[::-1]
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/145977.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)
blank

相关推荐

  • webstorm2021永久激活【2021.10最新】

    (webstorm2021永久激活)这是一篇idea技术相关文章,由全栈君为大家提供,主要知识点是关于2021JetBrains全家桶永久激活码的内容IntelliJ2021最新激活注册码,破解教程可免费永久激活,亲测有效,下面是详细链接哦~https://javaforall.cn/100143.html1435QFILVV-eyJsaWN…

  • openssl生成cer证书_tls证书生成

    openssl生成cer证书_tls证书生成一安装opensslwgethttp://www.openssl.org/source/openssl-1.0.0a.tar.gztarzxvfopenssl-1.0.0a.tar.gzcdopenssl-1.0.0a./config–prefix=/usr/local/opensslmake&&makeinstall二创建主证书先创建一个ssl的目录:m…

  • SpringBoot上传文件出错

    SpringBoot上传文件出错现象SpringBoot项目,今天做了一个与前端对接富文本的上传图片到服务器,返回一段URL给前端,一直运行着,前端一直请求接口一直上传图片做测试的时候,后台报了一个错误Couldnotparsemultipartservletrequest;nestedexceptionisjava.io.IOException:Thetemporaryuploadlocat…

  • pandas读取excel文件,转换为字典

    pandas读取excel文件,转换为字典pandas读取excel文件,转换为字典

  • 如何启用计算机双通道内存的方法,内存条怎么插 组建内存双通道正确插法教程…

    如何启用计算机双通道内存的方法,内存条怎么插 组建内存双通道正确插法教程…当我们安装或升级内存时,发现主板上有四个内存插槽,所以不知道该插入哪个内存插槽。事实上,理论上,任何一个内存插槽都可以正常使用。但是如果随意插上,未必能搭建双通道,搭建双通道也是有讲究的。那么双通道内存是什么意思呢?怎么安装?下面,安装者之家将为大家普及双通道内存的知识,并附上正确插入双通道内存的教程。希望这篇文章能对大家有所帮助。设置内存双通道插入教程一、双通道内存是什么意思?有什么好处?我们知…

  • Python散点图绘制(用seaborn绘制散点图)

    今天下午学习了如何使用python绘制简单的散点图,写成博客分享一下。在python中画散点图主要是用matplotlib模块中的scatter函数,先来看一下scatter函数的基本信息。网址为:点击打开链接可以看到scatter中有很多参数,经常使用的参数主要有以下几个:c:marker:数据、代码和绘制的图如下。数据(取第一列作为x,取第四列作为y)截图:代码如下…

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号