大家好，又见面了，我是你们的朋友全栈君。

一、外贸企业关系图谱的构建

说来惭愧，本科、研究生期间还没写过博客，正巧最近在写论文，想结合自己开发的项目来构思，于是就通过这篇博客记录一下使用Neo4j图数据库来做企业相似度查询的过程，方便以后参考。
这次外贸企业关系图谱的构建用到以前项目中测试库（Oracle）的数据，导入成csv格式后，再通过python的py2neo导入到neo4j中。
———–由于数据涉及项目的私密信息，暂时就不分享出来了————

1.从Oracle导出数据

该表在Oracle数据库中的部分结构如下：
在这里插入图片描述
目前数据库中的外贸企业数据约30多万条，经过两轮的数据清洗和过滤，我选出了约12万条数据导出，并保存为csv格式。

2.导入数据到Neo4j

Neo4j有自己的csv导入工具，还可以通过cypher语句导入csv格式的数据，但是这里我使用的是pyhon的py2neo库来完成数据的导入。
编写的python代码结构如下：
在这里插入图片描述
下面介绍每个函数的详细代码实现：

'''初始化，用于连接到Neo4j'''
def __init__(self, data):
    self.data = data
    self.g = Graph(
        host="127.0.0.1",  # neo4j 搭载服务器的ip地址
        http_port=7474,  # neo4j 服务器监听的端口号
        user="neo4j",  # 数据库user name
        password="112233") # 密码

'''读取文件'''
def read_nodes(self):
    # 共5类节点
    enterprise = self.data['COMP_NAME_CH']  # 企业
    region = set(self.data['PROVINCE_CH'])  # 地区
    if (np.nan in region):
        region.remove(np.nan)
    country = []  # 出口国家
    for index, row in self.data.iterrows():
        for r in row['EXPORT_COUNTRY_MXT'].split(','):
            country.append(r)
    # 企业类型：1-manufacture-生产型、2-trader-贸易型（贸信通）3-服务型
    enterprise_type = ['生产型', '贸易型', '服务型']  # 企业类型
    legal_representative = self.data['LEGAL_REPRESENTATIVE']  # 法人代表

    # 构建节点实体关系
    rels_region = []  # 企业－地区关系 locate
    rels_country = []  # 企业－出口国家关系 export
    rels_type = []  # 企业－企业类型关系 type
    # rels_product = [] # 企业－产品关系 product
    rels_legal = []  # 企业－法人代表关系 legal
    for index, row in self.data.iterrows():
        if (row['PROVINCE_CH'] is not np.nan):
            rels_region.append([row['COMP_NAME_CH'], row['PROVINCE_CH']])
        for r in row['EXPORT_COUNTRY_MXT'].split(','):
            # 一个企业有多个出口国家
            rels_country.append([row['COMP_NAME_CH'], r])
        rels_type.append([row['COMP_NAME_CH'], '生产型' if row['COMP_TYPE'] == 1\
            else ('服务型' if row['COMP_TYPE'] == 2 else '贸易型')])
        rels_legal.append([row['COMP_NAME_CH'], row['LEGAL_REPRESENTATIVE']])

    return set(enterprise), set(region), set(country), set(enterprise_type), set(legal_representative), \
           rels_region, rels_country, rels_type, rels_legal

'''建立单标签节点'''
def create_node(self, label, nodes):
    count = 0
    for node_name in nodes:
        node = Node(label, name=node_name)
        self.g.create(node)
        count += 1
        print(count, len(nodes))
    return

'''创建知识图谱外贸企业的节点'''
def create_enterprise_nodes(self):
    count = 0
    for index, row in self.data.iterrows():
        node = Node("Enterprise", name=row['COMP_NAME_CH'], credit_code=row['CREDIT_CODE'],
                    setup_time=row['SETUP_TIME'], address=row['ADDRESS_CH'],
                    captial=str(row['REG_CAPITAL']) + '万人民币')
        self.g.create(node)
        count += 1
        print(count)
    return

'''创建实体关联边'''
def create_relationship(self, start_node, end_node, edges, rel_type, rel_name):
    count = 0
    # 去重处理
    set_edges = []
    for edge in edges:
        set_edges.append('###'.join(edge))
    all = len(set(set_edges))
    for edge in set(set_edges):
        edge = edge.split('###')
        p = edge[0]
        q = edge[1]
        query = "match(p:%s),(q:%s) where p.name='%s'and q.name='%s' create (p)-[rel:%s{name:'%s'}]->(q)" % (
            start_node, end_node, p, q, rel_type, rel_name)
        try:
            self.g.run(query)
            count += 1
            print(rel_type, count, all)
        except Exception as e:
            print(e)
    return

'''创建知识图谱实体节点类型schema'''
def create_graphnodes(self):
    # 获取所有节点和关系
    Enterprises, Regions, Countries, Enterprise_types, Legal_representatives, \
    rels_region, rels_country, rels_type, rels_legal = self.read_nodes()
    # 创建图数据库节点
    self.create_enterprise_nodes() # 企业
    self.create_node('Region', Regions) # 地区
    print('地区：' + str(len(Regions)))
    self.create_node('Country', Countries) # 出口国家
    print('出口国家：' + str(len(Countries)))
    self.create_node('Type', Enterprise_types)  #企业类型
    print('企业类型：' + str(len(Enterprise_types)))
    # 暂不需要使用该节点和关系
    # self.create_node('Legal', Legal_representatives) # 法人代表
    # print('法人代表：' + len(Legal_representatives))
    return

'''创建实体关系边'''
def create_graphrels(self):
    # 获取所有关系组
    Enterprises, Regions, Countries, Enterprise_types, Legal_representatives, \
    rels_region, rels_country, rels_type, rels_legal = self.read_nodes()
    self.create_relationship('Enterprise', 'Region', rels_region, 'locate', '所在地区')
    self.create_relationship('Enterprise', 'Country', rels_country, 'export', '出口')
    self.create_relationship('Enterprise', 'Type', rels_type, 'type', '类型')
    # 暂不需要导入该关系
    # self.create_relationship('Enterprise', 'Legal', rels_legal, 'legal', '法人')

最后是main函数：

if __name__ == '__main__':
    # 获取当前路径，并转换为正确格式
    cur_dir = '/'.join(os.path.abspath(__file__).split('\\')[:-1])
    data_path = cur_dir + '/TB_ENTERPRISEINFO_FUSE_BAK.csv'
    print('read_csv from:' + data_path)
    data = pd.read_csv(data_path)
    # 创建实例
    handler = EnterpriseGragh(data)
    # 构建企业图谱的节点和关系
    handler.create_graphnodes()
    handler.create_graphrels()

3.Neo4j数据展示

大约运行了20多小时，终于成功在Neo4j构建好了外贸企业关系图谱，感觉应该是自己在代码优化上可能没有做好=_=||，如果使用Neo4j自带的工具感觉会快上不少。
数据库信息以及查询效果如下图所示：（一共4类节点，3种关系）
在这里插入图片描述

二、用Cypher做企业关联查询

简单查询就不打上来了，感觉有一定参考意义有以下几种查询，可以找到和查询企业关联度最高的企业，作为查询结果。

1.多层关系查询

由于该图数据的有向关系只有一层，所以查询时不能指定关系的方向，这里我们以‘陕西和沃进出口有限公司’为例，查询该企业的多层关系，查询结果如下图：
在这里插入图片描述
对应的Cypher查询语句如下：

match p=(n:Enterprise{name:'陕西和沃进出口有限公司'})-[*2..3]-() return p limit 20

2.基于邻居信息的Jaccard相似度计算

以查询‘陕西和沃进出口有限公司’为例，根据企业的出口国家，计算企业之间的Jaccard相似度，作为相似度衡量标准。（由于Jaccard计算以出口国家关系为基准，所以结果与3.加权关联度得分计算得到的结果不同）
Jaccard的计算公式参考如下：

根据计算公式，查询到的结果展示如下：
在这里插入图片描述
对应的Cypher查询语句如下：

MATCH (n:Enterprise{name:'陕西和沃进出口有限公司'})-[:export]->(c:Country)<-[:export]-(other:Enterprise)
with n,other,count(c) as intersection,collect(c.name) as collection
match (n)-[:export]->(nc:Country)
with n,other,intersection,collection,collect(nc.name) as s1
match (other)-[:export]->(oc:Country)
with n,other,intersection,collection,s1,collect(oc.name) as s2
with n,other,intersection,s1,s2
with n,other,intersection,s1+filter(x IN s2 where not x IN s1) as uni,s1,s2
return n.name,other.name,s1,s2,((1.0*intersection)/SIZE(uni)) as jaccard
order by jaccard DESC
limit 20

3.加权关联度得分计算

以查询‘陕西和沃进出口有限公司’为例，找到和该企业有相同关系的节点，我们对三种关系企业类型、所在地区、出口国家（type、locate、export）进行加权求和并计算得分，以该得分作为企业相似度的评价标准，可以得到最相关的企业如下。
在这里插入图片描述
对应的Cypher查询语句如下：

MATCH (n:Enterprise) where n.name='陕西和沃进出口有限公司'
match (n)-[:type]->(t:Type)<-[:type]-(other:Enterprise)
with n,other,count(t) as tn
optional match (n)-[:locate]->(r:Region)<-[:locate]-(other)
with n,other,tn,count(r) as rn
optional match (n)-[:export]->(c:Country)<-[:export]-(other)
with n,other,tn,rn,count(c) as cn
return other.name as 推荐企业,tn as 相同企业类型,rn as 相同地区,cn as 相同出口国家,(3*tn)+(3*rn)+(1*cn) as score
ORDER BY score DESC
limit 100

三、总结

以上就是外贸企业关系图谱的构建+查询的整个流程，比较基础。

个人认为可以应用和研究的方向：企业合作伙伴发现、相似企业推荐、投资风险预测、企业市场预测等场景。

看起来像是那么回事奥，但是其实现在论文还没动笔。。。
希望能尽早确定好论文方向，加油！！

发布者：全栈程序员-用户IM，转载请注明出处：https://javaforall.cn/153327.html原文链接：https://javaforall.cn

【正版授权，激活自己账号】： Jetbrains全家桶Ide使用，1年售后保障，每天仅需1毛

【官方授权正版激活】： 官方授权正版激活支持Jetbrains家族下所有IDE 使用个人JB账号...

基于Neo4j构建的外贸企业关系图谱做企业相似度查询「建议收藏」

目录

一、外贸企业关系图谱的构建

1.从Oracle导出数据

2.导入数据到Neo4j

3.Neo4j数据展示

二、用Cypher做企业关联查询

1.多层关系查询

2.基于邻居信息的Jaccard相似度计算

3.加权关联度得分计算

三、总结

发表回复

基于Neo4j构建的外贸企业关系图谱做企业相似度查询「建议收藏」

目录

一、外贸企业关系图谱的构建

1.从Oracle导出数据

2.导入数据到Neo4j

3.Neo4j数据展示

二、用Cypher做企业关联查询

1.多层关系查询

2.基于邻居信息的Jaccard相似度计算

3.加权关联度得分计算

三、总结

相关推荐

VMware虚拟机安装DOS6.22

python问题 Traceback (most recent call last)

关于AxisFault的说明[通俗易懂]

工业数据采集平台

log4cpp 使用完全手册「建议收藏」

PHP审计之POP链挖掘

发表回复