数据采集工具-DataX

数据采集工具-DataX

0、背景

为了对比 kettle 和 datax 的功能,先部署一个 datax 做一下技术预研(POC)。

需要的软件

(1)DataX github 代码仓库

(2)DataX Web github 代码仓库

(3)Hadoop Common 下载地址

(4)winutils github 代码仓库

(5)测试数据“中国5级行政区域 MySQL 库”代码仓库,数据量为:758049。

1、准备数据库

(1)准备 MySQL 数据库

①系统数据库数据准备

在 MySQL 里面创建一个 datax_web 数据库,字符集为 utf8mb4。

然后将 datax_web.sql 中的数据导入到 MySQL 数据库。

②数据采集数据库数据准备

克隆的“中国5级行政区域 MySQL 库”中的 cnarea20200630.7z 文件解压。

解压后的 cnarea20200630.sql 脚本,导入到 MySQL 的 cnarea20200630 数据库。

数据导入的时候,会报错:[ERR] 2006 - MySQL server has gone away

解决方法:

 set global max_allowed_packet=1024*1024*128;
 show global variables like 'max_allowed_packet';

我的 MySQL 的 max_allowed_packet 配置为:4194304 也就是 1024*1024*4

经过多次报错,一点点往上调整,最终在调整为 1024*1024*128 时,即 134217728 时,导入成功。

(2)准备 PostgreSQL 数据库

pg 数据库作为数据采集的目标库,用来将 MySQL 中 cnarea20200630 数据库数据采集过来。

在 PostgreSQL 中创建一个 test 库备用。

2、DataX Web 使用

(1)登录

输入 127.0.0.1:8080/index.ht 登录 DataX Web 的网页。

默认账户 admin,默认密码 123456。

登录后会看到一个“运行报表”界面。

(2)添加“数据源”

从菜单点击“数据源管理”菜单,进入“数据源管理界面”。

点击“添加”按钮,添加一个 MySQL 数据源。

录入数据源相关配置信息,点击“测试连接”按钮测试数据源是否可以连通。

从上图弹出提示“Success Tested Successfully”可以看出数据源可连通,点击“确认”。

同样的方法,初始化一个 PostgreSQL 数据库 test 数据源。

(3)DataX任务模板

点击“任务管理”菜单中的“DataX任务模板”菜单,点击“添加”按钮添加一个任务模板。

(4)任务构建

点击“任务管理”菜单中的“任务构建”菜单,来构建一个任务。

步骤一构建 reader:设置数据库源、数据库表名。

第二步构建 writer:设置数据库源、Schema、数据库表名。

第三步字段映射:设置数据库源、数据库表名。

步骤四构建:设置源端字段、目标字段。

点击“下一步”后,会出现 3 个按钮:1.构建、2.选择模板、复制json。

点击“1.构建”后会生成一个 job json。

{
  "job": {
    "setting": {
      "speed": {
        "channel": 3,
        "byte": 1048576
      },
      "errorLimit": {
        "record": 0,
        "percentage": 0.02
      }
    },
    "content": [
      {
        "reader": {
          "name": "postgresqlreader",
          "parameter": {
            "username": "XVko54UY9nOe/3JQGQUikw==",
            "password": "XCYVpFosvZBBWobFzmLWvA==",
            "column": [
              "\"id\"",
              "\"level\"",
              "\"parent_code\"",
              "\"area_code\"",
              "\"zip_code\"",
              "\"city_code\"",
              "\"name\"",
              "\"short_name\"",
              "\"merger_name\"",
              "\"pinyin\"",
              "\"lng\"",
              "\"lat\""
            ],
            "splitPk": "",
            "connection": [
              {
                "table": [
                  "cnarea_2020"
                ],
                "jdbcUrl": [
                  "jdbc:postgresql://192.168.1.100:5432/test"
                ]
              }
            ]
          }
        },
        "writer": {
          "name": "mysqlwriter",
          "parameter": {
            "username": "yRjwDFuoPKlqya9h9H2Amg==",
            "password": "XCYVpFosvZBBWobFzmLWvA==",
            "column": [
              "`id`",
              "`level`",
              "`parent_code`",
              "`area_code`",
              "`zip_code`",
              "`city_code`",
              "`name`",
              "`short_name`",
              "`merger_name`",
              "`pinyin`",
              "`lng`",
              "`lat`"
            ],
            "connection": [
              {
                "table": [
                  "cnarea_2020"
                ],
                "jdbcUrl": "jdbc:mysql://192.168.1.100:3306/cnarea20200630"
              }
            ]
          }
        }
      }
    ]
  }
}

点击“2.选择模板”按钮选择一个任务模板。

选择之后,第二个按钮会变成“DataX任务模板”的“任务ID(任务描述)”。

即变成了“22(晚 10 点每 10 分钟)”,然后点击页面最下面的“下一步”按钮,会创建成功。

(5)任务管理

点击“任务管理”中的菜单“任务管理”,可以看到上一步创建的任务“cnarea_2020”。

查看其“注册节点”。

查看其“下次触发时间”。

等一会儿看看,是否可以采集,界面操作先到此为止。

接下来来讲解上述 DataX Web 如何部署。

写了半天,发现没有运行,看了一下是因为“状态”列未启动。

点击那个按钮,启动任务。2021-11-20 22:13:36

3、部署 DataX

(1)下载

从这个 datax.tar.gz 地址下载 datax 包,并解压。

4、部署 Hadoop Common

(1)下载 hadoop-2.7.3.tar.gz 并解压

(2)配置 HADOOP_HOME 环境变量

创建一个“HADOOP_HOME”环境变量,其值指向“D:\code\hadoop-2.7.3”。

5、部署 winutils 工具

(1)克隆代码

从“winutils github 代码仓库”克隆代码

code@code MINGW64 /d/code
$ git clone https://github.com/cdarlint/winutils.git
Cloning into 'winutils'...
remote: Enumerating objects: 434, done.
remote: Counting objects: 100% (48/48), done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 434 (delta 19), reused 35 (delta 8), pack-reused 386 eceiving obje
Receiving objects: 100% (434/434), 5.84 MiB | 989.00 KiB/s, done.

Resolving deltas: 100% (312/312), done.
Updating files: 100% (448/448), done.

(2)拷贝可执行程序

将上述“hadoop-2.7.3/bin”目录中的文件,拷贝到“HADOOP_HOME”的 bin 目录。

可以完全覆盖。

(3)创建 CLASSPATH 环境变量

搞了这么多 hadoop 的配置,主要是解决下面的问题:

18:51:15.569 admin [main] ERROR o.a.h.u.Shell - Failed to locate the winutils binary in the hadoop binary path

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

6、编译运行 DataX Web

(1)克隆代码

(2)修改 datax-admin 配置

server:
  port: 8080
#  port: ${server.port}
spring:
  #数据源
  datasource:
    username: root
    password: 123456
    url: jdbc:mysql://192.168.1.100:3306/datax_web?serverTimezone=Asia/Shanghai&useLegacyDatetimeCode=false&useSSL=false&nullNamePatternMatchesAll=true&useUnicode=true&characterEncoding=UTF-8
#    password: ${DB_PASSWORD:password}
#    username: ${DB_USERNAME:username}
#    url: jdbc:mysql://${DB_HOST:127.0.0.1}:${DB_PORT:3306}/${DB_DATABASE:dataxweb}?serverTimezone=Asia/Shanghai&useLegacyDatetimeCode=false&useSSL=false&nullNamePatternMatchesAll=true&useUnicode=true&characterEncoding=UTF-8
    driver-class-name: com.mysql.jdbc.Driver


    hikari:
      ## 最小空闲连接数量
      minimum-idle: 5
      ## 空闲连接存活最大时间,默认600000(10分钟)
      idle-timeout: 180000
      ## 连接池最大连接数,默认是10
      maximum-pool-size: 10
      ## 数据库连接超时时间,默认30秒,即30000
      connection-timeout: 30000
      connection-test-query: SELECT 1
      ##此属性控制池中连接的最长生命周期,值0表示无限生命周期,默认1800000即30分钟
      max-lifetime: 1800000

  # datax-web email
  mail:
    host: smtp.qq.com
    port: 25
    username: gree2@qq.com
    password: xxxxxxxx
#    username: ${mail.username}
#    password: ${mail.password}
    properties:
      mail:
        smtp:
          auth: true
          starttls:
            enable: true
            required: true
        socketFactory:
          class: javax.net.ssl.SSLSocketFactory


management:
  health:
    mail:
      enabled: false
  server:
    servlet:
      context-path: /actuator

mybatis-plus:
  # mapper.xml文件扫描
  mapper-locations: classpath*:/mybatis-mapper/*Mapper.xml
  # 实体扫描,多个package用逗号或者分号分隔
  #typeAliasesPackage: com.yibo.essyncclient.*.entity
  global-config:
    # 数据库相关配置
    db-config:
      # 主键类型  AUTO:"数据库ID自增", INPUT:"用户输入ID", ID_WORKER:"全局唯一ID (数字类型唯一ID)", UUID:"全局唯一ID UUID";
      id-type: AUTO
      # 字段策略 IGNORED:"忽略判断",NOT_NULL:"非 NULL 判断"),NOT_EMPTY:"非空判断"
      field-strategy: NOT_NULL
      # 驼峰下划线转换
      column-underline: true
      # 逻辑删除
      logic-delete-value: 0
      logic-not-delete-value: 1
      # 数据库类型
      db-type: mysql
    banner: false
  # mybatis原生配置
  configuration:
    map-underscore-to-camel-case: true
    cache-enabled: false
    call-setters-on-nulls: true
    jdbc-type-for-null: 'null'
    type-handlers-package: com.wugui.datax.admin.core.handler

# 配置mybatis-plus打印sql日志
logging:
  level:
    com.wugui.datax.admin.mapper: info
    path: ./data/applogs/admin
#  level:
#    com.wugui.datax.admin.mapper: error
#    path: ${data.path}/applogs/admin



#datax-job, access token
datax:
  job:
    accessToken:
    #i18n (default empty as chinese, "en" as english)
    i18n:
    ## triggerpool max size
    triggerpool:
      fast:
        max: 200
      slow:
        max: 100
      ### log retention days
    logretentiondays: 30

datasource:
  aes:
    key: AD42F6697B035B75

(3)修改 datax-executor 配置

# web port
server:
#  port: ${server.port}
  port: 8081

# log config
logging:
  config: classpath:logback.xml
#  path: ${data.path}/applogs/executor/jobhandler
  path: ./data/applogs/executor/jobhandler

datax:
  job:
    admin:
      ### datax admin address list, such as "http://address" or "http://address01,http://address02"
      addresses: http://127.0.0.1:8080
#      addresses: http://127.0.0.1:${datax.admin.port}
    executor:
      appname: datax-executor
      ip: 192.168.1.9
      port: 9999
#      port: ${executor.port:9999}
      ### job log path
      logpath: ./data/applogs/executor/jobhandler
#      logpath: ${data.path}/applogs/executor/jobhandler
      ### job log retention days
      logretentiondays: 30
    ### job, access token
    accessToken:

  executor:
    jsonpath: D:\\code\\datax\\job
#    jsonpath: ${json.path}

  pypath: D:\\code\\datax\\bin\\datax.py
#  pypath: ${python.path}
#  pypath: ${python.path}

(4)运行

...
18:51:27.066 admin [main] INFO  o.s.b.w.e.t.TomcatWebServer - Tomcat started on port(s): 8080 (http) with context path ''
18:51:27.073 admin [main] INFO  c.w.d.a.DataXAdminApplication - Started DataXAdminApplication in 26.306 seconds (JVM running for 29.794)
18:51:27.105 admin [main] INFO  c.w.d.a.DataXAdminApplication - Access URLs:
----------------------------------------------------------
	Local-API: 		http://127.0.0.1:8080/doc.html
	External-API: 	http://172.26.240.1:8080/doc.html
	web-URL: 		http://127.0.0.1:8080/index.html
	----------------------------------------------------------
18:51:39.572 admin [http-nio-8080-exec-1] INFO  o.a.c.c.C.[.[.[/] - Initializing Spring DispatcherServlet 'dispatcherServlet'
18:51:39.572 admin [http-nio-8080-exec-1] INFO  o.s.w.s.DispatcherServlet - Initializing Servlet 'dispatcherServlet'
18:51:39.591 admin [http-nio-8080-exec-1] INFO  o.s.w.s.DispatcherServlet - Completed initialization in 18 ms
...

至此算是完成初步的预研,根据官方文档,前面执行的任务不会成功。

原因如下:

1、测试机用的 Anaconda 3 部署的 python 环境,其版本为 python 3.8.8

2、DataX 的文档说,默认配置的是 python 2.7,要使用 python 3 ,需要单独复制脚本。

综合上述,可以推断执行会失败。

=================================================================

2021-12-11 22:37:17

今晚在新本本上重新搞了一下环境

跑了一下,请参考下发的运行日志

Python3 支持

拷贝 datax-web/doc/datax-web/datax-python3 下 3 个 py 文件

替换 datax/bin 下面 3 个 py 文件

2021-12-11 22:28:27 [JobThread.run-130] <br>----------- datax-web job execute start -----------<br>----------- Param:
2021-12-11 22:28:27 [BuildCommand.buildDataXParam-100] ------------------Command parameters:
2021-12-11 22:28:27 [ExecutorJobHandler.execute-57] ------------------DataX process id: 18292
2021-12-11 22:28:27 [ProcessCallbackThread.callbackLog-186] <br>----------- datax-web job callback finish.
2021-12-11 22:28:27 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:27 [AnalysisStatistics.analysisStatisticsLog-53] DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
2021-12-11 22:28:27 [AnalysisStatistics.analysisStatisticsLog-53] Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2021-12-11 22:28:27 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:27 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.162 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.167 [main] INFO  Engine - the machine info  => 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	osInfo:	Oracle Corporation 1.8 25.202-b08
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	jvmInfo:	Windows 10 amd64 10.0
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	cpu num:	16
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	totalPhysicalMemory:	-0.00G
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	freePhysicalMemory:	-0.00G
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	maxFileDescriptorCount:	-1
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	currentOpenFileDescriptorCount:	-1
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	GC Names	[PS MarkSweep, PS Scavenge]
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	MEMORY_NAME                    | allocation_size                | init_size                      
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	PS Eden Space                  | 256.00MB                       | 256.00MB                       
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	Code Cache                     | 240.00MB                       | 2.44MB                         
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	PS Survivor Space              | 42.50MB                        | 42.50MB                        
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	PS Old Gen                     | 683.00MB                       | 683.00MB                       
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	Metaspace                      | -0.00MB                        | 0.00MB                         
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.187 [main] INFO  Engine - 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] {
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	"content":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 		{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			"reader":{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 				"name":"mysqlreader",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 				"parameter":{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"column":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`id`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`level`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`parent_code`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`area_code`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`zip_code`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`city_code`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`name`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`short_name`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`merger_name`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`pinyin`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`lng`",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"`lat`"
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					],
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"connection":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 							"jdbcUrl":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 								"jdbc:mysql://home:3307/cnarea20200630"
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 							],
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 							"table":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 								"cnarea_2020"
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 							]
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					],
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"password":"******",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"splitPk":"id",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"username":"root"
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 				}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			},
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			"writer":{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 				"name":"postgresqlwriter",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 				"parameter":{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"column":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"id\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"level\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"parent_code\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"area_code\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"zip_code\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"city_code\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"name\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"short_name\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"merger_name\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"pinyin\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"lng\"",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"\"lat\""
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					],
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"connection":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 							"jdbcUrl":"jdbc:postgresql://home:5432/datax",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 							"table":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 								"public.cnarea_2020"
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 							]
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					],
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"password":"******",
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"preSql":[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 						"delete from cnarea_2020"
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					],
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 					"username":"postgres"
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 				}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 		}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	],
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	"setting":{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 		"errorLimit":{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			"percentage":0.02,
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			"record":0
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 		},
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 		"speed":{
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			"byte":1048576,
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 			"channel":3
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 		}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 	}
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] }
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.203 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.204 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.205 [main] INFO  JobContainer - DataX jobContainer starts job.
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.206 [main] INFO  JobContainer - Set jobId = 0
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.480 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://home:3307/cnarea20200630?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.495 [job-0] INFO  OriginalConfPretreatmentUtil - table:[cnarea_2020] has columns:[id,level,parent_code,area_code,zip_code,city_code,name,short_name,merger_name,pinyin,lng,lat].
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.581 [job-0] INFO  OriginalConfPretreatmentUtil - table:[public.cnarea_2020] all columns:[
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] id,level,parent_code,area_code,zip_code,city_code,name,short_name,merger_name,pinyin,lng,lat
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] ].
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.607 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] INSERT INTO %s ("id","level","parent_code","area_code","zip_code","city_code","name","short_name","merger_name","pinyin","lng","lat") VALUES(?,?,?,?,?,?,?,?,?,?,?,?)
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] ], which jdbcUrl like:[jdbc:postgresql://home:5432/datax]
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.607 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.607 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.608 [job-0] INFO  JobContainer - DataX Writer.Job [postgresqlwriter] do prepare work .
2021-12-11 22:28:28 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:28.619 [job-0] INFO  CommonRdbmsWriter$Job - Begin to execute preSqls:[delete from cnarea_2020]. context info:jdbc:postgresql://home:5432/datax.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.778 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.779 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 1048576 bytes.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.787 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.788 [job-0] INFO  JobContainer - DataX Writer.Job [postgresqlwriter] splits to [1] tasks.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.810 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.815 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.817 [job-0] INFO  JobContainer - Running by standalone Mode.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.830 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.834 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.835 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.847 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:31.851 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [select `id`,`level`,`parent_code`,`area_code`,`zip_code`,`city_code`,`name`,`short_name`,`merger_name`,`pinyin`,`lng`,`lat` from cnarea_2020 
2021-12-11 22:28:31 [AnalysisStatistics.analysisStatisticsLog-53] ] jdbcUrl:[jdbc:mysql://home:3307/cnarea20200630?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2021-12-11 22:28:41 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:41.851 [job-0] INFO  StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 0.00%
2021-12-11 22:28:51 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:28:51.852 [job-0] INFO  StandAloneJobContainerCommunicator - Total 96768 records, 8931326 bytes | Speed 872.20KB/s, 9676 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 9.596s |  All Task WaitReaderTime 0.239s | Percentage 0.00%
2021-12-11 22:29:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:01.868 [job-0] INFO  StandAloneJobContainerCommunicator - Total 197120 records, 18204903 bytes | Speed 905.62KB/s, 10035 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 19.407s |  All Task WaitReaderTime 0.397s | Percentage 0.00%
2021-12-11 22:29:11 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:11.877 [job-0] INFO  StandAloneJobContainerCommunicator - Total 295424 records, 27029877 bytes | Speed 861.81KB/s, 9830 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 29.157s |  All Task WaitReaderTime 0.547s | Percentage 0.00%
2021-12-11 22:29:21 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:21.890 [job-0] INFO  StandAloneJobContainerCommunicator - Total 389632 records, 35786259 bytes | Speed 855.12KB/s, 9420 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 39.115s |  All Task WaitReaderTime 0.691s | Percentage 0.00%
2021-12-11 22:29:31 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:31.893 [job-0] INFO  StandAloneJobContainerCommunicator - Total 485888 records, 44464293 bytes | Speed 847.46KB/s, 9625 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 49.023s |  All Task WaitReaderTime 0.836s | Percentage 0.00%
2021-12-11 22:29:41 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:41.895 [job-0] INFO  StandAloneJobContainerCommunicator - Total 576000 records, 52520355 bytes | Speed 786.72KB/s, 9011 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 58.670s |  All Task WaitReaderTime 0.985s | Percentage 0.00%
2021-12-11 22:29:51 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:51.902 [job-0] INFO  StandAloneJobContainerCommunicator - Total 664064 records, 60429822 bytes | Speed 772.41KB/s, 8806 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 68.620s |  All Task WaitReaderTime 1.128s | Percentage 0.00%
2021-12-11 22:29:51 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:51.996 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [select `id`,`level`,`parent_code`,`area_code`,`zip_code`,`city_code`,`name`,`short_name`,`merger_name`,`pinyin`,`lng`,`lat` from cnarea_2020 
2021-12-11 22:29:51 [AnalysisStatistics.analysisStatisticsLog-53] ] jdbcUrl:[jdbc:mysql://home:3307/cnarea20200630?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2021-12-11 22:29:52 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:52.353 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[80507]ms
2021-12-11 22:29:52 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:29:52.354 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.909 [job-0] INFO  StandAloneJobContainerCommunicator - Total 758049 records, 70508004 bytes | Speed 984.20KB/s, 9398 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 78.400s |  All Task WaitReaderTime 1.289s | Percentage 100.00%
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.910 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.910 [job-0] INFO  JobContainer - DataX Writer.Job [postgresqlwriter] do post work.
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.910 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.910 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.911 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: D:\code\datax\datax\hook
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.912 [job-0] INFO  JobContainer - 
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 	 [total cpu info] => 
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 		averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 		-1.00%                         | -1.00%                         | -1.00%
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53]                         
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 	 [total gc info] => 
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 		 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 		 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 		 PS Scavenge          | 19                 | 19                 | 19                 | 0.088s             | 0.088s             | 0.088s             
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.912 [job-0] INFO  JobContainer - PerfTrace not enable!
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.912 [job-0] INFO  StandAloneJobContainerCommunicator - Total 758049 records, 70508004 bytes | Speed 765.06KB/s, 8422 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 78.400s |  All Task WaitReaderTime 1.289s | Percentage 100.00%
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 2021-12-11 22:30:01.914 [job-0] INFO  JobContainer - 
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 任务启动时刻                    : 2021-12-11 22:28:28
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 任务结束时刻                    : 2021-12-11 22:30:01
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 任务总计耗时                    :                 93s
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 任务平均流量                    :          765.06KB/s
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 记录写入速度                    :           8422rec/s
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 读出记录总数                    :              758049
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 读写失败总数                    :                   0
2021-12-11 22:30:01 [AnalysisStatistics.analysisStatisticsLog-53] 
2021-12-11 22:30:01 [JobThread.run-165] <br>----------- datax-web job execute end(finish) -----------<br>----------- ReturnT:ReturnT [code=200, msg=LogStatistics{taskStartTime=2021-12-11 22:28:28, taskEndTime=2021-12-11 22:30:01, taskTotalTime=93s, taskAverageFlow=765.06KB/s, taskRecordWritingSpeed=8422rec/s, taskRecordReaderNum=758049, taskRecordWriteFailNum=0}, content=null]
2021-12-11 22:30:02 [TriggerCallbackThread.callbackLog-186] <br>----------- datax-web job callback finish.

7、打包

后续将上述部署相关的各种包,整理一下打包上传到百度云。

并补充 python 3 相关的配置。

=================================================================

2021-12-11 22:45:22

链接:pan.baidu.com/s/1iyIo9B

提取码:sclo

编辑于 2021-12-11 22:51