ETL: CSV to Neo4j

images/download/thumbnails/18580660/etl_2.png

Load data from CSV into Neo4j using Cypher (LOAD CSV command).

Used drivers:

  • CSV

  • Neo4j or Neo4jCsvNodes with Neo4jCsvRelationships

You can use Neo4jCsvNodes and Neo4jCsvRelationships to automatically generate Cypher for you or you can use Neo4j driver to execute custom Cypher. See examples below. Both examples import same data and the result in Neo4j is the same. Here are CSV files which are used in the examples.

Nodes

logicalName;type;subtype;comment;vendor;responsible
Server1;Server;Physical;HP server;HP;ITOPS
Server2;Server;Physical;HP server;HP;ITOPS
App1;Application modul;Application;IB application;Microsoft;MS support
Server3;Server;Physical;HP server;HP;ITOPS
App2;Application modul;Application;IB application;Microsoft;MS support

Relationships

sourceLogicalName;targetLogicalName;relationshipName;type;subtype
App1;Server1;App1-Server1;Physical;Uses
App2;Server2;App2-Server2;Physical;Uses

Import using Neo4j driver

This import uses Neo4j driver where any Cypher command can by written. To import CSV we uses LOAD CSV command.

Advantage: You can make some changes in the name of attributes or add some conditions to the Cypher

Disadvantage: You need to write long Cypher to set every column of CSV. If you will add some new columns into CSV you need to change Cypher in ETL job. If you need to set the same attribute names as CSV column names use Neo4jCsvNodes and Neo4jCsvRelationships drivers instead of Neo4j driver.

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<description>Load CSV nodes and rels into Neo4j</description>
<properties>
csv.file1=/nodes.csv
csv.file2=/rels.csv
</properties>
<connection id="neo4j" driver="neo4j" url="bolt://localhost:7687" user="neo4j" password="admin"/>
<script connection-id="neo4j">
LOAD CSV WITH HEADERS FROM 'file:$csv.file1' AS line FIELDTERMINATOR ';'
MERGE (n:Ci{logicalName:line.logicalName})
SET
n.logicalName=line.logicalName,
n.type=line.type,
n.subtype=line.subtype,
n.comment=line.comment,
n.vendor=line.vendor,
n.responsible=line.responsible,
n._created=timestamp(),
n._updated=timestamp()
</script>
<script connection-id="neo4j">
LOAD CSV WITH HEADERS FROM 'file:$csv.file2' AS line FIELDTERMINATOR ';'
MATCH (n1:Ci{logicalName:line.sourceLogicalName}),(n2:Ci{logicalName:line.targetLogicalName}) MERGE (n1)-[r:RELATED{relationshipName:line.relationshipName}]->(n2)
SET
r.type=line.type,
r.subtype=line.subtype,
r._created=timestamp(),
r._updated=timestamp()
</script>
</etl>

Import using Neo4jCsvNodes and Neo4jCsvRelationships driver

This import uses Neo4jCsvNodes and Neo4jCsvRelationships drivers which automatically generate LOAD CSV Cypher for you in a background. There is no need to write Cypher to set every column of CSV.

Advantage: Short ETL script. There is no need to change ETL job if some columns are added into CSV.

Disadvantage: Cypher is generated in a background so you can't modify this Cypher to add conditions or something else. But you can use Neo4j driver after Neo4jCsvNodes and Neo4jCsvRelationships drivers to execute some custom Cypher and modify data in Neo4j.

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<description>Load CSV nodes and rels into Neo4j</description>
<properties>
csv.file1=/nodes.csv
csv.file2=/rels.csv
</properties>
<connection id="nodeImport" driver="neo4jCsvNodes" url="bolt://localhost:7687" user="neo4j" password="admin">
nodeLabels=Ci
mergeColumns=logicalName
emptyStringIsNull=true
timestampProperties=_created,_updated
csvPath=$csv.file1
</connection>
<connection id="relationshipImport" driver="neo4jCsvRelationships" url="bolt://localhost:7687" user="neo4j" password="admin">
sourceNodeLabels=Ci
targetNodeLabels=Ci
relationshipType=RELATED
sourceNodeMatchColumns=logicalName:sourceLogicalName
targetNodeMatchColumns=logicalName:targetLogicalName
relationshipMergeColumns=relationshipName
emptyStringIsNull=true
timestampProperties=_created,_updated
csvPath=$csv.file2
</connection>
<script connection-id="nodeImport"/>
<script connection-id="relationshipImport"/>
</etl>

Import using Neo4j driver without LOAD CSV

This import uses Neo4j driver where any Cypher command can by written. To import CSV we don't use LOAD CSV command in this case.

Advantage: You can make some custom CREATE or MERGE statements. For example to customize what node label will be created or what relationship type will be created. Node label and relationship type can't be customized in LOAD CSV (label and relationship type has to be a constant string). For example, you can write Cypher query:

MERGE (n:'$type'{id:toint('$id')}) SET n.logicalName='$logicalName'

Disadvantage: This import is very inefficient because every record is inserted separately and this cause poor performance.

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<description>CSV to Neo4j. Insert every record from CSV by own Cypher</description>
<properties>
csv.nodes=/nodes.csv
csv.rels=/rels.csv
</properties>
<connection id="csv_nodes" driver="csv" url="$csv.nodes">
separator=;
</connection>
<connection id="csv_rels" driver="csv" url="$csv.rels">
separator=;
</connection>
<connection id="neo4j" driver="neo4j" url="bolt://localhost:7687" user="neo4j" password="admin"/>
<query connection-id="csv_nodes">
<script connection-id="neo4j">
MERGE (n:Ci{logicalName:'$logicalName'})
SET
n.logicalName='$logicalName',
n.type='$type',
n.subtype='$subtype',
n.comment='$comment',
n.vendor='$vendor',
n.responsible='$responsible',
n._created=timestamp(),
n._updated=timestamp()
</script>
</query>
<query connection-id="csv_rels">
<script connection-id="neo4j">
MATCH (n1:Ci{logicalName:'$sourceLogicalName'}),(n2:Ci{logicalName:'$targetLogicalName'}) MERGE (n1)-[r:RELATED{relationshipName:'$relationshipName'}]->(n2)
SET
r.type='$type',
r.subtype='$subtype',
r._created=timestamp(),
r._updated=timestamp()
</script>
</query>
</etl>