Cypher: LOAD JSON from URL AS data

Neo4j’s query language Cypher supports loading data from CSV directly but not from JSON files or URLs.

Almost every site offers some kind of API or endpoint that returns JSON and we can also query many NOSQL databases via HTTP and get JSON responses back.

It’s quite useful to be able to ingest document structured information from all those different sources into a more usable graph model.

I want to show here that retrieving that data and ingesting it into Neo4j using Cypher is really straightforward and takes only little effort.

As Cypher is already pretty good at deconstructing nested documents, it’s actually not that hard to achieve it from a tiny program.

I want to show you today how you can achieve this from Python, Javascript, Ruby, Java, and Bash.

The Domain: StackOverflow

Being a developer I love StackOverflow. Just crossed 20k reputation by only answering 1100 Neo4j related questions :). You can do that too. That’s why I want to use StackOverflow users with their question, answers and comments and tags as our domain today.

stack overflow logo

Pulling StackOverflow information into a graph model allows me to find interesting insights, like:

  • what are the people asking or answering about Neo4j also interested in,

  • how is their activity distributed across tags and between questions, answers and comments

  • which kinds of questions attract answers and which don’t

  • looking at my own data, which answers to what kinds of questions got the highest approval rates

We need some data and a model suited to answer those questions.

StackOverflow API

Stackoverflow offers an API to retrieve that information, it’s credential protected as usual, but there is the cool option to pre-generate an API-URL that encodes your secrets and allows you to retrieve data without sharing them. You can still control some parameters like tags, page size and page-number though.

With this API-URL

We load the last 10 questions with the Neo4j tag. You get a like this.

Overall Response Structure
{ "items": [{
	"question_id": 24620768,
	"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
	"title": "Neo4j cypher query: get last N elements",
 	"answer_count": 1,
 	"score": 1,
 	.....
 	"creation_date": 1404771217,
 	"body_markdown": "I have a graph that contains a relationship between users, s....\r\nWhat I need is to delete the first relationship in this list. How can I do that?",
 	"tags": ["neo4j", "cypher"],
 	"owner": {
 		"reputation": 815,
 		"user_id": 1212067,
        ....
 		"link": "http://stackoverflow.com/users/1212067/c%c3%a9sar-garc%c3%ada-tapia"
 	},
 	"answers": [{
 		"owner": {
 			"reputation": 488,
 			"user_id": 737080,
 			"display_name": "Chris Leishman",
            ....
 		},
 		"answer_id": 24620959,
 		"share_link": "http://stackoverflow.com/a/24620959",
        ....
 		"body_markdown": "The simplest would be to use an ... some discussion on this here: http://docs.neo4j.org/chunked/stable/cypherdoc-linked-lists.html)",
 		"title": "Neo4j cypher query: get last N elements"
 	}]
 }

Graph Model

So what does the graph-model look like? We can develop it by looking at the questions we want to answer and the entities and relationships they refer to.

We need this model upfront to know where to put our data when we insert it into the graph. After all we don’t want to have loose ends.

stackoverflow model

Cypher Import Statement

The Cypher query to create that domain is also straightforward. You can deconstruct maps with dot notation map.key and arrays with slices array[0..4]. You’d use UNWIND to convert collections into rows and FOREACH to iterate over a collection with update statements. To create nodes and relationships we use MERGE and CREATE commands.

The JSON response that we retrieved from the API call is passed in as a parameter {json} to the Cypher statement.

WITH {json} as data
UNWIND data.items as q
MERGE (question:Question {id:q.question_id}) ON CREATE
  SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count

MERGE (owner:User {id:q.owner.user_id}) ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)

FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
   MERGE (question)<-[:ANSWERS]-(answer:Answer {id:a.answer_id})
   MERGE (answerer:User {id:a.owner.user_id}) ON CREATE SET answerer.display_name = a.owner.display_name
   MERGE (answer)<-[:PROVIDED]-(answerer)
)

Calling Cypher with the JSON parameters

To pass in the JSON to Cypher we have to programmatically call the Cypher endpoint of the Neo4j server, which can be done via one of the many drivers for Neo4j or manually. We can also call the Java API.

So without further ado here are our examples:

Python

We use the py2neo driver by Nigel Small to execute the statement:

import os
import requests
from py2neo import neo4j

# Connect to graph and add constraints.
neo4jUrl = os.environ.get('NEO4J_URL',"http://localhost:7474/db/data/")
graph = neo4j.GraphDatabaseService(neo4jUrl)

# Add uniqueness constraints.
neo4j.CypherQuery(graph, "CREATE CONSTRAINT ON (q:Question) ASSERT q.id IS UNIQUE;").run()

# Build URL.
apiUrl = "https://api.stackexchange.com/2.2/questions...." % (tag,page,page_size)
# Send GET request.
json = requests.get(apiUrl, headers = {"accept":"application/json"}).json()

# Build query.
query = """
UNWIND {json} AS data ....
"""

# Send Cypher query.
neo4j.CypherQuery(graph, query).run(json=json)

We did something similar with getting tweets from the Twitter search API into Ne4oj for the OSCON conference.

Javascript

For JavaScript I want to show how to call the transactional Cypher endpoint directly, by just using the request node module.

var r=require("request");
var neo4jUrl = (env["NEO4J_URL"] || "http://localhost:7474") + "/db/data/transaction/commit";

function cypher(query,params,cb) {
  r.post({uri:neo4jUrl,
          json:{statements:[{statement:query,parameters:params}]}},
         function(err,res) { cb(err,res.body)})
}

var query="UNWIND {json} AS data ....";
var apiUrl = "https://api.stackexchange.com/2.2/questions....";

r.get({url:apiUrl,json:true,gzip:true}, function(err,res,json) {
  cypher(query,{json:json},function(err, result) { console.log(err, JSON.stringify(result))});
});

Java

With Java I want to show how to use the Neo4j embedded API.

import org.apache.http.*;
import org.codehaus.jackson.map.ObjectMapper;
import org.neo4j.graphdb.*;

// somewhere in your application-scoped setup code
ObjectMapper mapper = new ObjectMapper();
HttpClient http = HttpClients.createMinimal();
GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedGraphDatabase(PATH);

// execute API request and parse response as JSON
HttpResponse response = http.execute(new HttpGet( apiUrl ));
Map json = mapper.readValue(response.getEntity().getContent(), Map.class)

// execute Cypher
String query = "UNWIND {json} AS data ....";
db.execute(query, singletonMap("json",json));

// application scoped shutdown, or JVM-shutdown-hook
db.shutdown();

Ruby

Using the neo4j-core Gem, we can talk to Neo4j server or embedded (using jRuby) by just changing a single line of configuration.

require 'rubygems'
require 'neo4j-core'
require 'rest-client'
require 'json'

QUERY="UNWIND {json} AS data ...."
API = "https://api.stackexchange.com/2.2/questions...."

res = RestClient.get(API)
json = JSON.parse(res.to_str)

session = Neo4j::Session.open
session.query(QUERY, json: json)

Bash

Bash is of course most fun, as we have to do fun substitutions to make this work.

load_json.sh
#!/bin/bash
echo "Usage load_json.sh 'http://json.api.com?params=values' import_json.cypher"
echo "Use {data} as parameter in your query for the JSON data"
JSON_API="$1"
QUERY=`cat "$2"` # cypher file
JSON_DATA=`curl --compress -s -H accept:application/json -s "$JSON_API"`
POST_DATA="{\"statements\":[{\"statement\": \"$QUERY\", \"parameters\": {\"data\":\"$JSON_DATA\"}}]}"
DB_URL=${NEO4J_URL-http://localhost:7474}
curl -i -H accept:application/json -H content-type:application/json -d "$POST_DATA" -XPOST "$DB_URL/db/data/transaction/commit"

So as you can see, even with LOAD JSON not being part of the language, it’s possible to retrieve JSON data from an API endpoint and deconstruct and insert it into Neo4j by just using plain Cypher.

Example Use-Cases

Here are some simple example queries that I now can run on top of this imported dataset.

To not overload this blog post with too much information, we’ll answer our original questions in Part 2.

Find the User who was most active

MATCH (u:User)
OPTIONAL MATCH (u)-[:PROVIDED|ASKED|COMMENTED]->()
RETURN u,count(*)
ORDER BY count(*) DESC
LIMIT 5

Find co-used Tags

MATCH (t:Tag)
OPTIONAL MATCH (t)<-[:TAGGED]-(question)-[:TAGGED]->(t2)
RETURN t.name,t2.name,count(distinct question) as questions
ORDER BY questions DESC
MATCH (t:Tag)<-[r:TAGGED]->(question)
RETURN t,r,question

Stackoverflow Response

{
	"items": [{
		"answers": [{
			"owner": {
				"reputation": 488,
				"user_id": 737080,
				"user_type": "registered",
				"accept_rate": 45,
				"profile_image": "https://www.gravatar.com/avatar/ffa6eed1e8a9c1b2adb37ca88c07dede?s=128&d=identicon&r=PG",
				"display_name": "Chris Leishman",
				"link": "http://stackoverflow.com/users/737080/chris-leishman"
			},
			"tags": [],
			"comment_count": 0,
			"down_vote_count": 0,
			"up_vote_count": 2,
			"is_accepted": false,
			"score": 2,
			"last_activity_date": 1404772223,
			"creation_date": 1404772223,
			"answer_id": 24620959,
			"question_id": 24620768,
			"share_link": "http://stackoverflow.com/a/24620959",
			"body_markdown": "The simplest would be to use an ... some discussion on this here: http://docs.neo4j.org/chunked/stable/cypherdoc-linked-lists.html)",
			"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements/24620959#24620959",
			"title": "Neo4j cypher query: get last N elements"
		}],
		"tags": ["neo4j", "cypher"],
		"owner": {
			"reputation": 815,
			"user_id": 1212067,
			"user_type": "registered",
			"accept_rate": 73,
			"profile_image": "http://i.stack.imgur.com/nnyS1.png?s=128&g=1",
			"display_name": "C&#233;sar Garc&#237;a Tapia",
			"link": "http://stackoverflow.com/users/1212067/c%c3%a9sar-garc%c3%ada-tapia"
		},
		"comment_count": 0,
		"delete_vote_count": 0,
		"close_vote_count": 0,
		"is_answered": true,
		"view_count": 14,
		"favorite_count": 0,
		"down_vote_count": 0,
		"up_vote_count": 1,
		"answer_count": 1,
		"score": 1,
		"last_activity_date": 1404772230,
		"creation_date": 1404771217,
		"question_id": 24620768,
		"share_link": "http://stackoverflow.com/q/24620768",
		"body_markdown": "I have a graph that contains a relationship between users, s....\r\nWhat I need is to delete the first relationship in this list. How can I do that?",
		"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
		"title": "Neo4j cypher query: get last N elements"
	}, {
		"tags": ["neo4j", "cypher"],
		"owner": {
			"reputation": 63,
			"user_id": 845435,
			"user_type": "registered",
			"accept_rate": 67,
			"profile_image": "https://www.gravatar.com/avatar/610458a30958c9d336ee691fa1a87369?s=128&d=identicon&r=PG",
			"display_name": "user845435",
			"link": "http://stackoverflow.com/users/845435/user845435"
		},
		"comment_count": 0,
		"delete_vote_count": 0,
		"close_vote_count": 0,
		"is_answered": false,
		"view_count": 16,
		"favorite_count": 0,
		"down_vote_count": 0,
		"up_vote_count": 0,
		"answer_count": 0,
		"score": 0,
		"last_activity_date": 1404768987,
		"creation_date": 1404768987,
		"question_id": 24620297,
		"share_link": "http://stackoverflow.com/q/24620297",
		"body_markdown": "I&#39;m trying to implement a simple graph db for NYC subway................Thanks!\r\n",
		"link": "http://stackoverflow.com/questions/24620297/cypher-query-with-infinite-relationship-takes-forever",
		"title": "Cypher query with infinite relationship takes forever"
	}],
	"has_more": true,
	"quota_max": 300,
	"quota_remaining": 205
}