DGraph represents a collection of large-scale dynamic graph datasets, consisting of interactive objects, events and labels that envolves with time. It provides the opportunity to perform multiple tasks (e.g., node classification, link prediction, graph classification) on massive real-world data and has a wide range of applications with a particular focus on the financial sector. In view of the inclusion of realistic and valuable graph data, the benchmark datasets are expected to promote more relevant research and expand practical applications.
Please kindly cite our paper if you would like to use this dataset:
Xuanwen Huang, Yang Yang, Yang Wang, Chunping Wang, Zhisheng Zhang, Jiarong Xu, and Lei Chen. DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection. Preprint. [PDF]
This dataset is crawled from AnJuKe, an online platform for real estate sales and renting. This dataset covers over 18K real estates at Shanghai in 2017.
In the data, each line indicate a real estate, with the following format:
"name" is the Chinese name of the real estate;
"price" is the average housing price of the real estate;
"latitude" and "longitude" present the location of the real estate, which is obtained from Baidu Maps.
Please kindly cite our paper if you would like to use this dataset:
Yang Yang, Zongtao Liu, Chenhao Tan, Fei Wu, Yueting Zhuang, and Yafeng Li. To Stay or to Leave: Churn Prediction for Urban Migrants in the Initial Period. In WWW'18. [BIB] [PDF]
This dataset is sampled from a real Twitter like Chinese social media. The sampling process is as follows: we first select top 100 source posts (ones not retweeting from others) with most retweets during Oct. 1st, 2012 to Oct. 7th, 2012 and put these posts into a set V. We then scan all other posts and add ones that retweet from one of the posts in V to V. The process is repeated until no more posts are newly added. In this way, we obtain a complete casacade process of those 100 source posts, involved with 96,782 posts in total.
In the data, each line indicates a post, with the following formate:
post_idx post_time user_id root_id root_user_id parent_id parent_user_id
"post_idx" is the unique index of the post;
"post_time" is the time when the post was published;
"user_id" is the unique index of the user who published the post;
"root_id" is the index of the source post, would be zero if the current line indicates a source post;
"root_user_id" is the user who published the source post, would be zero if the current line indicates a source post;
"parent_id" is the index of another post the current post retweeted from, would be zero if the current line indicates a source post;
"parent_user_id" is the user who published the 'parent post', would be zero if the current line indicates a source post.
Notice: we remove all content information due to privacy issues.
Please kindly cite our paper if you would like to use this dataset:
Yang Yang, Jie Tang, Cane Wing-Ki Leung, Yizhou Sun, Qicong Chen, Juanzi Li, and Qiang Yang. RAIN: Social Role-Aware Information Diffusion. In AAAI'15. 2015. [BIB] [PDF]