Databricks Hosted Datasets
25 minute read
The data contained within this directory is hosted for users to build data pipelines using Apache Spark and Databricks.
Rdatasets
Rdatasets
is a collection of 747 datasets that were originally distributed alongside the statistical software environment R
and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.
The list of available datasets (csv and docs) is available here:
For more information, please see the README file within the latest data
subdirectory
Versions
- data-001 is from the git hash: aa0d6940a9
Airline On-Time Statistics and Delay Causes
Background
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
FAQ Information is available at http://www.rita.dot.gov/bts/help_with_data/aviation/index.html
Data Source
http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
Usage Restrictions
- The data is released under the Freedom of Information act.
- More information can be found at http://www.fas.org/sgp/foia/citizen.html
Amazon Reviews datasets
The data20K
and test4K
datasets were created by Professor Julian McAuley at the University of California San Diego with the permission for use in the databricks-datasets
bucket by Databricks users.
Source: Image-based recommendations on styles and substitutes. J. McAuley, C. Targett, J. Shi, A. van den Hengel. SIGIR, 2015. Flight Performance Datasets 1997-2008 http://stat-computing.org/dataexpo/2009/the-data.html
Planes dataset http://stat-computing.org/dataexpo/2009/supplemental-data.html
Bike Sharing Dataset
Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto INESC Porto, Campus da FEUP Rua Dr. Roberto Frias, 378 4200 - 465 Porto, Portugal
Background
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Dataset
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions,
precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to
the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is
publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then
extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com.
Associated Tasks
* Regression:
* Predication of bike rental count hourly or daily based on the environmental and seasonal settings.
* Event and Anomaly Detection:
* Count of rented bikes are also correlated to some events in the town which easily are traceable via search engines.
For instance, query like "2012-10-30 washington d.c." in Google returns related results to Hurricane Sandy. Some of the important events are
identified in [1]. Therefore the data can be used for validation of anomaly or event detection algorithms as well.
Files
* hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours
* day.csv : bike sharing counts aggregated on daily basis. Records: 731 days
Dataset characteristics
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
License
Use of this dataset in publications must be cited to the following publication:
[1] Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-0040-3}, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, pages={1-15} }
Contact
For further information about this dataset please contact Hadi Fanaee-T (hadi.fanaee@fe.up.pt)
[CAVAIR Test Case Scenarios: Clips from INRIA (1st Set)]
This data set was obtained from http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. The source of this data is: EC Funded CAVIAR project/IST 2001 37540.
Data Set Information
The dataset structure is as follows:
Clips from INRIA (1st Set) from the CAVAIR Test Case Scenarios[1]:
* /databricks-datasets/cctvVideos/train/
* /databricks-datasets/cctvVideos/test/
Derived from the above datasets
All other folders contain dataset derived from the above Clips from INRIA (1st Set) from the CAVIAR Test Case Scenarios as described below.
/databricks-datasets/cctvVideos/mp4/ # MP4 videos generated from the above videos
/databricks-datasets/cctvVideos/labels/ # Manually created labels categorizing suspicious images
/databricks-datasets/cctvVideos/train_images # Hive-style partitioning of labelled images
MP4 version of videos
The MP4 videos stored in /databricks-datasets/cctvVideos/mp4/
were created by Databricks using the following command.
brew install ffmpeg
for x in *.MPG; do
ffmpeg -i $x -strict experimental -f mp4 \
-vcodec libx264 -acodec aac \
-ab 160000 -ac 2 -preset slow \
-crf 22 ${x/.MPG/.mp4};
done
Labels
Stored within /databricks-datasets/cctvVideos/labels/
; these are manually created labels to identify which images (extracted from the training videos) are considered suspicious per the blog post Identify Suspicious Behavior in Video with Databricks Runtime for Machine Learning
Training Labeled Images
Stored within /databricks-datasets/cctvVideos/train_images
; these are images labeled using Hive-style partitioning where label=0
denote non-suspicious images and label=1
denote suspicious images per the previously noted Labels section.
Citation
Applicable citations:
- EC Funded CAVIAR project/IST 2001 37540, found at URL: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. The feature ‘pcaVector’ is a Vector of the principal components obtained with PCA, the only features which have not been transformed with PCA are ’time’ and ‘amountRange’. Feature ’time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘amountRange’ is the approximate amount of the transaction. The ranges are represented as an integer between 0 and 7 which correspond to the range (in dollars) 0-1, 1-5, 5-10, 10-20, 20-50, 50-100, 100-200 and 200+ respectively. Feature ’label’ is the response variable and it takes value 1 in case of fraud and 0 otherwise. Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. This dataset is a slightly modified version of the dataset collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
Data.gov Datasets
This folder houses data that is copied from http://www.data.gov/. This vast trove of data is published and maintained by the government of the United States.
We only provide a small subset of datasets that are published on the site and it’s worth exploring http://www.data.gov/ itself if you want to find other data to work with!
Datasets
This folder contains all of the datasets used in The Definitive Guide.
The datasets are as follow.
Flight Data
This data comes from the United States Bureau of Transportation. Please see the website for more information: https://www.rita.dot.gov/bts/help_with_data/aviation/index.html
Retail Data
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
The data was downloaded from the UCI Machine Learning Repository. Please see this page for more information: http://archive.ics.uci.edu/ml/datasets/Online+Retail
Bike Data
This data comes from the Bay Area Bike Share network. Please see this page for more infomation: http://www.bayareabikeshare.com/open-data
Sensor Data (Heterogeneity Human Activity Recognition Dataset)
Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen “Smart Devices are Different: Assessing and Mitigating Mobile Sensing Heterogeneities for Activity Recognition” In Proc. 13th ACM Conference on Embedded Networked Sensor Systems (SenSys 2015), Seoul, Korea, 2015. [Web Link]
The data was downloaded from the UCI Machine Learning Repository. It is formally known as the Heterogeneity Human Activity Recognition Dataset. Please see this page for more information: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition
On-Time Performance Datasets
The source airports
dataset can be found at OpenFlights Airport, airline and route data.
The flights
, also known as the departuredelays
, dataset can be found at Airline On-Time Performance and Causes of Flight Delays: On_Time Data
Flowers (images)
This data set was obtained from
https://www.tensorflow.org/datasets/catalog/tf_flowers
The source of the data is:
Author: “The TensorFlow Team”, Title: “Flowers”, Url: “http://download.tensorflow.org/example_images/flower_photos.tgz”
Data Set Information A large set of images of flowers. License and/or Citation All images in this archive are licensed under the Creative Commons By-Attribution License, available at: https://creativecommons.org/licenses/by/2.0/
The photographers are listed below, thanks to all of them for making their work available, and please be sure to credit them for any use as per the license.
See the full list of photos and photographers in LICENSE.txt.
Citation:
@ONLINE {tfflowers, author = “The TensorFlow Team”, title = “Flowers”, month = “jan”, year = “2019”, url = “http://download.tensorflow.org/example_images/flower_photos.tgz" }
Flowers
This data set was obtained from
https://www.tensorflow.org/datasets/catalog/tf_flowers
The source of the data is:
Author: “The TensorFlow Team”, Title: “Flowers”, Url: “http://download.tensorflow.org/example_images/flower_photos.tgz”
Data Set Information A Delta table contains a large set of images of flowers. The ‘content’ column is a binary column of the images, and the ‘label’ column is a string column of the labels. The ‘path’ column the dbfs path of the image and the ‘size’ column contains the width and height of the image.
License and/or Citation
All images in this archive are licensed under the Creative Commons By-Attribution License, available at: https://creativecommons.org/licenses/by/2.0/
The photographers are listed below, thanks to all of them for making their work available, and please be sure to credit them for any use as per the license.
(See the full list of photos and photographers in LICENSE.txt.)
Citation:
@ONLINE {tfflowers, author = “The TensorFlow Team”, title = “Flowers”, month = “jan”, year = “2019”, url = “http://download.tensorflow.org/example_images/flower_photos.tgz" }
VEP Cache, RefSeq Transcripts, GRCh38
This data set was obtained from ftp://ftp.ensembl.org/pub/release-96/variation/VEP/homo_sapiens_refseq_vep_96_GRCh38.tar.gz.
The sources of the data are: Laurent Gil, Sarah E. Hunt, William McLaren (wm2@ebi.ac.uk), Anja Thormann, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Data Set Information
Variant Effect Predictor cache for Assembly GRCh38, RefSeq transcripts (Ensembl release 96).
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122. (2016) doi:10.1186/s13059-016-0974-4
License and/or Citation
This data set has no restrictions: https://uswest.ensembl.org/info/about/legal/disclaimer.html
SafeGraph FootTraffic Dataset
This data set was obtained from http://databricks.com/notebooks/safegraph_patterns_simulated__1_-91d51.csv. The source of the data is simulated Monthly Foot Traffic Time Series in SafeGraph format.
Data Set Information
The Data Set Information details are on the SafeGraph page: Guide to Points of Interest Data.
License and/or Citation
This data set is derived from SafeGraph’s data schema.
IOT Device Data
This dataset was created by Databricks.
It contains fake generated data in json and csv formats.
e.g.
{"user_id": 12, "calories_burnt": 489.79998779296875, "num_steps": 9796, "miles_walked": 4.8979997634887695, "time_stamp": "2018-07-24 03:54:00.893775", "device_id": 10}
Data Set Information
Schema for data-device:
[StructField(id,LongType,false),
StructField(user_id,LongType,true),
StructField(device_id,LongType,true),
StructField(num_steps,LongType,true),
StructField(miles_walked,FloatType,true),
StructField(calories_burnt,FloatType,true),
StructField(timestamp,StringType,true),
StructField(value,StringType,true)]
Schema for data-user:
[StructField(userid,IntegerType,true),
StructField(gender,StringType,true),
StructField(age,IntegerType,true),
StructField(height,IntegerType,true),
StructField(weight,IntegerType,true),
StructField(smoker,StringType,true),
StructField(familyhistory,StringType,true),
StructField(cholestlevs,StringType,true),
StructField(bp,StringType,true),
StructField(risk,IntegerType,true)]
License and/or Citation
Copyright (2018) Databricks, Inc. This dataset is licensed under a Creative Commons Attribution 4.0 International Licensehttps://creativecommons.org/licenses/by/4.0/.
Learning Spark - Example Data From The Book
This dataset holds the files for examples in the Learning Spark book. These examples are used throughout the book.
For more information, please see the README from the Learning Spark github project
License
The files in the Learning Spark github project are licensed with the MIT license as defined in https://github.com/databricks/learning-spark/blob/master/LICENSE.md
Versions
- data-001 is from the git hash: 13c39f22b1
Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.
Online Documentation
You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki. This README file only contains basic setup instructions.
Building Spark
Spark is built using Apache Maven. To build Spark and its example programs, run:
build/mvn -DskipTests clean package
(You do not need to do this if you downloaded a pre-built package.) More detailed documentation is available from the project site, at “Building Spark”.
Interactive Scala Shell
The easiest way to start using Spark is through the Scala shell:
./bin/spark-shell
Try the following command, which should return 1000:
scala> sc.parallelize(1 to 1000).count()
Interactive Python Shell
Alternatively, if you prefer Python, you can use the Python shell:
./bin/pyspark
And run the following command, which should also return 1000:
>>> sc.parallelize(range(1000)).count()
Example Programs
Spark also comes with several sample programs in the examples
directory.
To run one of them, use ./bin/run-example <class> [params]
. For example:
./bin/run-example SparkPi
will run the Pi example locally.
You can set the MASTER environment variable when running examples to submit
examples to a cluster. This can be a mesos:// or spark:// URL,
“yarn” to run on YARN, and “local” to run
locally with one thread, or “local[N]” to run locally with N threads. You
can also use an abbreviated class name if the class is in the examples
package. For instance:
MASTER=spark://host:7077 ./bin/run-example SparkPi
Many of the example programs print usage help if no params are given.
Running Tests
Testing first requires building Spark. Once Spark is built, tests can be run using:
./dev/run-tests
Please see the guidance on how to run tests for a module, or individual tests.
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at “Specifying the Hadoop Version” for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.
Configuration
Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.
Lending Club Statistics
This data set was obtained from https://www.lendingclub.com/info/download-data.action. The source of the data is: LendingClub, LendingClub Corporation Dept. 34268 ,P.O. Box 39000, San Francisco, CA 94139
Data Set Information
These files contain complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the “present” contains complete loan data for all loans issued through the previous completed calendar quarter.
License and/or Citation
Lending Club’s website does not explicitly state which license it is sharing the data under. However, it is stated explicitly on the URL where one downloads the data that “Want to slice and dice the data? Help yourself to the following exports of our loan databases.”
MNIST handwritten digits dataset
Data Source
LibSVM Datasets https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
Original Data Set Source
Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998. MNIST database available at http://yann.lecun.com/exdb/mnist/
- 20 Newsgroups Dataset – Binary Classification
This is a processed version of the 20 Newsgroup Dataset, saved in a parquet format.
Attribute Information
- newsgroup:string, Name of Newsgroup
- content:string, Document Content
- relatedToSci:integer, 1/0 binary indicator to determine if article belongs to a sci newsgroup or not
List of Newsgroups:
- alt.atheism
- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x
- misc.forsale
- rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey
- sci.crypt
- sci.electronics
- sci.med
- sci.space
- soc.religion.christian
- talk.politics.guns
- talk.politics.mideast
- talk.politics.misc
- talk.religion.misc
Source
####Original Owner and Donor Tom Mitchell School of Computer Science Carnegie Mellon University tom.mitchell@cmu.edu
Date Donated: September 9, 1999
Acknowledgements, Copyright Information, and Availability
You may use this material free of charge for any educational purpose, provided attribution is given in any lectures or publications that make use of this material.
NYC Taxi Dataset
This dataset was obtained from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. The source of the data is: The New York City Taxi Commision, Office of Legal Affairs, 33 Beaver Street, 22nd Floor, New York, NY 10004; Attn.: Records Access Officer
Data Set Information
This dataset contains aggregated data containing information from the NYC Taxi and Limousine on their various indicators, trip counts, crash history, etc., and also raw trip data from a variety of sources.
License and/or Citation
Public domain–this data is freely available without restriction from https://www1.nyc.gov/site/tlc/about/request-data.page
Combined Cycle Power Plant Data Set
Power Plant Sensor Readings Data Set
Source
http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
##Summary
The example data is provided by UCI at UCI Machine Learning Repository Combined Cycle Power Plant Data Set You can read the background on the UCI page, but in summary we have collected a number of readings from sensors at a Gas Fired Power Plant (also called a Peaker Plant) and now we want to use those sensor readings to predict how much power the plant will generate.
Usage License
If you publish material based on databases obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. We suggest the following reference format for referring to this repository: Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, Link, Link
Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)
Synthetic Retail Dataset
This dataset is a collection of files representing different dimensions and facts for a retail organization.
Provenance
This dataset was generated by Databricks.
Data Set Information
- Sales Orders: sales_orders/sales_orders.json records the customers’ originating purchase order.
- Purchase Orders: purchase_orders/purchase_orders.xml contains the raw materials that are being purchased.
- Products: products/products.csv contains products that the company sells.
- Goods Receipt: goods_receipt/goods_receipt.parquet contains the arrival time of purchased orders.
- Customers: customers/customers.csv contains those customers who are located in the US and are buying the finished products.
- Suppliers: suppliers/suppliers.csv contains suppliers that provide raw materials in the US.
- Sales Stream: sales_stream/sales_stream.json/ is a folder containing JSON files for streaming purposes.
- Promotions: promotions/promotions.csv contains additional benefits on top of normal purchases.
- Active Promotions: active_promotions/active_promotions.parquet shows how customers are progressing towards becoming eligible for promotions.
- Loyalty Segment: loyalty_segment/loyalty_segment.csv contains segmented customer data to appeal to all types of guests using targeted rewards and promotions.
License and/or Citation
Copyright (2020) Databricks, Inc. This dataset is licensed under a Creative Commons Attribution 4.0 International Licensehttps://creativecommons.org/licenses/by/4.0/
README
Introduction
Fire Calls-For-Service includes all fire units responses to calls. Each record includes the call number, incident number, address, unit identifier, call type, and disposition. All relevant time intervals are also included. Because this dataset is based on responses, and since most calls involved multiple units, there are multiple records for each call number. Addresses are associated with a block number, intersection or call box, not a specific address.
License
The data itself is available under an ODC Public Domain Dedication and License.
Additional Information
See https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3 #2013 SFO Customer Survey Data Set + Dictionary
SFO conducts a yearly comprehensive survey of our guests to gauge satisfaction with our facilities, services, and amenities. SFO compares results to previous surveys to look for areas of improvement and discover elements of the guest experience that are not satisfactory.
Source: https://data.sfgov.org/Transportation/2013-SFO-Customer-Survey-Data-Set-Dictionary/mjr8-p6m5
SMS Spam Collection v. 1
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.
Composition
This corpus has been collected from free or free for research sources at the Internet:
- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.
- A list of 450 SMS ham messages collected from Caroline Tag’s PhD Thesis available at http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf. Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: http://www.esp.uem.es/jmgomez/smsspamcorpus/.
You can find more useful information about the SMS Spam Collection v.1 at the following page of the UCI Repository.
http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
Usage
The collection is composed by just one file, where each line has the correct class (ham or spam) followed by the raw message.
ham What you doing?how are you? ham Ok lar... Joking wif u oni... ham dun say so early hor... U c already then say... ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H* ham Siva is in hostel aha:-. ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor. spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU
We would appreciate:
If you find this collection useful, make a reference to the paper below and the web page: http://dcomp.sor.ufscar.br/talmeida/smspamcollection/.
Send us a message either to talmeida < AT > ufscar.br or jmgomezh
Publication and More Information
We offer a comprehensive study of this corpus in the following papers. These works present a number of interesting statistics, studies and baseline results for many traditional machine learning methods.
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011. [preprint]
Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012. [preprint]
Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013. [Invited paper - full version]
About
The SMS Spam Collection has been created by Tiago A. Almeida and José María Gómez Hidalgo.
We would like to thank Min-Yen Kan and his team for making the NUS SMS Corpus available.
Sample of Million Song Dataset
Source
This data is a small subset of the Million Song Dataset. The original data was contributed by The Echo Nest. Prepared by T. Bertin-Mahieux <tb2332 ‘@’ columbia.edu>
Attribute Information
- artist_id:string
- artist_latitude:double
- artist_longitude:double
- artist_location:string
- artist_name:string
- duration:double
- end_of_fade_in:double
- key:int
- key_confidence:double
- loudness:double
- release:string
- song_hotnes:double
- song_id:string
- start_of_fade_out:double
- tempo:double
- time_signature:double
- time_signature_confidence:double
- title:string
- year:double
- partial_sequence:int
Citation
Using the dataset?
Please cite the following paper pdf bib:
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
Acknowledgements
The Million Song Dataset was created under a grant from the National Science Foundation, project IIS-0713334. The original data was contributed by The Echo Nest, as part of an NSF-sponsored GOALI collaboration. Subsequent donations from SecondHandSongs.com, musiXmatch.com, and last.fm, as well as further donations from The Echo Nest, are gratefully acknowledged.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
Fire Department Calls for Service
This data set was obtained from https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3 as of 11/11/2019. The source of the data is: Open Data published by the San Francisco Fire Department and is updated daily (available at the prior link).
Data Set Information
Fire Calls-For-Service includes all fire units responses to calls. Each record includes the call number, incident number, address, unit identifier, call type, and disposition. All relevant time intervals are also included. Because this dataset is based on responses, and since most calls involved multiple units, there are multiple records for each call number. Addresses are associated with a block number, intersection or call box, not a specific address.
License and/or Citation
This data set is licensed under the following license: Open Data Commons Public Domain Dedication and License (https://opendatacommons.org/licenses/pddl/1.0/)
TPC-H Data
The data in this directory was generated to run the TPC-H benchmark.
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
For more information, refer to the Transaction Processing Performance Council’s TPC-H page
Versions
- data-001 is a ~10GB TPC-H dataset and was generated by Parviz Deyhim parviz@databricks.com
Travel Recommendations Data Set
Synthetic Dataset related to travel recommendations.
License and/or Citation
The dataset was generated using Databricks Labs Data Generator https://databrickslabs.github.io/dbldatagen/public_docs/index.html.
Seattle Temperature Recordings Data Set
This data set was obtained from https://w2.weather.gov/climate/index.php?wfo=sew. The source of the data is: National Weather Service The National Weather Service data is not subject to copyright protection.
Data Set Information
This is a history weather recordings data set which contains all the high and low temperatures in Seattle, WA occurring between 01/01/2015 and 09/30/2018.
Attribute Information: date: The date of the temperature recording. temp: The daily maximum or minimum temperature in Fahrenheit.
Wine Quality Data Set
Two datasets related to red and white variants of the Portuguese “Vinho Verde” wine.
Provenance
This data set was obtained from http://archive.ics.uci.edu/ml/datasets/wine+quality. The source of the data is: Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal. @2009
License and/or Citation
Example: This data set is licensed under the following license: See citations.
Applicable citations: Cortez, Paulo (2009). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
License
Unless otherwise noted (e.g. within the README for a given data set), the data is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), which can be viewed at the following url: http://creativecommons.org/licenses/by/4.0/legalcode
Contributions and Requests
To request or contribute new datasets to this repository, please send an email to: hosted-datasets@databricks.com.
When making the request, include the README.md file you want to publish. Make sure the file includes information about the source of the data, the license, and how to get additional information. Please ensure the license for this data allows it to be hosted by Databricks and consumed by the public.