There is a 较新的版本 这个记录可用。

数据集 开放访问

实践中的再现性:Jupyter笔记本电脑大规模研究的数据集

匿名的


json-ld.(Schema.org)出口

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.</p>\n\n<p>This repository contains two files:</p>\n\n<ul>\n\t<li>dump.tar.bz2</li>\n\t<li>jupyter_reproducibility.tar.bz2</li>\n</ul>\n\n<p>The <em>dump.tar.bz2</em> file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.</p>\n\n<p>The <em>jupyter_reproducibility.tar.bz2</em> file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:</p>\n\n<ul>\n\t<li>analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.</li>\n\t<li>archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.</li>\n\t<li>paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it</li>\n</ul>\n\n<p>In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.</p>\n\n<p><strong>Reproducing the Analysis</strong></p>\n\n<p>This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:</p>\n\n<blockquote>\n<p>Ubuntu 18.04.1 LTS<br>\nPostgreSQL 10.6<br>\nConda 4.5.1<br>\nPython 3.6.8<br>\nPdfCrop 2012/11/02 v1.38</p>\n</blockquote>\n\n<p>First, download <em>dump.tar.bz2</em> and extract it:</p>\n\n<pre><code class=\"language-bash\">tar -xjf dump.tar.bz2</code></pre>\n\n<p>It extracts the file <em>db2019-01-13.dump</em>. Create a database in PostgreSQL (we call it &quot;jupyter&quot;), and use psql to restore the dump:</p>\n\n<pre><code class=\"language-bash\">psql jupyter &lt; db2019-01-13.dump</code></pre>\n\n<p>It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable <em>JUP_DB_CONNECTTION</em>:</p>\n\n<pre><code class=\"language-bash\">export JUP_DB_CONNECTION=\"postgresql://user:[email protected]/jupyter\";</code></pre>\n\n<p>Download and extract <em>jupyter_reproducibility.tar.bz2</em>:</p>\n\n<pre><code class=\"language-bash\">tar -xjf jupyter_reproducibility.tar.bz2</code></pre>\n\n<p>Create a conda environment with Python 3.6:</p>\n\n<pre><code class=\"language-bash\">conda create -n py36 python=3.6</code></pre>\n\n<p>Go to the analyses folder and install all the dependencies of the <em>requirements.txt</em></p>\n\n<pre><code class=\"language-bash\">cd jupyter_reproducibility/analyses\npip install -r requirements.txt</code></pre>\n\n<p>For reproducing the analyses, run jupyter on this folder:</p>\n\n<p>&nbsp;</p>\n\n<pre><code class=\"language-bash\">jupyter notebook</code></pre>\n\n<p>Execute the notebooks on this order:</p>\n\n<ul>\n\t<li>N0.Index.ipynb</li>\n\t<li>N1.Repository.ipynb</li>\n\t<li>N2.Notebook.ipynb</li>\n\t<li>N3.Cell.ipynb</li>\n\t<li>N4.Features.ipynb</li>\n\t<li>N5.Modules.ipynb</li>\n\t<li>N6.AST.ipynb</li>\n\t<li>N7.Name.ipynb</li>\n\t<li>N8.Execution.ipynb</li>\n\t<li>N9.Cell.Execution.Order.ipynb</li>\n\t<li>N10.Markdown.ipynb</li>\n\t<li>N11.To.Paper.ipynb</li>\n</ul>\n\n<p><strong>Reproducing or Expanding the Collection</strong></p>\n\n<p>The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.</p>\n\n<p><em><strong>Requirements</strong></em></p>\n\n<p>This time, we have extra requirements:</p>\n\n<blockquote>\n<p>All the analysis requirements<br>\nlbzip2 2.5<br>\ngcc 7.3.0<br>\nGithub account<br>\nGmail account</p>\n</blockquote>\n\n<p><em><strong>Environment</strong></em></p>\n\n<p>First, set the following environment variables:</p>\n\n<pre><code class=\"language-bash\">export JUP_MACHINE=\"db\"; # machine identifier\nexport JUP_BASE_DIR=\"/mnt/jupyter/github\"; # place to store the repositories\nexport JUP_LOGS_DIR=\"/home/jupyter/logs\"; # log files\nexport JUP_COMPRESSION=\"lbzip2\"; # compression program\nexport JUP_VERBOSE=\"5\"; # verbose level\nexport JUP_DB_CONNECTION=\"postgresql://user:[email protected]/jupyter\"; # sqlchemy connection\nexport JUP_GITHUB_USERNAME=\"github_username\"; # your github username\nexport JUP_GITHUB_PASSWORD=\"github_password\"; # your github password\nexport JUP_MAX_SIZE=\"8000.0\"; # maximum size of the repositories directory (in GB)\nexport JUP_FIRST_DATE=\"2013-01-01\"; # initial date to query github\nexport JUP_EMAIL_LOGIN=\"[email protected]\"; # your gmail address\nexport JUP_EMAIL_TO=\"[email protected]\"; # email that receives notifications\nexport JUP_OAUTH_FILE=\"~/oauth2_creds.json\" # oauth2 auhentication file\nexport JUP_NOTEBOOK_INTERVAL=\"\"; # notebook id interval for this machine. Leave it in blank\nexport JUP_REPOSITORY_INTERVAL=\"\"; # repository id interval for this machine. Leave it in blank\nexport JUP_WITH_EXECUTION=\"1\"; # run execute python notebooks\nexport JUP_WITH_DEPENDENCY=\"0\"; # run notebooks with and without declared dependnecies\nexport JUP_EXECUTION_MODE=\"-1\"; # run following the execution order\nexport JUP_EXECUTION_DIR=\"/home/jupyter/execution\"; # temporary directory for running notebooks\nexport JUP_ANACONDA_PATH=\"~/anaconda3\"; # conda installation path\nexport JUP_MOUNT_BASE=\"/home/jupyter/mount_ghstudy.sh\"; # bash script to mount base dir\nexport JUP_UMOUNT_BASE=\"/home/jupyter/umount_ghstudy.sh\"; # bash script to umount base dir\nexport JUP_NOTEBOOK_TIMEOUT=\"300\"; # timeout the extraction\n\n\n# Frequenci of log report\nexport JUP_ASTROID_FREQUENCY=\"5\";\nexport JUP_IPYTHON_FREQUENCY=\"5\";\nexport JUP_NOTEBOOKS_FREQUENCY=\"5\";\nexport JUP_REQUIREMENT_FREQUENCY=\"5\";\nexport JUP_CRAWLER_FREQUENCY=\"1\";\nexport JUP_CLONE_FREQUENCY=\"1\";\nexport JUP_COMPRESS_FREQUENCY=\"5\";\n\nexport JUP_DB_IP=\"localhost\"; # postgres database IP</code></pre>\n\n<p>Then, configure the file <em>~/oauth2_creds.json</em>, according to yagmail documentation: <a href=\"//media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf\">//media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf</a></p>\n\n<p>Configure the <em>mount_ghstudy.sh</em> and <em>umount_ghstudy.sh</em> scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.</p>\n\n<p><em><strong>Scripts</strong></em></p>\n\n<p>Download and extract <em>jupyter_reproducibility.tar.bz2</em>:</p>\n\n<pre><code class=\"language-bash\">tar -xjf jupyter_reproducibility.tar.bz2</code></pre>\n\n<p>Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):</p>\n\n<p><em><strong>Conda 2.7</strong></em></p>\n\n<pre><code class=\"language-bash\">conda create -n raw27 python=2.7 -y\nconda activate raw27\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Anaconda 2.7</strong></em></p>\n\n<pre><code class=\"language-bash\">conda create -n py27 python=2.7 anaconda -y\nconda activate py27\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology\n</code></pre>\n\n<p><strong><em>Conda 3.4</em></strong></p>\n\n<p>It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.</p>\n\n<pre><code class=\"language-bash\">conda create -n raw34 python=3.4 -y\nconda activate raw34\nconda install jupyter -c conda-forge -y\nconda uninstall jupyter -y\npip install --upgrade pip\npip install jupyter\npip install pipenv\npip install -e jupyter_reproducibility/archaeology\npip install pathlib2</code></pre>\n\n<p><em><strong>Anaconda 3.4</strong></em></p>\n\n<pre><code class=\"language-bash\">conda create -n py34 python=3.4 anaconda -y\nconda activate py34\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Conda 3.5</strong></em></p>\n\n<pre><code class=\"language-bash\">conda create -n raw35 python=3.5 -y\nconda activate raw35\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Anaconda 3.5</strong></em></p>\n\n<p>It requires the manual installation of other anaconda packages.</p>\n\n<pre><code class=\"language-bash\">conda create -n py35 python=3.5 anaconda -y\nconda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator\nconda activate py35\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Conda 3.6</strong></em></p>\n\n<pre><code class=\"language-bash\">conda create -n raw36 python=3.6 -y\nconda activate raw36\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Anaconda 3.6</strong></em></p>\n\n<pre><code class=\"language-bash\">conda create -n py36 python=3.6 anaconda -y\nconda activate py36\nconda install -y anaconda-navigator jupyterlab_server navigator-updater\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Conda 3.7</strong></em></p>\n\n<pre><code class=\"language-bash\">conda create -n raw37 python=3.7 -y\nconda activate raw37\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Anaconda 3.7</strong></em></p>\n\n<p>When we executed the experiments, the anaconda package for Python 3.7 was not complete. So, we attempted to install all Anaconda 3.x dependencies manually</p>\n\n<pre><code class=\"language-bash\">conda create -n py37 python=3.7 anaconda -y\nconda activate py37\nconda install -y _ipyw_jlab_nb_ext_conf alabaster anaconda-client anaconda-navigator anaconda-project appdirs asn1crypto astroid astropy atomicwrites attrs automat\nconda install -y babel backports backports.shutil_get_terminal_size beautifulsoup4 bitarray bkcharts blaze blosc bokeh boto bottleneck bzip2\nconda install -y cairo colorama constantly contextlib2 curl cycler cython\nconda install -y defusedxml docutils et_xmlfile fastcache filelock fribidi\nconda install -y get_terminal_size gevent glob2 gmpy2 graphite2 greenlet\nconda install -y harfbuzz html5lib hyperlink imageio imagesize incremental isort\nconda install -y jbig jdcal jeepney jupyter jupyter_console jupyterlab_launcher keyring kiwisolver\nconda install -y libtool libxslt lxml matplotlib mccabe mkl-service mpmath navigator-updater\nconda install -y nltk nose numpydoc openpyxl pango patchelf path.py pathlib2 patsy pep8 pkginfo ply pyasn1 pyasn1-modules pycodestyle pycosat pycrypto pycurl pyflakes pylint pyodbc pywavelets\nconda install -y rope scikit-image scikit-learn seaborn service_identity singledispatch spyder spyder-kernels statsmodels sympy\nconda install -y tqdm traitlets twisted unicodecsv xlrd xlsxwriter xlwt zope zope.interface\nconda install -y sortedcollections typed-ast\npip install --upgrade pip\npip install pipenv\npip install -e jupyter_reproducibility/archaeology</code></pre>\n\n<p><em><strong>Stopwords</strong></em></p>\n\n<p>Use nltk to download stopwords:</p>\n\n<pre><code class=\"language-bash\">conda activate py36\npython -c \"import nltk; nltk.download('stopwords')\"</code></pre>\n\n<p>Everything should be set to run right now.</p>\n\n<p><em><strong>Executing</strong></em></p>\n\n<p>In this step, we recommend using the <em>py36</em> environment to orchestrate the execution. We designed the scripts for Python 3.6, and if they are correctly configured, it can invoke the other environments.</p>\n\n<pre><code class=\"language-bash\">conda activate py36</code></pre>\n\n<p>If you want to extend the execution to more environments, configure the environments on the file <em>archaeology/config.py</em>.</p>\n\n<p>For querying and downloading repositories from github, run on the <em>jupyter_reproducibility/archaeology</em> directory:</p>\n\n<pre><code class=\"language-bash\">python s0_repository_crawler.py</code></pre>\n\n<p>For extracting data from the repositories and notebooks, run on this order:</p>\n\n<pre><code class=\"language-bash\">python s1_notebooks_and_cells.py\npython s2_requirement_files.py\npython s3_compress.py\npython s4_markdown_features.py\npython s5_extract_files.py\npython s6_cell_features.py\npython s7_execute_repositories.py\npython p0_local_possibility.py\npython p1_notebooks_and_cells.py</code></pre>\n\n<p>Alternatively, execute the following script that orchestrates all the executions and notifies when they finish:</p>\n\n<pre><code class=\"language-bash\">python main_with_crawler.py</code></pre>\n\n<p>If some script fails to process all repositories/notebooks/cells, use the option &quot;-e&quot; to rerun it and force the re-extraction.</p>\n\n<p>After this process, refer to the <em>*Reproducing the Analysis*</em> Section for analyzing the collected data.</p>\n\n<p><strong>Changelog</strong></p>\n\n<p>2019/01/14 - Version 1 - Initial version<br>\n2012/01/22 - Version 2 - Update&nbsp;N8.Execution.ipynb to calculate rate of failure for each reason</p>", 
  "license": "//creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "affiliation": "Anonymous", 
      "@type": "Person", 
      "name": "Anonymous"
    }
  ], 
  "url": "//americinnmankato.com/record/2546834", 
  "datePublished": "2019-01-22", 
  "keywords": [
    "jupyter notebook", 
    "github", 
    "reproducibility"
  ], 
  "@context": "//schema.org/", 
  "distribution": [
    {
      "contentUrl": "//americinnmankato.com/api/files/d14bdbd9-4f63-46ef-bce0-0e2bc450a4fd/dump.tar.bz2", 
      "encodingFormat": "bz2", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "//americinnmankato.com/api/files/d14bdbd9-4f63-46ef-bce0-0e2bc450a4fd/jupyter_reproducibility.tar.bz2", 
      "encodingFormat": "bz2", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "//doi.org/10.5281/zenodo.2546834", 
  "@id": "//doi.org/10.5281/zenodo.2546834", 
  "@type": "Dataset", 
  "name": "实践中的再现性:Jupyter笔记本电脑大规模研究的数据集"
}
2,467
494
views
downloads
所有版本 这个版本
意见 2,467248
下载 49450
数据量 2.5 TB214.4 GB.
独特的观点 2,096217
独特的下载 33130

分享

引用