From 2f7bbe671a257120895e54cda666d650ecda0241 Mon Sep 17 00:00:00 2001 From: kushalbakshi Date: Fri, 27 Dec 2024 11:25:14 -0500 Subject: [PATCH 01/49] Document `dj.Top()` and add missing pages --- .vscode/settings.json | 2 +- docs/src/concepts/data-model.md | 131 +++- docs/src/concepts/data-pipelines.md | 14 +- docs/src/design/alter.md | 52 ++ docs/src/design/tables/blobs.md | 27 +- docs/src/faq.md | 23 +- docs/src/internal/transpilation.md | 14 +- docs/src/manipulation/transactions.md | 2 +- docs/src/publish-data.md | 2 +- docs/src/query/restrict.md | 12 + docs/src/sysadmin/bulk-storage.md | 22 +- docs/src/sysadmin/database-admin.md | 2 +- docs/src/tutorials/dj-top.ipynb | 1022 +++++++++++++++++++++++++ docs/src/tutorials/json.ipynb | 16 +- 14 files changed, 1253 insertions(+), 88 deletions(-) create mode 100644 docs/src/tutorials/dj-top.ipynb diff --git a/.vscode/settings.json b/.vscode/settings.json index 00ebd4b97..c4e61c07a 100755 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -17,5 +17,5 @@ "[dockercompose]": { "editor.defaultFormatter": "disable" }, - "files.autoSave": "off" + "files.autoSave": "afterDelay" } \ No newline at end of file diff --git a/docs/src/concepts/data-model.md b/docs/src/concepts/data-model.md index 71220e168..ce9bf311d 100644 --- a/docs/src/concepts/data-model.md +++ b/docs/src/concepts/data-model.md @@ -2,11 +2,23 @@ ## What is a data model? -A **data model** refers to a conceptual framework for thinking about data and about -operations on data. -A data model defines the mental toolbox of the data scientist; it has less to do with -the architecture of the data systems, although architectures are often intertwined with -data models. +A **data model** is a conceptual framework that defines how data is organized, +represented, and transformed. It gives us the components for creating blueprints for the +structure and operations of data management systems, ensuring consistency and efficiency +in data handling. + +Data management systems are built to accommodate these models, allowing us to manage +data according to the principles laid out by the model. If you’re studying data science +or engineering, you’ve likely encountered different data models, each providing a unique +approach to organizing and manipulating data. + +A data model is defined by considering the following key aspects: + ++ What are the fundamental elements used to structure the data? ++ What operations are available for defining, creating, and manipulating the data? ++ What mechanisms exist to enforce the structure and rules governing valid data interactions? + +## Types of data models Among the most familiar data models are those based on files and folders: data of any kind are lumped together into binary strings called **files**, files are collected into @@ -24,17 +36,16 @@ objects in memory with properties and methods for transformations of such data. ## Relational data model The **relational model** is a way of thinking about data as sets and operations on sets. -Formalized almost a half-century ago -([Codd, 1969](https://dl.acm.org/citation.cfm?doid=362384.362685)), the relational data -model provides the most rigorous approach to structured data storage and the most -precise approach to data querying. -The model is defined by the principles of data representation, domain constraints, -uniqueness constraints, referential constraints, and declarative queries as summarized -below. +Formalized almost a half-century ago ([Codd, +1969](https://dl.acm.org/citation.cfm?doid=362384.362685)). The relational data model is +one of the most powerful and precise ways to store and manage structured data. At its +core, this model organizes all data into tables--representing mathematical +relations---where each table consists of rows (representing mathematical tuples) and +columns (often called attributes). ### Core principles of the relational data model -**Data representation** +**Data representation:** Data are represented and manipulated in the form of relations. A relation is a set (i.e. an unordered collection) of entities of values for each of the respective named attributes of the relation. @@ -43,26 +54,26 @@ below. A collection of base relations with their attributes, domain constraints, uniqueness constraints, and referential constraints is called a schema. -**Domain constraints** - Attribute values are drawn from corresponding attribute domains, i.e. predefined sets - of values. - Attribute domains may not include relations, which keeps the data model flat, i.e. - free of nested structures. +**Domain constraints:** + Each attribute (column) in a table is associated with a specific attribute domain (or + datatype, a set of possible values), ensuring that the data entered is valid. + Attribute domains may not include relations, which keeps the data model + flat, i.e. free of nested structures. -**Uniqueness constraints** +**Uniqueness constraints:** Entities within relations are addressed by values of their attributes. To identify and relate data elements, uniqueness constraints are imposed on subsets of attributes. Such subsets are then referred to as keys. One key in a relation is designated as the primary key used for referencing its elements. -**Referential constraints** +**Referential constraints:** Associations among data are established by means of referential constraints with the help of foreign keys. A referential constraint on relation A referencing relation B allows only those entities in A whose foreign key attributes match the key attributes of an entity in B. -**Declarative queries** +**Declarative queries:** Data queries are formulated through declarative, as opposed to imperative, specifications of sought results. This means that query expressions convey the logic for the result rather than the @@ -86,23 +97,26 @@ Similar to spreadsheets, relations are often visualized as tables with *attribut corresponding to *columns* and *entities* corresponding to *rows*. In particular, SQL uses the terms *table*, *column*, and *row*. -## DataJoint is a refinement of the relational data model - -DataJoint is a conceptual refinement of the relational data model offering a more -expressive and rigorous framework for database programming -([Yatsenko et al., 2018](https://arxiv.org/abs/1807.11104)). -The DataJoint model facilitates clear conceptual modeling, efficient schema design, and -precise and flexible data queries. -The model has emerged over a decade of continuous development of complex data pipelines -for neuroscience experiments -([Yatsenko et al., 2015](https://www.biorxiv.org/content/early/2015/11/14/031658)). -DataJoint has allowed researchers with no prior knowledge of databases to collaborate -effectively on common data pipelines sustaining data integrity and supporting flexible -access. -DataJoint is currently implemented as client libraries in MATLAB and Python. -These libraries work by transpiling DataJoint queries into SQL before passing them on -to conventional relational database systems that serve as the backend, in combination -with bulk storage systems for storing large contiguous data objects. +## The DataJoint Model + +DataJoint is a conceptual refinement of the relational data model offering a more +expressive and rigorous framework for database programming ([Yatsenko et al., +2018](https://arxiv.org/abs/1807.11104)). The DataJoint model facilitates conceptual +clarity, efficiency, workflow management, and precise and flexible data +queries. By enforcing entity normalization, +simplifying dependency declarations, offering a rich query algebra, and visualizing +relationships through schema diagrams, DataJoint makes relational database programming +more intuitive and robust for complex data pipelines. + +The model has emerged over a decade of continuous development of complex data +pipelines for neuroscience experiments ([Yatsenko et al., +2015](https://www.biorxiv.org/content/early/2015/11/14/031658)). DataJoint has allowed +researchers with no prior knowledge of databases to collaborate effectively on common +data pipelines sustaining data integrity and supporting flexible access. DataJoint is +currently implemented as client libraries in MATLAB and Python. These libraries work by +transpiling DataJoint queries into SQL before passing them on to conventional relational +database systems that serve as the backend, in combination with bulk storage systems for +storing large contiguous data objects. DataJoint comprises: @@ -115,3 +129,44 @@ modeled entities The key refinement of DataJoint over other relational data models and their implementations is DataJoint's support of [entity normalization](../design/normalization.md). + +### Core principles of the DataJoint model + +**Entity Normalization** + DataJoint enforces entity normalization, ensuring that every entity set (table) is + well-defined, with each element belonging to the same type, sharing the same + attributes, and distinguished by the same primary key. This principle reduces + redundancy and avoids data anomalies, similar to Boyce-Codd Normal Form, but with a + more intuitive structure than traditional SQL. + +**Simplified Schema Definition and Dependency Management** + DataJoint introduces a schema definition language that is more expressive and less + error-prone than SQL. Dependencies are explicitly declared using arrow notation + (->), making referential constraints easier to understand and visualize. The + dependency structure is enforced as an acyclic directed graph, which simplifies + workflows by preventing circular dependencies. + +**Integrated Query Operators producing a Relational Algebra** + DataJoint introduces five query operators (restrict, join, project, aggregate, and + union) with algebraic closure, allowing them to be combined seamlessly. These + operators are designed to maintain operational entity normalization, ensuring query + outputs remain valid entity sets. + +**Diagramming Notation for Conceptual Clarity** + DataJoint’s schema diagrams simplify the representation of relationships between + entity sets compared to ERM diagrams. Relationships are expressed as dependencies + between entity sets, which are visualized using solid or dashed lines for primary + and secondary dependencies, respectively. + +**Unified Logic for Binary Operators** + DataJoint simplifies binary operations by requiring attributes involved in joins or + comparisons to be homologous (i.e., sharing the same origin). This avoids the + ambiguity and pitfalls of natural joins in SQL, ensuring more predictable query + results. + +**Optimized Data Pipelines for Scientific Workflows** + DataJoint treats the database as a data pipeline where each entity set defines a + step in the workflow. This makes it ideal for scientific experiments and complex + data processing, such as in neuroscience. Its MATLAB and Python libraries transpile + DataJoint queries into SQL, bridging the gap between scientific programming and + relational databases. diff --git a/docs/src/concepts/data-pipelines.md b/docs/src/concepts/data-pipelines.md index 998ad372a..9ae2dfb87 100644 --- a/docs/src/concepts/data-pipelines.md +++ b/docs/src/concepts/data-pipelines.md @@ -157,10 +157,10 @@ with external groups. ## Summary of DataJoint features 1. A free, open-source framework for scientific data pipelines and workflow management -1. Data hosting in cloud or in-house -1. MySQL, filesystems, S3, and Globus for data management -1. Define, visualize, and query data pipelines from MATLAB or Python -1. Enter and view data through GUIs -1. Concurrent access by multiple users and computational agents -1. Data integrity: identification, dependencies, groupings -1. Automated distributed computation +2. Data hosting in cloud or in-house +3. MySQL, filesystems, S3, and Globus for data management +4. Define, visualize, and query data pipelines from MATLAB or Python +5. Enter and view data through GUIs +6. Concurrent access by multiple users and computational agents +7. Data integrity: identification, dependencies, groupings +8. Automated distributed computation diff --git a/docs/src/design/alter.md b/docs/src/design/alter.md index fe791a11f..70ed39341 100644 --- a/docs/src/design/alter.md +++ b/docs/src/design/alter.md @@ -1 +1,53 @@ # Altering Populated Pipelines + +Tables can be altered after they have been declared and populated. This is useful when +you want to add new secondary attributes or change the data type of existing attributes. +Users can use the `definition` property to update a table's attributes and then use +`alter` to apply the changes in the database. Currently, `alter` does not support +changes to primary key attributes. + +Let's say we have a table `Student` with the following attributes: + +```python +@schema +class Student(dj.Manual): + definition = """ + student_id: int + --- + first_name: varchar(40) + last_name: varchar(40) + home_address: varchar(100) + """ +``` + +We can modify the table to include a new attribute `email`: + +```python +Student.definition = """ +student_id: int +--- +first_name: varchar(40) +last_name: varchar(40) +home_address: varchar(100) +email: varchar(100) +""" +Student.alter() +``` + +The `alter` method will update the table in the database to include the new attribute +`email` added by the user in the table's `definition` property. + +Similarly, you can modify the data type or length of an existing attribute. For example, +to alter the `home_address` attribute to have a length of 200 characters: + +```python +Student.definition = """ +student_id: int +--- +first_name: varchar(40) +last_name: varchar(40) +home_address: varchar(200) +email: varchar(100) +""" +Student.alter() +``` diff --git a/docs/src/design/tables/blobs.md b/docs/src/design/tables/blobs.md index 76847983e..55cc0faff 100644 --- a/docs/src/design/tables/blobs.md +++ b/docs/src/design/tables/blobs.md @@ -1 +1,26 @@ -# Work in progress +# Overview + +DataJoint provides functionality for serializing and deserializing complex data types +into binary blobs for efficient storage and compatibility with MATLAB's mYm +serialization. This includes support for: + ++ Basic Python data types (e.g., integers, floats, strings, dictionaries). ++ NumPy arrays and scalars. ++ Specialized data types like UUIDs, decimals, and datetime objects. + +## Serialization and Deserialization Process + +Serialization converts Python objects into a binary representation for efficient storage +within the database. Deserialization converts the binary representation back into the +original Python object. + +Blobs over 1 KiB are compressed using the zlib library to reduce storage requirements. + +## Supported Data Types + +DataJoint supports the following data types for serialization: + ++ Scalars: Integers, floats, booleans, strings. ++ Collections: Lists, tuples, sets, dictionaries. ++ NumPy: Arrays, structured arrays, and scalars. ++ Custom Types: UUIDs, decimals, datetime objects, MATLAB cell and struct arrays. diff --git a/docs/src/faq.md b/docs/src/faq.md index a3d5fd92d..d22e64241 100644 --- a/docs/src/faq.md +++ b/docs/src/faq.md @@ -4,17 +4,18 @@ It is common to enter data during experiments using a graphical user interface. -1. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open -source project for data entry. +1. The [DataJoint Works](https://works.datajoint.com) platform is a web-based, fully +managed service to host and execute data pipelines. -2. The DataJoint Works platform is set up as a fully managed service to host and -execute data pipelines. +2. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open +source project for data entry but is no longer actively maintained. ## Does DataJoint support other programming languages? -DataJoint [Python](https://datajoint.com/docs/core/datajoint-python/) and -[Matlab](https://datajoint.com/docs/core/datajoint-matlab/) APIs are both actively -supported. Previous projects implemented some DataJoint features in +DataJoint [Python](https://datajoint.com/docs/core/datajoint-python/) is the most +up-to-date version and all future development will focus on the Python API. The +[Matlab](https://datajoint.com/docs/core/datajoint-matlab/) API was actively developed +through 2023. Previous projects implemented some DataJoint features in [Julia](https://github.com/BrainCOGS/neuronex_workshop_2018/tree/julia/julia) and [Rust](https://github.com/datajoint/datajoint-core). DataJoint's data model and data representation are largely language independent, which means that any language with a @@ -92,7 +93,7 @@ The entry of metadata can be manual, or it can be an automated part of data acqu into the database). Depending on their size and contents, raw data files can be stored in a number of ways. -In the simplest and most common scenario, raw data continue to be stored in either a +In the simplest and most common scenario, raw data continue to be stored in either a local filesystem or in the cloud as collections of files and folders. The paths to these files are entered in the database (again, either manually or by automated processes). @@ -100,7 +101,7 @@ This is the point at which the notion of a **data pipeline** begins. Below these "manual tables" that contain metadata and file paths are a series of tables that load raw data from these files, process it in some way, and insert derived or summarized data directly into the database. -For example, in an imaging application, the very large raw .TIFF stacks would reside on +For example, in an imaging application, the very large raw `.TIFF` stacks would reside on the filesystem, but the extracted fluorescent trace timeseries for each cell in the image would be stored as a numerical array directly in the database. Or the raw video used for animal tracking might be stored in a standard video format on @@ -163,8 +164,8 @@ This brings us to the final important question: ## How do I get my data out? -This is the fun part. See [queries](query/operators.md) for details of the DataJoint -query language directly from MATLAB and Python. +This is the fun part. See [queries](query/operators.md) for details of the DataJoint +query language directly from Python. ## Interfaces diff --git a/docs/src/internal/transpilation.md b/docs/src/internal/transpilation.md index a2ff1d0c4..c8fa09b0e 100644 --- a/docs/src/internal/transpilation.md +++ b/docs/src/internal/transpilation.md @@ -34,7 +34,7 @@ restriction appending the new condition to the input's restriction. Property `support` represents the `FROM` clause and contains a list of either `QueryExpression` objects or table names in the case of base queries. -The joint operator `*` adds new elements to the `support` attribute. +The join operator `*` adds new elements to the `support` attribute. At least one element must be present in `support`. Multiple elements in `support` indicate a join. @@ -56,10 +56,10 @@ self: `heading`, `restriction`, and `support`. The input object is treated as a subquery in the following cases: -1. A restriction is applied that uses alias attributes in the heading -1. A projection uses an alias attribute to create a new alias attribute. -1. A join is performed on an alias attribute. -1. An Aggregation is used a restriction. +1. A restriction is applied that uses alias attributes in the heading. +2. A projection uses an alias attribute to create a new alias attribute. +3. A join is performed on an alias attribute. +4. An Aggregation is used a restriction. An error arises if @@ -117,8 +117,8 @@ input — the *aggregated* query expression. The SQL equivalent of aggregation is 1. the NATURAL LEFT JOIN of the two inputs. -1. followed by a GROUP BY on the primary key arguments of the first input -1. followed by a projection. +2. followed by a GROUP BY on the primary key arguments of the first input +3. followed by a projection. The projection works the same as `.proj` with respect to the first input. With respect to the second input, the projection part of aggregation allows only diff --git a/docs/src/manipulation/transactions.md b/docs/src/manipulation/transactions.md index fa4f4294b..5e0d7ed07 100644 --- a/docs/src/manipulation/transactions.md +++ b/docs/src/manipulation/transactions.md @@ -6,7 +6,7 @@ interrupting the sequence of such operations halfway would leave the data in an state. While the sequence is in progress, other processes accessing the database will not see the partial results until the transaction is complete. -The sequence make include [data queries](../query/principles.md) and +The sequence may include [data queries](../query/principles.md) and [manipulations](index.md). In such cases, the sequence of operations may be enclosed in a transaction. diff --git a/docs/src/publish-data.md b/docs/src/publish-data.md index e68a2843a..522d5bc35 100644 --- a/docs/src/publish-data.md +++ b/docs/src/publish-data.md @@ -27,7 +27,7 @@ The code and the data can be found at https://github.com/sinzlab/Sinz2018_NIPS ## Exporting into a collection of files -Another option for publishing and archiving data is to export the data from the +Another option for publishing and archiving data is to export the data from the DataJoint pipeline into a collection of files. DataJoint provides features for exporting and importing sections of the pipeline. Several ongoing projects are implementing the capability to export from DataJoint diff --git a/docs/src/query/restrict.md b/docs/src/query/restrict.md index 0cb3cc29b..f66d91126 100644 --- a/docs/src/query/restrict.md +++ b/docs/src/query/restrict.md @@ -191,3 +191,15 @@ experiments that are part of sessions performed by Alice. query = Session & 'user = "Alice"' Experiment & query ``` + +## Restriction by `dj.Top` + +Restriction by `dj.Top` returns the number of entities specified by the `limit` +argument. These entities can be returned in the order specified by the `order_by` +argument. And finally, the `offset` argument can be used to offset the returned entities +which is useful for pagination in web applications. + +```python +# Return the first 10 sessions in descending order of session date +Session & dj.Top(limit=10, order_by='session_date DESC') +``` diff --git a/docs/src/sysadmin/bulk-storage.md b/docs/src/sysadmin/bulk-storage.md index 1289b8c9b..12af44791 100644 --- a/docs/src/sysadmin/bulk-storage.md +++ b/docs/src/sysadmin/bulk-storage.md @@ -8,18 +8,17 @@ significant and useful for a number of reasons. ### Cost -One of these is that the high-performance storage commonly used in -database systems is more expensive than that used in more typical -commodity storage, and so storing the smaller identifying information -typically used in queries on fast, relational database storage and -storing the larger bulk data used for analysis or processing on lower -cost commodity storage can allow for large savings in storage expense. +One reason is that the high-performance storage commonly used in database systems is +more expensive than typical commodity storage. Therefore, storing the smaller identifying +information typically used in queries on fast, relational database storage and storing +the larger bulk data used for analysis or processing on lower cost commodity storage +enables large savings in storage expense. ### Flexibility Storing bulk data separately also facilitates more flexibility in usage, since the bulk data can managed using separate maintenance -processes than that in the relational storage. +processes than those in the relational storage. For example, larger relational databases may require many hours to be restored in the event of system failures. If the relational portion of @@ -40,11 +39,10 @@ been retrieved in previous queries. ### Data Sharing -DataJoint provides pluggable support for different external bulk -storage backends, which can provide benefits for data sharing by -publishing bulk data to S3-Protocol compatible data shares both in the -cloud and on locally managed systems and other common tools for data -sharing, such as Globus, etc. +DataJoint provides pluggable support for different external bulk storage backends, +allowing data sharing by publishing bulk data to S3-Protocol compatible data shares both +in the cloud and on locally managed systems and other common tools for data sharing, +such as Globus, etc. ## Bulk Storage Scenarios diff --git a/docs/src/sysadmin/database-admin.md b/docs/src/sysadmin/database-admin.md index 64bf92cd8..e56cd833d 100644 --- a/docs/src/sysadmin/database-admin.md +++ b/docs/src/sysadmin/database-admin.md @@ -179,7 +179,7 @@ grouped together by common prefixes. For example, a lab may have a collection of schemas that begin with `common_`. Some common processing may be organized into several schemas that begin with `pipeline_`. Typically each user has all privileges to schemas that -begin with her username. +begin with their username. For example, alice may have privileges to select and insert data from the common schemas (but not create new tables), and have all diff --git a/docs/src/tutorials/dj-top.ipynb b/docs/src/tutorials/dj-top.ipynb new file mode 100644 index 000000000..4e0604af0 --- /dev/null +++ b/docs/src/tutorials/dj-top.ipynb @@ -0,0 +1,1022 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using the `dj.Top` restriction\n", + "\n", + "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://datajoint.com/docs/core/glossary/#data-pipeline).\n", + "\n", + "Now let's start by importing the `datajoint` client." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[2024-12-20 11:10:20,120][INFO]: Connecting root@127.0.0.1:3306\n", + "[2024-12-20 11:10:20,259][INFO]: Connected root@127.0.0.1:3306\n" + ] + } + ], + "source": [ + "import datajoint as dj\n", + "dj.config[\"database.host\"] = \"127.0.0.1\"\n", + "schema = dj.Schema('university')" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "@schema\n", + "class Student(dj.Manual):\n", + " definition = \"\"\"\n", + " student_id : int unsigned # university-wide ID number\n", + " ---\n", + " first_name : varchar(40)\n", + " last_name : varchar(40)\n", + " sex : enum('F', 'M', 'U')\n", + " date_of_birth : date\n", + " home_address : varchar(120) # mailing street address\n", + " home_city : varchar(60) # mailing address\n", + " home_state : char(2) # US state acronym: e.g. OH\n", + " home_zip : char(10) # zipcode e.g. 93979-4979\n", + " home_phone : varchar(20) # e.g. 414.657.6883x0881\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "@schema\n", + "class Department(dj.Manual):\n", + " definition = \"\"\"\n", + " dept : varchar(6) # abbreviated department name, e.g. BIOL\n", + " ---\n", + " dept_name : varchar(200) # full department name\n", + " dept_address : varchar(200) # mailing address\n", + " dept_phone : varchar(20)\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "@schema\n", + "class StudentMajor(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Student\n", + " ---\n", + " -> Department\n", + " declare_date : date # when student declared her major\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[2024-12-26 12:03:01,311][INFO]: Table altered\n" + ] + } + ], + "source": [ + "StudentMajor.definition = \"\"\"\n", + "-> Student\n", + "---\n", + "-> Department\n", + "declare_date : date # when student declared her major\n", + "\"\"\"\n", + "StudentMajor.alter()" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-> Student\n", + "---\n", + "-> Department\n", + "declare_date : date # when student declared her major\n", + "\n" + ] + } + ], + "source": [ + "print(StudentMajor.describe())" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "@schema\n", + "class Course(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Department\n", + " course : int unsigned # course number, e.g. 1010\n", + " ---\n", + " course_name : varchar(200) # e.g. \"Neurobiology of Sensation and Movement.\"\n", + " credits : decimal(3,1) # number of credits earned by completing the course\n", + " \"\"\"\n", + " \n", + "@schema\n", + "class Term(dj.Manual):\n", + " definition = \"\"\"\n", + " term_year : year\n", + " term : enum('Spring', 'Summer', 'Fall')\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Section(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Course\n", + " -> Term\n", + " section : char(1)\n", + " ---\n", + " auditorium : varchar(12)\n", + " \"\"\"\n", + " \n", + "@schema\n", + "class CurrentTerm(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Term\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Enroll(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Student\n", + " -> Section\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class LetterGrade(dj.Lookup):\n", + " definition = \"\"\"\n", + " grade : char(2)\n", + " ---\n", + " points : decimal(3,2)\n", + " \"\"\"\n", + " contents = [\n", + " ['A', 4.00],\n", + " ['A-', 3.67],\n", + " ['B+', 3.33],\n", + " ['B', 3.00],\n", + " ['B-', 2.67],\n", + " ['C+', 2.33],\n", + " ['C', 2.00],\n", + " ['C-', 1.67],\n", + " ['D+', 1.33],\n", + " ['D', 1.00],\n", + " ['F', 0.00]\n", + " ]\n", + "\n", + "@schema\n", + "class Grade(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Enroll\n", + " ---\n", + " -> LetterGrade\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm import tqdm\n", + "import faker\n", + "import random\n", + "import datetime\n", + "fake = faker.Faker()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "def yield_students():\n", + " fake_name = {'F': fake.name_female, 'M': fake.name_male}\n", + " while True: # ignore invalid values\n", + " try:\n", + " sex = random.choice(('F', 'M'))\n", + " first_name, last_name = fake_name[sex]().split(' ')[:2]\n", + " street_address, city = fake.address().split('\\n')\n", + " city, state = city.split(', ')\n", + " state, zipcode = state.split(' ') \n", + " except ValueError:\n", + " continue\n", + " else:\n", + " yield dict(\n", + " first_name=first_name,\n", + " last_name=last_name,\n", + " sex=sex,\n", + " home_address=street_address,\n", + " home_city=city,\n", + " home_state=state,\n", + " home_zip=zipcode,\n", + " date_of_birth=str(\n", + " fake.date_time_between(start_date=\"-35y\", end_date=\"-15y\").date()),\n", + " home_phone = fake.phone_number()[:20])" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "Student.insert(\n", + " dict(k, student_id=i) for i, k in zip(range(100,300), yield_students()))\n", + "\n", + "Department.insert(\n", + " dict(dept=dept, \n", + " dept_name=name, \n", + " dept_address=fake.address(), \n", + " dept_phone=fake.phone_number()[:20])\n", + " for dept, name in [\n", + " [\"CS\", \"Computer Science\"],\n", + " [\"BIOL\", \"Life Sciences\"],\n", + " [\"PHYS\", \"Physics\"],\n", + " [\"MATH\", \"Mathematics\"]])\n", + "\n", + "StudentMajor.insert({**s, **d, \n", + " 'declare_date':fake.date_between(start_date=datetime.date(1999,1,1))}\n", + " for s, d in zip(Student.fetch('https://github.com/KEY'), random.choices(Department.fetch('https://github.com/KEY'), k=len(Student())))\n", + " if random.random() < 0.75)\n", + "\n", + "# from https://www.utah.edu/\n", + "Course.insert([\n", + " ['BIOL', 1006, 'World of Dinosaurs', 3],\n", + " ['BIOL', 1010, 'Biology in the 21st Century', 3],\n", + " ['BIOL', 1030, 'Human Biology', 3],\n", + " ['BIOL', 1210, 'Principles of Biology', 4],\n", + " ['BIOL', 2010, 'Evolution & Diversity of Life', 3],\n", + " ['BIOL', 2020, 'Principles of Cell Biology', 3],\n", + " ['BIOL', 2021, 'Principles of Cell Science', 4],\n", + " ['BIOL', 2030, 'Principles of Genetics', 3],\n", + " ['BIOL', 2210, 'Human Genetics',3],\n", + " ['BIOL', 2325, 'Human Anatomy', 4],\n", + " ['BIOL', 2330, 'Plants & Society', 3],\n", + " ['BIOL', 2355, 'Field Botany', 2],\n", + " ['BIOL', 2420, 'Human Physiology', 4],\n", + "\n", + " ['PHYS', 2040, 'Classcal Theoretical Physics II', 4],\n", + " ['PHYS', 2060, 'Quantum Mechanics', 3],\n", + " ['PHYS', 2100, 'General Relativity and Cosmology', 3],\n", + " ['PHYS', 2140, 'Statistical Mechanics', 4],\n", + " \n", + " ['PHYS', 2210, 'Physics for Scientists and Engineers I', 4], \n", + " ['PHYS', 2220, 'Physics for Scientists and Engineers II', 4],\n", + " ['PHYS', 3210, 'Physics for Scientists I (Honors)', 4],\n", + " ['PHYS', 3220, 'Physics for Scientists II (Honors)', 4],\n", + " \n", + " ['MATH', 1250, 'Calculus for AP Students I', 4],\n", + " ['MATH', 1260, 'Calculus for AP Students II', 4],\n", + " ['MATH', 1210, 'Calculus I', 4],\n", + " ['MATH', 1220, 'Calculus II', 4],\n", + " ['MATH', 2210, 'Calculus III', 3],\n", + " \n", + " ['MATH', 2270, 'Linear Algebra', 4],\n", + " ['MATH', 2280, 'Introduction to Differential Equations', 4],\n", + " ['MATH', 3210, 'Foundations of Analysis I', 4],\n", + " ['MATH', 3220, 'Foundations of Analysis II', 4],\n", + " \n", + " ['CS', 1030, 'Foundations of Computer Science', 3],\n", + " ['CS', 1410, 'Introduction to Object-Oriented Programming', 4],\n", + " ['CS', 2420, 'Introduction to Algorithms & Data Structures', 4],\n", + " ['CS', 2100, 'Discrete Structures', 3],\n", + " ['CS', 3500, 'Software Practice', 4],\n", + " ['CS', 3505, 'Software Practice II', 3],\n", + " ['CS', 3810, 'Computer Organization', 4],\n", + " ['CS', 4400, 'Computer Systems', 4],\n", + " ['CS', 4150, 'Algorithms', 3],\n", + " ['CS', 3100, 'Models of Computation', 3],\n", + " ['CS', 3200, 'Introduction to Scientific Computing', 3],\n", + " ['CS', 4000, 'Senior Capstone Project - Design Phase', 3],\n", + " ['CS', 4500, 'Senior Capstone Project', 3],\n", + " ['CS', 4940, 'Undergraduate Research', 3],\n", + " ['CS', 4970, 'Computer Science Bachelor''s Thesis', 3]])\n", + "\n", + "Term.insert(dict(term_year=year, term=term) \n", + " for year in range(1999, 2019) \n", + " for term in ['Spring', 'Summer', 'Fall'])\n", + "\n", + "Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]\n", + "\n", + "CurrentTerm().insert1({\n", + " **Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]})\n", + "\n", + "def make_section(prob):\n", + " for c in (Course * Term).proj():\n", + " for sec in 'abcd':\n", + " if random.random() < prob:\n", + " break\n", + " yield {\n", + " **c, 'section': sec, \n", + " 'auditorium': random.choice('ABCDEF') + str(random.randint(1,100))} \n", + "\n", + "Section.insert(make_section(0.5))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 200/200 [00:27<00:00, 7.17it/s]\n" + ] + } + ], + "source": [ + "# Enrollment \n", + "terms = Term().fetch('https://github.com/KEY')\n", + "quit_prob = 0.1\n", + "for student in tqdm(Student.fetch('https://github.com/KEY')):\n", + " start_term = random.randrange(len(terms))\n", + " for term in terms[start_term:]:\n", + " if random.random() < quit_prob:\n", + " break\n", + " else:\n", + " sections = ((Section & term) - (Course & (Enroll & student))).fetch('https://github.com/KEY')\n", + " if sections:\n", + " Enroll.insert({**student, **section} for section in \n", + " random.sample(sections, random.randrange(min(5, len(sections)))))\n", + " \n", + "# assign random grades\n", + "grades = LetterGrade.fetch('https://github.com/grade')\n", + "\n", + "grade_keys = Enroll.fetch('https://github.com/KEY')\n", + "random.shuffle(grade_keys)\n", + "grade_keys = grade_keys[:len(grade_keys)*9//10]\n", + "\n", + "Grade.insert({**key, 'grade':grade} \n", + " for key, grade in zip(grade_keys, random.choices(grades, k=len(grade_keys))))" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

student_id

\n", + " university-wide ID number\n", + "
\n", + "

dept

\n", + " abbreviated department name, e.g. BIOL\n", + "
\n", + "

course

\n", + " course number, e.g. 1010\n", + "
\n", + "

term_year

\n", + " \n", + "
\n", + "

term

\n", + " \n", + "
\n", + "

section

\n", + " \n", + "
\n", + "

grade

\n", + " \n", + "
\n", + "

points

\n", + " \n", + "
100MATH22802018FallaA-3.67
191MATH22102018SpringbA4.00
211CS21002018FallaA4.00
273PHYS21002018SpringaA4.00
282BIOL20212018SpringdA4.00
\n", + " \n", + "

Total: 5

\n", + " " + ], + "text/plain": [ + "*student_id *dept *course *term_year *term *section *grade points \n", + "+------------+ +------+ +--------+ +-----------+ +--------+ +---------+ +-------+ +--------+\n", + "100 MATH 2280 2018 Fall a A- 3.67 \n", + "191 MATH 2210 2018 Spring b A 4.00 \n", + "211 CS 2100 2018 Fall a A 4.00 \n", + "273 PHYS 2100 2018 Spring a A 4.00 \n", + "282 BIOL 2021 2018 Spring d A 4.00 \n", + " (Total: 5)" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=5, order_by='points DESC', offset=5)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\"SELECT `grade`,`student_id`,`dept`,`course`,`term_year`,`term`,`section`,`points` FROM `university`.`#letter_grade` NATURAL JOIN `university`.`grade` WHERE ( (term_year='2018')) ORDER BY `points` DESC LIMIT 10\"" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "((LetterGrade * Grade) & \"term_year='2018'\" & dj.Top(limit=10, order_by='points DESC', offset=0)).make_sql()" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\"SELECT `student_id`,`dept`,`course`,`term_year`,`term`,`section`,`grade`,`points` FROM `university`.`grade` NATURAL JOIN `university`.`#letter_grade` WHERE ( (term_year='2018')) ORDER BY `points` DESC LIMIT 20\"" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "((Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=20, order_by='points DESC', offset=0)).make_sql()" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

student_id

\n", + " university-wide ID number\n", + "
\n", + "

dept

\n", + " abbreviated department name, e.g. BIOL\n", + "
\n", + "

course

\n", + " course number, e.g. 1010\n", + "
\n", + "

term_year

\n", + " \n", + "
\n", + "

term

\n", + " \n", + "
\n", + "

section

\n", + " \n", + "
\n", + "

grade

\n", + " \n", + "
\n", + "

points

\n", + " \n", + "
100CS32002018FallcA4.00
100MATH22802018FallaA-3.67
100PHYS22102018SpringdA4.00
122CS10302018FallcB+3.33
131BIOL20302018SpringaA4.00
131CS32002018FallbB+3.33
136BIOL22102018SpringcB+3.33
136MATH22102018FallbB+3.33
141BIOL20102018SummercB+3.33
141CS24202018FallbA4.00
141CS32002018FallbA-3.67
182CS14102018SummercA-3.67
\n", + "

...

\n", + "

Total: 20

\n", + " " + ], + "text/plain": [ + "*student_id *dept *course *term_year *term *section *grade points \n", + "+------------+ +------+ +--------+ +-----------+ +--------+ +---------+ +-------+ +--------+\n", + "100 CS 3200 2018 Fall c A 4.00 \n", + "100 MATH 2280 2018 Fall a A- 3.67 \n", + "100 PHYS 2210 2018 Spring d A 4.00 \n", + "122 CS 1030 2018 Fall c B+ 3.33 \n", + "131 BIOL 2030 2018 Spring a A 4.00 \n", + "131 CS 3200 2018 Fall b B+ 3.33 \n", + "136 BIOL 2210 2018 Spring c B+ 3.33 \n", + "136 MATH 2210 2018 Fall b B+ 3.33 \n", + "141 BIOL 2010 2018 Summer c B+ 3.33 \n", + "141 CS 2420 2018 Fall b A 4.00 \n", + "141 CS 3200 2018 Fall b A- 3.67 \n", + "182 CS 1410 2018 Summer c A- 3.67 \n", + " ...\n", + " (Total: 20)" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=20, order_by='points DESC', offset=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

grade

\n", + " \n", + "
\n", + "

student_id

\n", + " university-wide ID number\n", + "
\n", + "

dept

\n", + " abbreviated department name, e.g. BIOL\n", + "
\n", + "

course

\n", + " course number, e.g. 1010\n", + "
\n", + "

term_year

\n", + " \n", + "
\n", + "

term

\n", + " \n", + "
\n", + "

section

\n", + " \n", + "
\n", + "

points

\n", + " \n", + "
A100CS32002018Fallc4.00
A100PHYS22102018Springd4.00
A131BIOL20302018Springa4.00
A141CS24202018Fallb4.00
A186PHYS22102018Springa4.00
A191MATH22102018Springb4.00
A211CS21002018Falla4.00
A273PHYS21002018Springa4.00
A282BIOL20212018Springd4.00
A-100MATH22802018Falla3.67
A-141CS32002018Fallb3.67
A-182CS14102018Summerc3.67
\n", + "

...

\n", + "

Total: 20

\n", + " " + ], + "text/plain": [ + "*grade *student_id *dept *course *term_year *term *section points \n", + "+-------+ +------------+ +------+ +--------+ +-----------+ +--------+ +---------+ +--------+\n", + "A 100 CS 3200 2018 Fall c 4.00 \n", + "A 100 PHYS 2210 2018 Spring d 4.00 \n", + "A 131 BIOL 2030 2018 Spring a 4.00 \n", + "A 141 CS 2420 2018 Fall b 4.00 \n", + "A 186 PHYS 2210 2018 Spring a 4.00 \n", + "A 191 MATH 2210 2018 Spring b 4.00 \n", + "A 211 CS 2100 2018 Fall a 4.00 \n", + "A 273 PHYS 2100 2018 Spring a 4.00 \n", + "A 282 BIOL 2021 2018 Spring d 4.00 \n", + "A- 100 MATH 2280 2018 Fall a 3.67 \n", + "A- 141 CS 3200 2018 Fall b 3.67 \n", + "A- 182 CS 1410 2018 Summer c 3.67 \n", + " ...\n", + " (Total: 20)" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "(LetterGrade * Grade) & \"term_year='2018'\" & dj.Top(limit=20, order_by='points DESC', offset=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "elements", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/src/tutorials/json.ipynb b/docs/src/tutorials/json.ipynb index f83b960bc..a33c4b6c5 100644 --- a/docs/src/tutorials/json.ipynb +++ b/docs/src/tutorials/json.ipynb @@ -39,7 +39,7 @@ "metadata": {}, "outputs": [], "source": [ - "import datajoint as dj\n" + "import datajoint as dj" ] }, { @@ -57,9 +57,9 @@ "source": [ "For this exercise, let's imagine we work for an awesome company that is organizing a fun RC car race across various teams in the company. Let's see which team has the fastest car! 🏎️\n", "\n", - "This establishes 2 important entities: a `Team` and a `Car`. Normally we'd map this to their own dedicated table, however, let's assume that `Team` is well-structured but `Car` is less structured then we'd prefer. In other words, the structure for what makes up a *car* is varing too much between entries (perhaps because users of the pipeline haven't agreed yet on the definition? 🤷).\n", + "This establishes 2 important entities: a `Team` and a `Car`. Normally the entities are mapped to their own dedicated table, however, let's assume that `Team` is well-structured but `Car` is less structured than we'd prefer. In other words, the structure for what makes up a *car* is varying too much between entries (perhaps because users of the pipeline haven't agreed yet on the definition? 🤷).\n", "\n", - "This would make it a good use-case to keep `Team` as a table but make `Car` actually a `json` type defined within the `Team` table.\n", + "This would make it a good use-case to keep `Team` as a table but make `Car` a `json` type defined within the `Team` table.\n", "\n", "Let's begin." ] @@ -80,7 +80,7 @@ } ], "source": [ - "schema = dj.Schema(f\"{dj.config['database.user']}_json\")\n" + "schema = dj.Schema(f\"{dj.config['database.user']}_json\")" ] }, { @@ -99,7 +99,7 @@ " car=null: json # A car belonging to a team (null to allow registering first but specifying car later)\n", " \n", " unique index(car.length:decimal(4, 1)) # Add an index if this key is frequently accessed\n", - " \"\"\"\n" + " \"\"\"" ] }, { @@ -145,7 +145,7 @@ " ],\n", " },\n", " }\n", - ")\n" + ")" ] }, { @@ -193,7 +193,7 @@ " },\n", " },\n", " ]\n", - ")\n" + ")" ] }, { @@ -1044,7 +1044,7 @@ "metadata": {}, "outputs": [], "source": [ - "schema.drop()\n" + "schema.drop()" ] }, { From 27b09859ffb29365761521e3c0a10ce700847ae2 Mon Sep 17 00:00:00 2001 From: kushalbakshi Date: Fri, 10 Jan 2025 15:46:30 -0500 Subject: [PATCH 02/49] Update(teamwork.md): diagrams and rendering --- docs/src/concepts/teamwork.md | 69 +++++++++++++++++------------------ 1 file changed, 33 insertions(+), 36 deletions(-) diff --git a/docs/src/concepts/teamwork.md b/docs/src/concepts/teamwork.md index 46bd9e3a9..b203e1dea 100644 --- a/docs/src/concepts/teamwork.md +++ b/docs/src/concepts/teamwork.md @@ -5,10 +5,9 @@ Science labs organize their projects as a sequence of activities of experiment design, data acquisition, and processing and analysis. -
- ![data science in a science lab](../images/data-science-before.png){: style="width:520px; align:center"} -
Workflow and dataflow in a common findings-centered approach to data science in a science lab.
-
+![data science in a science lab](../images/data-science-before.png){: style="width:510px; display:block; margin: 0 auto;"} + +
Workflow and dataflow in a common findings-centered approach to data science in a science lab.
Many labs lack a uniform data management strategy that would span longitudinally across the entire project lifecycle as well as laterally across different projects. @@ -29,10 +28,9 @@ This approach requires formulating a general data science plan and upfront inves for setting up resources and processes and training the teams. The team uses DataJoint to build data pipelines to support multiple projects. -
- ![data science in a science lab](../images/data-science-after.png){: style="width:510px; align:center"} -
Workflow and dataflow in a data pipeline-centered approach.
-
+![data science in a science lab](../images/data-science-after.png){: style="width:510px; display:block; margin: 0 auto;"} + +
Workflow and dataflow in a data pipeline-centered approach.
Data pipelines support project data across their entire lifecycle, including the following functions @@ -55,42 +53,41 @@ data integrity. The adoption of a uniform data management framework allows separation of roles and division of labor among team members, leading to greater efficiency and better scaling. -
- ![data science vs engineering](../images/data-engineering.png){: style="width:350px; align:center"} -
Distinct responsibilities of data science and data engineering.
-
+![data science in a science lab](../images/data-engineering.png){: style="width:510px; display:block; margin: 0 auto;"} + +
Distinct responsibilities of data science and data engineering.
-Scientists +### Scientists - design and conduct experiments, collecting data. - They interact with the data pipeline through graphical user interfaces designed by - others. - They understand what analysis is used to test their hypotheses. +Design and conduct experiments, collecting data. +They interact with the data pipeline through graphical user interfaces designed by +others. +They understand what analysis is used to test their hypotheses. -Data scientists +### Data scientists - have the domain expertise and select and implement the processing and analysis - methods for experimental data. - Data scientists are in charge of defining and managing the data pipeline using - DataJoint's data model, but they may not know the details of the underlying - architecture. - They interact with the pipeline using client programming interfaces directly from - languages such as MATLAB and Python. +Have the domain expertise and select and implement the processing and analysis +methods for experimental data. +Data scientists are in charge of defining and managing the data pipeline using +DataJoint's data model, but they may not know the details of the underlying +architecture. +They interact with the pipeline using client programming interfaces directly from +languages such as MATLAB and Python. - The bulk of this manual is written for working data scientists, except for System - Administration. +The bulk of this manual is written for working data scientists, except for System +Administration. -Data engineers +### Data engineers - work with the data scientists to support the data pipeline. - They rely on their understanding of the DataJoint data model to configure and - administer the required IT resources such as database servers, data storage - servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc. - Data engineers can provide general solutions such as web hosting, data publishing, - interfaces, exports and imports. +Work with the data scientists to support the data pipeline. +They rely on their understanding of the DataJoint data model to configure and +administer the required IT resources such as database servers, data storage +servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc. +Data engineers can provide general solutions such as web hosting, data publishing, +interfaces, exports and imports. - The System Administration section of this tutorial contains materials helpful in - accomplishing these tasks. +The System Administration section of this tutorial contains materials helpful in +accomplishing these tasks. DataJoint is designed to delineate a clean boundary between **data science** and **data engineering**. From de1fe189f1a172a4a3ed872a581796f9e317c73e Mon Sep 17 00:00:00 2001 From: kushalbakshi Date: Thu, 23 Jan 2025 10:42:47 -0500 Subject: [PATCH 03/49] Various updates throughout docs --- docs/src/concepts/data-model.md | 8 +-- docs/src/design/integrity.md | 2 +- docs/src/design/tables/blobs.md | 2 +- docs/src/design/tables/customtype.md | 81 ++++++++++++++++++++++- docs/src/design/tables/indexes.md | 98 +++++++++++++++++++++++++++- docs/src/publish-data.md | 4 +- 6 files changed, 185 insertions(+), 10 deletions(-) diff --git a/docs/src/concepts/data-model.md b/docs/src/concepts/data-model.md index ce9bf311d..65fdf991d 100644 --- a/docs/src/concepts/data-model.md +++ b/docs/src/concepts/data-model.md @@ -120,10 +120,10 @@ storing large contiguous data objects. DataJoint comprises: -- a schema [definition](../design/tables/declare.md) language -- a data [manipulation](../manipulation/index.md) language -- a data [query](../query/principles.md) language -- a [diagramming](../design/diagrams.md) notation for visualizing relationships between ++ a schema [definition](../design/tables/declare.md) language ++ a data [manipulation](../manipulation/index.md) language ++ a data [query](../query/principles.md) language ++ a [diagramming](../design/diagrams.md) notation for visualizing relationships between modeled entities The key refinement of DataJoint over other relational data models and their diff --git a/docs/src/design/integrity.md b/docs/src/design/integrity.md index 56416e4d7..e24ff550c 100644 --- a/docs/src/design/integrity.md +++ b/docs/src/design/integrity.md @@ -1,6 +1,6 @@ # Data Integrity -The term **data integrity** describes guarantees made by the data management process +The term **data integrity** describes guarantees made by the data management process that prevent errors and corruption in data due to technical failures and human errors arising in the course of continuous use by multiple agents. DataJoint pipelines respect the following forms of data integrity: **entity diff --git a/docs/src/design/tables/blobs.md b/docs/src/design/tables/blobs.md index 55cc0faff..9f73d54d4 100644 --- a/docs/src/design/tables/blobs.md +++ b/docs/src/design/tables/blobs.md @@ -1,4 +1,4 @@ -# Overview +# Blobs DataJoint provides functionality for serializing and deserializing complex data types into binary blobs for efficient storage and compatibility with MATLAB's mYm diff --git a/docs/src/design/tables/customtype.md b/docs/src/design/tables/customtype.md index 76847983e..823dd987c 100644 --- a/docs/src/design/tables/customtype.md +++ b/docs/src/design/tables/customtype.md @@ -1 +1,80 @@ -# Work in progress +# Custom Types + +In modern scientific research, data pipelines often involve complex workflows that +generate diverse data types. From high-dimensional imaging data to machine learning +models, these data types frequently exceed the basic representations supported by +traditional relational databases. For example: + ++ A lab working on neural connectivity might use graph objects to represent brain + networks. ++ Researchers processing raw imaging data might store custom objects for pre-processing + configurations. ++ Computational biologists might store fitted machine learning models or parameter + objects for downstream predictions. + +To handle these diverse needs, DataJoint provides the `dj.AttributeAdapter` method. It +enables researchers to store and retrieve complex, non-standard data types—like Python +objects or data structures—in a relational database while maintaining the +reproducibility, modularity, and query capabilities required for scientific workflows. + +## Uses in Scientific Research + +Imagine a neuroscience lab studying neural connectivity. Researchers might generate +graphs (e.g., networkx.Graph) to represent connections between brain regions, where: + ++ Nodes are brain regions. ++ Edges represent connections weighted by signal strength or another metric. + +Storing these graph objects in a database alongside other experimental data (e.g., +subject metadata, imaging parameters) ensures: + +1. Centralized Data Management: All experimental data and analysis results are stored + together for easy access and querying. +2. Reproducibility: The exact graph objects used in analysis can be retrieved later for + validation or further exploration. +3. Scalability: Graph data can be integrated into workflows for larger datasets or + across experiments. + +However, since graphs are not natively supported by relational databases, here’s where +`dj.AttributeAdapter` becomes essential. It allows researchers to define custom logic for +serializing graphs (e.g., as edge lists) and deserializing them back into Python +objects, bridging the gap between advanced data types and the database. + +### Example: Storing Graphs in DataJoint + +To store a networkx.Graph object in a DataJoint table, researchers can define a custom +attribute type in a datajoint table class: + +```python +import datajoint as dj + +class GraphAdapter(dj.AttributeAdapter): + + attribute_type = 'longblob' # this is how the attribute will be declared + + def put(self, obj): + # convert the nx.Graph object into an edge list + assert isinstance(obj, nx.Graph) + return list(obj.edges) + + def get(self, value): + # convert edge list back into an nx.Graph + return nx.Graph(value) + + +# instantiate for use as a datajoint type +graph = GraphAdapter() + + +# define a table with a graph attribute +schema = dj.schema('test_graphs') + + +@schema +class Connectivity(dj.Manual): + definition = """ + conn_id : int + --- + conn_graph = null : # a networkx.Graph object + """ +``` diff --git a/docs/src/design/tables/indexes.md b/docs/src/design/tables/indexes.md index 76847983e..8c0b53f15 100644 --- a/docs/src/design/tables/indexes.md +++ b/docs/src/design/tables/indexes.md @@ -1 +1,97 @@ -# Work in progress +# Indexes + +Table indexes are data structures that allow fast lookups by an indexed attribute or +combination of attributes. + +In DataJoint, indexes are created by one of the three mechanisms: + +1. Primary key +2. Foreign key +3. Explicitly defined indexes + +The first two mechanisms are obligatory. Every table has a primary key, which serves as +an unique index. Therefore, restrictions by a primary key are very fast. Foreign keys +create additional indexes unless a suitable index already exists. + +## Indexes for single primary key tables + +Let’s say a mouse in the lab has a lab-specific ID but it also has a separate id issued +by the animal facility. + +```python +@schema +class Mouse(dj.Manual): + definition = """ + mouse_id : int # lab-specific ID + --- + tag_id : int # animal facility ID + """ +``` + +In this case, searching for a mouse by `mouse_id` is much faster than by `tag_id` +because `mouse_id` is a primary key, and is therefore indexed. + +To make searches faster on fields other than the primary key or a foreign key, you can +add a secondary index explicitly. + +Regular indexes are declared as `index(attr1, ..., attrN)` on a separate line anywhere in +the table declration (below the primary key divide). + +Indexes can be declared with unique constraint as `unique index (attr1, ..., attrN)`. + +Let’s redeclare the table with a unique index on `tag_id`. + +```python +@schema +class Mouse(dj.Manual): + definition = """ + mouse_id : int # lab-specific ID + --- + tag_id : int # animal facility ID + unique index (tag_id) + """ +``` +Now, searches with `mouse_id` and `tag_id` are similarly fast. + +## Indexes for tables with multiple primary keys + +Let’s now imagine that rats in a lab are identified by the combination of `lab_name` and +`rat_id` in a table `Rat`. + +```python +@schema +class Rat(dj.Manual): + definition = """ + lab_name : char(16) + rat_id : int unsigned # lab-specific ID + --- + date_of_birth = null : date + """ +``` +Note that despite the fact that `rat_id` is in the index, searches by `rat_id` alone are not +helped by the index because it is not first in the index. This is similar to searching for +a word in a dictionary that orders words alphabetically. Searching by the first letters +of a word is easy but searching by the last few letters of a word requires scanning the +whole dictionary. + +In this table, the primary key is a unique index on the combination `(lab_name, rat_id)`. +Therefore searches on these attributes or on `lab_name` alone are fast. But this index +cannot help searches on `rat_id` alone. Similarly, searing by `date_of_birth` requires a +full-table scan and is inefficient. + +To speed up searches by the `rat_id` and `date_of_birth`, we can explicit indexes to +`Rat`: + +```python +@schema +class Rat2(dj.Manual): + definition = """ + lab_name : char(16) + rat_id : int unsigned # lab-specific ID + --- + date_of_birth = null : date + + index(rat_id) + index(date_of_birth) + """ +``` diff --git a/docs/src/publish-data.md b/docs/src/publish-data.md index 522d5bc35..83471cea1 100644 --- a/docs/src/publish-data.md +++ b/docs/src/publish-data.md @@ -23,7 +23,7 @@ populated DataJoint pipeline. One example of publishing a DataJoint pipeline as a docker container is > Sinz, F., Ecker, A.S., Fahey, P., Walker, E., Cobos, E., Froudarakis, E., Yatsenko, D., Pitkow, Z., Reimer, J. and Tolias, A., 2018. Stimulus domain transfer in recurrent models for large scale cortical population prediction on video. In Advances in Neural Information Processing Systems (pp. 7198-7209). https://www.biorxiv.org/content/early/2018/10/25/452672 -The code and the data can be found at https://github.com/sinzlab/Sinz2018_NIPS +The code and the data can be found at [https://github.com/sinzlab/Sinz2018_NIPS](https://github.com/sinzlab/Sinz2018_NIPS). ## Exporting into a collection of files @@ -31,4 +31,4 @@ Another option for publishing and archiving data is to export the data from the DataJoint pipeline into a collection of files. DataJoint provides features for exporting and importing sections of the pipeline. Several ongoing projects are implementing the capability to export from DataJoint -pipelines into [Neurodata Without Borders](https://www.nwb.org/) files. +pipelines into [Neurodata Without Borders](https://www.nwb.org/) files. From f9aeb43e705597e26f7860bb5f056c9a8b9e8683 Mon Sep 17 00:00:00 2001 From: kushalbakshi Date: Thu, 13 Feb 2025 16:15:12 -0500 Subject: [PATCH 04/49] Small fixes for web rendering --- docs/src/client/stores.md | 1 - docs/src/faq.md | 4 +- docs/src/quick-start.md | 9 ++++ docs/src/tutorials/dj-top.ipynb | 74 +++++++++++++-------------------- docs/src/tutorials/json.ipynb | 2 +- 5 files changed, 40 insertions(+), 50 deletions(-) delete mode 100644 docs/src/client/stores.md diff --git a/docs/src/client/stores.md b/docs/src/client/stores.md deleted file mode 100644 index 76847983e..000000000 --- a/docs/src/client/stores.md +++ /dev/null @@ -1 +0,0 @@ -# Work in progress diff --git a/docs/src/faq.md b/docs/src/faq.md index d22e64241..b86692979 100644 --- a/docs/src/faq.md +++ b/docs/src/faq.md @@ -4,8 +4,8 @@ It is common to enter data during experiments using a graphical user interface. -1. The [DataJoint Works](https://works.datajoint.com) platform is a web-based, fully -managed service to host and execute data pipelines. +1. The [DataJoint Works](https://works.datajoint.com) platform is a web-based, + end-to-end platform to host and execute data pipelines. 2. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open source project for data entry but is no longer actively maintained. diff --git a/docs/src/quick-start.md b/docs/src/quick-start.md index 65a5df433..7ff26a8d6 100644 --- a/docs/src/quick-start.md +++ b/docs/src/quick-start.md @@ -1,5 +1,14 @@ # Quick Start Guide +## Tutorials + +The easiest way to get started is through the [DataJoint +Tutorials](https://github.com/datajoint/datajoint-tutorials). These tutorials are +configured to run using [GitHub Codespaces](https://github.com/features/codespaces) +where the full environment including the database is already set up. + +Advanced users can install DataJoint locally. Please see the installation instructions below. + ## Installation First, please [install Python](https://www.python.org/downloads/) version diff --git a/docs/src/tutorials/dj-top.ipynb b/docs/src/tutorials/dj-top.ipynb index 4e0604af0..bbfe59f11 100644 --- a/docs/src/tutorials/dj-top.ipynb +++ b/docs/src/tutorials/dj-top.ipynb @@ -4,8 +4,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Using the `dj.Top` restriction\n", - "\n", + "# Using the dj.Top restriction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://datajoint.com/docs/core/glossary/#data-pipeline).\n", "\n", "Now let's start by importing the `datajoint` client." @@ -31,6 +36,13 @@ "schema = dj.Schema('university')" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Table Definition" + ] + }, { "cell_type": "code", "execution_count": 2, @@ -87,50 +99,6 @@ " \"\"\"" ] }, - { - "cell_type": "code", - "execution_count": 59, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[2024-12-26 12:03:01,311][INFO]: Table altered\n" - ] - } - ], - "source": [ - "StudentMajor.definition = \"\"\"\n", - "-> Student\n", - "---\n", - "-> Department\n", - "declare_date : date # when student declared her major\n", - "\"\"\"\n", - "StudentMajor.alter()" - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "-> Student\n", - "---\n", - "-> Department\n", - "declare_date : date # when student declared her major\n", - "\n" - ] - } - ], - "source": [ - "print(StudentMajor.describe())" - ] - }, { "cell_type": "code", "execution_count": 5, @@ -207,6 +175,13 @@ " \"\"\"" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Insert" + ] + }, { "cell_type": "code", "execution_count": 6, @@ -389,6 +364,13 @@ " for key, grade in zip(grade_keys, random.choices(grades, k=len(grade_keys))))" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# dj.Top Restriction" + ] + }, { "cell_type": "code", "execution_count": 29, diff --git a/docs/src/tutorials/json.ipynb b/docs/src/tutorials/json.ipynb index a33c4b6c5..f39b43e33 100644 --- a/docs/src/tutorials/json.ipynb +++ b/docs/src/tutorials/json.ipynb @@ -6,7 +6,7 @@ "id": "7fe24127-c0d0-4ff8-96b4-6ab0d9307e73", "metadata": {}, "source": [ - "# Using the `json` type" + "# Using the json type" ] }, { From 5f074ab133cc53bd1ebe768ba20eece926bb5d42 Mon Sep 17 00:00:00 2001 From: kushalbakshi Date: Mon, 17 Feb 2025 15:36:02 -0500 Subject: [PATCH 05/49] Website formatting fix --- docs/src/sysadmin/external-store.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/src/sysadmin/external-store.md b/docs/src/sysadmin/external-store.md index 301270043..8215f4084 100644 --- a/docs/src/sysadmin/external-store.md +++ b/docs/src/sysadmin/external-store.md @@ -255,19 +255,19 @@ to upgrade to DataJoint v0.12, the following process should be followed: 5. Migrate external tracking tables for each schema to use the new format. For instance in Python: - ```python - import datajoint.migrate as migrate - db_schema_name='schema_1' - external_store='raw' - migrate.migrate_dj011_external_blob_storage_to_dj012(db_schema_name, external_store) - ``` + ```python + import datajoint.migrate as migrate + db_schema_name='schema_1' + external_store='raw' + migrate.migrate_dj011_external_blob_storage_to_dj012(db_schema_name, external_store) + ``` 6. Verify pipeline functionality after this process has completed. For instance in Python: - ```python - x = myschema.TableWithExternal.fetch('https://github.com/external_field', limit=1)[0] - ``` + ```python + x = myschema.TableWithExternal.fetch('https://github.com/external_field', limit=1)[0] + ``` Note: This migration function is provided on a best-effort basis, and will convert the external tracking tables into a format which is compatible From e71418df4e740694ecf1608d8748dbaf2ba4428e Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Thu, 27 Feb 2025 07:06:28 -0600 Subject: [PATCH 06/49] document autopopulate.make logic --- datajoint/autopopulate.py | 76 +++++++++++++++++++++++++++++++++++---- 1 file changed, 70 insertions(+), 6 deletions(-) diff --git a/datajoint/autopopulate.py b/datajoint/autopopulate.py index 6d72b7aa7..22053d5cd 100644 --- a/datajoint/autopopulate.py +++ b/datajoint/autopopulate.py @@ -93,13 +93,75 @@ def _rename_attributes(table, props): def make(self, key): """ - Derived classes must implement method `make` that fetches data from tables - above them in the dependency hierarchy, restricting by the given key, - computes secondary attributes, and inserts the new tuples into self. + This method must be implemented by derived classes to perform automated computation. + The method must implement the following three steps: + + 1. Fetch data from tables above in the dependency hierarchy, restricted by the given key. + 2. Compute secondary attributes based on the fetched data. + 3. Insert the new tuples into the current table. + + The method can be implemented either as: + (a) Regular method: All three steps are performed in a single database transaction. + The method must return None. + (b) Generator method: + The make method is split into three functions: + - `make_fetch`: Fetches data from the parent tables. + - `make_compute`: Computes secondary attributes based on the fetched data. + - `make_insert`: Inserts the computed data into the current table. + + Then populate logic is executes as follows: + + + fetched_data1 = self.make_fetch(key) + computed_result = self.make_compute(key, *fetched_data1) + begin transaction: + fetched_data2 = self.make_fetch(key) + if fetched_data1 != fetched_data2: + cancel transaction + else: + self.make_insert(key, *computed_result) + commit_transaction + + + Importantly, the output of make_fetch is a tuple that serves as the input into `make_compute`. + The output of `make_compute` is a tuple that serves as the input into `make_insert`. + + The functionality must be strictly divided between these three methods: + - All database queries must be completed in `make_fetch`. + - All computation must be completed in `make_compute`. + - All database inserts must be completed in `make_insert`. + + DataJoint may programmatically enforce this separation in the future. + + :param key: The primary key value used to restrict the data fetching. + :raises NotImplementedError: If the derived class does not implement the required methods. """ - raise NotImplementedError( - "Subclasses of AutoPopulate must implement the method `make`" - ) + + if not ( + hasattr(self, "make_fetch") + and hasattr(self, "make_insert") + and hasattr(self, "make_compute") + ): + # user must implement `make` + raise NotImplementedError( + "Subclasses of AutoPopulate must implement the method `make` or (`make_fetch` + `make_compute` + `make_insert`)" + ) + + # User has implemented `_fetch`, `_compute`, and `_insert` methods instead + + # Step 1: Fetch data from parent tables + fetched_data = self.make_fetch(key) # fetched_data is a tuple + computed_result = yield fetched_data # passed as input into make_compute + + # Step 2: If computed result is not passed in, compute the result + if computed_result is None: + # this is only executed in the first invocation + computed_result = self.make_compute(key, *fetched_data) + yield computed_result # this is passed to the second invocation of make + + # Step 3: Insert the computed result into the current table. + self.make_insert(key, *computed_result) + yield @property def target(self): @@ -347,6 +409,8 @@ def _populate1( ] ): # rollback due to referential integrity fail self.connection.cancel_transaction() + logger.warning( + f"Referential integrity failed for {key} -> {self.target.full_table_name}") return False gen.send(computed_result) # insert From 3726e6f065e46e823c27abe91a4d540cc04a67ed Mon Sep 17 00:00:00 2001 From: github-actions Date: Thu, 17 Apr 2025 17:01:35 +0000 Subject: [PATCH 07/49] Update version.py to 0.14.4 --- datajoint/version.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datajoint/version.py b/datajoint/version.py index c980ad0d0..3f48dc939 100644 --- a/datajoint/version.py +++ b/datajoint/version.py @@ -1,6 +1,6 @@ # version bump auto managed by Github Actions: # label_prs.yaml(prep), release.yaml(bump), post_release.yaml(edit) # manually set this version will be eventually overwritten by the above actions -__version__ = "0.14.3" +__version__ = "0.14.4" assert len(__version__) <= 10 # The log table limits version to the 10 characters From b737b41c7890389b5f4f9a919cc6d53cbb555650 Mon Sep 17 00:00:00 2001 From: github-actions Date: Thu, 17 Apr 2025 17:01:35 +0000 Subject: [PATCH 08/49] Update README.md badge to v0.14.4 --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index fe677fd95..8a36818e6 100644 --- a/README.md +++ b/README.md @@ -30,8 +30,8 @@ Since Release - - commit since last release + + commit since last release From 31b04e3aa0ac47592d7d6c4b7cb67c589c6ea2b8 Mon Sep 17 00:00:00 2001 From: Drew Yang Date: Sat, 3 May 2025 13:49:04 -0500 Subject: [PATCH 09/49] chore: yambottle->drewyangdev --- .github/workflows/post_draft_release_published.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/post_draft_release_published.yaml b/.github/workflows/post_draft_release_published.yaml index 3daac2f5d..20160e62b 100644 --- a/.github/workflows/post_draft_release_published.yaml +++ b/.github/workflows/post_draft_release_published.yaml @@ -132,7 +132,7 @@ jobs: --body "This PR updates \`version.py\` to match the latest release: ${{ github.event.release.name }}" \ --base master \ --head ${{ env.BRANCH_NAME }} \ - --reviewer dimitri-yatsenko,yambottle,ttngu207 + --reviewer dimitri-yatsenko,drewyangdev,ttngu207 - name: Post release notification to Slack if: ${{ env.TEST_PYPI == 'false' }} uses: slackapi/slack-github-action@v2.0.0 From 080bb44bd55cbc49d2af63db0dad313e5429befd Mon Sep 17 00:00:00 2001 From: Drew Yang Date: Sat, 3 May 2025 13:54:20 -0500 Subject: [PATCH 10/49] docs: fix typo --- docs/src/design/tables/indexes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/design/tables/indexes.md b/docs/src/design/tables/indexes.md index 8c0b53f15..fcd1b5702 100644 --- a/docs/src/design/tables/indexes.md +++ b/docs/src/design/tables/indexes.md @@ -35,7 +35,7 @@ To make searches faster on fields other than the primary key or a foreign key, y add a secondary index explicitly. Regular indexes are declared as `index(attr1, ..., attrN)` on a separate line anywhere in -the table declration (below the primary key divide). +the table declaration (below the primary key divide). Indexes can be declared with unique constraint as `unique index (attr1, ..., attrN)`. From fb0ee7db86cf28b03f9b95bc01c0db33cebd7b0d Mon Sep 17 00:00:00 2001 From: MilagrosMarin Date: Thu, 15 May 2025 14:00:09 +0100 Subject: [PATCH 11/49] fix: Update home URL from `datajoint.com/docs` to `docs.datajoint.com` add exclamation mark removed previously --- README.md | 37 +++-- datajoint/diagram.py | 2 +- docs/README.md | 5 +- docs/src/faq.md | 2 +- docs/src/index.md | 4 +- docs/src/query/operators.md | 2 +- docs/src/tutorials/dj-top.ipynb | 263 ++++++++++++++++++-------------- docs/src/tutorials/json.ipynb | 19 +-- pyproject.toml | 6 +- 9 files changed, 187 insertions(+), 153 deletions(-) diff --git a/README.md b/README.md index 8a36818e6..eecee41a0 100644 --- a/README.md +++ b/README.md @@ -54,7 +54,7 @@ Doc Status - + doc status @@ -68,12 +68,12 @@ - Developer Chat - - + Developer Chat + + datajoint slack - - + + License @@ -84,21 +84,20 @@ - Citation - - - bioRxiv - + Citation + + + bioRxiv +
zenodo - - + + - DataJoint for Python is a framework for scientific workflow management based on relational principles. DataJoint is built on the foundation of the relational data model and prescribes a consistent method for organizing, populating, computing, and @@ -110,7 +109,7 @@ volumes of data streaming from regular experiments. Starting in 2011, DataJoint been available as an open-source project adopted by other labs and improved through contributions from several developers. Presently, the primary developer of DataJoint open-source software is the company -DataJoint (https://datajoint.com). +DataJoint (). ## Data Pipeline Example @@ -132,13 +131,13 @@ DataJoint (https://datajoint.com). pip install datajoint ``` -- [Documentation & Tutorials](https://datajoint.com/docs/core/datajoint-python/) +- [Documentation & Tutorials](https://docs.datajoint.com/core/datajoint-python/) - [Interactive Tutorials](https://github.com/datajoint/datajoint-tutorials) on GitHub Codespaces -- [DataJoint Elements](https://datajoint.com/docs/elements/) - Catalog of example pipelines for neuroscience experiments +- [DataJoint Elements](https://docs.datajoint.com/elements/) - Catalog of example pipelines for neuroscience experiments - Contribute - - [Contribution Guidelines](https://datajoint.com/docs/about/contribute/) + - [Contribution Guidelines](https://docs.datajoint.com/about/contribute/) - - [Developer Guide](https://datajoint.com/docs/core/datajoint-python/latest/develop/) + - [Developer Guide](https://docs.datajoint.com/core/datajoint-python/latest/develop/) diff --git a/datajoint/diagram.py b/datajoint/diagram.py index cb3daf4d3..aa505fb54 100644 --- a/datajoint/diagram.py +++ b/datajoint/diagram.py @@ -35,7 +35,7 @@ class Diagram: Entity relationship diagram, currently disabled due to the lack of required packages: matplotlib and pygraphviz. To enable Diagram feature, please install both matplotlib and pygraphviz. For instructions on how to install - these two packages, refer to https://datajoint.com/docs/core/datajoint-python/0.14/client/install/ + these two packages, refer to https://docs.datajoint.com/core/datajoint-python/0.14/client/install/ """ def __init__(self, *args, **kwargs): diff --git a/docs/README.md b/docs/README.md index df42fe764..4aecf0a69 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,6 +1,6 @@ # Contribute to DataJoint Documentation -This is the home for DataJoint software documentation as hosted at https://datajoint.com/docs/core/datajoint-python/latest/ +This is the home for DataJoint software documentation as hosted at https://docs.datajoint.com/core/datajoint-python/latest/. ## VSCode Linter Extensions and Settings @@ -89,9 +89,8 @@ INFO - Doc file 'index.md' contains an unrecognized relative link './develop - `/docs/core/datajoint-python/` is the actual docs site hosted by datajoint/datajoint-python's github pages - `/docs/elements/element-*/` is the actual docs site hosted by each element's github pages - ```log WARNING - Doc file 'query/operators.md' contains a link '../../../images/concepts-operators-restriction.png', but the target '../../images/concepts-operators-restriction.png' is not found among documentation files. ``` -- We use Github Pages to host our docs, the image references needs to follow the mkdocs's build directory structure, under `site/` directory once you run mkdocs. \ No newline at end of file +- We use Github Pages to host our docs, the image references needs to follow the mkdocs's build directory structure, under `site/` directory once you run mkdocs. diff --git a/docs/src/faq.md b/docs/src/faq.md index 06ebbc2db..1de69bb31 100644 --- a/docs/src/faq.md +++ b/docs/src/faq.md @@ -12,7 +12,7 @@ source project for data entry but is no longer actively maintained. ## Does DataJoint support other programming languages? -DataJoint [Python](https://datajoint.com/docs/core/datajoint-python/) is the most +DataJoint [Python](https://docs.datajoint.com/core/datajoint-python/) is the most up-to-date version and all future development will focus on the Python API. The [Matlab](https://datajoint.com/docs/core/datajoint-matlab/) API was actively developed through 2023. Previous projects implemented some DataJoint features in diff --git a/docs/src/index.md b/docs/src/index.md index 8c5f8fcb1..6e3bf2a2d 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -36,9 +36,9 @@ Presently, the primary developer of DataJoint open-source software is the compan - [Interactive Tutorials](https://github.com/datajoint/datajoint-tutorials){:target="_blank"} on GitHub Codespaces -- [DataJoint Elements](https://datajoint.com/docs/elements/) - Catalog of example pipelines for neuroscience experiments +- [DataJoint Elements](https://docs.datajoint.com/elements/) - Catalog of example pipelines for neuroscience experiments - Contribute - [Development Environment](./develop) - - [Guidelines](https://datajoint.com/docs/community/contribute/) + - [Guidelines](https://docs.datajoint.com/about/contribute/) diff --git a/docs/src/query/operators.md b/docs/src/query/operators.md index 39f2488dd..ee3549f35 100644 --- a/docs/src/query/operators.md +++ b/docs/src/query/operators.md @@ -392,4 +392,4 @@ dj.U().aggr(Session, n="max(session)") # (3) `dj.U()`, as shown in the last example above, is often useful for integer IDs. For an example of this process, see the source code for -[Element Array Electrophysiology's `insert_new_params`](https://datajoint.com/docs/elements/element-array-ephys/latest/api/element_array_ephys/ephys_acute/#element_array_ephys.ephys_acute.ClusteringParamSet.insert_new_params). +[Element Array Electrophysiology's `insert_new_params`](https://docs.datajoint.com/elements/element-array-ephys/latest/api/element_array_ephys/ephys_acute/#element_array_ephys.ephys_acute.ClusteringParamSet.insert_new_params). diff --git a/docs/src/tutorials/dj-top.ipynb b/docs/src/tutorials/dj-top.ipynb index bbfe59f11..7ed9f97cc 100644 --- a/docs/src/tutorials/dj-top.ipynb +++ b/docs/src/tutorials/dj-top.ipynb @@ -11,7 +11,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://datajoint.com/docs/core/glossary/#data-pipeline).\n", + "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", "\n", "Now let's start by importing the `datajoint` client." ] @@ -32,8 +32,9 @@ ], "source": [ "import datajoint as dj\n", + "\n", "dj.config[\"database.host\"] = \"127.0.0.1\"\n", - "schema = dj.Schema('university')" + "schema = dj.Schema(\"university\")" ] }, { @@ -114,7 +115,8 @@ " course_name : varchar(200) # e.g. \"Neurobiology of Sensation and Movement.\"\n", " credits : decimal(3,1) # number of credits earned by completing the course\n", " \"\"\"\n", - " \n", + "\n", + "\n", "@schema\n", "class Term(dj.Manual):\n", " definition = \"\"\"\n", @@ -122,6 +124,7 @@ " term : enum('Spring', 'Summer', 'Fall')\n", " \"\"\"\n", "\n", + "\n", "@schema\n", "class Section(dj.Manual):\n", " definition = \"\"\"\n", @@ -131,13 +134,15 @@ " ---\n", " auditorium : varchar(12)\n", " \"\"\"\n", - " \n", + "\n", + "\n", "@schema\n", "class CurrentTerm(dj.Manual):\n", " definition = \"\"\"\n", " -> Term\n", " \"\"\"\n", "\n", + "\n", "@schema\n", "class Enroll(dj.Manual):\n", " definition = \"\"\"\n", @@ -145,6 +150,7 @@ " -> Section\n", " \"\"\"\n", "\n", + "\n", "@schema\n", "class LetterGrade(dj.Lookup):\n", " definition = \"\"\"\n", @@ -153,18 +159,19 @@ " points : decimal(3,2)\n", " \"\"\"\n", " contents = [\n", - " ['A', 4.00],\n", - " ['A-', 3.67],\n", - " ['B+', 3.33],\n", - " ['B', 3.00],\n", - " ['B-', 2.67],\n", - " ['C+', 2.33],\n", - " ['C', 2.00],\n", - " ['C-', 1.67],\n", - " ['D+', 1.33],\n", - " ['D', 1.00],\n", - " ['F', 0.00]\n", - " ]\n", + " [\"A\", 4.00],\n", + " [\"A-\", 3.67],\n", + " [\"B+\", 3.33],\n", + " [\"B\", 3.00],\n", + " [\"B-\", 2.67],\n", + " [\"C+\", 2.33],\n", + " [\"C\", 2.00],\n", + " [\"C-\", 1.67],\n", + " [\"D+\", 1.33],\n", + " [\"D\", 1.00],\n", + " [\"F\", 0.00],\n", + " ]\n", + "\n", "\n", "@schema\n", "class Grade(dj.Manual):\n", @@ -192,6 +199,7 @@ "import faker\n", "import random\n", "import datetime\n", + "\n", "fake = faker.Faker()" ] }, @@ -202,14 +210,14 @@ "outputs": [], "source": [ "def yield_students():\n", - " fake_name = {'F': fake.name_female, 'M': fake.name_male}\n", + " fake_name = {\"F\": fake.name_female, \"M\": fake.name_male}\n", " while True: # ignore invalid values\n", " try:\n", - " sex = random.choice(('F', 'M'))\n", - " first_name, last_name = fake_name[sex]().split(' ')[:2]\n", - " street_address, city = fake.address().split('\\n')\n", - " city, state = city.split(', ')\n", - " state, zipcode = state.split(' ') \n", + " sex = random.choice((\"F\", \"M\"))\n", + " first_name, last_name = fake_name[sex]().split(\" \")[:2]\n", + " street_address, city = fake.address().split(\"\\n\")\n", + " city, state = city.split(\", \")\n", + " state, zipcode = state.split(\" \")\n", " except ValueError:\n", " continue\n", " else:\n", @@ -222,8 +230,10 @@ " home_state=state,\n", " home_zip=zipcode,\n", " date_of_birth=str(\n", - " fake.date_time_between(start_date=\"-35y\", end_date=\"-15y\").date()),\n", - " home_phone = fake.phone_number()[:20])" + " fake.date_time_between(start_date=\"-35y\", end_date=\"-15y\").date()\n", + " ),\n", + " home_phone=fake.phone_number()[:20],\n", + " )" ] }, { @@ -232,95 +242,106 @@ "metadata": {}, "outputs": [], "source": [ - "Student.insert(\n", - " dict(k, student_id=i) for i, k in zip(range(100,300), yield_students()))\n", + "Student.insert(dict(k, student_id=i) for i, k in zip(range(100, 300), yield_students()))\n", "\n", "Department.insert(\n", - " dict(dept=dept, \n", - " dept_name=name, \n", - " dept_address=fake.address(), \n", - " dept_phone=fake.phone_number()[:20])\n", + " dict(\n", + " dept=dept,\n", + " dept_name=name,\n", + " dept_address=fake.address(),\n", + " dept_phone=fake.phone_number()[:20],\n", + " )\n", " for dept, name in [\n", " [\"CS\", \"Computer Science\"],\n", " [\"BIOL\", \"Life Sciences\"],\n", " [\"PHYS\", \"Physics\"],\n", - " [\"MATH\", \"Mathematics\"]])\n", + " [\"MATH\", \"Mathematics\"],\n", + " ]\n", + ")\n", "\n", - "StudentMajor.insert({**s, **d, \n", - " 'declare_date':fake.date_between(start_date=datetime.date(1999,1,1))}\n", - " for s, d in zip(Student.fetch('https://github.com/KEY'), random.choices(Department.fetch('https://github.com/KEY'), k=len(Student())))\n", - " if random.random() < 0.75)\n", + "StudentMajor.insert(\n", + " {**s, **d, \"declare_date\": fake.date_between(start_date=datetime.date(1999, 1, 1))}\n", + " for s, d in zip(\n", + " Student.fetch(\"KEY\"), random.choices(Department.fetch(\"KEY\"), k=len(Student()))\n", + " )\n", + " if random.random() < 0.75\n", + ")\n", "\n", "# from https://www.utah.edu/\n", - "Course.insert([\n", - " ['BIOL', 1006, 'World of Dinosaurs', 3],\n", - " ['BIOL', 1010, 'Biology in the 21st Century', 3],\n", - " ['BIOL', 1030, 'Human Biology', 3],\n", - " ['BIOL', 1210, 'Principles of Biology', 4],\n", - " ['BIOL', 2010, 'Evolution & Diversity of Life', 3],\n", - " ['BIOL', 2020, 'Principles of Cell Biology', 3],\n", - " ['BIOL', 2021, 'Principles of Cell Science', 4],\n", - " ['BIOL', 2030, 'Principles of Genetics', 3],\n", - " ['BIOL', 2210, 'Human Genetics',3],\n", - " ['BIOL', 2325, 'Human Anatomy', 4],\n", - " ['BIOL', 2330, 'Plants & Society', 3],\n", - " ['BIOL', 2355, 'Field Botany', 2],\n", - " ['BIOL', 2420, 'Human Physiology', 4],\n", + "Course.insert(\n", + " [\n", + " [\"BIOL\", 1006, \"World of Dinosaurs\", 3],\n", + " [\"BIOL\", 1010, \"Biology in the 21st Century\", 3],\n", + " [\"BIOL\", 1030, \"Human Biology\", 3],\n", + " [\"BIOL\", 1210, \"Principles of Biology\", 4],\n", + " [\"BIOL\", 2010, \"Evolution & Diversity of Life\", 3],\n", + " [\"BIOL\", 2020, \"Principles of Cell Biology\", 3],\n", + " [\"BIOL\", 2021, \"Principles of Cell Science\", 4],\n", + " [\"BIOL\", 2030, \"Principles of Genetics\", 3],\n", + " [\"BIOL\", 2210, \"Human Genetics\", 3],\n", + " [\"BIOL\", 2325, \"Human Anatomy\", 4],\n", + " [\"BIOL\", 2330, \"Plants & Society\", 3],\n", + " [\"BIOL\", 2355, \"Field Botany\", 2],\n", + " [\"BIOL\", 2420, \"Human Physiology\", 4],\n", + " [\"PHYS\", 2040, \"Classcal Theoretical Physics II\", 4],\n", + " [\"PHYS\", 2060, \"Quantum Mechanics\", 3],\n", + " [\"PHYS\", 2100, \"General Relativity and Cosmology\", 3],\n", + " [\"PHYS\", 2140, \"Statistical Mechanics\", 4],\n", + " [\"PHYS\", 2210, \"Physics for Scientists and Engineers I\", 4],\n", + " [\"PHYS\", 2220, \"Physics for Scientists and Engineers II\", 4],\n", + " [\"PHYS\", 3210, \"Physics for Scientists I (Honors)\", 4],\n", + " [\"PHYS\", 3220, \"Physics for Scientists II (Honors)\", 4],\n", + " [\"MATH\", 1250, \"Calculus for AP Students I\", 4],\n", + " [\"MATH\", 1260, \"Calculus for AP Students II\", 4],\n", + " [\"MATH\", 1210, \"Calculus I\", 4],\n", + " [\"MATH\", 1220, \"Calculus II\", 4],\n", + " [\"MATH\", 2210, \"Calculus III\", 3],\n", + " [\"MATH\", 2270, \"Linear Algebra\", 4],\n", + " [\"MATH\", 2280, \"Introduction to Differential Equations\", 4],\n", + " [\"MATH\", 3210, \"Foundations of Analysis I\", 4],\n", + " [\"MATH\", 3220, \"Foundations of Analysis II\", 4],\n", + " [\"CS\", 1030, \"Foundations of Computer Science\", 3],\n", + " [\"CS\", 1410, \"Introduction to Object-Oriented Programming\", 4],\n", + " [\"CS\", 2420, \"Introduction to Algorithms & Data Structures\", 4],\n", + " [\"CS\", 2100, \"Discrete Structures\", 3],\n", + " [\"CS\", 3500, \"Software Practice\", 4],\n", + " [\"CS\", 3505, \"Software Practice II\", 3],\n", + " [\"CS\", 3810, \"Computer Organization\", 4],\n", + " [\"CS\", 4400, \"Computer Systems\", 4],\n", + " [\"CS\", 4150, \"Algorithms\", 3],\n", + " [\"CS\", 3100, \"Models of Computation\", 3],\n", + " [\"CS\", 3200, \"Introduction to Scientific Computing\", 3],\n", + " [\"CS\", 4000, \"Senior Capstone Project - Design Phase\", 3],\n", + " [\"CS\", 4500, \"Senior Capstone Project\", 3],\n", + " [\"CS\", 4940, \"Undergraduate Research\", 3],\n", + " [\"CS\", 4970, \"Computer Science Bachelors Thesis\", 3],\n", + " ]\n", + ")\n", "\n", - " ['PHYS', 2040, 'Classcal Theoretical Physics II', 4],\n", - " ['PHYS', 2060, 'Quantum Mechanics', 3],\n", - " ['PHYS', 2100, 'General Relativity and Cosmology', 3],\n", - " ['PHYS', 2140, 'Statistical Mechanics', 4],\n", - " \n", - " ['PHYS', 2210, 'Physics for Scientists and Engineers I', 4], \n", - " ['PHYS', 2220, 'Physics for Scientists and Engineers II', 4],\n", - " ['PHYS', 3210, 'Physics for Scientists I (Honors)', 4],\n", - " ['PHYS', 3220, 'Physics for Scientists II (Honors)', 4],\n", - " \n", - " ['MATH', 1250, 'Calculus for AP Students I', 4],\n", - " ['MATH', 1260, 'Calculus for AP Students II', 4],\n", - " ['MATH', 1210, 'Calculus I', 4],\n", - " ['MATH', 1220, 'Calculus II', 4],\n", - " ['MATH', 2210, 'Calculus III', 3],\n", - " \n", - " ['MATH', 2270, 'Linear Algebra', 4],\n", - " ['MATH', 2280, 'Introduction to Differential Equations', 4],\n", - " ['MATH', 3210, 'Foundations of Analysis I', 4],\n", - " ['MATH', 3220, 'Foundations of Analysis II', 4],\n", - " \n", - " ['CS', 1030, 'Foundations of Computer Science', 3],\n", - " ['CS', 1410, 'Introduction to Object-Oriented Programming', 4],\n", - " ['CS', 2420, 'Introduction to Algorithms & Data Structures', 4],\n", - " ['CS', 2100, 'Discrete Structures', 3],\n", - " ['CS', 3500, 'Software Practice', 4],\n", - " ['CS', 3505, 'Software Practice II', 3],\n", - " ['CS', 3810, 'Computer Organization', 4],\n", - " ['CS', 4400, 'Computer Systems', 4],\n", - " ['CS', 4150, 'Algorithms', 3],\n", - " ['CS', 3100, 'Models of Computation', 3],\n", - " ['CS', 3200, 'Introduction to Scientific Computing', 3],\n", - " ['CS', 4000, 'Senior Capstone Project - Design Phase', 3],\n", - " ['CS', 4500, 'Senior Capstone Project', 3],\n", - " ['CS', 4940, 'Undergraduate Research', 3],\n", - " ['CS', 4970, 'Computer Science Bachelor''s Thesis', 3]])\n", + "Term.insert(\n", + " dict(term_year=year, term=term)\n", + " for year in range(1999, 2019)\n", + " for term in [\"Spring\", \"Summer\", \"Fall\"]\n", + ")\n", "\n", - "Term.insert(dict(term_year=year, term=term) \n", - " for year in range(1999, 2019) \n", - " for term in ['Spring', 'Summer', 'Fall'])\n", + "Term().fetch(order_by=(\"term_year DESC\", \"term DESC\"), as_dict=True, limit=1)[0]\n", "\n", - "Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]\n", + "CurrentTerm().insert1(\n", + " {**Term().fetch(order_by=(\"term_year DESC\", \"term DESC\"), as_dict=True, limit=1)[0]}\n", + ")\n", "\n", - "CurrentTerm().insert1({\n", - " **Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]})\n", "\n", "def make_section(prob):\n", " for c in (Course * Term).proj():\n", - " for sec in 'abcd':\n", + " for sec in \"abcd\":\n", " if random.random() < prob:\n", " break\n", " yield {\n", - " **c, 'section': sec, \n", - " 'auditorium': random.choice('ABCDEF') + str(random.randint(1,100))} \n", + " **c,\n", + " \"section\": sec,\n", + " \"auditorium\": random.choice(\"ABCDEF\") + str(random.randint(1, 100)),\n", + " }\n", + "\n", "\n", "Section.insert(make_section(0.5))" ] @@ -339,29 +360,35 @@ } ], "source": [ - "# Enrollment \n", - "terms = Term().fetch('https://github.com/KEY')\n", + "# Enrollment\n", + "terms = Term().fetch(\"KEY\")\n", "quit_prob = 0.1\n", - "for student in tqdm(Student.fetch('https://github.com/KEY')):\n", + "for student in tqdm(Student.fetch(\"KEY\")):\n", " start_term = random.randrange(len(terms))\n", " for term in terms[start_term:]:\n", " if random.random() < quit_prob:\n", " break\n", " else:\n", - " sections = ((Section & term) - (Course & (Enroll & student))).fetch('https://github.com/KEY')\n", + " sections = ((Section & term) - (Course & (Enroll & student))).fetch(\"KEY\")\n", " if sections:\n", - " Enroll.insert({**student, **section} for section in \n", - " random.sample(sections, random.randrange(min(5, len(sections)))))\n", - " \n", + " Enroll.insert(\n", + " {**student, **section}\n", + " for section in random.sample(\n", + " sections, random.randrange(min(5, len(sections)))\n", + " )\n", + " )\n", + "\n", "# assign random grades\n", - "grades = LetterGrade.fetch('https://github.com/grade')\n", + "grades = LetterGrade.fetch(\"grade\")\n", "\n", - "grade_keys = Enroll.fetch('https://github.com/KEY')\n", + "grade_keys = Enroll.fetch(\"KEY\")\n", "random.shuffle(grade_keys)\n", - "grade_keys = grade_keys[:len(grade_keys)*9//10]\n", + "grade_keys = grade_keys[: len(grade_keys) * 9 // 10]\n", "\n", - "Grade.insert({**key, 'grade':grade} \n", - " for key, grade in zip(grade_keys, random.choices(grades, k=len(grade_keys))))" + "Grade.insert(\n", + " {**key, \"grade\": grade}\n", + " for key, grade in zip(grade_keys, random.choices(grades, k=len(grade_keys)))\n", + ")" ] }, { @@ -517,7 +544,9 @@ } ], "source": [ - "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=5, order_by='points DESC', offset=5)" + "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(\n", + " limit=5, order_by=\"points DESC\", offset=5\n", + ")" ] }, { @@ -537,7 +566,11 @@ } ], "source": [ - "((LetterGrade * Grade) & \"term_year='2018'\" & dj.Top(limit=10, order_by='points DESC', offset=0)).make_sql()" + "(\n", + " (LetterGrade * Grade)\n", + " & \"term_year='2018'\"\n", + " & dj.Top(limit=10, order_by=\"points DESC\", offset=0)\n", + ").make_sql()" ] }, { @@ -557,7 +590,11 @@ } ], "source": [ - "((Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=20, order_by='points DESC', offset=0)).make_sql()" + "(\n", + " (Grade * LetterGrade)\n", + " & \"term_year='2018'\"\n", + " & dj.Top(limit=20, order_by=\"points DESC\", offset=0)\n", + ").make_sql()" ] }, { @@ -763,7 +800,9 @@ } ], "source": [ - "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=20, order_by='points DESC', offset=0)" + "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(\n", + " limit=20, order_by=\"points DESC\", offset=0\n", + ")" ] }, { @@ -969,7 +1008,9 @@ } ], "source": [ - "(LetterGrade * Grade) & \"term_year='2018'\" & dj.Top(limit=20, order_by='points DESC', offset=0)" + "(LetterGrade * Grade) & \"term_year='2018'\" & dj.Top(\n", + " limit=20, order_by=\"points DESC\", offset=0\n", + ")" ] }, { diff --git a/docs/src/tutorials/json.ipynb b/docs/src/tutorials/json.ipynb index f39b43e33..9c5feebf6 100644 --- a/docs/src/tutorials/json.ipynb +++ b/docs/src/tutorials/json.ipynb @@ -27,7 +27,7 @@ "id": "67cf93d2", "metadata": {}, "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://datajoint.com/docs/core/glossary/#data-pipeline).\n", + "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", "\n", "Now let's start by importing the `datajoint` client." ] @@ -406,7 +406,7 @@ ], "source": [ "# Which team has a `car` equal to 100 inches long?\n", - "Team & {'car.length': 100}" + "Team & {\"car.length\": 100}" ] }, { @@ -592,7 +592,7 @@ ], "source": [ "# Any team that has had their car inspected?\n", - "Team & [{'car.inspected:unsigned': True}, {'car.safety_inspected:unsigned': True}]" + "Team & [{\"car.inspected:unsigned\": True}, {\"car.safety_inspected:unsigned\": True}]" ] }, { @@ -820,7 +820,7 @@ "source": [ "# Only interested in the car names and the length but let the type be inferred\n", "q_untyped = Team.proj(\n", - " car_name='car.name',\n", + " car_name=\"car.name\",\n", " car_length=\"car.length\",\n", ")\n", "q_untyped" @@ -950,7 +950,7 @@ "source": [ "# Nevermind, I'll specify the type explicitly\n", "q_typed = Team.proj(\n", - " car_name='car.name',\n", + " car_name=\"car.name\",\n", " car_length=\"car.length:float\",\n", ")\n", "q_typed" @@ -1058,7 +1058,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3.7.16 64-bit", + "display_name": "all_purposes", "language": "python", "name": "python3" }, @@ -1072,12 +1072,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.16" - }, - "vscode": { - "interpreter": { - "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1" - } + "version": "3.9.18" } }, "nbformat": 4, diff --git a/pyproject.toml b/pyproject.toml index c484072bd..075bb92b7 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -35,7 +35,7 @@ maintainers = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, ] -# manually sync here: https://datajoint.com/docs/core/datajoint-python/latest/#welcome-to-datajoint-for-python +# manually sync here: https://docs.datajoint.com/core/datajoint-python/latest/#welcome-to-datajoint-for-python description = "DataJoint for Python is a framework for scientific workflow management based on relational principles. DataJoint is built on the foundation of the relational data model and prescribes a consistent method for organizing, populating, computing, and querying data." readme = "README.md" license = {file = "LICENSE.txt"} @@ -69,8 +69,8 @@ classifiers = [ ] [project.urls] -Homepage = "https://datajoint.com/docs" -Documentation = "https://datajoint.com/docs" +Homepage = "https://docs.datajoint.com/" +Documentation = "https://docs.datajoint.com/" Repository = "https://github.com/datajoint/datajoint-python" "Bug Tracker" = "https://github.com/datajoint/datajoint-python/issues" "Release Notes" = "https://github.com/datajoint/datajoint-python/releases" From 557e11a5972a5b4d58462b9af10c9696b89e3107 Mon Sep 17 00:00:00 2001 From: MilagrosMarin Date: Thu, 15 May 2025 15:16:43 +0100 Subject: [PATCH 12/49] fix: typo for codespell --- docs/src/design/tables/indexes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/design/tables/indexes.md b/docs/src/design/tables/indexes.md index 8c0b53f15..fcd1b5702 100644 --- a/docs/src/design/tables/indexes.md +++ b/docs/src/design/tables/indexes.md @@ -35,7 +35,7 @@ To make searches faster on fields other than the primary key or a foreign key, y add a secondary index explicitly. Regular indexes are declared as `index(attr1, ..., attrN)` on a separate line anywhere in -the table declration (below the primary key divide). +the table declaration (below the primary key divide). Indexes can be declared with unique constraint as `unique index (attr1, ..., attrN)`. From 7a0fe5aff8a59277d1e12e72d27dd8373e7f53b0 Mon Sep 17 00:00:00 2001 From: MilagrosMarin Date: Sat, 31 May 2025 02:01:58 +0100 Subject: [PATCH 13/49] fix(URL): remove `core` in `docs.datajoint.com/core/datajoint-python` --- README.md | 4 ++-- docs/README.md | 2 +- docs/src/tutorials/dj-top.ipynb | 2 +- docs/src/tutorials/json.ipynb | 2 +- pyproject.toml | 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index eecee41a0..00bdb6928 100644 --- a/README.md +++ b/README.md @@ -131,7 +131,7 @@ DataJoint (). pip install datajoint ``` -- [Documentation & Tutorials](https://docs.datajoint.com/core/datajoint-python/) +- [Documentation & Tutorials](https://docs.datajoint.com/datajoint-python/) - [Interactive Tutorials](https://github.com/datajoint/datajoint-tutorials) on GitHub Codespaces @@ -140,4 +140,4 @@ DataJoint (). - Contribute - [Contribution Guidelines](https://docs.datajoint.com/about/contribute/) - - [Developer Guide](https://docs.datajoint.com/core/datajoint-python/latest/develop/) + - [Developer Guide](https://docs.datajoint.com/datajoint-python/latest/develop/) diff --git a/docs/README.md b/docs/README.md index 4aecf0a69..3fe48a691 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,6 +1,6 @@ # Contribute to DataJoint Documentation -This is the home for DataJoint software documentation as hosted at https://docs.datajoint.com/core/datajoint-python/latest/. +This is the home for DataJoint software documentation as hosted at https://docs.datajoint.com/datajoint-python/latest/. ## VSCode Linter Extensions and Settings diff --git a/docs/src/tutorials/dj-top.ipynb b/docs/src/tutorials/dj-top.ipynb index 7ed9f97cc..b3472f1b2 100644 --- a/docs/src/tutorials/dj-top.ipynb +++ b/docs/src/tutorials/dj-top.ipynb @@ -11,7 +11,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", + "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", "\n", "Now let's start by importing the `datajoint` client." ] diff --git a/docs/src/tutorials/json.ipynb b/docs/src/tutorials/json.ipynb index 9c5feebf6..cb583b2ad 100644 --- a/docs/src/tutorials/json.ipynb +++ b/docs/src/tutorials/json.ipynb @@ -27,7 +27,7 @@ "id": "67cf93d2", "metadata": {}, "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", + "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", "\n", "Now let's start by importing the `datajoint` client." ] diff --git a/pyproject.toml b/pyproject.toml index 075bb92b7..02c61d2df 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -35,7 +35,7 @@ maintainers = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, ] -# manually sync here: https://docs.datajoint.com/core/datajoint-python/latest/#welcome-to-datajoint-for-python +# manually sync here: https://docs.datajoint.com/datajoint-python/latest/#welcome-to-datajoint-for-python description = "DataJoint for Python is a framework for scientific workflow management based on relational principles. DataJoint is built on the foundation of the relational data model and prescribes a consistent method for organizing, populating, computing, and querying data." readme = "README.md" license = {file = "LICENSE.txt"} From b42c3051db1fa853f9180a3b115f81a67ed62763 Mon Sep 17 00:00:00 2001 From: MilagrosMarin Date: Sat, 31 May 2025 02:22:41 +0100 Subject: [PATCH 14/49] fix(URL): add `datajoint-docs` before `elements` --- README.md | 2 +- docs/src/index.md | 2 +- docs/src/query/operators.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 00bdb6928..e839d81bc 100644 --- a/README.md +++ b/README.md @@ -135,7 +135,7 @@ DataJoint (). - [Interactive Tutorials](https://github.com/datajoint/datajoint-tutorials) on GitHub Codespaces -- [DataJoint Elements](https://docs.datajoint.com/elements/) - Catalog of example pipelines for neuroscience experiments +- [DataJoint Elements](https://docs.datajoint.com/datajoint-docs/elements/) - Catalog of example pipelines for neuroscience experiments - Contribute - [Contribution Guidelines](https://docs.datajoint.com/about/contribute/) diff --git a/docs/src/index.md b/docs/src/index.md index 6e3bf2a2d..64a4a6ea0 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -36,7 +36,7 @@ Presently, the primary developer of DataJoint open-source software is the compan - [Interactive Tutorials](https://github.com/datajoint/datajoint-tutorials){:target="_blank"} on GitHub Codespaces -- [DataJoint Elements](https://docs.datajoint.com/elements/) - Catalog of example pipelines for neuroscience experiments +- [DataJoint Elements](https://docs.datajoint.com/datajoint-docs/elements/) - Catalog of example pipelines for neuroscience experiments - Contribute - [Development Environment](./develop) diff --git a/docs/src/query/operators.md b/docs/src/query/operators.md index ee3549f35..c18612429 100644 --- a/docs/src/query/operators.md +++ b/docs/src/query/operators.md @@ -392,4 +392,4 @@ dj.U().aggr(Session, n="max(session)") # (3) `dj.U()`, as shown in the last example above, is often useful for integer IDs. For an example of this process, see the source code for -[Element Array Electrophysiology's `insert_new_params`](https://docs.datajoint.com/elements/element-array-ephys/latest/api/element_array_ephys/ephys_acute/#element_array_ephys.ephys_acute.ClusteringParamSet.insert_new_params). +[Element Array Electrophysiology's `insert_new_params`](https://docs.datajoint.com/datajoint-docs/elements/element-array-ephys/latest/api/element_array_ephys/ephys_acute/#element_array_ephys.ephys_acute.ClusteringParamSet.insert_new_params). From 9781d6e47348175925027298bb4ed5bcbc6498a2 Mon Sep 17 00:00:00 2001 From: MilagrosMarin Date: Sat, 31 May 2025 02:27:47 +0100 Subject: [PATCH 15/49] fix(URL): add `datajoint-docs` before `contribute` --- README.md | 2 +- docs/src/index.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e839d81bc..bd2236145 100644 --- a/README.md +++ b/README.md @@ -138,6 +138,6 @@ DataJoint (). - [DataJoint Elements](https://docs.datajoint.com/datajoint-docs/elements/) - Catalog of example pipelines for neuroscience experiments - Contribute - - [Contribution Guidelines](https://docs.datajoint.com/about/contribute/) + - [Contribution Guidelines](https://docs.datajoint.com/datajoint-docs/about/contribute/) - [Developer Guide](https://docs.datajoint.com/datajoint-python/latest/develop/) diff --git a/docs/src/index.md b/docs/src/index.md index 64a4a6ea0..59ffef4f3 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -41,4 +41,4 @@ Presently, the primary developer of DataJoint open-source software is the compan - Contribute - [Development Environment](./develop) - - [Guidelines](https://docs.datajoint.com/about/contribute/) + - [Guidelines](https://docs.datajoint.com/datajoint-docs/about/contribute/) From a7ffe2ebe90396ed8d0b9db3552bf885b741b89b Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 2 Jun 2025 08:04:52 -0500 Subject: [PATCH 16/49] Revert "fix(URL): broken routing and redirects on docs.datajoint.com" --- README.md | 8 ++++---- docs/README.md | 2 +- docs/src/index.md | 4 ++-- docs/src/query/operators.md | 2 +- docs/src/tutorials/dj-top.ipynb | 2 +- docs/src/tutorials/json.ipynb | 2 +- pyproject.toml | 2 +- 7 files changed, 11 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index bd2236145..eecee41a0 100644 --- a/README.md +++ b/README.md @@ -131,13 +131,13 @@ DataJoint (). pip install datajoint ``` -- [Documentation & Tutorials](https://docs.datajoint.com/datajoint-python/) +- [Documentation & Tutorials](https://docs.datajoint.com/core/datajoint-python/) - [Interactive Tutorials](https://github.com/datajoint/datajoint-tutorials) on GitHub Codespaces -- [DataJoint Elements](https://docs.datajoint.com/datajoint-docs/elements/) - Catalog of example pipelines for neuroscience experiments +- [DataJoint Elements](https://docs.datajoint.com/elements/) - Catalog of example pipelines for neuroscience experiments - Contribute - - [Contribution Guidelines](https://docs.datajoint.com/datajoint-docs/about/contribute/) + - [Contribution Guidelines](https://docs.datajoint.com/about/contribute/) - - [Developer Guide](https://docs.datajoint.com/datajoint-python/latest/develop/) + - [Developer Guide](https://docs.datajoint.com/core/datajoint-python/latest/develop/) diff --git a/docs/README.md b/docs/README.md index 3fe48a691..4aecf0a69 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,6 +1,6 @@ # Contribute to DataJoint Documentation -This is the home for DataJoint software documentation as hosted at https://docs.datajoint.com/datajoint-python/latest/. +This is the home for DataJoint software documentation as hosted at https://docs.datajoint.com/core/datajoint-python/latest/. ## VSCode Linter Extensions and Settings diff --git a/docs/src/index.md b/docs/src/index.md index 59ffef4f3..6e3bf2a2d 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -36,9 +36,9 @@ Presently, the primary developer of DataJoint open-source software is the compan - [Interactive Tutorials](https://github.com/datajoint/datajoint-tutorials){:target="_blank"} on GitHub Codespaces -- [DataJoint Elements](https://docs.datajoint.com/datajoint-docs/elements/) - Catalog of example pipelines for neuroscience experiments +- [DataJoint Elements](https://docs.datajoint.com/elements/) - Catalog of example pipelines for neuroscience experiments - Contribute - [Development Environment](./develop) - - [Guidelines](https://docs.datajoint.com/datajoint-docs/about/contribute/) + - [Guidelines](https://docs.datajoint.com/about/contribute/) diff --git a/docs/src/query/operators.md b/docs/src/query/operators.md index c18612429..ee3549f35 100644 --- a/docs/src/query/operators.md +++ b/docs/src/query/operators.md @@ -392,4 +392,4 @@ dj.U().aggr(Session, n="max(session)") # (3) `dj.U()`, as shown in the last example above, is often useful for integer IDs. For an example of this process, see the source code for -[Element Array Electrophysiology's `insert_new_params`](https://docs.datajoint.com/datajoint-docs/elements/element-array-ephys/latest/api/element_array_ephys/ephys_acute/#element_array_ephys.ephys_acute.ClusteringParamSet.insert_new_params). +[Element Array Electrophysiology's `insert_new_params`](https://docs.datajoint.com/elements/element-array-ephys/latest/api/element_array_ephys/ephys_acute/#element_array_ephys.ephys_acute.ClusteringParamSet.insert_new_params). diff --git a/docs/src/tutorials/dj-top.ipynb b/docs/src/tutorials/dj-top.ipynb index b3472f1b2..7ed9f97cc 100644 --- a/docs/src/tutorials/dj-top.ipynb +++ b/docs/src/tutorials/dj-top.ipynb @@ -11,7 +11,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", + "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", "\n", "Now let's start by importing the `datajoint` client." ] diff --git a/docs/src/tutorials/json.ipynb b/docs/src/tutorials/json.ipynb index cb583b2ad..9c5feebf6 100644 --- a/docs/src/tutorials/json.ipynb +++ b/docs/src/tutorials/json.ipynb @@ -27,7 +27,7 @@ "id": "67cf93d2", "metadata": {}, "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", + "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", "\n", "Now let's start by importing the `datajoint` client." ] diff --git a/pyproject.toml b/pyproject.toml index 02c61d2df..075bb92b7 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -35,7 +35,7 @@ maintainers = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, ] -# manually sync here: https://docs.datajoint.com/datajoint-python/latest/#welcome-to-datajoint-for-python +# manually sync here: https://docs.datajoint.com/core/datajoint-python/latest/#welcome-to-datajoint-for-python description = "DataJoint for Python is a framework for scientific workflow management based on relational principles. DataJoint is built on the foundation of the relational data model and prescribes a consistent method for organizing, populating, computing, and querying data." readme = "README.md" license = {file = "LICENSE.txt"} From 43fabad0602d8e757f8788ae67f1fd88795613b2 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 6 Jun 2025 15:19:21 -0500 Subject: [PATCH 17/49] fix error message for the case when attempting to delete without the REFERENCE privilege --- datajoint/autopopulate.py | 5 ++--- datajoint/declare.py | 2 +- datajoint/table.py | 21 +++++++++++---------- 3 files changed, 14 insertions(+), 14 deletions(-) diff --git a/datajoint/autopopulate.py b/datajoint/autopopulate.py index 22053d5cd..e4d7ba80b 100644 --- a/datajoint/autopopulate.py +++ b/datajoint/autopopulate.py @@ -98,7 +98,7 @@ def make(self, key): 1. Fetch data from tables above in the dependency hierarchy, restricted by the given key. 2. Compute secondary attributes based on the fetched data. - 3. Insert the new tuples into the current table. + 3. Insert the new tuple(s) into the current table. The method can be implemented either as: (a) Regular method: All three steps are performed in a single database transaction. @@ -263,9 +263,8 @@ def populate( self.connection.schemas[self.target.database].jobs if reserve_jobs else None ) - # define and set up signal handler for SIGTERM: if reserve_jobs: - + # Define a signal handler for SIGTERM def handler(signum, frame): logger.info("Populate terminated by SIGTERM") raise SystemExit("SIGTERM received") diff --git a/datajoint/declare.py b/datajoint/declare.py index b1194880f..d061aa879 100644 --- a/datajoint/declare.py +++ b/datajoint/declare.py @@ -302,7 +302,7 @@ def declare(full_table_name, definition, context): name=table_name, max_length=MAX_TABLE_NAME_LENGTH ) ) - + ( table_comment, primary_key, diff --git a/datajoint/table.py b/datajoint/table.py index db9eaffa1..5b1ba1103 100644 --- a/datajoint/table.py +++ b/datajoint/table.py @@ -135,7 +135,7 @@ def alter(self, prompt=True, context=None): sql, external_stores = alter(self.definition, old_definition, context) if not sql: if prompt: - logger.warn("Nothing to alter.") + logger.warning("Nothing to alter.") else: sql = "ALTER TABLE {tab}\n\t".format( tab=self.full_table_name @@ -518,7 +518,13 @@ def cascade(table): try: delete_count = table.delete_quick(get_count=True) except IntegrityError as error: - match = foreign_key_error_regexp.match(error.args[0]).groupdict() + match = foreign_key_error_regexp.match(error.args[0]) + if match is None: + raise DataJointError( + "Cascading deletes failed because the error message is missing foreign key information." + "Make sure you have REFERENCES privilege to all dependent tables." + ) from None + match = match.groupdict() # if schema name missing, use table if "`.`" not in match["child"]: match["child"] = "{}.{}".format( @@ -641,7 +647,7 @@ def cascade(table): # Confirm and commit if delete_count == 0: if safemode: - logger.warn("Nothing to delete.") + logger.warning("Nothing to delete.") if transaction: self.connection.cancel_transaction() elif not transaction: @@ -651,12 +657,12 @@ def cascade(table): if transaction: self.connection.commit_transaction() if safemode: - logger.info("Deletes committed.") + logger.info("Delete committed.") else: if transaction: self.connection.cancel_transaction() if safemode: - logger.warn("Deletes cancelled") + logger.warning("Delete cancelled") return delete_count def drop_quick(self): @@ -724,11 +730,6 @@ def size_on_disk(self): ).fetchone() return ret["Data_length"] + ret["Index_length"] - def show_definition(self): - raise AttributeError( - "show_definition is deprecated. Use the describe method instead." - ) - def describe(self, context=None, printout=False): """ :return: the definition string for the query using DataJoint DDL. From e7c8943705b512bc04289c8e26398849be4c3fd9 Mon Sep 17 00:00:00 2001 From: Thinh Nguyen Date: Fri, 13 Jun 2025 13:36:05 -0500 Subject: [PATCH 18/49] fix: improve error handling when `make_fetch` referential integrity fails --- datajoint/autopopulate.py | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/datajoint/autopopulate.py b/datajoint/autopopulate.py index d5cabe062..c88891049 100644 --- a/datajoint/autopopulate.py +++ b/datajoint/autopopulate.py @@ -412,11 +412,10 @@ def _populate1( != deepdiff.DeepHash(fetched_data, ignore_iterable_order=False)[ fetched_data ] - ): # rollback due to referential integrity fail - self.connection.cancel_transaction() - logger.warning( - f"Referential integrity failed for {key} -> {self.target.full_table_name}") - return False + ): # raise error if fetched data has changed + raise DataJointError( + "Referential integrity failed - the `make_fetch` data has changed." + ) gen.send(computed_result) # insert except (KeyboardInterrupt, SystemExit, Exception) as error: From e55bbcb6e935ec72e3cfe8c7f6e3cbd5e023c2f4 Mon Sep 17 00:00:00 2001 From: Thinh Nguyen Date: Fri, 13 Jun 2025 13:38:33 -0500 Subject: [PATCH 19/49] style: black format --- datajoint/autopopulate.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/datajoint/autopopulate.py b/datajoint/autopopulate.py index c88891049..1b0e6c12c 100644 --- a/datajoint/autopopulate.py +++ b/datajoint/autopopulate.py @@ -105,8 +105,8 @@ def make(self, key): The method can be implemented either as: (a) Regular method: All three steps are performed in a single database transaction. The method must return None. - (b) Generator method: - The make method is split into three functions: + (b) Generator method: + The make method is split into three functions: - `make_fetch`: Fetches data from the parent tables. - `make_compute`: Computes secondary attributes based on the fetched data. - `make_insert`: Inserts the computed data into the current table. @@ -124,7 +124,7 @@ def make(self, key): self.make_insert(key, *computed_result) commit_transaction - + Importantly, the output of make_fetch is a tuple that serves as the input into `make_compute`. The output of `make_compute` is a tuple that serves as the input into `make_insert`. From 964743efdb45f43bce9564cb625b7c890454daa6 Mon Sep 17 00:00:00 2001 From: Thinh Nguyen Date: Fri, 13 Jun 2025 13:40:11 -0500 Subject: [PATCH 20/49] style: format --- datajoint/autopopulate.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datajoint/autopopulate.py b/datajoint/autopopulate.py index 1b0e6c12c..461972cfa 100644 --- a/datajoint/autopopulate.py +++ b/datajoint/autopopulate.py @@ -414,7 +414,7 @@ def _populate1( ] ): # raise error if fetched data has changed raise DataJointError( - "Referential integrity failed - the `make_fetch` data has changed." + "Referential integrity failed! The `make_fetch` data has changed" ) gen.send(computed_result) # insert From 32918e573b398223ed42e15770a7c9f056aca344 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Thu, 3 Jul 2025 15:01:50 -0500 Subject: [PATCH 21/49] Fix missing final statement in parse_sql --- datajoint/utils.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/datajoint/utils.py b/datajoint/utils.py index 1aae610d8..cbf5f51ef 100644 --- a/datajoint/utils.py +++ b/datajoint/utils.py @@ -146,3 +146,5 @@ def parse_sql(filepath): if line.endswith(delimiter): yield " ".join(statement) statement = [] + if statement: + yield " ".join(statement) From da8b68082c36ba053d9f2fdc6ac46010c45ff39f Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Thu, 3 Jul 2025 15:12:11 -0500 Subject: [PATCH 22/49] Remove trailing spaces --- datajoint/declare.py | 2 +- docs/src/concepts/data-model.md | 10 +++++----- docs/src/concepts/teamwork.md | 20 ++++++++++---------- docs/src/design/integrity.md | 4 ++-- docs/src/design/tables/customtype.md | 8 ++++---- docs/src/design/tables/indexes.md | 4 ++-- docs/src/faq.md | 12 ++++++------ docs/src/internal/transpilation.md | 2 +- docs/src/manipulation/transactions.md | 2 +- docs/src/publish-data.md | 6 +++--- docs/src/quick-start.md | 2 +- 11 files changed, 36 insertions(+), 36 deletions(-) diff --git a/datajoint/declare.py b/datajoint/declare.py index 1d62d15c4..304476798 100644 --- a/datajoint/declare.py +++ b/datajoint/declare.py @@ -304,7 +304,7 @@ def declare(full_table_name, definition, context): name=table_name, max_length=MAX_TABLE_NAME_LENGTH ) ) - + ( table_comment, primary_key, diff --git a/docs/src/concepts/data-model.md b/docs/src/concepts/data-model.md index 14528fe04..90460361a 100644 --- a/docs/src/concepts/data-model.md +++ b/docs/src/concepts/data-model.md @@ -54,7 +54,7 @@ columns (often called attributes). A collection of base relations with their attributes, domain constraints, uniqueness constraints, and referential constraints is called a schema. -**Domain constraints:** +**Domain constraints:** Each attribute (column) in a table is associated with a specific attribute domain (or datatype, a set of possible values), ensuring that the data entered is valid. Attribute domains may not include relations, which keeps the data model @@ -68,13 +68,13 @@ columns (often called attributes). One key in a relation is designated as the primary key used for referencing its elements. **Referential constraints:** - Associations among data are established by means of referential constraints with the + Associations among data are established by means of referential constraints with the help of foreign keys. A referential constraint on relation A referencing relation B allows only those entities in A whose foreign key attributes match the key attributes of an entity in B. **Declarative queries:** - Data queries are formulated through declarative, as opposed to imperative, + Data queries are formulated through declarative, as opposed to imperative, specifications of sought results. This means that query expressions convey the logic for the result rather than the procedure for obtaining it. @@ -106,7 +106,7 @@ clarity, efficiency, workflow management, and precise and flexible data queries. By enforcing entity normalization, simplifying dependency declarations, offering a rich query algebra, and visualizing relationships through schema diagrams, DataJoint makes relational database programming -more intuitive and robust for complex data pipelines. +more intuitive and robust for complex data pipelines. The model has emerged over a decade of continuous development of complex data pipelines for neuroscience experiments ([Yatsenko et al., @@ -123,7 +123,7 @@ DataJoint comprises: + a schema [definition](../design/tables/declare.md) language + a data [manipulation](../manipulation/index.md) language + a data [query](../query/principles.md) language -+ a [diagramming](../design/diagrams.md) notation for visualizing relationships between ++ a [diagramming](../design/diagrams.md) notation for visualizing relationships between modeled entities The key refinement of DataJoint over other relational data models and their diff --git a/docs/src/concepts/teamwork.md b/docs/src/concepts/teamwork.md index 4cccea9f5..a0a782dde 100644 --- a/docs/src/concepts/teamwork.md +++ b/docs/src/concepts/teamwork.md @@ -60,33 +60,33 @@ division of labor among team members, leading to greater efficiency and better s ### Scientists Design and conduct experiments, collecting data. -They interact with the data pipeline through graphical user interfaces designed by +They interact with the data pipeline through graphical user interfaces designed by others. They understand what analysis is used to test their hypotheses. ### Data scientists -Have the domain expertise and select and implement the processing and analysis +Have the domain expertise and select and implement the processing and analysis methods for experimental data. -Data scientists are in charge of defining and managing the data pipeline using -DataJoint's data model, but they may not know the details of the underlying +Data scientists are in charge of defining and managing the data pipeline using +DataJoint's data model, but they may not know the details of the underlying architecture. -They interact with the pipeline using client programming interfaces directly from +They interact with the pipeline using client programming interfaces directly from languages such as MATLAB and Python. -The bulk of this manual is written for working data scientists, except for System +The bulk of this manual is written for working data scientists, except for System Administration. ### Data engineers Work with the data scientists to support the data pipeline. -They rely on their understanding of the DataJoint data model to configure and -administer the required IT resources such as database servers, data storage +They rely on their understanding of the DataJoint data model to configure and +administer the required IT resources such as database servers, data storage servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc. -Data engineers can provide general solutions such as web hosting, data publishing, +Data engineers can provide general solutions such as web hosting, data publishing, interfaces, exports and imports. -The System Administration section of this tutorial contains materials helpful in +The System Administration section of this tutorial contains materials helpful in accomplishing these tasks. DataJoint is designed to delineate a clean boundary between **data science** and **data diff --git a/docs/src/design/integrity.md b/docs/src/design/integrity.md index 299a2a45a..cb7122755 100644 --- a/docs/src/design/integrity.md +++ b/docs/src/design/integrity.md @@ -1,7 +1,7 @@ # Data Integrity -The term **data integrity** describes guarantees made by the data management process -that prevent errors and corruption in data due to technical failures and human errors +The term **data integrity** describes guarantees made by the data management process +that prevent errors and corruption in data due to technical failures and human errors arising in the course of continuous use by multiple agents. DataJoint pipelines respect the following forms of data integrity: **entity integrity**, **referential integrity**, and **group integrity** as described in more diff --git a/docs/src/design/tables/customtype.md b/docs/src/design/tables/customtype.md index 823dd987c..aad194ff5 100644 --- a/docs/src/design/tables/customtype.md +++ b/docs/src/design/tables/customtype.md @@ -49,9 +49,9 @@ attribute type in a datajoint table class: import datajoint as dj class GraphAdapter(dj.AttributeAdapter): - + attribute_type = 'longblob' # this is how the attribute will be declared - + def put(self, obj): # convert the nx.Graph object into an edge list assert isinstance(obj, nx.Graph) @@ -60,7 +60,7 @@ class GraphAdapter(dj.AttributeAdapter): def get(self, value): # convert edge list back into an nx.Graph return nx.Graph(value) - + # instantiate for use as a datajoint type graph = GraphAdapter() @@ -75,6 +75,6 @@ class Connectivity(dj.Manual): definition = """ conn_id : int --- - conn_graph = null : # a networkx.Graph object + conn_graph = null : # a networkx.Graph object """ ``` diff --git a/docs/src/design/tables/indexes.md b/docs/src/design/tables/indexes.md index fcd1b5702..9d8148c36 100644 --- a/docs/src/design/tables/indexes.md +++ b/docs/src/design/tables/indexes.md @@ -62,7 +62,7 @@ Let’s now imagine that rats in a lab are identified by the combination of `lab @schema class Rat(dj.Manual): definition = """ - lab_name : char(16) + lab_name : char(16) rat_id : int unsigned # lab-specific ID --- date_of_birth = null : date @@ -86,7 +86,7 @@ To speed up searches by the `rat_id` and `date_of_birth`, we can explicit indexe @schema class Rat2(dj.Manual): definition = """ - lab_name : char(16) + lab_name : char(16) rat_id : int unsigned # lab-specific ID --- date_of_birth = null : date diff --git a/docs/src/faq.md b/docs/src/faq.md index 1de69bb31..c4c82d014 100644 --- a/docs/src/faq.md +++ b/docs/src/faq.md @@ -7,13 +7,13 @@ It is common to enter data during experiments using a graphical user interface. 1. The [DataJoint platform](https://works.datajoint.com) platform is a web-based, end-to-end platform to host and execute data pipelines. -2. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open +2. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open source project for data entry but is no longer actively maintained. ## Does DataJoint support other programming languages? DataJoint [Python](https://docs.datajoint.com/core/datajoint-python/) is the most -up-to-date version and all future development will focus on the Python API. The +up-to-date version and all future development will focus on the Python API. The [Matlab](https://datajoint.com/docs/core/datajoint-matlab/) API was actively developed through 2023. Previous projects implemented some DataJoint features in [Julia](https://github.com/BrainCOGS/neuronex_workshop_2018/tree/julia/julia) and @@ -93,7 +93,7 @@ The entry of metadata can be manual, or it can be an automated part of data acqu into the database). Depending on their size and contents, raw data files can be stored in a number of ways. -In the simplest and most common scenario, raw data continues to be stored in either a +In the simplest and most common scenario, raw data continues to be stored in either a local filesystem or in the cloud as collections of files and folders. The paths to these files are entered in the database (again, either manually or by automated processes). @@ -101,8 +101,8 @@ This is the point at which the notion of a **data pipeline** begins. Below these "manual tables" that contain metadata and file paths are a series of tables that load raw data from these files, process it in some way, and insert derived or summarized data directly into the database. -For example, in an imaging application, the very large raw `.TIFF` stacks would reside on -the filesystem, but the extracted fluorescent trace timeseries for each cell in the +For example, in an imaging application, the very large raw `.TIFF` stacks would reside on +the filesystem, but the extracted fluorescent trace timeseries for each cell in the image would be stored as a numerical array directly in the database. Or the raw video used for animal tracking might be stored in a standard video format on the filesystem, but the computed X/Y positions of the animal would be stored in the @@ -164,7 +164,7 @@ This brings us to the final important question: ## How do I get my data out? -This is the fun part. See [queries](query/operators.md) for details of the DataJoint +This is the fun part. See [queries](query/operators.md) for details of the DataJoint query language directly from Python. ## Interfaces diff --git a/docs/src/internal/transpilation.md b/docs/src/internal/transpilation.md index b263c7528..b8d81d42a 100644 --- a/docs/src/internal/transpilation.md +++ b/docs/src/internal/transpilation.md @@ -59,7 +59,7 @@ The input object is treated as a subquery in the following cases: 1. A restriction is applied that uses alias attributes in the heading. 2. A projection uses an alias attribute to create a new alias attribute. 3. A join is performed on an alias attribute. -4. An Aggregation is used a restriction. +4. An Aggregation is used a restriction. An error arises if diff --git a/docs/src/manipulation/transactions.md b/docs/src/manipulation/transactions.md index c7d6951a7..58b9a3167 100644 --- a/docs/src/manipulation/transactions.md +++ b/docs/src/manipulation/transactions.md @@ -6,7 +6,7 @@ interrupting the sequence of such operations halfway would leave the data in an state. While the sequence is in progress, other processes accessing the database will not see the partial results until the transaction is complete. -The sequence may include [data queries](../query/principles.md) and +The sequence may include [data queries](../query/principles.md) and [manipulations](index.md). In such cases, the sequence of operations may be enclosed in a transaction. diff --git a/docs/src/publish-data.md b/docs/src/publish-data.md index d766f49da..3ec2d7211 100644 --- a/docs/src/publish-data.md +++ b/docs/src/publish-data.md @@ -27,8 +27,8 @@ The code and the data can be found at [https://github.com/sinzlab/Sinz2018_NIPS] ## Exporting into a collection of files -Another option for publishing and archiving data is to export the data from the +Another option for publishing and archiving data is to export the data from the DataJoint pipeline into a collection of files. -DataJoint provides features for exporting and importing sections of the pipeline. -Several ongoing projects are implementing the capability to export from DataJoint +DataJoint provides features for exporting and importing sections of the pipeline. +Several ongoing projects are implementing the capability to export from DataJoint pipelines into [Neurodata Without Borders](https://www.nwb.org/) files. diff --git a/docs/src/quick-start.md b/docs/src/quick-start.md index f3309c066..a7f255658 100644 --- a/docs/src/quick-start.md +++ b/docs/src/quick-start.md @@ -5,7 +5,7 @@ The easiest way to get started is through the [DataJoint Tutorials](https://github.com/datajoint/datajoint-tutorials). These tutorials are configured to run using [GitHub Codespaces](https://github.com/features/codespaces) -where the full environment including the database is already set up. +where the full environment including the database is already set up. Advanced users can install DataJoint locally. Please see the installation instructions below. From 0709c379d8d3220833fd07e295ccc959856ff829 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Thu, 3 Jul 2025 16:31:17 -0500 Subject: [PATCH 23/49] minor format --- datajoint/autopopulate.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/datajoint/autopopulate.py b/datajoint/autopopulate.py index e95de6b0d..ae731d8f1 100644 --- a/datajoint/autopopulate.py +++ b/datajoint/autopopulate.py @@ -146,7 +146,8 @@ def make(self, key): ): # user must implement `make` raise NotImplementedError( - "Subclasses of AutoPopulate must implement the method `make` or (`make_fetch` + `make_compute` + `make_insert`)" + "Subclasses of AutoPopulate must implement the method `make` " + "or (`make_fetch` + `make_compute` + `make_insert`)" ) # User has implemented `_fetch`, `_compute`, and `_insert` methods instead From ac141e7c41141ae74608613b1775b58104355416 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Thu, 3 Jul 2025 16:33:03 -0500 Subject: [PATCH 24/49] blackify --- datajoint/autopopulate.py | 2 +- datajoint/blob.py | 4 ++-- datajoint/condition.py | 2 +- datajoint/preview.py | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/datajoint/autopopulate.py b/datajoint/autopopulate.py index ae731d8f1..226e64dda 100644 --- a/datajoint/autopopulate.py +++ b/datajoint/autopopulate.py @@ -147,7 +147,7 @@ def make(self, key): # user must implement `make` raise NotImplementedError( "Subclasses of AutoPopulate must implement the method `make` " - "or (`make_fetch` + `make_compute` + `make_insert`)" + "or (`make_fetch` + `make_compute` + `make_insert`)" ) # User has implemented `_fetch`, `_compute`, and `_insert` methods instead diff --git a/datajoint/blob.py b/datajoint/blob.py index 82e1c3d18..639789680 100644 --- a/datajoint/blob.py +++ b/datajoint/blob.py @@ -140,7 +140,7 @@ def read_blob(self, n_bytes=None): "S": self.read_struct, # matlab struct array "C": self.read_cell_array, # matlab cell array # basic data types - "\xFF": self.read_none, # None + "\xff": self.read_none, # None "\x01": self.read_tuple, # a Sequence (e.g. tuple) "\x02": self.read_list, # a MutableSequence (e.g. list) "\x03": self.read_set, # a Set @@ -401,7 +401,7 @@ def read_none(self): @staticmethod def pack_none(): - return b"\xFF" + return b"\xff" def read_tuple(self): return tuple( diff --git a/datajoint/condition.py b/datajoint/condition.py index 7fbe0c7bc..96cfbb6ef 100644 --- a/datajoint/condition.py +++ b/datajoint/condition.py @@ -1,4 +1,4 @@ -""" methods for generating SQL WHERE clauses from datajoint restriction conditions """ +"""methods for generating SQL WHERE clauses from datajoint restriction conditions""" import collections import datetime diff --git a/datajoint/preview.py b/datajoint/preview.py index 775570432..564c92a0a 100644 --- a/datajoint/preview.py +++ b/datajoint/preview.py @@ -1,4 +1,4 @@ -""" methods for generating previews of query expression results in python command line and Jupyter """ +"""methods for generating previews of query expression results in python command line and Jupyter""" from .settings import config From 7c570d1a61f3d979fb748d35c8894d25ea12fc92 Mon Sep 17 00:00:00 2001 From: github-actions Date: Fri, 25 Jul 2025 13:37:31 +0000 Subject: [PATCH 25/49] Update version.py to 0.14.5 --- datajoint/version.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datajoint/version.py b/datajoint/version.py index 3f48dc939..b51d5935a 100644 --- a/datajoint/version.py +++ b/datajoint/version.py @@ -1,6 +1,6 @@ # version bump auto managed by Github Actions: # label_prs.yaml(prep), release.yaml(bump), post_release.yaml(edit) # manually set this version will be eventually overwritten by the above actions -__version__ = "0.14.4" +__version__ = "0.14.5" assert len(__version__) <= 10 # The log table limits version to the 10 characters From a6ebe19d47ff3b5500890608eafecf7a1a38eace Mon Sep 17 00:00:00 2001 From: github-actions Date: Fri, 25 Jul 2025 13:37:31 +0000 Subject: [PATCH 26/49] Update README.md badge to v0.14.5 --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index eecee41a0..da0ce3c02 100644 --- a/README.md +++ b/README.md @@ -30,8 +30,8 @@ Since Release - - commit since last release + + commit since last release From fb77a486e21c74796edc288a538ed21de18b286b Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 25 Jul 2025 08:51:56 -0500 Subject: [PATCH 27/49] docs: redirect changelog --- CHANGELOG.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 1a7b86032..4bf094509 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,8 @@ ## Release notes +**Note:** This file is no longer updated. See the GitHub change log page for the +latest release notes: . + ### 0.14.3 -- Sep 23, 2024 - Added - `dj.Top` restriction - PR [#1024](https://github.com/datajoint/datajoint-python/issues/1024)) PR [#1084](https://github.com/datajoint/datajoint-python/pull/1084) - Fixed - Added encapsulating double quotes to comply with [DOT language](https://graphviz.org/doc/info/lang.html) - PR [#1177](https://github.com/datajoint/datajoint-python/pull/1177) From d220c72be16849f2b1ca952ff08dc65933adbed8 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 25 Jul 2025 09:05:11 -0500 Subject: [PATCH 28/49] begin preparing 0.14.6 --- CHANGELOG.md | 3 +++ datajoint/version.py | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 1a7b86032..4bf094509 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,8 @@ ## Release notes +**Note:** This file is no longer updated. See the GitHub change log page for the +latest release notes: . + ### 0.14.3 -- Sep 23, 2024 - Added - `dj.Top` restriction - PR [#1024](https://github.com/datajoint/datajoint-python/issues/1024)) PR [#1084](https://github.com/datajoint/datajoint-python/pull/1084) - Fixed - Added encapsulating double quotes to comply with [DOT language](https://graphviz.org/doc/info/lang.html) - PR [#1177](https://github.com/datajoint/datajoint-python/pull/1177) diff --git a/datajoint/version.py b/datajoint/version.py index 3f48dc939..5fb608cef 100644 --- a/datajoint/version.py +++ b/datajoint/version.py @@ -1,6 +1,6 @@ # version bump auto managed by Github Actions: # label_prs.yaml(prep), release.yaml(bump), post_release.yaml(edit) # manually set this version will be eventually overwritten by the above actions -__version__ = "0.14.4" +__version__ = "0.14.6" assert len(__version__) <= 10 # The log table limits version to the 10 characters From 88c0856ee196c14c21225fcd110077c4ff46bc13 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 15:58:04 -0600 Subject: [PATCH 29/49] fix Dev Container configuration --- .devcontainer/docker-compose.yml | 6 +++--- Dockerfile | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/.devcontainer/docker-compose.yml b/.devcontainer/docker-compose.yml index 71d74e46f..449212a42 100644 --- a/.devcontainer/docker-compose.yml +++ b/.devcontainer/docker-compose.yml @@ -7,13 +7,13 @@ services: # docker-compose.yml file (the first in the devcontainer.json "dockerComposeFile" # array). The sample below assumes your primary file is in the root of your project. container_name: datajoint-python-devcontainer - image: datajoint/datajoint-python-devcontainer:${PY_VER:-3.11}-${DISTRO:-buster} + image: datajoint/datajoint-python-devcontainer:${PY_VER:-3.11}-${DISTRO:-bookworm} build: context: . - dockerfile: .devcontainer/Dockerfile + dockerfile: Dockerfile args: - PY_VER=${PY_VER:-3.11} - - DISTRO=${DISTRO:-buster} + - DISTRO=${DISTRO:-bookworm} volumes: # Update this to wherever you want VS Code to mount the folder of your project diff --git a/Dockerfile b/Dockerfile index dce8a6438..0d727f6b4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -2,7 +2,7 @@ ARG IMAGE=mambaorg/micromamba:1.5-bookworm-slim FROM ${IMAGE} ARG CONDA_BIN=micromamba -ARG PY_VER=3.9 +ARG PY_VER=3.11 ARG HOST_UID=1000 RUN ${CONDA_BIN} install --no-pin -qq -y -n base -c conda-forge \ From 55e23ea965c6866c5511bea505dbe3060cc3642d Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 16:09:54 -0600 Subject: [PATCH 30/49] revert to .devcontainer/Dockerfile --- .devcontainer/docker-compose.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.devcontainer/docker-compose.yml b/.devcontainer/docker-compose.yml index 449212a42..949243ce5 100644 --- a/.devcontainer/docker-compose.yml +++ b/.devcontainer/docker-compose.yml @@ -10,7 +10,7 @@ services: image: datajoint/datajoint-python-devcontainer:${PY_VER:-3.11}-${DISTRO:-bookworm} build: context: . - dockerfile: Dockerfile + dockerfile: .devcontainer/Dockerfile args: - PY_VER=${PY_VER:-3.11} - DISTRO=${DISTRO:-bookworm} From b588e16f75e161838b60fabb8c117c52ba12b5e9 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 17:20:02 -0600 Subject: [PATCH 31/49] minor --- .devcontainer/docker-compose.yml | 1 - pyproject.toml | 1 + 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/.devcontainer/docker-compose.yml b/.devcontainer/docker-compose.yml index 949243ce5..5c22aaf14 100644 --- a/.devcontainer/docker-compose.yml +++ b/.devcontainer/docker-compose.yml @@ -1,4 +1,3 @@ -version: '2.4' services: # Update this to the name of the service you want to work with in your docker-compose.yml file app: diff --git a/pyproject.toml b/pyproject.toml index 075bb92b7..fd675bcb9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -27,6 +27,7 @@ dependencies = [ requires-python = ">=3.9,<4.0" authors = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, + {name = "Thinh Nguen", email = "thinh@datajoint.com"} {name = "Raphael Guzman"}, {name = "Edgar Walker"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, From ad22ece289c53941d3dc7e82353d3d46ea6c4820 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 17:45:00 -0600 Subject: [PATCH 32/49] remove old plugins --- datajoint/attribute_adapter.py | 7 +--- datajoint/connection.py | 37 ++------------------ datajoint/errors.py | 20 ----------- datajoint/plugin.py | 46 ------------------------- pyproject.toml | 4 +-- tests/test_plugin.py | 62 ---------------------------------- 6 files changed, 5 insertions(+), 171 deletions(-) delete mode 100644 datajoint/plugin.py delete mode 100644 tests/test_plugin.py diff --git a/datajoint/attribute_adapter.py b/datajoint/attribute_adapter.py index e062f4c57..2a8e59a51 100644 --- a/datajoint/attribute_adapter.py +++ b/datajoint/attribute_adapter.py @@ -1,7 +1,6 @@ import re from .errors import DataJointError, _support_adapted_types -from .plugin import type_plugins class AttributeAdapter: @@ -44,11 +43,7 @@ def get_adapter(context, adapter_name): raise DataJointError("Support for Adapted Attribute types is disabled.") adapter_name = adapter_name.lstrip("<").rstrip(">") try: - adapter = ( - context[adapter_name] - if adapter_name in context - else type_plugins[adapter_name]["object"].load() - ) + adapter = context[adapter_name] except KeyError: raise DataJointError( "Attribute adapter '{adapter_name}' is not defined.".format( diff --git a/datajoint/connection.py b/datajoint/connection.py index 6e21b5fef..8fae80cfa 100644 --- a/datajoint/connection.py +++ b/datajoint/connection.py @@ -16,7 +16,6 @@ from .blob import pack, unpack from .dependencies import Dependencies from .hash import uuid_from_buffer -from .plugin import connection_plugins from .settings import config from .version import __version__ @@ -27,33 +26,6 @@ cache_key = "query_cache" # the key to lookup the query_cache folder in dj.config -def get_host_hook(host_input): - if "://" in host_input: - plugin_name = host_input.split("://")[0] - try: - return connection_plugins[plugin_name]["object"].load().get_host(host_input) - except KeyError: - raise errors.DataJointError( - "Connection plugin '{}' not found.".format(plugin_name) - ) - else: - return host_input - - -def connect_host_hook(connection_obj): - if "://" in connection_obj.conn_info["host_input"]: - plugin_name = connection_obj.conn_info["host_input"].split("://")[0] - try: - connection_plugins[plugin_name]["object"].load().connect_host( - connection_obj - ) - except KeyError: - raise errors.DataJointError( - "Connection plugin '{}' not found.".format(plugin_name) - ) - else: - connection_obj.connect() - def translate_query_error(client_error, query): """ @@ -177,7 +149,6 @@ class Connection: """ def __init__(self, host, user, password, port=None, init_fun=None, use_tls=None): - host_input, host = (host, get_host_hook(host)) if ":" in host: # the port in the hostname overrides the port argument host, port = host.split(":") @@ -190,11 +161,9 @@ def __init__(self, host, user, password, port=None, init_fun=None, use_tls=None) use_tls if isinstance(use_tls, dict) else {"ssl": {}} ) self.conn_info["ssl_input"] = use_tls - self.conn_info["host_input"] = host_input self.init_fun = init_fun self._conn = None self._query_cache = None - connect_host_hook(self) if self.is_connected: logger.info( "DataJoint {version} connected to {user}@{host}:{port}".format( @@ -232,7 +201,7 @@ def connect(self): **{ k: v for k, v in self.conn_info.items() - if k not in ["ssl_input", "host_input"] + if k not in ["ssl_input"] }, ) except client.err.InternalError: @@ -245,7 +214,7 @@ def connect(self): k: v for k, v in self.conn_info.items() if not ( - k in ["ssl_input", "host_input"] + k in ["ssl_input"] or k == "ssl" and self.conn_info["ssl_input"] is None ) @@ -352,7 +321,7 @@ def query( if not reconnect: raise logger.warning("Reconnecting to MySQL server.") - connect_host_hook(self) + self.connect() if self._in_transaction: self.cancel_transaction() raise errors.LostConnectionError( diff --git a/datajoint/errors.py b/datajoint/errors.py index 427e8d1ad..03555bf13 100644 --- a/datajoint/errors.py +++ b/datajoint/errors.py @@ -5,32 +5,12 @@ import os -# --- Unverified Plugin Check --- -class PluginWarning(Exception): - pass - - # --- Top Level --- class DataJointError(Exception): """ Base class for errors specific to DataJoint internal operation. """ - def __init__(self, *args): - from .plugin import connection_plugins, type_plugins - - self.__cause__ = ( - PluginWarning("Unverified DataJoint plugin detected.") - if any( - [ - any([not plugins[k]["verified"] for k in plugins]) - for plugins in [connection_plugins, type_plugins] - if plugins - ] - ) - else None - ) - def suggest(self, *args): """ regenerate the exception with additional arguments diff --git a/datajoint/plugin.py b/datajoint/plugin.py deleted file mode 100644 index 8cb668092..000000000 --- a/datajoint/plugin.py +++ /dev/null @@ -1,46 +0,0 @@ -import logging -from pathlib import Path - -import pkg_resources -from cryptography.exceptions import InvalidSignature -from otumat import hash_pkg, verify - -from .settings import config - -logger = logging.getLogger(__name__.split(".")[0]) - - -def _update_error_stack(plugin_name): - try: - base_name = "datajoint" - base_meta = pkg_resources.get_distribution(base_name) - plugin_meta = pkg_resources.get_distribution(plugin_name) - - data = hash_pkg(pkgpath=str(Path(plugin_meta.module_path, plugin_name))) - signature = plugin_meta.get_metadata(f"{plugin_name}.sig") - pubkey_path = str(Path(base_meta.egg_info, f"{base_name}.pub")) - verify(pubkey_path=pubkey_path, data=data, signature=signature) - logger.info(f"DataJoint verified plugin `{plugin_name}` detected.") - return True - except (FileNotFoundError, InvalidSignature): - logger.warning(f"Unverified plugin `{plugin_name}` detected.") - return False - - -def _import_plugins(category): - return { - entry_point.name: dict( - object=entry_point, - verified=_update_error_stack(entry_point.module_name.split(".")[0]), - ) - for entry_point in pkg_resources.iter_entry_points( - "datajoint_plugins.{}".format(category) - ) - if "plugin" not in config - or category not in config["plugin"] - or entry_point.module_name.split(".")[0] in config["plugin"][category] - } - - -connection_plugins = _import_plugins("connection") -type_plugins = _import_plugins("datatype") diff --git a/pyproject.toml b/pyproject.toml index fd675bcb9..b41727bd0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -18,16 +18,14 @@ dependencies = [ "pydot", "minio>=7.0.0", "matplotlib", - "otumat", "faker", - "cryptography", "urllib3", "setuptools", ] requires-python = ">=3.9,<4.0" authors = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, - {name = "Thinh Nguen", email = "thinh@datajoint.com"} + {name = "Thinh Nguen", email = "thinh@datajoint.com"}, {name = "Raphael Guzman"}, {name = "Edgar Walker"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, diff --git a/tests/test_plugin.py b/tests/test_plugin.py deleted file mode 100644 index 7fd9aff22..000000000 --- a/tests/test_plugin.py +++ /dev/null @@ -1,62 +0,0 @@ -from os import path - -import pkg_resources -import pytest - -import datajoint.errors as djerr -import datajoint.plugin as p - - -@pytest.mark.skip(reason="marked for deprecation") -def test_check_pubkey(): - base_name = "datajoint" - base_meta = pkg_resources.get_distribution(base_name) - pubkey_meta = base_meta.get_metadata("{}.pub".format(base_name)) - - with open( - path.join(path.abspath(path.dirname(__file__)), "..", "datajoint.pub"), "r" - ) as f: - assert f.read() == pubkey_meta - - -def test_normal_djerror(): - try: - raise djerr.DataJointError - except djerr.DataJointError as e: - assert e.__cause__ is None - - -def test_verified_djerror(category="connection"): - try: - curr_plugins = getattr(p, "{}_plugins".format(category)) - setattr( - p, - "{}_plugins".format(category), - dict(test_plugin_id=dict(verified=True, object="example")), - ) - raise djerr.DataJointError - except djerr.DataJointError as e: - setattr(p, "{}_plugins".format(category), curr_plugins) - assert e.__cause__ is None - - -def test_verified_djerror_type(): - test_verified_djerror(category="type") - - -def test_unverified_djerror(category="connection"): - try: - curr_plugins = getattr(p, "{}_plugins".format(category)) - setattr( - p, - "{}_plugins".format(category), - dict(test_plugin_id=dict(verified=False, object="example")), - ) - raise djerr.DataJointError("hello") - except djerr.DataJointError as e: - setattr(p, "{}_plugins".format(category), curr_plugins) - assert isinstance(e.__cause__, djerr.PluginWarning) - - -def test_unverified_djerror_type(): - test_unverified_djerror(category="type") From ecee483576c98c6ac61977baf32fc9eec5359433 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 17:46:06 -0600 Subject: [PATCH 33/49] typo --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index fd675bcb9..e67503e70 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -27,7 +27,7 @@ dependencies = [ requires-python = ">=3.9,<4.0" authors = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, - {name = "Thinh Nguen", email = "thinh@datajoint.com"} + {name = "Thinh Nguen", email = "thinh@datajoint.com"}, {name = "Raphael Guzman"}, {name = "Edgar Walker"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, From 24bacc8ab2b0c7db166da3c5f5583b9b17bdecd3 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 18:41:18 -0600 Subject: [PATCH 34/49] fix #1252 - deprecate otumat --- .vscode/launch.json | 25 +++++++++++++------------ datajoint/connection.py | 1 + 2 files changed, 14 insertions(+), 12 deletions(-) diff --git a/.vscode/launch.json b/.vscode/launch.json index 0746b2a85..ea4656fab 100644 --- a/.vscode/launch.json +++ b/.vscode/launch.json @@ -1,16 +1,17 @@ { - // Use IntelliSense to learn about possible attributes. - // Hover to view descriptions of existing attributes. - // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [ - { - "name": "Python: Current File", - "type": "python", - "request": "launch", - "program": "${file}", - "console": "integratedTerminal", - "justMyCode": false - } + { + "name": "Debug pytest test", + "type": "python", + "request": "launch", + "module": "pytest", + "args": [ + "tests/", // Replace with your actual test folder or file + "-s" + ], + "console": "integratedTerminal", + "justMyCode": false + } ] -} + } \ No newline at end of file diff --git a/datajoint/connection.py b/datajoint/connection.py index 8fae80cfa..c68ef3da9 100644 --- a/datajoint/connection.py +++ b/datajoint/connection.py @@ -164,6 +164,7 @@ def __init__(self, host, user, password, port=None, init_fun=None, use_tls=None) self.init_fun = init_fun self._conn = None self._query_cache = None + self.connect() if self.is_connected: logger.info( "DataJoint {version} connected to {user}@{host}:{port}".format( From e7c0528d46164768cf10185939b243853cee7eae Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 18:50:31 -0600 Subject: [PATCH 35/49] Update pyproject.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index b41727bd0..b98361fe8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -25,7 +25,7 @@ dependencies = [ requires-python = ">=3.9,<4.0" authors = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, - {name = "Thinh Nguen", email = "thinh@datajoint.com"}, + {name = "Thinh Nguyen", email = "thinh@datajoint.com"}, {name = "Raphael Guzman"}, {name = "Edgar Walker"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, From d1202f6a5b2c8cbd64b4f342a3ad3175bfd9afc4 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 01:02:41 +0000 Subject: [PATCH 36/49] fix: add pre-commit hook for black formatting --- .pre-commit-config.yaml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index b8992481a..4a58e0483 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -21,18 +21,18 @@ repos: hooks: - id: codespell - repo: https://github.com/pycqa/isort - rev: 5.12.0 # Use the latest stable version + rev: 6.0.1 # Use the latest stable version hooks: - id: isort args: - --profile=black # Optional, makes isort compatible with Black - repo: https://github.com/psf/black - rev: 24.2.0 # matching versions in pyproject.toml and github actions + rev: 25.1.0 # matching versions in pyproject.toml and github actions hooks: - id: black args: ["--check", "-v", "datajoint", "tests", "--diff"] # --required-version is conflicting with pre-commit - repo: https://github.com/PyCQA/flake8 - rev: 7.1.2 + rev: 7.3.0 hooks: # syntax tests - id: flake8 From 3af49a1ba4fe351d297c01334de2ce036760d229 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 01:13:59 +0000 Subject: [PATCH 37/49] Add SSH agent forwarding to devcontainer --- .devcontainer/devcontainer.json | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json index 9099347df..6ed3c52c4 100644 --- a/.devcontainer/devcontainer.json +++ b/.devcontainer/devcontainer.json @@ -24,6 +24,12 @@ 8080, 9000 ], + "mounts": [ + "type=bind,source=${env:SSH_AUTH_SOCK},target=/ssh-agent" + ], + "containerEnv": { + "SSH_AUTH_SOCK": "/ssh-agent" + }, // Uncomment the next line if you want start specific services in your Docker Compose config. // "runServices": [], // Uncomment the next line if you want to keep your containers running after VS Code shuts down. From 64e0ce92296632282c9739f53a97673d27c9a6ac Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 01:23:00 +0000 Subject: [PATCH 38/49] Restore launch.json --- .vscode/launch.json | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/.vscode/launch.json b/.vscode/launch.json index ea4656fab..0746b2a85 100644 --- a/.vscode/launch.json +++ b/.vscode/launch.json @@ -1,17 +1,16 @@ { + // Use IntelliSense to learn about possible attributes. + // Hover to view descriptions of existing attributes. + // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [ - { - "name": "Debug pytest test", - "type": "python", - "request": "launch", - "module": "pytest", - "args": [ - "tests/", // Replace with your actual test folder or file - "-s" - ], - "console": "integratedTerminal", - "justMyCode": false - } + { + "name": "Python: Current File", + "type": "python", + "request": "launch", + "program": "${file}", + "console": "integratedTerminal", + "justMyCode": false + } ] - } \ No newline at end of file +} From 9342dd7f328581a3d51a75771f6c992c7b766bb1 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 01:35:31 +0000 Subject: [PATCH 39/49] formatting --- datajoint/connection.py | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/datajoint/connection.py b/datajoint/connection.py index c68ef3da9..f03650bfe 100644 --- a/datajoint/connection.py +++ b/datajoint/connection.py @@ -26,7 +26,6 @@ cache_key = "query_cache" # the key to lookup the query_cache folder in dj.config - def translate_query_error(client_error, query): """ Take client error and original query and return the corresponding DataJoint exception. @@ -214,11 +213,9 @@ def connect(self): **{ k: v for k, v in self.conn_info.items() - if not ( - k in ["ssl_input"] - or k == "ssl" - and self.conn_info["ssl_input"] is None - ) + if k == "ssl_input" + or k == "ssl" + and self.conn_info["ssl_input"] is None }, ) self._conn.autocommit(True) From cb332a9e3fa043d50c41bdb651ff8472e1798d7c Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 02:39:35 +0000 Subject: [PATCH 40/49] fix #1246, updated docs to explain three-part make pattern and generator function implementation --- docs/src/compute/populate.md | 187 +++++++++++++++++++++++++++++++++++ 1 file changed, 187 insertions(+) diff --git a/docs/src/compute/populate.md b/docs/src/compute/populate.md index 76fc62aee..8a1612281 100644 --- a/docs/src/compute/populate.md +++ b/docs/src/compute/populate.md @@ -65,6 +65,193 @@ The `make` callback does three things: `make` may populate multiple entities in one call when `key` does not specify the entire primary key of the populated table. +### Three-Part Make Pattern for Long Computations + +For long-running computations, DataJoint provides an advanced pattern called the +**three-part make** that separates the `make` method into three distinct phases. +This pattern is essential for maintaining database performance and data integrity +during expensive computations. + +#### The Problem: Long Transactions + +Traditional `make` methods perform all operations within a single database transaction: + +```python +def make(self, key): + # All within one transaction + data = (ParentTable & key).fetch1() # Fetch + result = expensive_computation(data) # Compute (could take hours) + self.insert1(dict(key, result=result)) # Insert +``` + +This approach has significant limitations: +- **Database locks**: Long transactions hold locks on tables, blocking other operations +- **Connection timeouts**: Database connections may timeout during long computations +- **Memory pressure**: All fetched data must remain in memory throughout the computation +- **Failure recovery**: If computation fails, the entire transaction is rolled back + +#### The Solution: Three-Part Make Pattern + +The three-part make pattern splits the `make` method into three distinct phases, +allowing the expensive computation to occur outside of database transactions: + +```python +def make_fetch(self, key): + """Phase 1: Fetch all required data from parent tables""" + fetched_data = ((ParentTable & key).fetch1(),) + return fetched_data # must be a sequence, eg tuple or list + +def make_compute(self, key, *fetched_data): + """Phase 2: Perform expensive computation (outside transaction)""" + computed_result = expensive_computation(*fetched_data) + return computed_result # must be a sequence, eg tuple or list + +def make_insert(self, key, *computed_result): + """Phase 3: Insert results into the current table""" + self.insert1(dict(key, result=computed_result)) +``` + +#### Execution Flow + +To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence: + +```python +# Step 1: Fetch data outside transaction +fetched_data1 = self.make_fetch(key) +computed_result = self.make_compute(key, *fetched_data1) + +# Step 2: Begin transaction and verify data consistency +begin transaction: + fetched_data2 = self.make_fetch(key) + if fetched_data1 != fetched_data2: # deep comparison + cancel transaction # Data changed during computation + else: + self.make_insert(key, *computed_result) + commit_transaction +``` + +#### Key Benefits + +1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration +2. **Connection Efficiency**: Database connections are only used briefly for data transfer +3. **Memory Management**: Fetched data can be processed and released during computation +4. **Fault Tolerance**: Computation failures don't affect database state +5. **Scalability**: Multiple computations can run concurrently without database contention + +#### Referential Integrity Protection + +The pattern includes a critical safety mechanism: **referential integrity verification**. +Before inserting results, the system: + +1. Re-fetches the source data within the transaction +2. Compares it with the originally fetched data using deep hashing +3. Only proceeds with insertion if the data hasn't changed + +This prevents the "phantom read" problem where source data changes during long computations, +ensuring that results remain consistent with their inputs. + +#### Implementation Details + +The pattern is implemented using Python generators in the `AutoPopulate` class: + +```python +def make(self, key): + # Step 1: Fetch data from parent tables + fetched_data = self.make_fetch(key) + computed_result = yield fetched_data + + # Step 2: Compute if not provided + if computed_result is None: + computed_result = self.make_compute(key, *fetched_data) + yield computed_result + + # Step 3: Insert the computed result + self.make_insert(key, *computed_result) + yield +``` +Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above. + +#### Use Cases + +This pattern is particularly valuable for: + +- **Machine learning model training**: Hours-long training sessions +- **Image processing pipelines**: Large-scale image analysis +- **Statistical computations**: Complex statistical analyses +- **Data transformations**: ETL processes with heavy computation +- **Simulation runs**: Time-consuming simulations + +#### Example: Long-Running Image Analysis + +Here's an example of how to implement the three-part make pattern for a +long-running image analysis task: + +```python +@schema +class ImageAnalysis(dj.Computed): + definition = """ + # Complex image analysis results + -> Image + --- + analysis_result : longblob + processing_time : float + """ + + def make_fetch(self, key): + """Fetch the image data needed for analysis""" + return (Image & key).fetch1('image'), + + def make_compute(self, key, image_data): + """Perform expensive image analysis outside transaction""" + import time + start_time = time.time() + + # Expensive computation that could take hours + result = complex_image_analysis(image_data) + processing_time = time.time() - start_time + return result, processing_time + + def make_insert(self, key, analysis_result, processing_time): + """Insert the analysis results""" + self.insert1(dict(key, + analysis_result=analysis_result, + processing_time=processing_time)) +``` + +The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above: + +```python +@schema +class ImageAnalysis(dj.Computed): + definition = """ + # Complex image analysis results + -> Image + --- + analysis_result : longblob + processing_time : float + """ + + def make(self, key): + fetched_data = (Image & key).fetch1('image'), + computed_result = yield fetched_data + + if computed_result is None: + # Expensive computation that could take hours + import time + start_time = time.time() + result = complex_image_analysis(image_data) + processing_time = time.time() - start_time + computed_result = result, processing_time + yield computed_result + + result, processing_time = computed_result + self.insert1(dict(key, + analysis_result=result, + processing_time=processing_time)) + yield # yield control back to the caller +``` +We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity. + ## Populate The inherited `populate` method of `dj.Imported` and `dj.Computed` automatically calls From ce9b1a9110e168a095bc6871938a4e09399f7672 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 02:42:17 +0000 Subject: [PATCH 41/49] typo --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index e67503e70..c787cc11d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -27,7 +27,7 @@ dependencies = [ requires-python = ">=3.9,<4.0" authors = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, - {name = "Thinh Nguen", email = "thinh@datajoint.com"}, + {name = "Thinh Nguyen", email = "thinh@datajoint.com"}, {name = "Raphael Guzman"}, {name = "Edgar Walker"}, {name = "DataJoint Contributors", email = "support@datajoint.com"}, From 66a3f649991511fc39670b106dccd3da66041e38 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 02:53:11 +0000 Subject: [PATCH 42/49] fix error in docs --- docs/src/compute/populate.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/src/compute/populate.md b/docs/src/compute/populate.md index 8a1612281..c05c6236e 100644 --- a/docs/src/compute/populate.md +++ b/docs/src/compute/populate.md @@ -232,8 +232,8 @@ class ImageAnalysis(dj.Computed): """ def make(self, key): - fetched_data = (Image & key).fetch1('image'), - computed_result = yield fetched_data + image_data = (Image & key).fetch1('image') + computed_result = yield (image, ) # pack fetched_data if computed_result is None: # Expensive computation that could take hours @@ -241,10 +241,10 @@ class ImageAnalysis(dj.Computed): start_time = time.time() result = complex_image_analysis(image_data) processing_time = time.time() - start_time - computed_result = result, processing_time + computed_result = result, processing_time #pack yield computed_result - result, processing_time = computed_result + result, processing_time = computed_result # unpack self.insert1(dict(key, analysis_result=result, processing_time=processing_time)) From 220eaf8220f9a4ac62e5d20ba61d2d10e2c1fd70 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 03:03:30 +0000 Subject: [PATCH 43/49] fix logic error in connection.py (introduced in a recent commit) --- datajoint/connection.py | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/datajoint/connection.py b/datajoint/connection.py index f03650bfe..21b1c97a4 100644 --- a/datajoint/connection.py +++ b/datajoint/connection.py @@ -213,9 +213,11 @@ def connect(self): **{ k: v for k, v in self.conn_info.items() - if k == "ssl_input" - or k == "ssl" - and self.conn_info["ssl_input"] is None + if not ( + k == "ssl_input" + or k == "ssl" + and self.conn_info["ssl_input"] is None + ) }, ) self._conn.autocommit(True) From b3009dbc5b5a0bcd0a6c95499cb2f1d05ea95310 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Mon, 28 Jul 2025 21:11:38 -0600 Subject: [PATCH 44/49] Update docs/src/compute/populate.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/src/compute/populate.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/compute/populate.md b/docs/src/compute/populate.md index c05c6236e..329723fec 100644 --- a/docs/src/compute/populate.md +++ b/docs/src/compute/populate.md @@ -233,7 +233,7 @@ class ImageAnalysis(dj.Computed): def make(self, key): image_data = (Image & key).fetch1('image') - computed_result = yield (image, ) # pack fetched_data + computed_result = yield (image_data, ) # pack fetched_data if computed_result is None: # Expensive computation that could take hours From 4a93aad39267b45a016ea2e90fa4cf8fc475c395 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Tue, 29 Jul 2025 03:52:45 +0000 Subject: [PATCH 45/49] documentation:Move explanation of the make method to the make.md file. --- docs/src/compute/make.md | 190 +++++++++++++++++++++++++++++++++++ docs/src/compute/populate.md | 190 +---------------------------------- 2 files changed, 191 insertions(+), 189 deletions(-) diff --git a/docs/src/compute/make.md b/docs/src/compute/make.md index c67711079..1b5569b65 100644 --- a/docs/src/compute/make.md +++ b/docs/src/compute/make.md @@ -23,3 +23,193 @@ The `make` call of a master table first inserts the master entity and then inser the matching part entities in the part tables. None of the entities become visible to other processes until the entire `make` call completes, at which point they all become visible. + +### Three-Part Make Pattern for Long Computations + +For long-running computations, DataJoint provides an advanced pattern called the +**three-part make** that separates the `make` method into three distinct phases. +This pattern is essential for maintaining database performance and data integrity +during expensive computations. + +#### The Problem: Long Transactions + +Traditional `make` methods perform all operations within a single database transaction: + +```python +def make(self, key): + # All within one transaction + data = (ParentTable & key).fetch1() # Fetch + result = expensive_computation(data) # Compute (could take hours) + self.insert1(dict(key, result=result)) # Insert +``` + +This approach has significant limitations: +- **Database locks**: Long transactions hold locks on tables, blocking other operations +- **Connection timeouts**: Database connections may timeout during long computations +- **Memory pressure**: All fetched data must remain in memory throughout the computation +- **Failure recovery**: If computation fails, the entire transaction is rolled back + +#### The Solution: Three-Part Make Pattern + +The three-part make pattern splits the `make` method into three distinct phases, +allowing the expensive computation to occur outside of database transactions: + +```python +def make_fetch(self, key): + """Phase 1: Fetch all required data from parent tables""" + fetched_data = ((ParentTable1 & key).fetch1(), (ParentTable2 & key).fetch1()) + return fetched_data # must be a sequence, eg tuple or list + +def make_compute(self, key, *fetched_data): + """Phase 2: Perform expensive computation (outside transaction)""" + computed_result = expensive_computation(*fetched_data) + return computed_result # must be a sequence, eg tuple or list + +def make_insert(self, key, *computed_result): + """Phase 3: Insert results into the current table""" + self.insert1(dict(key, result=computed_result)) +``` + +#### Execution Flow + +To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence: + +```python +# Step 1: Fetch data outside transaction +fetched_data1 = self.make_fetch(key) +computed_result = self.make_compute(key, *fetched_data1) + +# Step 2: Begin transaction and verify data consistency +begin transaction: + fetched_data2 = self.make_fetch(key) + if fetched_data1 != fetched_data2: # deep comparison + cancel transaction # Data changed during computation + else: + self.make_insert(key, *computed_result) + commit_transaction +``` + +#### Key Benefits + +1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration +2. **Connection Efficiency**: Database connections are only used briefly for data transfer +3. **Memory Management**: Fetched data can be processed and released during computation +4. **Fault Tolerance**: Computation failures don't affect database state +5. **Scalability**: Multiple computations can run concurrently without database contention + +#### Referential Integrity Protection + +The pattern includes a critical safety mechanism: **referential integrity verification**. +Before inserting results, the system: + +1. Re-fetches the source data within the transaction +2. Compares it with the originally fetched data using deep hashing +3. Only proceeds with insertion if the data hasn't changed + +This prevents the "phantom read" problem where source data changes during long computations, +ensuring that results remain consistent with their inputs. + +#### Implementation Details + +The pattern is implemented using Python generators in the `AutoPopulate` class: + +```python +def make(self, key): + # Step 1: Fetch data from parent tables + fetched_data = self.make_fetch(key) + computed_result = yield fetched_data + + # Step 2: Compute if not provided + if computed_result is None: + computed_result = self.make_compute(key, *fetched_data) + yield computed_result + + # Step 3: Insert the computed result + self.make_insert(key, *computed_result) + yield +``` +Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above. + +#### Use Cases + +This pattern is particularly valuable for: + +- **Machine learning model training**: Hours-long training sessions +- **Image processing pipelines**: Large-scale image analysis +- **Statistical computations**: Complex statistical analyses +- **Data transformations**: ETL processes with heavy computation +- **Simulation runs**: Time-consuming simulations + +#### Example: Long-Running Image Analysis + +Here's an example of how to implement the three-part make pattern for a +long-running image analysis task: + +```python +@schema +class ImageAnalysis(dj.Computed): + definition = """ + # Complex image analysis results + -> Image + --- + analysis_result : longblob + processing_time : float + """ + + def make_fetch(self, key): + """Fetch the image data needed for analysis""" + image_data = (Image & key).fetch1('image') + params = (Params & key).fetch1('params') + return (image_data, params) # pack fetched_data + + def make_compute(self, key, image_data, params): + """Perform expensive image analysis outside transaction""" + import time + start_time = time.time() + + # Expensive computation that could take hours + result = complex_image_analysis(image_data, params) + processing_time = time.time() - start_time + return result, processing_time + + def make_insert(self, key, analysis_result, processing_time): + """Insert the analysis results""" + self.insert1(dict(key, + analysis_result=analysis_result, + processing_time=processing_time)) +``` + +The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above: + +```python +@schema +class ImageAnalysis(dj.Computed): + definition = """ + # Complex image analysis results + -> Image + --- + analysis_result : longblob + processing_time : float + """ + + def make(self, key): + image_data = (Image & key).fetch1('image') + params = (Params & key).fetch1('params') + computed_result = yield (image, params) # pack fetched_data + + if computed_result is None: + # Expensive computation that could take hours + import time + start_time = time.time() + result = complex_image_analysis(image_data, params) + processing_time = time.time() - start_time + computed_result = result, processing_time #pack + yield computed_result + + result, processing_time = computed_result # unpack + self.insert1(dict(key, + analysis_result=result, + processing_time=processing_time)) + yield # yield control back to the caller +``` +We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity. \ No newline at end of file diff --git a/docs/src/compute/populate.md b/docs/src/compute/populate.md index c05c6236e..eb7ae5f0b 100644 --- a/docs/src/compute/populate.md +++ b/docs/src/compute/populate.md @@ -62,195 +62,7 @@ The `make` callback does three things: 2. Computes and adds any missing attributes to the fields already in `key`. 3. Inserts the entire entity into `self`. -`make` may populate multiple entities in one call when `key` does not specify the -entire primary key of the populated table. - -### Three-Part Make Pattern for Long Computations - -For long-running computations, DataJoint provides an advanced pattern called the -**three-part make** that separates the `make` method into three distinct phases. -This pattern is essential for maintaining database performance and data integrity -during expensive computations. - -#### The Problem: Long Transactions - -Traditional `make` methods perform all operations within a single database transaction: - -```python -def make(self, key): - # All within one transaction - data = (ParentTable & key).fetch1() # Fetch - result = expensive_computation(data) # Compute (could take hours) - self.insert1(dict(key, result=result)) # Insert -``` - -This approach has significant limitations: -- **Database locks**: Long transactions hold locks on tables, blocking other operations -- **Connection timeouts**: Database connections may timeout during long computations -- **Memory pressure**: All fetched data must remain in memory throughout the computation -- **Failure recovery**: If computation fails, the entire transaction is rolled back - -#### The Solution: Three-Part Make Pattern - -The three-part make pattern splits the `make` method into three distinct phases, -allowing the expensive computation to occur outside of database transactions: - -```python -def make_fetch(self, key): - """Phase 1: Fetch all required data from parent tables""" - fetched_data = ((ParentTable & key).fetch1(),) - return fetched_data # must be a sequence, eg tuple or list - -def make_compute(self, key, *fetched_data): - """Phase 2: Perform expensive computation (outside transaction)""" - computed_result = expensive_computation(*fetched_data) - return computed_result # must be a sequence, eg tuple or list - -def make_insert(self, key, *computed_result): - """Phase 3: Insert results into the current table""" - self.insert1(dict(key, result=computed_result)) -``` - -#### Execution Flow - -To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence: - -```python -# Step 1: Fetch data outside transaction -fetched_data1 = self.make_fetch(key) -computed_result = self.make_compute(key, *fetched_data1) - -# Step 2: Begin transaction and verify data consistency -begin transaction: - fetched_data2 = self.make_fetch(key) - if fetched_data1 != fetched_data2: # deep comparison - cancel transaction # Data changed during computation - else: - self.make_insert(key, *computed_result) - commit_transaction -``` - -#### Key Benefits - -1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration -2. **Connection Efficiency**: Database connections are only used briefly for data transfer -3. **Memory Management**: Fetched data can be processed and released during computation -4. **Fault Tolerance**: Computation failures don't affect database state -5. **Scalability**: Multiple computations can run concurrently without database contention - -#### Referential Integrity Protection - -The pattern includes a critical safety mechanism: **referential integrity verification**. -Before inserting results, the system: - -1. Re-fetches the source data within the transaction -2. Compares it with the originally fetched data using deep hashing -3. Only proceeds with insertion if the data hasn't changed - -This prevents the "phantom read" problem where source data changes during long computations, -ensuring that results remain consistent with their inputs. - -#### Implementation Details - -The pattern is implemented using Python generators in the `AutoPopulate` class: - -```python -def make(self, key): - # Step 1: Fetch data from parent tables - fetched_data = self.make_fetch(key) - computed_result = yield fetched_data - - # Step 2: Compute if not provided - if computed_result is None: - computed_result = self.make_compute(key, *fetched_data) - yield computed_result - - # Step 3: Insert the computed result - self.make_insert(key, *computed_result) - yield -``` -Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above. - -#### Use Cases - -This pattern is particularly valuable for: - -- **Machine learning model training**: Hours-long training sessions -- **Image processing pipelines**: Large-scale image analysis -- **Statistical computations**: Complex statistical analyses -- **Data transformations**: ETL processes with heavy computation -- **Simulation runs**: Time-consuming simulations - -#### Example: Long-Running Image Analysis - -Here's an example of how to implement the three-part make pattern for a -long-running image analysis task: - -```python -@schema -class ImageAnalysis(dj.Computed): - definition = """ - # Complex image analysis results - -> Image - --- - analysis_result : longblob - processing_time : float - """ - - def make_fetch(self, key): - """Fetch the image data needed for analysis""" - return (Image & key).fetch1('image'), - - def make_compute(self, key, image_data): - """Perform expensive image analysis outside transaction""" - import time - start_time = time.time() - - # Expensive computation that could take hours - result = complex_image_analysis(image_data) - processing_time = time.time() - start_time - return result, processing_time - - def make_insert(self, key, analysis_result, processing_time): - """Insert the analysis results""" - self.insert1(dict(key, - analysis_result=analysis_result, - processing_time=processing_time)) -``` - -The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above: - -```python -@schema -class ImageAnalysis(dj.Computed): - definition = """ - # Complex image analysis results - -> Image - --- - analysis_result : longblob - processing_time : float - """ - - def make(self, key): - image_data = (Image & key).fetch1('image') - computed_result = yield (image, ) # pack fetched_data - - if computed_result is None: - # Expensive computation that could take hours - import time - start_time = time.time() - result = complex_image_analysis(image_data) - processing_time = time.time() - start_time - computed_result = result, processing_time #pack - yield computed_result - - result, processing_time = computed_result # unpack - self.insert1(dict(key, - analysis_result=result, - processing_time=processing_time)) - yield # yield control back to the caller -``` -We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity. +`make` may populate multiple entities in one call when `key` does not specify the entire primary key of the populated table. ## Populate From a9eaf22b8064ff5ada62cf70f71072ff15d5959a Mon Sep 17 00:00:00 2001 From: github-actions Date: Thu, 31 Jul 2025 22:06:46 +0000 Subject: [PATCH 46/49] Update version.py to 0.14.6 --- datajoint/version.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datajoint/version.py b/datajoint/version.py index b51d5935a..5fb608cef 100644 --- a/datajoint/version.py +++ b/datajoint/version.py @@ -1,6 +1,6 @@ # version bump auto managed by Github Actions: # label_prs.yaml(prep), release.yaml(bump), post_release.yaml(edit) # manually set this version will be eventually overwritten by the above actions -__version__ = "0.14.5" +__version__ = "0.14.6" assert len(__version__) <= 10 # The log table limits version to the 10 characters From 258bb4cceef42fe21572377680e47a9fdf47601c Mon Sep 17 00:00:00 2001 From: github-actions Date: Thu, 31 Jul 2025 22:06:46 +0000 Subject: [PATCH 47/49] Update README.md badge to v0.14.6 --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index da0ce3c02..e582c8ec5 100644 --- a/README.md +++ b/README.md @@ -30,8 +30,8 @@ Since Release - - commit since last release + + commit since last release From 8631da4ca1aefadf41d1bab496d67a2093faf330 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Sun, 10 Aug 2025 12:17:36 -0500 Subject: [PATCH 48/49] doc typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/src/compute/populate.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/compute/populate.md b/docs/src/compute/populate.md index 476f86330..45c863f17 100644 --- a/docs/src/compute/populate.md +++ b/docs/src/compute/populate.md @@ -64,7 +64,7 @@ The `make` callback does three things: A single `make` call may populate multiple entities when `key` does not specify the entire primary key of the populated table, when the definition adds new attributes to the primary key. -This is design is uncommon and not recommended. +This design is uncommon and not recommended. The standard practice for autopopulated tables is to have its primary key composed of foreign keys pointing to parent tables. From e40a2589fd734bf91b2b319f10067cd8a8e52ac5 Mon Sep 17 00:00:00 2001 From: Davis Vann Bennett Date: Thu, 28 Aug 2025 13:52:18 +0000 Subject: [PATCH 49/49] bump python version in docker compose to ensure that tests pass --- docker-compose.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docker-compose.yaml b/docker-compose.yaml index d09d06d49..40b211756 100644 --- a/docker-compose.yaml +++ b/docker-compose.yaml @@ -40,7 +40,7 @@ services: context: . dockerfile: Dockerfile args: - PY_VER: ${PY_VER:-3.8} + PY_VER: ${PY_VER:-3.9} HOST_UID: ${HOST_UID:-1000} depends_on: db: