Steve Hoberman

Author of 9 books including “The Rosedata Stone”
Data Modeling Zone Conference Chair
Inventor of the Data Model Scorecard
Columbia University Applied Analytics Professor
Steve has been a data modeler for over 30 years, and thousands of business and data professionals have completed his Data Modeling Master Class. He has written nine books on data modeling, including The Rosedata Stone and Data Modeling Made Simple. He has also wrote the blockchain bestseller, Blockchainopoly. A frequent consulting assignment of Steve’s is to review data models using the Data Model Scorecard® technique. He has written the Data Modeling Institute’s Data Modeling Certification exam, chair the Data Modeling Zone conferences, manage Technics Publications, lecture at Columbia University, and received the Data Administration Management Association (DAMA) International Professional Achievement Award.
Are you following Steve on Twitter or Linked-In? Every once in a while, inspiration kicks in and a little tweet pops out.
Steve the Data Modeler
Steve the Superhero
- Steve Solves Stonehenge"These stones represent the Druid’s first attempt at data modeling,” says Steve. “In the days before data modeling tools, early data modelers had to use stone.”
- Steve Saves CitySteve catches boulder and saves city below from avalanche destruction.
- Data Modeling ProdigySteve invents 8NF (Eighth Normal Form) at age 8! Imagine what level of normalization he will invent when he grows up!
- Steve Rescues CrewSteve amazingly pulls broken boat (his own boat) across the depths and perils of the Barnegat Bay.
Data Modeling Challenges
- Can an indicator be NULL?An attribute that is defined as an indicator has only two values, such as Active/Inactive, Yes/No, True/False, and On/Off. I was asked during a training class this week if NULL (empty) can also be a value. Can an indicator attribute have three values, such as Active, Inactive, and NULL? That is, can an indicator attribute be optional? If an indicator attribute value is always required, than only two values are possible (e.g. Active/Inactive) and a NULL value, if it occurs, will be replaced with a default value (such as Active).
Which do you think is better and why, an indicator attribute that is defined as NOT NULL and uses a default value, or an attribute that is defined as NULL and allows the attribute to be optional? - Ambiguity in definitionsI review many data models throughout the year using the Data Model Scorecard technique, and over 95% of the entity and attribute definitions reviewed lack precision. There are certain terms in a definition that, if not explained, automatically make the definition imprecise and vague. The terms I look for include “Generally”, “Sometimes”, “Normally”, “Most of the time”, and “With few exception.”
What terms in a definition do you look for that scream ambiguity? - Is stability an important property for a candidate key on the logical data model?In my data modeling classes, I stress that a candidate key must have three properties:
Unique. So if we have 100 students we will have 100 unique Student Numbers and 100 unique combinations of Student First Name, Student Last Name, and Student Birth Date. For example, we cannot (on this model) have two students with the same first and last name born on the same date.
Mandatory. A candidate key must always be populated and have a value, that is for example, the Student Number cannot be null (empty).
Stable. We cannot update the value of a candidate key. For example, we cannot change Student Number 123 to a different value such as 124.
Someone challenged me recently on the stability property. His comment was, if a logical data model represents a point in time perspective (that is, it is timeless), why is the stability property relevant? In our physical design, we manage how data changes over time, such as a slowly changing dimension managing changes to the Student Last Name, but is stability important on the logical? Why or why not?
What are your thoughts? - A data modeler joined a panel….A data modeler joined a panel…sounds like the beginning of a good joke. However, a couple of weeks ago I participated in a data management panel at Columbia University.
I volunteered to participate in this panel because another theme from this past year’s Data Modeling Hackathon was educating university students on data modeling. .I wasn’t sure how many students knew data modeling in this audience. However, I was shocked when the first question posed to the panelists was directed at me! A student asked, “How relevant is data modeling in the world of NoSQL?” I smiled and gave my answer.
How would you answer this question? - Can a conceptual data model contain attributes?I was asked recently to review a conceptual data model (CDM). My expectation was that the model would contain somewhere between 50 to 100 entities. Upon entering the room and seeing the model for the first time, I realized my assumption was correct, as there were close to 100 entities. However, each of the entities on the model contained attributes! There were over 500 attributes on the model.
The facilitator stressed that this is a conceptual data model because all of the entities and attributes represent business concepts. The business created this model to capture their view of their business. The model was extremely well-organized for readability and there was very little abstraction as these terms were true business terms.
What do you think? Can a conceptual contain attributes? Please share your thoughts. - Blockchain on data modelingI am talking about the impact of blockchain on data modeling for an upcoming ER/Studio Webinar, and here are a few of the impacts I think we will face:
Herding cats. If you have the word “data” or “analyst” in your job title, no doubt at some point you came across Excel spreadsheets or Access databases that were used to run critical parts of the business. Excel and Access are easy to use and quick to get up and running, and therefore they exist in too many places and we lose control over this data. Blockchain is Excel on steroids, allowing us to store more data in spreadsheet format easily and have many more people use the spreadsheet than if it exists on a single computer. We will need to model the data put on the blockchain, before it is put on the blockchain, making the logical data model an even more important deliverable.
It’s about the keys. Blockchain creates a “trustless” system by never exposing sensitive business keys, such as Social Security numbers or credit card information. Blockchain calls the business keys “private keys” and the keys that can be exposed “public keys”. Most business keys will be treated as private keys, and most surrogate keys become public keys. However, there will be exceptions and it will require the physical data model not just capture surrogates and business keys, but also public and private. There could be additional keys, or the same key may play multiple roles.
Bet on the enterprise data model. Blockchain will require tighter alignment between data modeling and data architecture, making the enterprise data model even more significant. We need to know what data is in which blockchains, for example. We need to know if we modify the length of an attribute, what will be the impact. We need to know if we retire a blockchain, who needs to be notified.
Please share your thoughts! - Foreign key on a logical data model?I often get into passionate discussions with data modelers on whether a logical data model (LDM) should contain foreign keys. Those that belieave an LDM should not include foreign keys explain foreign keys are relational database constructs and therefore should appear on the physical only. Those that show foreign keys explain that foreign keys often increase model readability as it is easier to see how entities relate to each other, and for other reasons as well. What are your thoughts? Do you show foreign keys on a logical? What are your reasons?
- From VSAM to MongoDBAn organization facing big data pressures is migrating from VSAM to MongoDB. Although VSAM is 1970s technology, it stores data hierarchically which is very similar in structure to NoSQL document-based databases such as MongoDB. Therefore the data migration from VSAM to MongoDB could be straightforward. However, there is a debate going on whether a fully normalized relational logical data model should be built prior to migrating to MongoDB. Normalizing leads to well-understood structures and enables integration with existing data models. Some feel though that the effort to normalize will not be worth the value, because the current lack of metadata in the VSAM structures will lead to many unanswered questions that will only get answered once users can access the data in MongoDB. Also, there is a current lack of tools available for maintaining the lineage between the relational logical and the NoSQL physical, so the logical data model built may quickly become out of sync with its physical. What are your thoughts?
- Unstructured data classwordsEach attribute on our data model ends in a classword. For example, “Number” is the classword in Account Number, “Name” is the classword in Customer Last Name and “Date” is the classword in Order Entry Date. As we start modeling more and more unstructured data, we need to get more precise than just the generic “object” or “blob” classwords, which today is often the “catch-all” for video, audio, photos, documents, social media posts, etc. For example, we may introduce a “Document” classword for Microsoft Word, PDF, Google Docs, etc. What other classwords would you recommend we introduce?
- Enterprise Data Model ROIHere’s the challenge. You are the manager of this newly formed enterprise modeling group and you need to “sell” the EDM to senior management. You plan on explaining the benefits of increased business understanding across departments, higher quality data, and lower software development costs over the long term. But…you know they will ask you what the Return On Investment (ROI) will be for the EDM. How would you answer the question, “What is the EDM ROI?”
- Broken shellsI was about to go for a run on the beach when my 4-year-old daughter asked, “Daddy, can you bring me back some shells? It’s ok if the shell is chipped or missing a piece but I don’t want any shell pieces.” Although it was easy for her to distinguish a chipped shell from a shell piece, I had a much more difficult time separating the two while jogging along the beach. Where do we draw the line between a chipped shell and a shell piece? That is, at what point does a shell change so dramatically that it is no longer a shell but now a piece of a shell?
After returning home completely ‘shell-less’, I started thinking about how this shell story relates to a challenge we sometimes have to face in capturing requirements. William Kent in “Data and Reality” raises a similar dilemma on replacing parts in a car. If you replace the windshield wipers, it is still the same car. But what if you replace the car engine and car body? Is it then still the same car? In your organization, if a customer moves to a different location is it a new customer? If no, is there anything about a customer that could change (name, tax identifier, or even gender) that could turn an existing customer into a different customer? What about a product or employee or any other concept important to your organization?
Have you ever had to analyze or model a concept in your organization where enough changes could occur to create a completely new concept? How did you handle it?