Alright, so welcome back to computer science S-75 this is lecture 4 on sequel. So, what did we do Monday? Anyone? XML, alright so good. So, we talked about XML stands for extensible markup language and in one sentence what is it or what is it good for? Yes. 

It's good for structuring data in an easy way. 

Okay, good for structuring data in an easy way and we looked at a few examples. We looked at how you might structure lectures for instance. We looked at how you might structure a pizza menu or at least the beginnings thereof, and what are some of the design constraints on an XML file? What features can you have or what features can you not have? How about someone else? Yeah. 

Can only have one root. 

Can only have one root element, okay good. What's your name again? 

Louis [assumed spelling]. 

Louis, okay. So, you can only have one root element. You can have stuff at the top of the file like comments, you can have the XML declaration which his optional so there might be some stuff up there and all of that hangs below the so-called document note and that document note could become more relevant once we get the JavaScript and even if you're already familiar with some JavaScript you might know that there's this global object known as document in all lower case if you've programed in JavaScript before. So, more on that to come. Alright, so with XML what are some other features of this markup language? Yeah, Axel. 

I really don't know if [inaudible] is a feature, but elements can have attributes. 

Okay, so elements can have attributes. So, if you want to markup some data but then associate some additional data with it, you can have these things called attributes and an attribute is usually key equals "value", you can have multiple attributes on an XML element, you can have zero attributes on an XML element, you can't have identically named attributes on the same element and you have to have those attributes values quoted, but besides that you at least have an alternative to this approach. What's the alternative to making some piece of data an attribute in terms of the design of an XML file? Yeah. 

Making children elements within the element. 

Yeah, exactly. Making children of an element is an alternative. So, what's an argument for or against either of these approaches? Attributes versus children? Someone else, otherwise it's just going to me and Axel today. Yes, Axel in the front. 

Okay, if the data itself doesn't have a child you might want to use an attribute. 

Okay. 

But if it would an additional child then you have to use an element. 

Okay, good. So, if you think that your piece of data that you're representing might at some point need to be extended in the same way that a person was extended in our simple little Amazon purchase order example whereby we added an initial, we added an address then maybe it should indeed be a child element so that you've reserved that flexibility for yourself. If instead you make is an attribute, that's pretty much a, the buck stops there, you can't further extend the definition so attributes tend to be good for small pieces of data like IDs, or short words, or phrases, but you wouldn't typically put for instance a paragraph long description of a pizza for instance in an attribute, you could, but at that point it just doesn't lend itself to readability then it gets harder to have quote marks in the attributes value, so in short this is one of those things that you just kind of start to know when you see it, but when it comes to your pizza menu, recall that going to want to be careful with the design because if you for instance create a pizza child; what was one of the downsides even though we talked very briefly about it on Monday, what was the downside arguably of coming up with your own pizza element? Yeah, Jack. 

Well, if you want to use the same thing to find the pizza element, to find let's say something in your sandwiches. 

Okay. 

You're not going to be able to do the same thing because they're not the same element. 

Exactly. If you want to find two different things in the file you have to know in advance what those things are. You have to know that there's a pizza element, you have to know there's a salad element, you have to know that there's a grinder element and what not and so, you've written now a lot of PHP code that has these keywords hardcoded. Now, that's fine if the pizza guy wants to add more pizzas to menu or add more salads, but under what scenario would this design break? And by break I mean you the developer have to go back to the pizza shop and change actual PHP code as opposed to XML. 

They want a whole new item. 

They want a whole item. They want to sell ice cream or frozen yogurt or something that just wasn't anticipated by you and so now you need to have a FROZEN yogurt element and hardcode that, so there again is sort of the manifestation of a poor design decision and so we started talking about alternatives to this, and again, I disclaim there's no one right way so you don't just have to copy what we started to do in Notepad that other day, but instead we have category elements and we had item elements and we put the names for those categories and items in that case in the attribute, in attribute values. We could have done child elements but at least there we had a more general design so that if you're writing PHP code as you will be this week and this weekend, you can have a for loop or a while loop that's iterating over all of the item elements in the DOM and you don't have to worry too much about whether it's a pizza or a salad cause you can figure that out on the fly, so again, that's the thought process you should have when sketching out your own XML file. Alright. Alright, so just a word on Project Zero, so do bear in mind on page 1 of the project spec, as well as in the syllabus is the specification for the course's policies on academic honesty and they essentially boil down to this; you're welcome and encouraged to collaborate verbally with each other, stand in front of a whiteboard if you've gotten friendly with classmates and talk through various design decisions, but when it comes time to write code that should be done completely separately so that everyone ultimately is submitting his or her own projects, so discussions in English good, in pseudocode good, in code bad. So, just bear in mind that and read through it in more detail. Alright, so that was XML and what was the point of XML? Well, we'll see today an alternative to something like XML. We'll actually look at databases and we'll look at a language called SEQUEL, structured query language that allows you to express yourself even more powerfully than you can with XPath, whereby XPath refers to what? Ah, Jack. 

PHP method for going into and finding XML files. 

Good, the PHP method for going through XML files and finding things and just to be clear, it's not a PHP feature per se that happens to be PHPs implementation thereof, it's a feature of PHP SimpleXML API, but XPath is language independent and indeed there exist XPath processors for Java, for C++ for all sorts of languages, so XPath itself is a language separate from SEQUEL, from PHP, from all of these other alternative languages, so we looked at a few features of XPath now. So, we had these location paths that kind of feel a little like C:/ program files or what not, so something a little familiar to people but what kinds of features does an XPath expression have, an XPath location path? We had these things called steps and a step is kind of like the word implies, it allows you to start at the root of the document and dive in deeper and deeper and deeper like taking steps into the tree, what else did steps do? 

Yeah, you can also say which attribute in the certain step [inaudible]. 

Exactly. If you want to look at an attribute's value like we did for the lecture number in our very short example involving a generation of an unordered HTML list of all of the lectures we've had thus far, we grabbed the lecture number attributes using @number for instance or rather that's what it would be in XPath, we happen to do that particular example in PHP where we use the square bracket notation. So, just realize we looked at a couple of different things in PHP we had the SimpleXML API that allows you to use the arrow operator, so hyphen angled bracket to stepdown into a child and XPath allows you to do essentially the same thing but more powerfully, it's more of an, it's a more expressive language than just go to a child or go to an attribute because even though we just scratched the surface or recall that XPath has multiple axes and an axis is something like child colon, colon or attribute colon, colon and those are generally not even written explicitly just cause they would be incredibly tedious to write XPath expressions with those long words, but there are things like decedents, or ancestor, or sibling and those might seem kind of unnecessary right now, but indeed there are scenarios where it's useful. If you find yourself navigating for instance your XML file and you're at a pizza element and for whatever reason you want to get all of the other pizza elements at that same level in the tree, that level in the DOM, you can use the sibling axis to just get all of those pizza elements for instance and more powerfully think about this way, when you're implementing your pizza ML E-Commerce sites and a user has added an item to his or her shopping cart; let's give some thought as to what should go in the shopping cart now. So, the shopping cart is implemented in code via what PHP feature? Yeah, Jack. 

Session. 

Yeah, the session. So, the session is the super global that allows you to illusion of state with the user and by illusion I mean that even http is stateless, as soon as you stop requesting an HTML file cause it's been downloaded, that's it, the network connection does not stay on the present, but there's this thing of cookies and this handstamp metaphor whereby you are reminding the server who you are so the server can in turn give you back the illusion of that same shopping cart. So, what should you put in the shopping cart? A super global like session is just an associative array which means you can put keys and values in it, which is actually pretty versatile. It means you can put pretty much anything in it you want. So, what should you be putting into the session super global when a user says, give me a medium cheese pizza, submit or whatever your mechanism is for getting that input from the user? What should you put in the session? Yeah. 

Size, item, price. 

Good, so size, item and what the last? 

Price. 

Price. Okay, and what's you name again? 

Ben. 

Ben. Okay, so size, item, price are candidates. What else? 

Maybe the XML path to that particular element. 

Interesting, so maybe the XML path to that element. Why would you want to store that perhaps? 

Because on my checkout page maybe I actually want to show the item and the name and everything in it instead of just storing everything inside the session which kind of takes up a lot of space. 

Okay. 

With a long name and a description. I could just tell my checkout page where to find it. 

Okay. Interesting, so storing this location path, an XPath expression essentially would enable you to query the XML file later to get back that same elements, the upside of which is you're not storing all these pieces of data like price and description and what not, things that you would want the user to see in their shopping cart, in their checkout page for instance, so there's that optimization and at the same time, the location path is fairly long, so it's not necessarily superefficient and what if for instance, and this is admittedly a corner case and it's not something it's something you can consciously choose to ignore, but you should at least trip over it, if you did store prices as was it Ben? 

Yeah. 

Ben proposed, what's a downside there? Find a bug even it's an obscure corner case in storing prices and names and descriptions in the shopping cart? Yeah. 

The dollar sign. 

Okay, so. [ Inaudible ] Oh, interesting. Okay, so if we're using US dollars we have the dollar sign symbol which usually denotes a variable. Now, we can work around that by using for instance single quotes and so forth, but potential bug. Jack. 

What if someone can go in and change their own session like make negative money so that they are charging to yourself? 

So, what if a user make negative money by mutating the session object? So, that's a good thought. In this case because the session is stored server side, they couldn't change the contents of it, they could spoof someone else's cookie by sniffing it wirelessly and then seeing someone else's shopping cart, but even in that scenario they're going to end up ordering the wrong food or they're going to be buying something, yeah, someone else put in their shopping cart. So, a good thought but in this case I think we're safe from that particular attack. Okay, Axel. 

Yeah, what if a user orders a pizza and then you store the price in the session, but then the price is change? 

Yeah. 

For whatever reason and then the guy who actually buys the pizza pays the wrong price, a higher price-lower price? 

Good. So, this is a legitimate corner case and maybe it's not a big deal cause if you're changing your price what are you going to, you're either going to gain or lose a few cents, maybe not a big deal, but in principle here if someone knew your prices were going up and they know that the items are stored in the server side, the prices are stored in the server side session, if they just keep around that cookie that gives then access to that shopping cart and they maybe pre-create a whole bunch of shopping carts using different browsers or what not, again obscure corner case, they will get access to the old price whether it's lower or higher, so not necessarily bad but it's a little weird, right, especially if your database is logging the prices people paid and even though there's not a database in this scenario, now all of a sudden even though your prices changed with the new fiscal year, July 1st of 2012, somehow your customers are still paying the old prices, so in short, it's just a little weird, it's a little inconsistent and it's also just not necessary. What would be an alternative to storing these various pieces of metadata or even storing something as long and as string-like as an XPath expression? What else could we put in the shopping cart, yet still enable this guy to sell pizzas and more? Yeah. 

You could implement and identifier. 

Okay. 

Of each item so they would have like a number or a sequence of letters. 

Okay, good. 

And then you probably find that and essentially find everything. 

Okay, good. So, what if we instead gave everything a unique identifier? Maybe it's a number like this is item number 1, 2, 3. Maybe it's instead and alphanumeric string like this is pizza underscore S for small, underscore anchovies or something like that. It could go into a bunch of different directions but just having this principle of a unique identifier it's pretty compelling. Now, push back on that, so have an argument among ourselves here, what's bad about that design? Jack. 

It's the person who's editing and going through all these unique identifiers and decides that they want to change certain identifiers or that they need to add something somewhere it wasn't before. 

Ah hum. 

So, it's kind of messy for the person who has to implement new items. 

Okay, good. So, what if the pizzeria guy in short needs to change things later, update things, you're assuming a lot of sophistication now from this guy who told you build him a site that allows him to edit it with TextEdit or Notepad.exe and if you now have to put in your instructions to this fellow, well you now have to make sure you choose a unique identifier for every one of your products, you know, the questions going be frankly what's a unique identifier which, you know, even though that's, you know, says what it means to a layman that's not necessarily the clearest statement so this is also just assuming that the human involved is going to care about these details too and frankly a pizza guy and even a technical guy should not, they should for the record, they might be in the mutual exclusive in this world, they it's just not a detail you should have to care about, right? Why do I the human have to come up with these unique identifiers? So, in short where we get some nice flexibility of XML and it's very low-impacted, we don't have to know anything about databases, we don't need a database server. We can run this on a PC under the desk at the pizza shop so long as it has a public IP address. It's pretty simple, but we're paying a prince for it and even now, I don't think we've reached closure on what the best design is here, but again, these are the kinds of things you should be struggling with or tripping over so that at least when you've designed your file even if you know this is not perfect this world of XML, at least you've made a conscious design choice and as we've said, as I think we've said earlier any time you have given some thought to a design decision like this in code or in XML or in HTML even, and you think hum you know a sharp teaching fellow might think that I'm an idiot for having done this but I really did give it some thought just comment it, right? Put PHP comments, XML comments, to just convey to us why you did what you did and even if we disagree at least we know you too are a rational being who gave this some thought, so that's as important as the decision itself. Alright, so any questions then on XML, XPath, pizza or the like? Alright, so let's try to solve some of these problems then today, databases and SEQUELs specifically. What is a database in layman's terms? Axel. 

You can call it big table with rows and columns. 

Good. So, big table with rows and columns or at least that's one type of database that's been popular for some time, there's alternatives that we'll touch upon, but a table with rows and columns. So, you might think of just frankly Excel or Apple numbers any kind of spreadsheet program is effectively giving you a database and in fact if you've used Microsoft Access, I mean that even looks like a spreadsheet at least the UI part of it and it allows you to store data in rows and columns. So, what kinds of things might you store in a database? Well, what about orders from a pizzeria? Suppose we did go this route of having a unique identifier for pizzas, pizza underscore S underscore anchovies, or the location path, or a combination of fields like Ben proposed "pizza", and "S", and "anchovies", or the price and whatever your approach, suppose that one of our columns represented that, whatever we're using to identify the thing in the shopping cart and in the other columns what might we have? What are the other pieces of information you probably need to remember when someone is buying something from a pizza site or any site really? Yeah. 

You would do the price, so if you were in the database with an identifier you could get that price. 

Okay. 

Maybe the name, maybe availability if it's. 

Great. Yeah, any of this metadata, again quantity of pizzas bought or available, how many of these things are in stock, how much dough do you have for instance, what is the unit price or in the case of an order that user is placed how many pizzas did they buy, so in short any of the data you might remember that you're throwing away right now for pizza ML cause we don't require that you have a database, but we do require that you allow these or to pretend to checkout. Once you have a database you can also write to this storage mechanism and not just read as you are from the XML file. Oh, and just as an aside so it's clear, one of the key motivations for XPath realize is that if you do take the approach that Axel proposed or Ben proposed or this approach of a unique identifier realize that if you do have at least the first and third of those, if you have a location path or a unique identifier you can use XPath as we did to home in on the specific element in the XML file that you want to get back because you want to check it's price, you want to check it's availability and so forth, so realize that the XPath function in PHP even though it will return by definition an array of matching elements, realize that array could just be a size one which it will be if you've uniquely identified a node. So, you can just grab bracket zero for instance of that array to get back the thing that you queried for, so that's where XPath's power comes from so that you the developer, don't have to write a bunch of nested for loops iterating over every possible element in the tree, looking is this one, is this the one, is this the one, is the one with Ifs and ELSEs and so forth, you can do it with XPath, so realize you have that expressive capability. Alright, so what forms do databases come in? Well, what's CSV? What's it stand for even? Yeah, Ben. 

Comma-separated values. 

Yeah, comma-separated values. So, this is really like a quick and dirty style database, so much like a spreadsheet has rows and columns, you can kind of mimic the idea of rows and columns in a text file whereby every time you hit enter you get a new row, that's pretty trivial and every time you put a comma you get essentially a different column, so if your first row has three commas and therefore four fields, so field, field, field, field and the next row also has three commas and the next one has three commas even though the spacing might not line up perfectly you will have effectively four columns in the CSV file. So, you have the beginnings of a database not unlike a spreadsheet, but there's some problems. If you're using commas to separate your "columns" what's perhaps the most glaring deficiency of this representation? Yeah, Conner. [ Inaudible Response ] Right. You're kind of screwed is your words have commas in them because now it's going to be ambiguous to the program reading this CSV or even the human reading this CSV file what the comma represents, is it a comma for grammar or is it a comma for field separation? So, it wasn't so long ago that people found a workaround for this. They have PSVs. Anyone know what a PSV is? 

Period-separation. 

Period, so not a bad guess. It's not period but it's a pipe-separated values. It's that vertical bar that you get by holding shift usually and hitting a key on the keyboard. So, why did the world introduce PSVs? Well, frankly to address this problem. Now, of course there's a corner case here, what's the problem with PSVs? Yeah. 

If anyone ever needs to use one those. 

Exactly. If you ever need a vertical bar which is less common in fairness, but possible especially if it's right there on the keyboard you're screwed again. Okay, so then the world came up with TSVs. What are TSVs? 

Tilde. 

Tilde, no but good guess, I mean frankly we can make up our own file formats here pretty easily. It's tab-separated values, so that is good in that at least you're not likely to have a tab in the middle of the sentence, but what if you're actually storing paragraphs or what if you actually are using text editor as some of you for programming might use that doesn't really support tabs per se cause if they automatically converted to spaces, that system kind of breaks down, so tabs in general is a very fragile mechanism in programming in general, so that's not so hot either, so there are all these various formats. There are some workarounds, what you can actually do with any of these formats commas, pipes, tabs is you simply add some quotes to the situation so you quote the entire columns value so even if that value has a comma in it for grammatical purposes, you look to the quotes to say what is the string in this column so to speak. Now, of course what's the pushback there? Well, what if you want to have a quote in your quote, well then you have to introduce backslash quote marks so we a solution there, but what if you want to have backslash quote marks literally in your, well then you do backslash, backslash quote, but that might be ridiculous but that's the world we're already in with programming itself, so just realize there are these various issues. But more importantly, what's nice about CSVs is one, they're completely language independent. It doesn't matter if you're Java programmer, PHP programmer, or whatever these are compatible with any language cause it's just silly text. You have to write a program that reads it, but once you do anyone in the world in theory could use that library and voila, you now have CSV support indeed in PHP they wrote it for us. There's a couple of functions, F get CSV and F put CSV that does all of the annoying ParSet for you, so with F get CSV if you look up its documentation on PHP.net you'll see that it takes an argument which for instance is a reference to a file, food.csv on the hard drive and it will ParSet for you top to bottom, left to right and what it will return to you is an array of all of the rows and then you can index into those rows because they themselves are arrays. So, it returns an array of arrays where the big array represents all your rows and each individual array represents all of the cells or the columns in that row, so it's nice. It just deals with all the stupid space issues, the pipes, the CS, the commas, the tabs, and in fact even though it's called F get CSV, you can override the default delimiter and tell it don't use comma, use tab, use pipe, use tilde, use period, whatever you want it to be, so it's nice in that regard and more importantly if there's F put CSV which is compelling cause it does the opposite, it writes CSV files and it deals with the headache of figuring out what needs to be quoted, what needs to be backslashed escaped and so forth, so really handy. Alright. Then there's XML, SimpleXML API actually has some faults it's sometimes too simple and for instance case in point the X file function always returns an array even if you know there's only going to be one node in the tree that has this unique identifier, so occasionally you'll run into just some, hum it's not quite as user friendly as you like, but it's definitely simple. An alternative to it is the DOM, document object model API and PHP which you're welcome to use if you really want. It's a little more sophisticated and it's not necessary technologically for pizza ML, but realize there are other parsers so to speak. A parser is a program that reads files and does something with them, and then we have MySQL. What is MySQL? Yeah. 

A query language. 

It's actually not a query language, SEQUEL is a query language, but what's MySQL? 

That's the type of database. 

Yeah, so it's a specific make and model of database, so it's a vendor called MySQL that sells and also makes available for free a database server called MySQL, so it's software that you can download from Mac OS, Windows, Linux and the like. You install it on your physical computer or your physical server and it is just another service running on a port listening in this case to TCP port 3306, sort of slightly random trivia that you only rarely need to know, but it's just in that regard like a web server like an email server, it's listening on a port like we discussed in lecture zero. So, MySQL, MySQL PDO we'll talk about at least a couple of these tonight and on Monday but there are, there's built-in support in PHP from MySQL which means just like you have these built-in functions for CSVs, and TSVs, and PSVs, and XML, similarly do get a lot of free functionality with PHP itself for interacting with a database and what is this database MySQL? It's essentially a super fancy version of rows and columns but with multiple tables, multiple spreadsheets if you will so you store lots and lots of data as we'll start to tonight. So, SQLite anyone know what this is? It's yet another alternative for a database. So, SQLite allows you to use SEQUEL the structured query language we'll start looking at tonight in lecture and in section and it lets you use the language, but without needing a database server per se, a SQLite database is literally just a binary file, zeros and ones on disc usually called something dot db for database and it stores rows and columns; how? You don't have to care about or know about but it creates the illusion of rows and columns even though it itself is just one big file, so it allows you to use SEQUEL as we'll see it's quite powerful, flexible, way more user, well its sort of user friendly, way more expressive than CSVs and even XML files especially when it comes to searching and such, but you don't need to figure out how to turn on a database or run a database or use the RAM for a database if it's a quick and dirty application or a small application with tens of users; this means you can package it all up in one folder and you don't need any infrastructure beyond that, so some of the projects that I've ran, I just reach for SQLite where I want to use SEQUEL cause as you'll see it's a very powerful language but I don't want to setup a database, setup a user name and a password when I give the code someone else then they have to setup a database, import that the database, this way you literally can send them a Zip file with everything in it so long as their server supports whatever language the project was written in whether its PHP or something else. Alright. So, rows looks familiar, here's Microsoft Excel and just to paint a very concrete picture, rows and columns we're sort of talking about building with something like that in memory and we can do it the old school way using a terminal window, so you might have dived in to Project Zero already with SSH or with opening a terminal window inside of the appliance. This gives you a black and white or white and black interface not unlike this and MySQL because it's a server that's listening on a port, you can talk to it just like a web browser can talk to a server or an email client can talk to a server, so one way in which you can start interacting with MySQL is via this command prompt. So, let me actually go over to the appliance here, let me open up a terminal window, and what I'm going to do is, what I'm going to do here is type MySQL, enter and we'll see access denied and let me make one tweak here just so that we can zoom in on this, give me one second, so I can zoom in for us. There we go. Okay, so access denied for user jharvard@localhost. Recall that localhost refers to the current computer. So I need to type a slightly more involved command. MySQL-ujharvard is fine and then dash p, enter now I'm being asked for a password, as you may have seen in the appliances documentation, is that John Harvard's password is always crimson by default so now we're at the point of where the slide was at just a moment ago, so now I can start typing commands and this will very quickly get a bit tedious, but I can type show databases; and here the databases that come by default with a MySQL server, information schema, MySQL, performance schema, test; for the most part you should never, ever touch these databases in fact, we typically can figure the appliance in such a way that when you log into the database server with a GUI as we'll soon see called PHPMyadmin we hide these by default cause generally a developer should not touch them, they're used internally by MySQL with a couple of exceptions. So, there's nothing interesting here just yet, but I could start typing commands, like I could start saying if I need a database for lecture, I could say create database and then specify what kinds of rows and columns do I want. Do I want numbers, do I want letters, do I want dates, do I want times, we're going to able to specify with much more precision that type of data we're storing where a CSV and XML they're all strings in that case unless it, and even though it might look like int, it's still "an integer." So, let's actually transition from this black and white environment which while powerful is just not very user friendly and let me open up something called PHPMyadmin. So, notice here, I'm back in my Mac but you can do this with any operating system, Windows or Linux or the like, notice I've gone to http:// even though the browser is hiding that, appliance/PHPMyadmin. Now, this will not work for you out of the box, when you've booted up the appliance you can pull up this utility called PHPMyadmin using Chrome inside of the appliance by using what URL? If I open up Chrome inside of the CSV 50 appliance, and I type http:// it's kind spoiled by autocomplete now. Where can I go to see this? 

Localhost. 

Localhost. When you're inside of the computer as I am now, I'm in Linux cause that's the screen I've pulled up, I can call up localhost and I'll get to the same place. I'm going to type in jharvard and crimson for my user name and password and I see that same UI here. The browser looks ever so slightly different in Linux than in Mac OS, but it's the same idea. However, on my Mac I have instead appliance/localhost. How did I make that possible? Yeah, Axel. 

Probably at c host. 

Yeah, at c host, so recall that there's this trick on Mac OS and Linux and also on Windows where there's a file on the system and the project spec tells you how you can edit this for Project Zero if you want. That allows you to specify a synonym for an IP address. So, what I have done is edited at c host, to say that this IP address w.x.y.z should actually be known as appliance; why? Just more user friendly. It would be wrong for me type localhost/PHPMadmin on my Mac why? Yeah. 

Because that would refer back to your Mac which you [inaudible]. 

Exactly, localhost if I'm on my Mac refers obviously to my Mac cause that is indeed the localhost and there is no PHPMyadmin on there, at least by default. You can install it, you can use XAMPP or WAMP or these various tools we talked about briefly in the first week as alternatives, but if you're not running it on your actual Mac or PC that's going to be a dead end, so instead we want to resolve it to the IP address and as for the appliance it's always going to change but you can always check the appliances IP address by checking the bottom right-hand corner here and if you indeed have network access on your computer and all is well and nothing's broken, you will be able to visit the appliance via that IP address which in this case is 17216100.129, but will be different for you. Alright, so what can we do when we're in this environment? Well, let me go back to my Mac and why do I use my Mac? It's just a little either, they're functionally equivalent. So, what can we do here? Well, one now I'm being told there's no databases, but that's because the appliance is lying to you. We've hidden some of those more administrative type databases just so you don't accidentally break things or get distracted by things you don't care about and I'm going to go ahead and click databases at top left and you'll see I'm prompted to create a database. I'm going to go ahead and create a database called jharvard underscore lecture and notice that I'm choosing my user name underscore and then the name of the database that I actually want which is a common convention and I'm going click create and now notice what PHPMyadmin did. PHPMyadmin is coincidentally written in PHP. We don't care that it's written in PHP that's just what the person who wrote it called it. It, however, is a GUI graphical user interface that allows you to administer a MySQL server that could be running really anywhere, in this case it's running on my appliance, but if you have a commercial web post they might say, hey we support PHPMyadmin go to something .com/PHPMYadmin and then you log in with your commercial user name and password in that case. So, you'll see hopefully that this is just a nice way of navigating a database especially once you start getting data in it cause you can just see it visually rather than in black and white text and also pedagogically we'll be able to see the features of MySQL thanks some nice dropdowns in just a moment. So, this text here in purple and green create database/jharvard underscorelecture/; that is sequel code. So, if I instead go back to my terminal window, my black and white prompt from before, I could have finished this sentence and I could have said, create database jharvard underscorelecture`; and notice it fails, why? 

Because it says it exists. 

I already exists, right? And if I change the name like lecture 2, enter. That seemed to be okay and notice the feedback is literally query okay one row affect and it did it really fast zero seconds now let's go back to PHPMyadmin, let's go back to the home icon here and voila now I have two databases. If I want to delete it I can go ahead and click here and, or actually let me go here. If I go to the databases tab noice I see my two databases, I can check lecture 2, click drop which is the lexicon for removing a database, notice it's yelling at me "are you sure you want to execute this command?" And I am, and in fact I'm going to do it the other way though, I'm going to go here and say drop database jharvard lecture 2; query okay, back here, reload and click the home icon and sometimes it caches so you have to reload. Oh, did I delete the wrong one? I did, that's okay. Here's how you rename databases. So, now I'm going to go up to lecture 2, now notice there's a bunch of tabs and we won't play with all of them today and most, for the most part they're self-explanatory once you get comfortable exploring. I'm going to click operations and now notice I can just change this here jharvard underscore lecture and there's some other stuff I can drop it, I can create a table in it, I can copy the database and so forth, but let me click rename and now notice, and this where it's pedagogically instructive even though we're using this click, user friendly click and drag and interface, notice that it's telling us the SEQUEL that you could have executed manually so you can infer what the syntax is like, create database jharvard lecture, drop database jharvard lecture 2 is the way it chose to rename it in this case, but it would have preserved our data if we had any. Yeah, Conner. 

I noticed all these are, all these commands are in caps, is that case sensitive? 

It is not, so even though you're seeing them in purple capital letters, that's typically a convention and it's a style thing that I would actually encourage you to adopt because when you're reading you code in a syntax highlighted terminal IDE or code editor, or when you're just reading it as a human with eyes it's just easier to see the SEQUEL keywords as distinct from your own table and column names which are generally should be in lowercase with no spaces or weird punctuation. Good question. Alright, so let's now use this, but let's first motivate the problem somehow. I want to create a database for authentication. Recall that on Monday we looked at a number of PHP examples among which, actually last week we looked at the log in examples where we used session and a cookie and we remembered that jharvard was logged in or not logged in and we went through a few iterations log in 1, log in 2, log in 3 and so forth and none of those though used the database, at best we used a constant and just hardcoded John Harvard's user name and password, not very scalable, not very conducive to having multiple users. So, now we have a database. So, now I have the ability to store rows and columns of information related to users. So, that begs the question, if you were storing a database of users even if you just think of this now as an Excel spreadsheet, what would the columns be in that spreadsheet that you would want to remember for each of your users which is going to represent a row, so users go in rows, but what each column represent if this is a spreadsheet of users? What kinds of fields? [ Inaudible Response ] Yeah, okay so let's do that. So, you might want a name, okay so what else besides name? 

Location, phone number. 

Location, phone number. What else, remember that they need to be able to log in? Yeah. 

Password. 

Password. So, we need some kind of password in the system, so a few fields. So, let's do this. Let's create the first of our spreadsheets called a table in a database and to be clear, the fact that we're using a database that has tables means it's a relational database, RDMS relational database management system is the buzz word there and this just means you're using tables to store your information and we'll talk in the future about alternatives to this among which are no SEQUEL databases or object oriented databases or document stores which generally means you just store data differently, you don't use rows and columns, you instead store actual say PHP objects which is a bit of an oversimplification, but that idea. Alright, so let's call my table of users, again, I would say a common convention is used all lowercase, no special characters and so forth. You could call it users, it's a little messy I wouldn't call it users. I wouldn't call it my users, this is just bad, it will work but you'll have to quote the string everywhere cause it has a space, so in short, best practice would generally say you should lowercase simple words that says what they are, like don't call table, call it users. Alright, how many columns? Well, let's keep this simple for now and let's just go with two for the moment, username and password and we can add to it later. So, let me now click on go or hit enter and now I get this field, this form that I can fill out, so I've not created the table yet but I'm being asked a few questions so let's go through this top to bottom and here's where frankly PHPMyadmin interface it's a little ugly and it needs some work, but it's definitely more user friendly than the command line for this particular task. So, I'm going to give this field a name, username, this one the name password, though I could call them whatever I want but like with variables in a language use descriptive words and here is where now the dropdowns get a little instructive. Turns out I have a whole bunch of datatypes with which it's defining my schema, so a schema refers to all of the decisions we're about to make. What is the format of your table? So, for username what do you want to go with here in terms of datatype? 

Variable char. 

Yeah, so there's this thing here variable char or VARCHAR. This is probably the best candidate and so let's put that at top of our mental list but let's see what else we might want to rule in or out. Text, sounds like it's related so let's maybe keep that on the mental list. DATE, no. TINYINT all these numbers are clearly wrong, BOOLEAN wrong, DATE and TIME wrong, okay STRING looks like we might have to give this one a bit of thought, CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT and the rest down there is some binary stuff. So, you say VARCHAR variable char, why? Or what does that represent? What is a VARCHAR? 

If you specify that there is a ten character law is not exactly ten maybe less than ten. 

Okay. 

Say one through 10. 

Okay. 

If is VARCHAR then they must be three. 

Exactly. So, a VARCHAR we're about to see also requires that we tell it how long it should be maximally. So, variable char means it's a string but of variable length, but you have to give it an upper bound. So, as you proposed if we say that this is a VARCHAR of size ten that means you can store usernames of length one, or two, or three, or even zero, or one, or two, or three, or four, five, six, seven, eight, nine, ten but not eleven or beyond, but the upside of this is that if you're username is just foo, three letters it's only going to use three characters and then you're going to save those seven additional characters, seven additional bytes or what not and so you have some space saving, now for one row who cares, for ten rows who cares, for a million rows that's absolutely going to start adding up. But then CHAR contrary to the name it's not a single CHAR you still specify the size but a CHAR field is a fixed size, so if you say ten, it's going to use ten bytes. If you say eleven, eleven bytes or maybe more if it's using sixteen bit characters for other languages. So, CHAR, VARCHAR why would you choose one versus the other or why does CHAR even exist if clearly it's more wasteful potentially? Conner. 

I mean maybe CHAR they're all going to be that length. 

So, maybe CHAR if they're all going to be that length, so that's good, so even though users names might vary in length, what might be a piece of data, even though we don't have room for it yet, that's a fixed length that people have? 

Middle initial. 

I'm sorry. 

Middle initial. 

So, middle initial if you just want one initial, a single CHAR would do the trick. What else? 

You wouldn't use a CHAR for this, but a date usually has the same. 

Okay, so a date, now it turns out there are date field but something like that that's a known fixed length, anything else? Maybe phone number, you know, little tricky with different country codes and what not which could be variable length but for US numbers you can use ten digits for a state you could maybe use seven if it only has one zip code, so there's some optimizations there if you're not storing the hyphens and parentheses and things like that. But more compellingly, CHARs are also useful for performance. It turns out, and this is a lower level detail, but it turns out that if you do specify CHAR because you know the length in advance, the database can be faster at searching that because if you think of it just in the most naïve way, if you have variable length CHARs in a column, you know, the column's going to look like this, it's going to be a ragged column which means this string is this length, this one's this length, this one, you know, it's kind of uneven which means how do you go from one string to another if you're doing something silly like linear search? Well, you have to know the length of each of those strings and in the worst case you have to search through each of those strings looking for the end of the string if you come from C or C++/zero represents the end of a string; you essentially have to look for that throughout the column looking for the beginnings of new words. By contrast, if it's all CHARs and of length ten, the column now looks beautiful like this even if you're wasting some of those bytes at least they're on the same boundaries which needs to go from this row to this one or this string to this string, it's just plus ten, plus ten, plus ten, plus ten, so you essentially get random access like you would in an array. So, that's the tradeoff you have to make and this is one of these nonobvious decisions just like in Project Zero, there's no one right answer to menu.XML, similarly for Project One there's going to be no one right answer to how you design your database for that project but these are the kinds of decisions you need to think through and make and sometimes struggle with and even then you and your, you know, business partner if you go off after this class and work on something collaboratively, might not agree but hopefully you'll at least arrive to the table a little more informed than him or her, so CHAR versus VARCHAR but it looks like we have four other options, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT so these are generally bigger whereby, I think text is usually something like, it's big 30, damn I can't remember, it's like a few megabytes I think is for a field like that, so this is like if you want people uploading their resumes for instance or some long document or something, or a huge HTML page that you've screen scraped or something like that, it might not be a few megabytes, I'm misremembering, but it's bigger than any of the fields we just discussed thus far. Generally, CHAR, VARCHAR are caps, years ago the cap was 255 characters, these days it's 65,535 or 36 characters; TEXT is more than that, I just don't remember what it is off hand, but the documentation would have this, so the upside of this is that you can store big quantities of text. Well that seems great, right? If I, and they are variable length, so they're like big, big, big VARCHARs so now pushback. If you just have this ability to just, hell, let's use long text cause I have no idea how long someone's name is going to be. I don't want to choose some arbitrary cutoff like ten or thirty-two or 255, it will let it be as big as the database supports. What must the price be that we're paying to have that flexibility of being able to store names that are millions of characters long? What could it be? Conner, take a guess anything. 

So, so are you saying like the downside to that? 

Not so much, I'll, so sure you can view it that way. There's some price we're paying, right, cause otherwise if you could use long text for everything, it's begs the question, why do any of the other datatypes exist? Now, we know why CHAR exists, but what about VARCHAR? 

Just because of the performance, I think it would slow it down too much having that much data. 

Okay, so that's not bad, that is right, so it is actually a performance thing because text and long text and even tiny text which is still big, it's just smaller than text is meant to be pretty darn big. For efficiency reasons when you store the database on disc and this is not a design detail we developers have to care about, it's something that my MySQL people have to care about. When they implement this database, they pretty much store the tables on disc or in RAM contiguously and conceptually this long column of text, but when it's a really long text what they instead do is put in the column a pointer effectively to some huge chunk text elsewhere, so in other words the text when it's really big is not stored right there so you can't just kind of hop around looking for it, you have to go here and then look over here, back to the table look over here, back to the table, over here. So, in short, it's just not as local to the rest of your data as everything else so this actually has real world implications for caching, if you're familiar with L1 caches, L2 caches, some hardware type things even in memory caches that a database would have, in short, the farther away your data is, the less likely it is to be in RAM or in caches at any given time, so in short, this might just have performance impact as Conner proposed and you might not notice it until you really have huge tables, but it's again one of those design decisions where for the courses projects certainly, you really shouldn't be dabbling in the text field unless you get to the point of uploading, you know, large documents or screen scraping stuff or just big corpses of text then it's compelling. Alright, so that's username, let's go with VARCHAR and, oh damn it, now we have to have a conversation about how long a username should be and this too, not obvious, but let's take a suggestion or two. How long should our variable length username be? Yeah. 

Thirty-two characters. 

Okay, 32 does anyone disagree? Yeah, frankly 31 is pretty compelling. I don't know anyone with a 32 character username. Jack. 

I would say 15 cause I don't know anyone who's goes that far either. 

Okay, 15 so I can actually think of some undergrads who have crazy long usernames just because their first and their last names are really long. So, 15 makes me a little uncomfortable cause now we're going to, you're going to have like people with stupid user names where they're like losing one or two letters. So, maybe 32 but even there it's like who knows what's best. There's some sweet spot but you don't want to error too high because you will potentially pay a performance impact and not all the much, but it's just you know we should to try to keep it as close to reality as we need to, so 32 frankly is not bad cause any more than that and no one's probably ever typing your email address anyway, so 32 fields at least reasonable. Alright, how about password? What datatype should this thing be? 

VARCHAR maybe? 

Yeah, VARCHARs fine and in fact, for most of our test fields VARCHAR is fine and how long should the password be? I don't know, maybe 32 again, you know that field's long. Most people in this room probably don't have 32 character passwords unless your super paranoid like Jack. 

No, I was just going to say that most websites say that a minimum password length is 8 and then I don't know anyone who really goes beyond 16. 

Okay, so 8-16, so somewhere in there and I'll compromise this time, let's say 16. Sure, that's fine. So, your user name is more secure than your password in some sense, but that's fine, so in short, just decisions you need to make and I would generally just as a matter of being anal at least choose some standard numbers, powers of two or just heuristic so that you're not making judgment calls every which way, just pick some consistent pattern. So, frankly I might even say 32 here just to be consistent, but I have no good argument for that other than the consistency. Alright, let's see what else we can choose here to the right. So, now in the top row we have username, bottom row we have password, default, so it turns out that a database unlike a CSV file, unlike an XML file allows you to specify default values, though in fairness there's ways to do this in XML but it's not nearly as straight forward as this. So, should a user have a default username whereby if they don't provide a username my database will just put something there for them? Well, you don't have that much flexibility. You can either choose as defined which means we can all everyone, for instance John Dow if they don't give us a username, or we can just say, ah that's fine their username can be Null or it can be a timestamp which is just wrong here. So, what's the right decision in this dropdown for default values for username? This one I'd argue there's a right answer so don't guess wrong. Yeah. 

I would use none. 

Yeah, good. So, right if they're creating an account they need a username you can't just like give them a fixed name like John Dow cause it's only going to work for the first person, so in this case it's none which means they have to fill in this field, yeah. Axel. 

The length of the password. 

Okay. 

If you, I mean [inaudible] you don't store the password in plain text. 

Indeed. 

You store with some kind of encryption and if you use like say MD5 it's going to be 32. 

I agree. So, let's actually come back to this concern. Let's do it this sort of naïve clear text way then and then we'll come back and realize shoot that was really stupid and then we'll actually refine that field in particular and it also allows an opportunity to edit the tables which you can do after the fact, so we're not writing this in stone. And password value, probably none either, we want them to give us a password as well. Coalition is generally not worrisome, you'll see that this just has to do with encoding of characters. You don't have to even bother filling this in, the database will choose one for you, don't worry when it says it's Swedish, just happens that the original author's in MySQL were Swedish and so the coalition by default is a Swedish encoding set, but there is perfect overlap between that and English so it's not a problem. You don't have to fret over that detail. So, I've pretty much always left coalition blank. Now attributes, this is unrelated for now, but we'll come back to this. You can have a binary attribute, an unsigned, unsigned zero fill, or on update current timestamp, we'll come back to those, but there some additional nice fields that a database can do for us automatically so we don't have to write it in code. Now, on the right-hand side here, notice that we have, whoops, notice that we have a couple of remaining options, the first of which is Null, can this field be Null? So, this is distinct from the default value even though we saw Null a moment ago, that just meant make the default value for this field Null. This checkbox lets us make the decision can it even be Null, yes or no? So, this is a good one to make a decision on. In this case I think the story is the same, no they cannot be Null so I should not check this box, but even if they could be Null I still wouldn't want a default value of Null, but sometimes you might want field. What's a situation? Suppose we were making more sophisticated users that have more than just users names and passwords, what field might be reasonable to have, but allow the user to leave it Null? 

Like middle name. 

Yeah, middle name not everyone has a middle name, something else? 

When you did the signup thing for the sports, it's captain and if you're not captain. 

Yeah, exactly. So, captain could be a field. It could be a Boolean field where you at least have a zero or a one there explicitly or it could just be Null or non-Null that would work too, so in short, optional fields where you don't want it to be the empty string, you don't want to waste any amount of space representing an empty string, you just want to say there's nothing even here, you can do that. Now, index let's come back to but we'll have a discussion shortly about primary unique index and full text AI is completely nondescriptive, but it means autoincrement it's also implacable here but we'll see when that might be useful and comments is just for us humans. So, at this point I'm going to go ahead to the bottom of the box and I'm going to say save and hit that button and now notice what was just inputted to the database. Here, again, could have done all of this manually at the black and white prompt just gets a little tedious and frankly I never remember the punctuation so I still use tools like this. Create table, jharvard underscore lecture. These are back ticks, so for US keyboards this is the character usually on the top left of your keyboard near the tilde, that is my MySQL's way of escaping things that might even have single or double quotes in them, so just realize that's a MySQL thing, these back ticks they actually have meaning. Dot user, so what is this saying? This is saying specifically create a user's table in what database? Jharvard lecture. Technically you can omit jharvard underscore lecture. if you just say users, the database will assume whatever database we are in now and recall that I clicked on my database at the top left earlier, so I've selected a database so this is more rebus than it needs to be. User name now is of type VARCHAR 32, the spaces are of relevant this is just a pretty printing thing. Not Null, obviously just reiterating what we said before and password has the same constraints. Engine we'll talk about next week, but there are different formats in which you can store database tables, one of them is innoDB another MyISAM, another is HEAP there's a whole bunch of others as well. For now, know that they just have to do with performance and certain fancy features and you can think of an analog as being a Mac OS has HFS+ for instance, the Mac OS file system, Windows has NTFS or FAT32, FAT16. The database engine is similar in spirit. You can still store data in different engines, but they have different features just like you can store files on Macs and a PCs even though those file systems have different features, so we'll come back to that detail. Alright, so now down here we see our little Swedish table that has usernames and passwords, the schema therefore, but no rows and columns. How do I know that? Well, if I go up at top and try to click browse the GUI is just yelling at me, there the "table is empty", so clicking browse gets me nowhere. So, let's do this sort of in a cheating fashion, let's click on insert and manually insert some users here, so for my username field, I'm going to type in a value of jharvard, and for the password I'm going type crimson and then I could give it another row and column, this ignore thing, this is again, this has nothing to do intellectually with databases, this is just a GUI that's making it easier to put data into this database. So, now I'm going to go ahead and click go, notice what just happened. I'm being reminded of the SEQUEL that I could have typed if I wanted to, and what happened here is a new SEQUEL command. We've seen create database, we seen create table, now we're seeing insert which is definitely one of the most popular ones. Insert into jharvardl ecture.users, and again, which of those words could we leave off? "Jharvard lecture" and the dot not strictly necessary; ("username", "password") and in fact the quotes are not always necessary, it depends on if you have special keywords of the like, bur realize that PHPMyadmin always puts them for good measure. Then values. So, values is a little cryptic in that it's not, like there's no colon separating keys from values, but notice that username was first, password. So, similarly is jharvard first, crimson in this case. So, now let's do this manually and notice that PHPMyadmin has a SEQUEL tab up here and that's indeed where we just were, now notice by default it puts in this query select star from users where one. This is another common SEQUEL command, select and it is what you use to search a database. So, in this case select star from users where one. Translate that even if you've never seen SEQUEL before just based on instinct to English. What is this doing? Yeah, Conner. 

Is it just saying select all like items and users that exist basically. 

Exactly, select all users, all fields, or all columns from users that exist, cause where one is just obviously true. So, we can actually wittle this down, the where one technically doesn't need to be there, it's pointing now that this, you can have conditionals in a sentence kind of like a predicate that had in XPath. So, this is equivalent and you know what's also equivalent here is select username, password from users. That also is equivalent, the star is nice shorthand notion at least when you have a bunch of fields. So, now let me click go and see what we get back. So, now notice it's a little overwhelming in terms of the GUI, the only thing that matters is the temporary table that came back. What select does is it selects rows from your table and returns to effectively a temporary table. Now, at the moment this is really uninteresting because I selected everything so it's equivalent to what I saw when I browsed this table a moment ago. Let's see in another format just so as not to get too distracted by the GUI. If I go back to my terminal window, let me now do show, I'll do it in caps, SHOW DATABASES; notice if you forget a semicolon, you instead get another prompt so you can then hit the semicolon there to make it behave. So, now jharvard lecture, so I could do select star from jharvard lecture.users where one; enter and there's my columns and rows. So, this what I meant by old school before, like this really is ASCII Art representing a database table and this is fine for small datasets, this is a nightmare when you actually have lots of rows and columns. You just can't do it at the command line like this, but now let's trim this query to be a little more user friendly. I don't need this useless predicate, I also don't need this or do I? Let's see select star from users. So, I haven't selected a database, so I can actually do this use jharvard underscore lecture; so now notice databased changed. I can hit up and down to go through my history, so now if I do select star from users, enter, now it works. So, why is this relevant? PHP is going to have the same exact mechanism where you first connect to a database like I did at the command line and then you select a database and then you don't have to worry about hardcoding you table, your database name all over the place which gets tedious and also means if you ever change your database name because you move your code from one server to another, you have to change hundreds of lines of code potentially or one if you factored it out to a configuration file. Alright, so in short, where are going with this? So, SEQUEL has a whole bunch of common statements; create which we've seen, haven't seen altered yet; kind of saw drop cause I did it manually to delete a database I didn't want; select we're going to see, we saw insert, we'll see update; delete and then a whole bunch of others but why don't we go ahead here, take five minute break and then regroup after that. Alright so we are back. So, let's actually do something with this, so thus far all we've done is play around with the database both the command line and the GUI, but let's actually try talking now to the database with actual code. So, let me go ahead and open up a somewhat familiar example from a couple of lectures ago namely these guys here. Let me go ahead and pull up in the appliance today's source code and let me go into the log in directory and notice that we have these files this time. So, a couple of these are familiar home.PHP and probably logout.PHP, but we left off with 1, 2, 3, 4 last time, now we have a few more variants, 5, 6, 7, and 8 which now actually introduce a database. So, let's see what home.PHP looks like. Let me go over to my appliance, go into the root here, choose log in and choose home and to be clear now that you do have presumably or will have this week for the project the appliance, realize that you can download of these examples, put them in John Harvard's public HTML directly or you can create a Vhost like you were guided through for Project Zero and you can call it whatever you want. I called mine appliance and you can actually play with all of these examples hands on. Alright, so you are not logged in and does anyone recall how this file, home.PHP knows that I am not logged in? Yeah. 

So, it looked into sessions super global and looked after log in true or something. 

Exactly, so it looked in the session super global and it checked for a flag called authenticated, a key called authenticated and if it's set there that means the user is indeed logged in. Alright, so let's go ahead and try logging in here. Version 5 has this problem though at the moment. So, not bad it's deliberately supposed to happen because I haven't provided any of the username or password or database details that we just created on the fly using PHPMyadmin and you can kind of infer as much in orange, MySQL connect, access denied for user "at localhost" so that already doesn't look right. Using password no, that doesn't sound right. So, a whole bunch of problems all of which can be traced back apparently line 17. So, let me go into the code in line 17 with Geany on log in 5 and we'll see here in my text editor, whoops, we'll see here in my text editor following source code. So, how is this starting? So, first I'm calling session start. Let me zoom in. What is session start doing for me again? Yeah, Axel. 

It's starting the session so. 

I could have told you that. 

But that they connect the super globals and everything after you do that. 

Okay, good so it's enabling you to access the super global as dollar sign underscore session. Let's be more concrete. What is it functionally doing underneath the hood? How is it achieving that end result? Put another way, how do sessions work? How do they get started? Huh? 

I believe, I think something is sent in the, from when you request a page and the session start is included the things you actually receive has like a session ID. 

Good. So, calling session start ensures that the server will send if necessary a set cookie, HTP header with a PHP ses ID cookie whose value is a big random sequence of letters and numbers and will also ensure that if the user sent us a cookie:header, so not set cookie, but cookie:header and that T-value is present then it will use that key value to look by default in slash temp where there are files named almost identically to the cookie value, which is again big random number, and will give me the developer the illusion of having access to the same shopping cart or what not that I previously had access to. Now, it's not a shopping cart here, it's just a storage for the authenticated flag, but same idea. Alright, so the next line connect the database is the comment. So, there's a new function here and we can do better than this function, but it turns out this is one that you'll see fairly on the presently in code online, open source code, and the like called MySQL underscore connect but we'll better versions next week that will hide some of the details we'll now get our hands dirty with. So, MySQL connect takes three arguments which you can see in documentation if you pull it up on PHP.net, but I happen to remember that the first field is the IP address or the name of the server which in this case is going to be localhost because I'm in the appliance and the database server happens to be on the same physical server as the web server, so it doesn't need to be one in the same, but in this case it is. My username recall is John Harvard, my password is crimson and what am I checking for here? So, if connection equals MySQL connect equals, equals, equals, false so some new syntax. So, first of all MySQL connect is a function that does exactly what it says. It connects to the database using those credentials and that database server, localhost. What does it return? It returns a pointer to a connection object, I don't really know what that means yet, but it's some kind of reference to the open connection so that I can subsequently send commands to the database server and then equals, equals, false what is this here for? Well, if something goes wrong MySQL connect is going to return false and it actually did when we saw that big orange error message a moment ago, it was, that was returning false but why is it equals, equals, equals feels like a typo? Anyone know? Yeah. 

I think two equals is the value and two equals is also the type. 

Exactly. So, this is exists in other languages to JavaScript for instance, because PHP is weakly typed whereby you don't specify in sin floats and strings explicitly you just use variables however you want. If you nonetheless want to ensure that you're testing a variable against a value not only bases on its bitwise, not only based on its value like true or false, but also based on its type so that is has to be false and it has to be a Boolean you use equals, equals, equals or not bang equals, equals. So, you use three characters instead of the usual two, so in short, this is the identity operator it's not the equality operator, it's the identity operator which means test for equality and for type. Now, why is this relevant? Because if just use equals, equals and MySQL connect returns zero, well zero is not false per se, but it is pretty darn close to false and in fact if you convert an integer like zero to a Boolean what do you get typically? 

False. 

So, you get false but zero is not really false and what if, even though this is not the case, what if MySQL connects purpose in life was to return integers? Well, then you couldn't distinguish the integer zero from the Boolean return value of false, so this is a little weird in PHP in that you can return multiple datatypes, but it's a very common paradigm and frankly it's a little useful in that you can return one datatype 99.9% of the time like ints or strings or the like, but just in case something goes wrong, you can return a different type like a Bool namely false. So, this is different from a lot of languages whereby if you want to return a sentinel value to signal an error, you have to reserve certain numbers for instance or certain strings so it's common convention in a lot of languages to return negative one to signify an error. The problem with that of course is that you're killing half of your address space. If you're reserving all two billion negative integers for error messages, I mean that's a lot of potential mistakes and you're killing half of your possible address space, so this is much nicer even though it's a little bit messy. Alright, so connections; suppose it goes through, it doesn't return false what do we have to do next? Well, you can kind of guess from this next line based on the example we did earlier, select database. So, this just chooses a default database so we don't have to type the database name all over the place so that's helpful too and now notice I'm passing in two things, one of which I need to fill in the blank for, the other of which is a reference to the previous line, so now I'm going to go ahead and pass in the name of the database which was what? 

Jharvard underscore lecture. 

Yeah. Jharvard underscore lecture and as an aside, this is not strictly necessary, if you don't bother passing that in and you don't even bother retaining that value, MySQL connect and MySQL select database will assume you want the most recently open connection that was just open so that's a nice little convenience, but I'll go back to the original slightly more rigorous way. Alright, so now some of the code is pretty familiar even though it's been a few days. If is set user and if is set pass, what is the implication? Well, that means that the user submitted the form, right and we had this conversation a couple lectures ago where this is one way of inferring that a form was in fact submitted. Alright, what comes next? Well, now we have some SEQUEL code and there's a few ways to do this and let me do it the wrong way first, so I'm going to delete what's here now and we're going to do it more of a naïve way first. So, let me go ahead and do this. A SEQUEL variable and I'm going to just do select star from users where user equals "jharvard" okay. So, this line is obviously wrong functionally, syntactically it is in fact correct, but why is this obviously the wrong way to implement this? Yeah. 

Because it's going to look for jharvard no matter what user [inaudible]. 

Exactly. It's going to look for John Harvard no matter what even if the user is David, or Chris, or Alon or something else altogether. So, obviously broken, but at least it lets you see the SEQUEL syntax a little more clearly for the moment. Now, what's the goal? The goal at hand is to determine whether or not the person who's trying to log in, their username and their password is correct. We can do this is a whole bunch of ways, but intuitively if I can find the same user and password in the database that's a pretty good start because if I can find that username and password it means this person has typed in someone's user name and password correctly even though maybe it's not actually John Harvard, but that's a different problem altogether, so jharvard isn't quite right, so I could do this $ underscore POST "username" but there's a couple things wrong with this. One, you can tell from the quotes I'm kind of throwing off the balance here. Two, this is such an ugly looking variable it needs to be explicitly interpolated by putting these curly quotes around it, or curly braces around it. Alternatively, I could do this, let me try this another way. I could do ".username.' so that's another way and this is kind of a stylistic decision, frankly you see this a lot even though I find it harder to read. So, any problems with this? First, it's first noting the single quote is important. I this case doesn't matter if I'm using double quotes and the single quotes or single quotes and then double quotes, but anytime you are searching based on a string you do need to quote things in MySQL, so this is important having the single quotes around the username. So, that else could we do here? Well, let me adopt this approach and let me steal this code and move this up a line. So, what about this version? So, you might not have seen this function before, but it exists in a bunch of languages sprintf, stringprintf, notice that I can use printf like format strings as I have here as %s as a placeholder and then sprintf plugs in its second argument to that placeholder and then optionally it's third and its fourth and so forth, so why is the useful? Really, this just is an esthetic thing for me. I actually find at the moment the third line in orange much more readable that says select star from users where a user equals "%s", why? Cause I can kind of read it all in one byte and then I can mentally go back and plug in the values to the %s, but I'm also doing something else. One of the stupidest names for a function ever, but what do you think it's doing? What's the point of this function called? Isaac. 

I think it's to, if the username posts a malicious code. 

Yeah, so if the user somehow inserts malicious code. Similar in spirit to the XSS cross site scripting attack we talked about earlier albeit in the world of JavaScript and HTML. If they type in bad characters, maybe quotes feels like quotes can be a dangerous character if I'm using quotes in my string. There's other things that can be dangerous. What if the user says their username is delete that feels like it could be bad if I screw up with the interpretation of it, so in short MySQL real escape string protects against a class of attacks known as SEQUEL injection attacks and we'll come back to those next week in our security lecture, but for now know that this is good and leaving this off is bad. Now, I say it's kind of stupid name for a function because years ago PHP had no notion of name space support, did not have object oriented support and so the way in which they added more and more and more functionality to PHP is they arbitrarily use this underscore convention, so all of the MySQL functions begin with MySQL underscore something, why they called it MySQL real escape string is a ridiculous. There was a MySQL escape string function which is apparently a little bit flawed so if someone else created a real escape string function I mean it's like a bad joke that's become popular convention, so anyhow, we'll do better that this next time when we introduce another API for database connectivity, but the point for today is that super important to do this, otherwise you put your data at risk to various attacks among which include deletions or random insertions into your tables or theft outright of like your users and passwords or hashes thereof. Okay, so all three of these lines for the moment are equivalent but the third one is the best because it has this additional security check, so let's go back to just the original and delete the other two that I've prepared here and let's see what I now do. Here, is how you execute a command in PHP against this MySQL server, so when I use the GUI a bit ago, PHPMyadmin which again is just an administrative tool for poking around the database, has no, it's something you're going use in writing your Project One or projects in general. MySQL query is the equivalent in PHP of my having clicked the submit button a few minutes ago, so what does it return? So, it returns what's called a result set and you can think of a result set as a collection of rows. It's a temporary table that is the result of executing that query which in this case is select. So, the temporary table I'm going to back is going to have how many columns based on the orange query up there? Axel. 

How many columns? 

How many columns will this query? Sorry? 

Two columns. 

Two, and what are they? 

Username, password. 

Good. So, even though I use star the table itself recalled that we made only has two columns so the temporary table I'm going to get back by selecting from that table also has two columns. Now, how many rows is it going to have? Well, that depends on what the user typed in. Suppose the user did type in jharvard and crimson, how many rows obviously should the table, the temporary table have? Just the one. Now, in this case it's coincidence the temporary table is identical to the actual table, but now assume we're implementing Facebook and we have 500 million users, only one of whom is John Harvard so then our temporary table is obviously much, much smaller than the actual table, so then it make sense that would be doing this selection. Alright, so if result equal, equals, equals false that just means something bad went wrong; how do I know to check for that? The documentation for the MySQL query function on PHP.net says "upon error this function returns false" so that's how I knew to check for that. I'm using the identity operator again to be super correct. Die sort of an unfortunately named function, but it's exit with a non-zero exit code in this case so this just means something really went wrong. You should not typically die in production code. So, if you're implementing a real website for real users selling stuff, getting popularity and so forth, you should not die, you should instead return some kind of error message to the user but given that we're just trying to demonstrate MySQL here and we don't want to get into views and MVC again all of that, it's reasonable here to just say die with and informative error message. The user will see it, but it won't be very user friendly, it's going to be text. So, now we have to do something with this result set. So, just like XPath the function in PHP returns a node set which is a collection of nodes or really an array of nodes similarly is a result set of collection or really an array of rows from a temporary table. So, I can count those rows with MySQL num rows and if MySQL num rows when past that result set returns the number one that means there's one row in it. So, why, what do I thing do? If there is in fact one row that I got back I call this line, MySQL fetch assoc. There's a few different versions of this you can call MySQL fetch array, you can call MySQL fetch objects, you can call MySQL fetch assoc; in this case MySQL fetch assoc is associative array so it's nice because it's going to give me back an associative array of key value pairs where the keys are, take a guess. 

Username, password. 

Correct. The column names more generally username and password in this case and the values are what's actually in the cells. Yeah. 

One thing, instead of doing the MySQL count rows thing. 

Ah hah. 

Can't you just as a SQL do limit one? 

Yes, I could do that. I could add a limit one which is another piece of SEQUEL syntax up here, however, it could still return zero, so I need to check for something and so that's correct. If my database is correct though and I add something called unique index which we'll come back to, that should hopefully be unnecessary but it's still good practice, still a good addition so I'll leave it in there. Alright, so now how about below this fetch line. So, what do I have do at this point? At this point in the story I have presumably, at this point in the story I have selected John Harvard from the database, but I haven't checked what? I only selected based on username, so I have. 

Have to check password. 

Right, so I still need to do this additional step just because jharvard exists does not mean this person trying to log in is jharvard until we actually check the password. So, let's not do this. It's pretty trivial. So, after I've called MySQL fetch assoc, I have an associative array for that row whose keys are the columns whose values are the cells in that row. So, if row "pass" whoops I got to change this cause I chose different names before. So, if row "password" equals post "password" actually it pass in the HTML so I'll leave that be even though I'm being slightly inconsistent here. What do I want to do? Remember that the user's logged in and then all of these lines down here is just the cryptic stuff that we used last time to do the redirection of the URLs to redirect the user back to home.PHP and that's it. If they are not logged in notice that we have all of this stuff down here which is just a simple form that we've seen a couple lectures ago. So, let's see this in action and see if we got it all right, so let me go back to my browser, and let me reload log in five, good we got rid of the orange message, how? Well, we're actually connecting to a database properly, so it doesn't mean my, the rest of my code is correct so let's try this, jharvard with no password, let's see if I can sneak my way into the site. Log in, could not query database, that actually does not sound right, so what could have gone wrong there? Let's go back here and what am I missing? Jharvard crimson want to connect to jharvard lecture and it's saying could not connect to database, why is that? Jharvard crimson could not query database that's different message, sorry so we are connecting, sprintf MySQL select database users where, oh okay so this is failing and this is just my unfortunate choice of names earlier. What's the field actually called? 

Username. 

Good. Alright, so I made the mistake when I prefabbed the code I called it user instead of username so let's see if that now solves our problem there. Reload, we're going to resubmit the same form with just jharvard no password, hum, I ended up back here, why? Yeah. 

Because you were not allowed to log in so you were. 

Exactly, these are super simple examples the fact that I've been, I've seen the form again means that I got it wrong. So, let's try jharvard and some random sequence of characters, log in, still not working, jharvard and crimson log in, nice and so now I'm logged in. I can log out and now I'm logged out, back at home, I can try logging again. So, let's prove that this isn't faked. Let's go to where is it? Let's go to the appliance PHPMyadmin and let's go to our user's table, zoom out a bit here, let's edit jharvard crimson and let's see the update command, so I'm going to change John Harvard's password to password and then I'm click go, and notice now the SEQUEL that was executed, here's how you update a field, update, users, set, password equals "password" alright so slightly confusing but this is the column name, this is the value, where user name equals "jharvard" and in this case, whoops, and let's see, my cache expired, let me do this again, edit, we're going to change it 1, 2, 3, 4, 5 so it's more explicit. So, and password equals password, so it is trying to change John Harvard's password to 1, 2, 3, 4, 5 for any row for which it is the case that username equals jharvard and password equals 1, 2, 3, 4, 5. So, that's one way of uniquely identifying this row hopefully unless we screwed up and we allowed two John Harvards who coincidentally have the same password in the database in which case both of their passwords are going to be changed which just is wrong, so we'll fix that shortly. Alright, so what else can we do here? How can we improve upon this? Well, let me go back into the code here and let me go ahead and open up log in 6.PHP instead, so when log in 6.PHP notice that we have a couple of changes here and what different if we focus on this? Much of the code is the same, but what have I fundamentally done that's different? Yeah, Axel. 

Well, in the SQL you queried for the username and the password at the same time. 

Good, so in MySQL query in this middle thing here select one from users where user equals %s and pass equals %s, I need to fix this again and change user to username and pass to password, but that's not a big deal so that fixes that same issue again. So, selecting one why am I selecting one? Anyone want to hazard a guess at why I could take this alternative approach, what's an upside, what's a downside? If any? What do you like about it? [ Pause ] Nothing? Okay, what do you dislike about it? Alright, no one cares about it. So, take a guess like think back to the previous example where I only selected based on jharvard and I did some stuff. This time I'm selecting based on jharvard and crimson and I'm doing less stuff, right? What do I not have to do in the case of this query? [ Inaudible Response ] Yeah, I'm just saving myself some code, right? And this speaks to the beginnings of the power of an actual database server. You have the ability do filtration and selection and conditionals. You don't need to reimplement that wheel in the code, in PHP code or any language for that matter, so if you have the ability to do Booleans ANDs and you do here apparent in SEQUEL with capital AND in this case, well just punt to the database, let the database figure out if John Harvard exists and if this is indeed his password cause it means I can write less code and frankly, I'm going to assume that some professional database people in MySQL are smarter than I am when it comes to optimizing certain types of queries like comparisons like this. Now, granted I still have to do the thought upfront as to how long the field should be, what should the datatypes be to kind of help the database be highly performing, but once I do those initial hints when configuring the schema frankly the database should be able to do this pretty fast if It's doing its job well, so in the case I'm still escaping both just so that there's no danger and I'm only selecting one as a slight modification here. Typically, what did we say is returned when you do a select statement? You get back a temporary table inside of which is what? 

The things you asked for. 

The things I asked for. So, technically I'm not really asking for much now. All I want is the number one back from the database, so this is a minor performance improvement for at least a small database here, but if you think about in general, if you were returning a whole, if you were selecting star and we had more rows, more columns rather than just username and password and we had phone number and email address and, you know, GChat ID and all these various pieces of data you might have associated with the user, why in the hell do I need to select all of those pieces of data when all I want is an answer to the question, is your password correct? So, the way to return that answer as a efficiently as possible is just give me back a temporary table with one column and one row inside of which is literally the number one if my query is, so to speak, correct otherwise I get back no rows in which case it's clearly not his username and password. So, a minor performance improvement for again small datasets, but the fundamental idea is that don't select more data than you actually need, and indeed, previously when I did select star, I was being kind of lazy. I needed to get back John Harvard's password, but did I need to get back John Harvard's username? I mean, no I already had it, right, I gave it to the database so that too was just lazy on my part, so in general, avoid using star cause you're just going to waste time transferring information from the database to the web server, to your PHP code to access and if you don't care about those fields you should instead enumerate the ones you do care about one by one unless you want them all back. Alright, so the rest of this code is then thereafter the same. So, let's see another variance of this that allows us to take it in a slightly different direction. So, in this case, we're solving the problem that Axel referred to earlier. So, this design we started with it's kind of stupid. It's not incorrect, it's just not very good for security and by that I mean my usernames are in clear text and that's okay that kind of is necessary, but my passwords are also in clear text and by clear text I mean if John Harvard's password is crimson, what do we see in the database? Crimson, what does that mean? It means if someone's looking over the database administrator's shoulder he or she sees that same password, it means if the database is somehow compromised and a bad guy physically walks up to the computer and copies it or somehow someone on the Internet steals that database or someone on the Internet executes a SEQUEL injection attack inside of which is a select statement so now some random kid on the Internet has selected all of the users and passwords on my database, in short, not so good if by losing the database table you're also losing your user's passwords. So, what's an alternative? So, you proposed what before Axel? 

You can one way encrypt the password. 

Okay, what does it mean to one way encrypt the passwords? 

Well, it means to scramble the letters and numbers in a way that they can't engineer it backwards. 

Good. So, a one way hash as it's generally called is kind of like encryption but it's encryption in one direction and it's encryption in the sense that it does scramble the input, so your clear text becomes some kind of cypher text, but it's one way in the sense that you can't undo those effects, so if my password is crimson for crimson what's stored in the database is some crazy looking sequence of characters, it's not going to be crimson. The only problem with that is that no one now can see what my password is. So it feels like this is not the right solution, right? If I am not storing crimson in the database I'm instead storing a hash of it, a mutation a scrambling of it somehow and yet me the human, knows I know what my password is. I don't know what that random series of text is, only the database does. How do I subsequently log in after my password has been stored in this scrambled fashion? Yeah. 

Encrypt it and [inaudible]. 

Yeah, exactly so there is a solution here. So, when I create my account, when I register for an account, they don't store the database crimson any more, they instead run it through this one way hash which is just a function that returns some random characters, then they store that string in the database. So, what does that mean for subsequent log in's? Well, I obviously don't know what that random string is or that seemingly random string is, but I do know that my password's crimson, so all we have to do is that same math, those same mathematics again. The next time you try to log in, take my password crimson which I typed into the post submission, encrypt that or hash that so to speak and then compare what? Compare that result against the result you stored in the database and if they match then it must be me. Now, small white lie it's possible with various hashes that two people's passwords could hash to the same value, so if my password is crimson and your password is 1, 2, 3, 4, 5 in theory because of the way hashing functions work, the random sequence of characters that's stored in the database could actually be identical, so what does this mean? This actually means that I could log in with either crimson or with 1, 2, 3, 4, 5 the only catch is that I have no idea mathematically what else hash is to that same value so the reality is almost all of us who have accounts on sites on the Internet odds are you don't just have one password, you could have maybe two, maybe even three or more passwords it's just you have no idea what they are and you could try to figure it out, but if passwords can be 12 characters, 16 characters and they can be letters and numbers and punctuation, that's going to take a lot of time to brute force figure out what your other possible passwords are, but this is just the nature of hashing function. So, how are we doing it here? In orange here in the middle, almost the same code but notice that I'm practicing what I'm preaching here. I'm comparing not just the username with %s ignoring the misnamed user and pass field, not just comparing the user fields, I'm also comparing the pass field but I'm not comparing the pass field against %s, what am I comparing the pass field against? Capital letters, PASSWORD ("%s") so it turns out that MySQL as a database server can do more than just store rows and columns. It can also allow you to call functions, and in fact, databases often support what are called stored procedures whereby you can write your own custom functions, store them in the database and then call them, just so happens that MySQL gives you one such function for free called password which does exactly this one way hashing that we've been talking about. There are other functions like average and sum and count and various like useful things for a toolbox to have, but this is one relevant to passwords, now, there is a catch. This exists, but does anyone know what the caveat here is? Yeah. 

Can reverse engineering. 

Can reverse engineering. It sucks is really what it boils down to. The password function is the right idea, but not very well implemented, and in fact, with relatively little effort can you brute force the password function and figure out alternatives to those hashes so that you can figure out effectively how to log in as someone else. So, there's alternatives and Axel mentioned MD5 before, there's SHA-1, there's MD5, there's SHA-256, SHA-512 and yet others, the reality is that MySQL does not have great built in support for this kind of hashing. There's faults in most everyone of the algorithms that it does support out of the box, so generally even I in recent years have taken to using third party libraries and doing this in code, PHP pass is an alternative, it's a freely open source PHP library that actually does the mathematics for you so you call an actual PHP function and you pass that string into the database rather than letting the database do it yourself, arguably this is better too because then if you every do change from MySQL to Oracle or PostgreSQL you're not relying on database specific functionality you're only calling the most basic of SEQUEL functions like average and count and summation and so forth, so your code is arguably a little more portable, but this is one of those minor headaches of MySQL is that this function perfectly named though it is, is horribly implemented so just don't use it, but also don't store things in clear text, your better bet is using a library in PHP or whatever language you're writing in but the point ultimately is to hash it. Alright, any questions? Alright, so one last variance of this that was seven and here is now eight. Alright, so in version eight, no I'm not going to show you this one cause this one's just similarly not so great. Let's do this instead. Alright. So, MySQL, unfortunately documentation not so good. It's correct and it's complete, but it's not nearly as user friendly to navigate to be honest, so when it comes to like looking up things related to MySQL or SEQUEL to be honest, I think you find that Google and various free write websites some of which we've linked on the resources page on the course's homepage, and will point you at as needed in the pdf of Project One are probably better resources but if you ever need an authoritative answer about MySQL realize that the manual this is the place to go and this is in fact useful for things like how big is an int field? How big is a big int field? How big is a date field? How big is a text field and so forth? When you want definitive answers to those kinds of things, best not to trust random people on the Internet, but to actually go to the official documentation much like you would for various PHP functions but we've glossed over one important and compelling detail earlier in one of those dropdown menus in PHPMyadmin, name the primary key index unique and full text. So, in our user's table, we have username and password, which of those should be by nature unique? And by unique I mean only one user should ever have that value for username or for password? Yeah. 

It should be the username. 

Okay, username why? 

Because people, oh yes if you only want one person with one particular username. 

Perfect. 

And the passwords, well you can't really tell people the passwords they should have and then lots of people have like [inaudible] 1, 2, 3, 4. 

Okay, good. So, username it's reasonable to expect that really should be unique otherwise consider the alternative, right? Consider on Facebook if you could log in as jhavard but so could someone else, like whose profile are you going to see? So, like that clearly needs to be unique. Now, password doesn't need to be unique, right? Because it'd be nice if we could have the same password, it'd be nice if it's not something silly like 1, 2, 3, 4, 5, but by chance you have 500 million plus users and two people are going to have the same password, so to impose that constraint seems a little foolish, plus if you really think about it, well if you only had two users in the system and you try to choose a password and you're told that someone already has it, you know the other guy's password, right, but that's a corner case. Alright, so in terms of enforcing this, what can you do? Well, if you're writing code that registers users, signs them up for your website you could just do it in PHP, right, you select; if a user wants to register as John Harvard with username jharvard, you first check does jharvard exist and how do you do that? You do a SEQUEL select and then if it doesn't exist, you get back zero rows then you do your insert, if instead you do see a jharvard already you just yell at the user and you say pick another user name. Now, there is a problem here that we'll come back to next week whereby what if two John Harvard's are sitting next to each other or sitting in Starbucks across, halfway across the world from each other, but at roughly the same time they both try to register for the site, so they type in jharvard and then both hit enter at roughly the same time their post requests go to the webserver, the webserver processes them maybe even in parallel alright this is a multicore computer, multiple CPUs it has multiple threads therefore literally stuff can happen at that the same time and even if it can suppose one of them gets in ever so slightly before the other what might happen? Well, both of these guy's posts might be handled by the webserver. The webserver is going to treat them similarly and it's going to do a select from the database to see if jharvard exists, but then think about how computers typically work. You don't get 100% of the CPU's attention in multitasks among all of the various threads or things going inside of the computer, so right after your select statement for jharvard this thread so to speak might be put to sleep briefly, split second, but that means this guy might be woken up and his select statement gets checked. The answer now to both of these guys is jharvard does not exist. So, now your second line of code executes and you do the inserts then this guy gets put to sleep just because of the operating system is behaving. This guy then tries to do the insert, what happens? It breaks somehow, right? Either you're trying to insert another jharvard and worse case you now have two jharvards which is like the Facebook problem, now which is the right jharvard whose profile to show and so forth, or there's a failure case or somehow this guy's password is now the same for both people, in short, this is not a good situation. So, we'll solve that problem next time with this notion of locks or transactions, but for now this is kind of a bad situation but this where two databases are so much more powerful than XML files and CSV files where the [inaudible] is otherwise on you, here we can tell the database make the username field unique so that even if I screw up, I being the developer, I can at least still have a defense in place, so I'm going to go users, I'm going to click not edit cause edit, it would just edit my table instead I'm going to click on structure at the top and now here's just a reminder of what we did earlier, so I've got my username field, VARCHAR, Swedish and so forth. I'm going to go ahead and check all and then click change and now I'm just going to see again that form that we saw a while ago. So, what do I want to change here, I'm going to scroll over to the side and I'm going to change the notice here I can longer do it here, I actually forget, can't do this here. So, no longer do I have that dropdown that had uniqueness and that's because the table already exists, so it's actually on another screen and this is just a PHPMyadmin thing, it's not compelling for any other reason and if I scroll down here notice that next to user name I have a few options. So, I have an add unique index, add index, add spatial index, add full text index and actually there's another way I can actually do this, if I instead go over here I can check username and then notice these icons here, I can check primary or unique. Frankly this UI is a bit of a mess, but in short I have a number of alterations I can now make on this field. So, what do I want to say? I do want to say username should be unique, but there's a special word given for unique keys if those keys or rather those fields also are supposed to uniquely identify rows in your table. So, in this case username uniquely identifies my users or should at least, so I could make a unique field by clicking where the cursor is now at top right, clicking unique but technically in the future if I'm going to use this field jharvard to uniquely identify users, this is by definition what's called a primary key. It is the key, the field, the column that you use to uniquely identify your users which it to say, your table can have other unique fields even if you don't actually use them to identify your users, for instance, what's another field that I in theory a user, only one user in the world should have one of? Phone number? [ Pause ] Cell phone maybe, but you know some of us, some of you still have landlines, right? And whole families or roommates. What? 

Social security number. 

Okay, so social security number. That is supposed to be a unique number per person and so you might want to enforce that in the database but frankly not everyone in the world has a social security number so you might want that field to be Null, but when it's not Null you want it to be unique so you can have the database at least enforce that. What about email address? Well, email address in theory should be unique unless you're sharing an account or something, but it isn't necessarily what I want to identify my users with, if only because an email address might be this long. What would actually be a better datatype to use in general for uniquely identifying rows and a table do you think? What does Excel use? 

Index. 

What's an index though? 

It's like an iterative number. 

Yeah, [inaudible] number. So, just a number, right? One, 2, 3, or maybe we started with zero, 1, 2, 3. So, just a number, why? Cause then it's only 32 bits or 64 bits, so it's only 4 or 8 bytes meanwhile my email address, you know, it could 10 characters, 20 characters even depending; social security number is similarly long with hyphens and what not, so in short, the best candidate for a unique key is probably not a string at all. So, let's actually alter this table. I'm going to go ahead and before I change the key notice down here I can make some alterations. I'm going to say add a column at the beginning of the table, although where it goes doesn't really matter, but it's common convention to put your primary key at the top of, at the beginning of your table, I'm click go, this is going to allow me to add a new field and I'm going to call it aptly ID and I'm going to leave this blank, this blank, this blank, I'm going to leave this as none, I don't want a default value and I'm going to go ahead and click save and now notice I have in int whose size or type is apparently 11. This is legacy thing, this does not mean your integer can have 11 digits, it does not mean your integer can have 11 bits, it means when you are in this black and white window that we saw earlier and you print out these old school looking columns like this thing here, that means your integer will use 11 characters in this black and white interface, so completely legacy, so it's pretty much irrelevant; 11 was just the default now. Alright, so now I have some power. I'm going to go ahead and change this. I'm going to go ahead, whoops, let me go back to my structure. I'm going to go to ID and let's say what do I want to do here, change, I'm going to go to why is not letting me, there it is, change, oh I didn't scroll over far enough before this was here, so AI, does anyone recall what this is? Yeah. 

Part of auto-increment. 

Auto-increment. 

I think it will when you add another user it's automatically going to be two if the first one is one. 

Exactly. 

Add [inaudible]. 

So, this is yet another feature of the database that's not just nice to have, it's really compelling. So, if you want to uniquely identify your users as efficiently as possible much like Facebook does with a number in the URL unless you chose a custom yeah username for Facebook; a number is the way to go; 32 bits, 64 bits, definitely nice and predictable instead of a variable length string, but you don't really want to be in the business frankly of figuring out what user ID is available, right, two would be kind of annoying if anytime a user registered you first had to figure out the highest number of the previous user's ID and then choose the next number by adding one, it's not hard, I mean it's trivial mentally to do that but it's just extra work plus you run into the situation of what if two people try to register ever so slightly the same time, you're going to get into this so-called race condition, but again more on that next time. So, I'm going to choose auto-increment here because what this means is when I do an insert in the future I am going to have the ID field automatically assigned for me. Now, I can't just do this yet cause notice incorrect table definition, there can only be one auto column and it must be defined as a key. We haven't finished that part of the story of making this a key. So, rather than use username as my primary key that uniquely identifies my users, I'm instead going to say, you're going what, ID will be my primary key and this is the SEQUEL query that was just executed, alter table users, add primary key on ID and now I'm going to go and change this field and scroll over to the side make it auto-increment and save and now notice under the extra column, I'm just being reminded that this has the auto-increment flag on it and here's how this worked; alter table users, change ID, ID to not Null, auto-increments. Alright, so username that still leaves the question of username. Should username have any kind of index or key? Well, do you want username to be unique? So, we still do. You can only have one primary key though and we'll see why next time too why you want to have a primary key as opposed to just a unique index, but I'm going to go over here and say this should be unique, so now what will happen, whoops, my cursor is doing strange things. Here's browse, we only have one user, so let's try to insert someone else into the database and we'll do it the real way with SEQUEL commands. So, insert into users a username and a password and I'll put my quotes around everything just good measure, also because password is a function recall. The values of "malan" "1, 2, 3, 4, 5" and semicolon is not necessary here, it's only necessary in the command line clients. So, let's do go, okay so that was inserted. If I go to browse now, notice I have two rows, malan and jharvard. Now, let's try to register malan again but notice one thing first, what was John Harvard's ID and what was malan's ID? One and two. So, exactly what Axel promised would happen, the nice thing here is I didn't have to think about that, I didn't have to insert two myself. I won't have to insert three myself, if it's auto-increment auto is literally the keyword here it happens automatically for you and you avoid this thing called a race condition where two people might be registering at the same time and might otherwise, if I were writing the code, give them the same ID which would be bad. So, now let's test our uniqueness constraint. Let's go into SEQUEL again and let's do insert into users a username and a password of, with these values malan and a password of 5555, so different password, alright go; duplicate entry malan for key username. So, what does this really mean? So, this is an error now that's happened so if you called MySQL query you would actually be in form programmatically in PHP an error has happened which is you're way of inferring I must have tried inserting the same username twice, so we now have this defense in place. Alright, so this table is coming along and it turns out that primary keys are going to have a relationship with something called foreign keys, but more on those separately. What else motivates this choice? What might you gain by telling the database in advance that this field is special, that this field is unique? What might that lend itself to that's not too big? It's actually the answer to; it's the same answer to every question thus far tonight. Yeah. 

Maybe it has to do with something with performance. 

Yeah, performance. Good answer. So, it does actually have to do with performance. When you create an index as it's called or a key, key index they're essentially referring to the same thing. When you create an index on a field, on a column in a database, the database is going to spend some effort, some time and some disc space upfront to optimize that column and by that I mean it's going to build up a secondary data structure which is usually a data structure of a tree, a B-tree in fact if you've take a data structure's class and a B-tree is essentially a very shallow tree that lends itself to searching large chunks of data like you might have in a big database table. It creates this index, this B-tree that makes it much easier to answer questions of the form who is ID two? Or is jharvard in this table? Any query you might want to ask about a specific field in this case ID or even the username field will be faster. Previously, when we did the select star from users where username equals jharvard, that was a linear search of the entire table. Now, you were not unimpressed because there was one person in the table, so it's obviously not that slow to search the table, but if we did have Facebook's 500 million plus or 800 million plus users, that would have had been a linear search looking for John Harvard, awful, especially for large datasets. Now, if you instead click that button and say make this a unique index, or make this a primary key index, then the database is going to churn through that list and will do linear search maybe once, maybe twice, maybe three times, but the output of that process is going to be some kind of tree structure that's kept around in RAM so that the next time you ask me a question, it's going to be much faster to answer the query and it's going to do something like binary search or even something fancier than that. So, indexes are huge and if you've ever visited a really bad website in terms of performance, it's slow, going from page to page, has nothing to do with your Internet connection cause you're at home on Broadband or what not, just a really crappy website odds are it has to do with one, server could just be overloaded and there are too popular for their own good, or are very likely they just didn't know what they were doing when creating their database tables and they just created rows and columns like you would in Excel; they gave not thought to primary keys, indexes, uniqueness, or anything like that. Full text is similar in spirit, it can be used for text fields, so here's one of these design tradeoffs we didn't touch on earlier, VARCHARs are great for variable length strings. Text fields are great for variable length strings that are even longer, but you do pay a performance penalty cause the text fields remember end up elsewhere which just means they're not as local, caching issues and so forth, but the upside of using a text field which is bigger is that you can put a full text index on it which means you can do Google like query is on the search saying return this row if it has the keywords foo, and bar, and not baz or things like that and you don't have to implement that yourself the database can do it for you. The price you pay though is performance for just selecting the data potentially. So, again, there's no perfect solution here it really depends on the use case you're trying to solve. So, let's consider a problem that we're not going to solve tonight, but that does kind remain. Suppose, we augment our user's table to include not just username and password and not even just ID, but again what are some things we might want to associate with a user? Axel. 

Their email address. 

Email address, good. Give me something else. Yeah, Conner. 

Phone number. 

Phone number, good. Something else. 

Gender. 

Gender, good. Something else, we'll keep doing this until we get the answer I need to tell the story. Yeah. 

Pictures of the user. 

What's that? 

Pictures. 

Pictures of the user, good. Oh, actually we can tell a quick story about this. How do you store binary data? So, you actually can store it in a database and in fact one of the fields that we didn't look at but was on the screen there briefly when you create a new field, let me try to simulate it, let me go to structure, let me go ahead and add a new field and show the types. At the bottom here, there's some fancier features that we really haven't even scratched the surface of BLOBs, binary large objects and this just refers to binary data, so you could actually store photographs of users in the database or you could store them on disc and this is actually one of these other nonobvious design decisions, but this one at least there's some good rules of thumb. Frankly, I am of the philosophy that data belongs in a database and files belong on a file system and by this I mean if you are having your users upload photos or resumes or what not, storing them in the database is probably the not the right place, cause the database is going to get bigger and bigger and bigger, I mean files are generally much bigger than textual rows and so you're going run eventually into performance issues, you're going to run potentially into disc space issues, you're going to run into replication issues whereby very much in Vogue [assumed spelling] these days or CDNs, content delivery networks which are just severs like Akamai and Google and the like, and Facebook has this too, hundreds or thousands of servers whose purpose in life is just to serve up static content, JavaScript files, jpegs, ping, movie files and what not and that's all they do. If instead you're storing that data in a database now you have to replicate your database around the entire world and MySQL databases here and here and here which is totally possible, but completely unnecessary for scalability, so we'll talk more about this at the end of the semester but in general, I think in like that past 6-10 years I've never stored binary data in a database really, at least not files. Better to store the file on disc in a folder for instance of uploads that's owned by you, [inaudible] suPHP which we talked about a couple times ago, but what could you still store in the database? 

Path. 

The path. So you store the path or the name of the file that's uploaded, so you still associate the thing with the user but you store it in a more natural environment that lends itself to scalability and access controls and the like. Now, there's some other cool ones if you like things like geography and anything related to points and latitudes and longitudes; there's nice built in support in MySQL for that kind of stuff too where you can actually do queries like is this a latitude, longitudinal point nearby this other one and similar queries which you would have to otherwise implement yourself in CSV or XML or any other domain like that. Alright, so let me try to coax this last story a little further along, what else might you want to associate with a user? 

Address. 

Address, there we go. Okay, so address is interesting because it may be the case that you have unique postal addresses or home addresses unless you have a family or a roommate, so there's some corner cases there but let's start telling this story. So, now John Harvard lives for instance at 33 Oxford Street, Cambridge, Mass. 02138. Suppose, David Malan also lives at 33 Oxford Street, Cambridge, Mass. 02138 and so and so on and so on, what's the redundancy there in particular? Yeah. 

You're storing the same string or whatever multiple times. 

Exactly. What string is really redundant here or strings? Jack. 

Cambridge, Mass. 

Cambridge, Mass. Why the heck am I storing Cambridge, Mass; Cambridge, Mass; Cambridge, Mass; Cambridge, Mass. again and again when really I could identify Cambridge, Mass. by what? 

Zip code. 

By zip code. Now as aside, the US is kind of a mess with the zip codes and over time I've learned that zip codes actually don't always follow town boundaries and there can be weird overlap and so forth so this is a nicer story than it is actually in practice. Sometimes town share area, zip codes and vice versa. So, kind of a mess but that's the human's fault not the computer people's fault in this case, so let's at least assume that zip codes do uniquely identify cities and states, so this is great cause I can store five digits of CHAR field even, 02138 and then how do I remember that 02138 maps to Cambridge, Mass? Jack. 

Is there another database somewhere that has all of the areas that have the existed or have been put up on your server? 

Yeah, exactly. So, you can buy off the Internet a big database of zip codes with city, states. So, you could just buy that or find it somewhere and then you could store your own local copy cause what you can do with MySQL even though we haven't done it yet is you can have more than one table, right? Just like Excel can have multiple worksheets. MySQL can have multiple tables, so I can have one table called users and another table called cities or another table called zips whatever I want to call it and what would the primary key be in the zips table? And what would the columns be? Yeah. 

The zip code. 

Yeah, probably the zip code. Could be a unique number like an integer but if I already have a number and it's a fixed length like this is actually a pretty good candidate for a primary key, so what else would be in the zips table? 

The address. 

Yeah, city and state and I could push a little harder. I could factor out anyone who has 33 Oxford Street somehow, but frankly storing the street address redundantly is probably okay. Storing city, state again and again doesn't feel nearly as okay cause right, how do I uniquely identify 33 Oxford Street? I have to standardize on a name for it like 33 space Oxford space, we'd have to make more of a design decision there whereas at least with city, state and zip that in theory should have a nicer relationship, so I can factor that out. So, now the issue of primary key should maybe make a little more sense. Now, we can have a zips table whose primary key is zip code and whose other columns are city and state and so forth. We can then have my user's table and we add another column to called zip and what do I store in the zip field? Well, something like 02138, at that point in the story, 02138 or specifically zip code is a primary key in the zip's table and it's what's called a foreign key in my user's table, so this too is why it's advantageous to define these kinds of keys. Primary key again means uniquely identifies rows in this table. Foreign key means, is a primary key in another table and what this will allow us to do is take the user's table, take the zip's table and if I want to see all of that data together, I want to see jharvard crimson 02138 Cambridge, Mass. I want to join all of that information together so I can get at it with one associative array, one, MySQL fetch assoc call, what we can do is if the right most column of this user's table is zip and just for the sake of pictures the left most column of, it's the opposite to you guys, the left most column of the zip's table is zip, right so we have zips, zips, we can effectively overlap them and join them so that now we have a wider table that has all of the data we care about; redundantly but at least now it's a temporary table, so this refers to generally the process of normalization and factoring out data that would otherwise be redundant is not necessary to keep and this is the bread and butter of relational databases; you put as little information as you need to solve the problem in a given table, you factor out as much as you can and you leverage an feature of SEQUEL known as joins to actually rejoin the data later which even though it costs you a bit in CPU cycles, saves you significantly in space especially when you have many, many, many, users. Alright, any questions? Alright, why don't we go ahead and call all on Louis. 

Louis. 

Louis. 

When we're doing a project, so PHPMyadmin that's in the appliance? 

Ah hum. 

Is there any other, GUI that we can use or do we have use that because that's [inaudible]? 

It's in the appliance already, if you're running Windows there is a Windows when MySQL client that you can download somewhere on MySQL.com you would then configure that Windows program to talk to the IP address of the appliance with user name jhavard password crimson and you could use that as well, to be honest, I would, even though I have my qualms with its UI, it's actually a wonderful useful tool that just gets the job done and it's not again a key part of the project, it's just a user friendly way of getting at the data and creating tables and such. 

We can use any methods to create a database? 

Yep, absolutely. 

PHPMyadmin and stuff? 

So, exactly. So, at the end well and actually for Project Zero you don't need MySQL at all, you don't need PHPMydmin at all, so this will only be relevant next week for Project One which will actually use MySQL, but at that point you're welcome to use any development environment you want, any tools that you want, so long as, as we say in the spec your code works properly when you install it in the appliance so that when we and the teaching staff install it in our appliance, it's guaranteed to work and it's not tied to your random PC configuration or Mac configuration, there's at least a standard installation setup. Let's adjourn there. I'll stick around for one on one questions, otherwise Alon will start with section in just a little bit. See you on Monday.