START [ Silence ] All right, welcome back to Computer Science E-75. This is lecture one in which we actually dive in to PHP. And so you pulled up your browser, you hit www.google.com and you hit enter. Can we play that back to the story, what happens first and try to impress everyone with as much technical detail by just one step as possible. Give me one step in this process. You have hit enter, what happens? Yes. Communication with the DNS server. OK, so there's some communication with the DNS sever, where by your browser asks the local operating system. What is the IP address of google.com. If you're operating system itself does not know, it turn asks the local DNS server. And who typically owns or controls these DNS servers? [ Inaudible Remark ] Yeah, you're ISP. So, for Verisign, Comcast, Harvard, your company anyone along those lines. And if you're company your ISP does not know what the IP is for google.com, what happens next? Yup. They probably know another DNS provider that knows so little, it may direct to that stuff. Excellent, they probably know some other DNS server and so, they ask the-- a bigger fish followed by a bigger fish and so forth. And worse case, these are these root servers that at least know where the other authorities are for the various .coms, .nets, .orgs. And the reason that all works is that when buy google.com or on your personal domain, you at least have to tell you're registrar what? Yeah. On the DNS server if you're-- where you're getting your website. The DNS serves of the-- of the hosting company of what not, where your website lives, and that's typically called NS1 and NS2, just conventions. But the important detail is that they're usually two DNS serves that in return know your websites as IP address, knows your webs-- domain names, e-mail servers and the like. OK, so now my browser knows the IP address of my google.com, what happens next? Yeah. Look, sends and it get request. Good. Yeah, the room of actual hard drive. Good so we told the story of the virtual envelop a.k.a. packet and that's send from point A, you, to point B, Google. And inside that envelop is this message "get me slash" and then there's some reminder of the protocol that's being spoken "http slash 1.1" or what not. What's also inside of that pocket? What amount of information? Could be reminder of what actual web address that user typed in. Good, so a reminder of the address that user typed in which is the host HTTP header and this is crucial for what feature offered by today's web servers, someone else, yeah. Virtual hosting. Virtual hosting, whereby you can put many websites on the same physical machine and even on the same IP address because browsers thankfully will remind the server what host name was actually requested so that the web server can distinguish between your website of someone else's website and so forth. All right, this virtual envelop goes to Google. Google opens the envelop so to speak. She get slash dot, dot. Realizes "Oh, you want the root of our website?" In Google's case, that's all the HTML and other assets that compose their home page for searching. And so, they respond with the packet of their own or more packets of their own inside of which is all that HTML, your browser receives it, renders it, connection is close. Now in terms of more subtle details, browsers these days are fairly smart and that rather than ever have to ask the operating system, Mac OS, Windows, whatever. What the IP address is of google.com, a browser will cache that IP address typically. So this just means it's slightly more efficient than asking the operating system and certainly more efficient than asking local DNS servers. But there's a got you, and one of the themes of this course will be to try to point out some of these details. Because, if you are not just a user but you're actually a web developer trying to build new websites, suppose that the IP address has been cached but suppose that you moved the website to another server or another virtual machine. There are these got you's you might run in to. And so one of the recurring themes of any sort of web development especially in this PHP world is constantly be clearing your cache. And in one other upsides of using Chrome frankly for primary development is it has incognito mode which, well, usually is used so you can browse sketchy places online. It can also be used to a developer's advantage and that it will prevent cookies from being saved and other details from being cached. But even then, it's not perfect and even I often to have quit the browser entirely clear my cache manually. If you ever notice anomaly is happening or like, "I know I changed that file" it could just be some stupid cache issue. Just-- So put that in the back of your mind so that you don't waste 10, 20 minutes some night this summer chasing down a bug that you actually already fixed. Caching takes many forms and DNS is just one of them. All right. So any questions on that big picture of HTTP? None? All right. So where does this all fit in? So this is the picture we essentially just painted verbally, so what's on the end of point B? In this case Google or some other server. So one of the most popular web servers out there is Apache. This is freely available software. It can run on Linux computers, Macs, Windows computers but it's super common in the Linux and Unix world in particular, and those tend to be machines used for web servers these days. It is the A in LAMP. So LAMP is just a silly buzz word, a Linux, Apache, MySQL, PHP, and that's just a buzz word saying, "I'm using all of these various technologies." But common jargon in the industry is to say that "I'm running a LAMP stack." And that just means you have Linux as you operating system, Apache as your web server and so forth. And so there's nothing technical about the term, but we'll be looking at the individual pieces. So one of the latest versions of Apache is 2.2 something. This is the documentation there. I will say from personal experience, I've never found it the most user-friendly. So frankly Google is the better friend to me at least than Apache is on websites, stockoverflow.com, serverfault.com. These are wonderful places where smart technical people post generally useful solutions to common problems. So keep an eye out for-- or make use of those resources as you see fit. But what are the kinds of things that you can do with the web server configuration? Well, virtual host name. So this is a representative snippet from a file called httpd.conf. And let me just pull up a little scratch pad so we can type out some notes here. And the blackboards are occluded by the projector here so we'll use text edit. So this just so happens to be the name typically of a configuration file however you might also see it as apache.conf, apache2.conf. It really depends on your operating system or the distribution of Linux for instance that you're using. But the important takeaway is that this is typically the main configuration file for an Apache based web server and internetics [phonetic] in Microsoft IIS server has similar features. There's other web servers software but Apache is definitely among the most the common. And here is a representative snippet from that file that apparently is implementing what feature for the web server if you can infer. Kind of just guess by reading it. Yeah. First of, it's a port 80 so that's on a regular website. OK. Good so you see a port 80 at the very top there which suggests it's indeed a sort of standard website living on a standard port. What else comes to mind? What other feature is being conveyed by this configure? Yeah. A database. A database, where are you inferring database from? Is that port 443? 44-- so not-- 443 is actually used for SSL. So there's two pieces here. We can-- and we'll focus on both, but first, the top one port 80 is sort of the simpler of the two, so let's look there first. So I'll put this one up. So virtual hosting, this feature where by a web server can use multiple-- the same IP address for multiple websites is implemented literally by a way of a file like this. This is telling the web server, and the top thing there is just a comment, this is telling the web server "Hey, define a virtual host or Vhost on port 80 of any IP address that the server have." So star denotes anything, and in this case, it's meant to mean an IP address. And this is relevant because if the web server just so happens to have multiple IP addresses, this is a wild card character that just says, it doesn't matter what IP address, the request comes in on, go ahead and just listen on port 80 on all of those IPs. So another common thing specially if you're developing on your local virtual machine which is increasingly common, and this again what we'll do in the class, sometimes you do need to know the IP address specially in various cloud environments. So just be mindful of sometimes star is not sufficient unless you have configuration another layer of configuration that I'll wave my hand up for now because we're just looking at snippet here. So this says, listen on port 80 on any IP address that the server has for incoming requests. Now, when in-- requests do come in to the server, thankfully, they should have that host colon HTTP header that reminds the server what this request was for. So, if you skim through some of these, and let's skip the top part now, server name, this is where the Vhost's name is actually defined. And we'll see it down here, too. For the SSL version, the name of this website will be the same. But I've also defined what we call an alias, which is just what in this case? Web sanity check. Yeah. The same size of [inaudible]. Exactly. The alias here is just cs75.net with no www. So, this is just one of the steps necessary to ensure that both www.cs75.network and cs75.network. So, the quick story I told on Monday about certain websites just not working with just something.com or the like, is because someone did not think to configure a fairly minor detail like this. Again, this is Apache but other web servers, a Lighttpd, Nginx and others have similar features. So, this is one step and just to time Monday until tonight, what was the other key detail that you need to do to ensure that both work? Both www and not ww. Not a redirect. Redirect is really just to ensure these are ends up at the place you want where you want both destinations fundamentally to work. [ Inaudible Remark ] So, we needed a DNS record, an A record in particular. So, we needed to specify that cs75.net itself has an A record and we need to specify that www.cs75.net has an A record or, what other type of record could be? [ Inaudible Remark ] Multiple aliases which we called CNAMES on Monday. So, CNAME are canonical name. Now, these two is sort of a corner case, technically, unfortunately, you can not generally make CNAMES for the root of d domain. Cs75.net cannot be a CNAME for something else but something with a host name, www, ftp, mail, .something.com, those can all be CNAMES and that's a bit even over simplification. You can have cs75.net be at CNAME technically but things like e-mail tend to break as a result. So, let me just make the blank statement that this has to be an A record, this can be an A record or a CNAME. So, just little things you need to keep in mind when setting up for instance your own domain name that you just bought. Server admin, so this is just a floppy detail so that if there's ever an error on your website and you see it like 404 or something like that, if you haven't customized the error message, the footer of the web page is generally going to give the email address of the web master at something.com. In this case, we're telling them to use this address just because. So, it's not something like web master which doesn't exist in our case since we're such a small shot. Lastly, custom log, error log, this kind of do with they say. It's just specifying the folder in which you want logs to be stored and most important line here though perhaps is document root. Now, this is kind of crazy long encryptic [phonetic]. It just is what we as a class decided to do in terms of the layout of our hard drive. However, all this is telling the virtual host is that the HTML files or PHP files, GIFs or PNGs for this virtual host called www. [inaudible].net lives specifically in this directory on the server. Very often this will be much a shorter path for normal people but we've kind of laid ourselves out fairly hierarchy play which is why it's so long but that's all it means. All right, any questions? And again, this is something that for the first project you have an opportunity to tinker with and even break if you want and you'll be able to restore it rather easily. All right. So the virtual host on port 443 is a little more interesting but also mostly a duplicate but the few lines are new which one's jump out at you is obviously new, so all the SSL stuff at the bottom. So SSL is kind of a pain to setup at least with certain web servers whereby you have to configure a few files. So what is SSL? SSL is Secure Sockets Layer. This is the protocol that websites use to communicate securely with browsers but what is necessary before you can actually use SSL on your website? Does anyone know? What's involved in doing this? Yeah. I think you need to distribute a certificate that the user will have to get. Exactly, you need to distribute a certificate that the user will need part of, it will need to get some help from you. Thankfully, it's all automatic. So how do you go about getting a SSL certificates? So there's a couple of things you can do. You can either, create one and sign it, so to speak to yourself, or you can pay someone else. And have you ever been to a website that said-- whereby the browser upon visiting, yells at you as saying something like "this website cannot be trusted" only, you know, "you should not go here" for some reason like that. So that's because that website probably doesn't have a certificate that was signed by what's called a certificate authority. And I think I can actually simulate this, I just happened to cross this the other day because I wanted to make one of my university websites run over SSL. So let me open up chrome here and type in https://cs.harvard.edu, enter. Perfect, perfect example. So CS department has not paid for and what's called a SSL certificate ironically. And I will fix this but it's a great demonstration, so. What does this mean? It means that the site isn't necessarily insecure, per se. It pretty much boils down and this is some what pessimistic to the fact that we have not paid for in SSL certificate. We have created an SSL certificate whereby that's just a command. On a Linux computer, you typically run a command called Open SSL with some fairly arcane command line arguments and hit enter. And that gives you what's called a public key and a private key. What does that mean? Well for our purposes here, just know that there's a fancy mathematical relationship between this thing called a pubic key and a private key. They're really just big random numbers and mathematically, people in the internet can use CS75's public key to encrypt information to us. So if some random user is visiting, trying to visit Harvard CS website, their browser automatically will say the cs.harvard.edu, "Can I please have your public key?" And the browser will send it for free and over the internet publicly, it's not something that's secure. Public key is meant to be-- by definition public. That browser will behind the scenes unbeknownst to the user, use that public key, that big random number to encrypt their request. And the request can be something stupid like get slash and that's literally all my request just now was. But it encrypts it none the less. And you could probably guess, what is the only number in the world that can decrypt something that's been encrypted with the public key, the private key? And that's something that my server or the CS department server keeps to itself. And you don't give it out, and the web servers never going to send it. It's stored somewhere on the hard drive. Now mathematically, that key will be used with mathematical formula to reverse the effects essentially of the encryption. So that what the CS department's web server finally sees is get slash or whatever it is the user wants. And conversely it works in the other direction. When you install browser, your browser generates and a public and private key pair, so that's-- the traffic can work on the opposite direction as well if necessary. So what's the take away here? We did all that in the CS department but we didn't pay someone else to certify that we are Harvard University's CS department. So the way as SSL works on a higher level is that there is this chain of trust that humans in the world have tried to build up whereby there's big companies like Verisign is one of them. GoDaddy is another and maybe ever Namecheap does this. Even more cheaply than others, whereby you have these fairly big entities on the world who charge you money to then stamp so to speak your certificate as valid. What does mean? They digitally sign it. So there's actually some interesting mathematics there that are involved but in the end of the day, it's in part of marketing thing, whereby we the whole world of internet users are trusting that if Verisign says that this SSL certificate belongs to cs.harvard.edu. If I trust Verisign, I should trust this website. Now how does Verisign do the authorization? Well, some of these registrars or these sellers of SLL certificates, they'll go to a reasonable lengths to make sure. They'll call you on the phone, they'll check some business records. That's what you get if they're really being diligent. But the reality is all they do is send an email typically to whoever is on file as the owner of the domain name, and in this case it's Drew Faust or someone like that for harvard.edu. And that person has to say, "Yes, I own this domain and I approve this digital signing of this certificate." And then, you get back your digitally signed certificate. And what you do as the system administrator is you install that digitally signed certificate which frankly is a big number supplemented by another big number and you install it on your web server using the syntax that we just saw and we'll see again in just a moment. So how do you get this certificate? Well, you can go to someone like Verisign-- and let's do that. Verisign.com and here we have-- let's see lots of products. So, oh, here we go. Buy SSL certificates and OK. You know it's going to be expensive when they don't tell you the price right away on the page, so let's compare all SLL certificates. OK. So what do we get? Let's see, let's just spoil the-- OK. Here we go. OK, they're still not-- oh there we go. OK. So here's what an SSL certificate apparently cost if you go through Verisign. And mind you, it's just for one year. So you're essentially renting their approval for a year. What you get now is what here? Different encryption strengths. So if you're familiar with cryptography, the more bits in the cipher in the encryption algorithm, the more secure in theory the transmission is. Extended validation, not quite sure what this means, probably has to do something like the duration of it. The warranty, I've never really understood, you know, you're going to pay $400 and somehow they're warranting your website for $1.5 million dollars. I assume the fine prints said something like, "If the cryptography we use is broken, fundamentally, we will pay out this amount." I'm just making that up. But the reality is this is pretty meaningless, all of this. And the fact that you get the right to put Norton Secured Sealed on your website is atrocious. Because anyone can put an image tag on a website that says something like that. So a lot of these realize is trying to create an industry around, sending a message of security to end users. But seeing this should never mean anything to anyone. It just means that someone knows how to embed an image on a website. And the take away here too is that using Verisign isn't necessarily all that compelling. If we instead go to GoDaddy.com. GoDaddy.com which again tries to sell you everything in the kitchen sink when you visit their website, at least is more reasonable when it comes to SSL certificates whereby you can get away with $69.99 a year or the premium SSL. And in this case premium SSL, which is a feature a lot of these SSL providers have tried to market in recent years does really one fundamental difference. What does it mean when you visit a website and the address bar, it not only says HTTPS but it also turns green and says the companies name in that address bar. What does it mean? It's supposed to mean this side is really secure and you really trust it. Right. But in reality what does it effectively mean based on this-- They paid a hundred dollars [inaudible]. Exactly, they paid a hundred dollars instead of $70 to get that right. Now before we just said these sentences, how many of you knew that a green address bar meant something fundamentally different? OK. So-- OK. Even eh, like-- so there's the question. Is it really worth $30 to convince no one in this room that you're site is more secure? So I'm being a little pessimistic with all of these. But frankly I do think this is a bit of scam. That we've built up this whole industry, that in theory is actually is a wonderful idea. These chains of trust whereby if you trust someone authoritative, like Verisign or the like, you can then trust anyone they trust. But the reality is, it's so easy to get SSL certificate these days. And even until recently most browsers did not put this crazy sounding message in front of the user. You might see a little broken link or a broken padlock icon but they didn't really raise the bar. One thing Google has started doing is putting up a site like this. But I dare say, and this is a made up statistic, 9 times out of 10, when you see this message, it's just because someone has let-- hasn't paid for their SSL certificate for the year or it has lapsed. I do this all the time, once a year our website start saying this because I forgot to pay to bill for the SSL certificate. But fundamentally, it's a wonderful idea because it means that you might be visiting a site that is not who they claim to be. Because rather, you're the victim of what might be called a man in the middle attack whereby someone has gotten into the middle of your DNS traffic and even though you think your visiting cs.harvard.edu, some bad guys sitting in Starbucks has actually lead you to his website instead and is trying to trick you into typing in your user name and password at the like. So again the mathematics, the technology itself is wonderful but the fact that there is this market that are paying hundreds of dollars versus tens of dollars is a bit unfortunate that that's where we're at. Yeah. This message will only appear if port 443 is active in SSL is being offered. Enabled. Exactly. Otherwise it will not. Correct. If the web server itself is not configured to listen, so to speak, on port 443, then this-- you will just get a dead end and you will get a generic browser message saying "server not found" or something to that effect. So you must per the configuration we started glancing at. At least have your website configured to listen on both of those TCP ports. Recall our discussion of ports on Monday. We can do a little introspection here. If I click the X up here and then zoom in. Server's certificate does not match the URL, server certificate has expired, server certificate is not trusted. So, we're really not doing so well here. So let's click on certificate information just to see what-- oh, but the irony is-- but we have a very secure connection to whoever the hell this is on the internet. So, let's click certificate information and we'll get a little more detail. So looks like this certificate expired in May. So I'm guilty of the same, so I can't really poke fun of them for doing this. But if we click details and scroll down, we see that the certificate they're actually using for cs.harvard.edu should actually be eecs.harvard.edu. That's Electrical Engineering in Computer Science. So there is a-- unfortunately, I've just revealed who's responsible for this certificate but he's no longer here, so it is OK. But what the take away here is that there's a few solutions. Either one, you pay the bill and then at least one of those messages goes away. And it's not just a matter of paying the bill, you have to download an updated certificate to install in your web server with an updated date for expiration. But more than that, we also have to fix the domain name. And so you have a few options here. You either one, buy a separate SSL certificate for cs.harvard.edu in addition to eecs.harbor.edu, or you can buy what's called the wildcard certificate. And for instance the course CS75, we have this ourselves. It's unfortunately like $199 a year, but what that means for your money, is that you can protect and avoid these kinds of warnings for *.cs75.net. Any subdomains you want and we happen to use things like mail and others for back and technical reasons. So for us that actually tends to make sense. So there's a few solutions here. And I should say too, one of the other reasonably-- compelling reason to pay more money to a bigger fish than someone like GoDaddy or Namecheap for SSL certificates is that, as part of this chain of trust, the various browser manufacturers Microsoft, Google, Apple and so forth, they ship their browsers, Safari, IE and so forth, with certain certificate authorities' own certificate installed. So in other words it's up to those big companies of browsers to decide who-- which certificate authority should you trust. And some of those vendors, Microsoft, they might have a list of certificate authorities who's trust this long, Google's might be this long, really depends on the company. So if you go some fly-by-night operation or you yourself digitally signed your own certificate, which is mathematically possible, you-- if you are not trusted or that fly-by-night as a SSL company is not trusted by Microsoft or Google, you're going to get this kind of warning. So one of the things you're paying-- and if you frankly are Fortune 500 Company and the difference between $300 or $1000 is not such a big deal to make sure that more of your costumers reach your website correctly, it might be worth spending more money because it could be that someone has got the latest version of Android and they're-- for whatever reason it did not ship with the right certificates or someone's using version 1.0 of Netscape or something like that, and so certificates aren't trusted inside of that. So again, you're paying to minimize the risk of users running into this kind of unrecognized message but that's orthogonal to the expiration which is just a matter of we left the bill laps. Any questions? No? All right. So how do you actually configure this? Well, when you create your certificate, running a command on the computer, you end up with two files, one is a key and one is a certificate. They key-- rather one is a private key, one is a public key. This line here, SSL certificate key file, this is literally where our private key can found on the server. For security reasons, I've faked it as path to cs75.key but it's somewhere on the hard drive. And I should make it clear, it is not in the same location as your HTML files and GIFs because that would be stupid if you-- anyone could just download it. So it's somewhere else. The certificate key-- a SSL certificate file, this is what you're paying for. You upload you public key to GoDaddy or Verisign, they then send you back via email or a download a digitally signed copy which has your big number and essentially very big number. And then you install that here. And then lastly, this chain file just has to do with some registrar, some SSL providers where by just in case their-- one of the certificate-- authority certificates didn't ship with the browser, this chain certificate essentially says we trust this person so it's OK if your certificate is assigned by them. So I have glossed over some of the technical detail, and it turns out, is maybe nice the theories this is. SSL itself is still completely broken, like it can be circumvented. And I'll actually try to dig up an article and I'll post it on the lecture's page after tonight. If you're curious to see an interesting presentation on the various ways in which you can-- for word SSL and trick users into thinking its secure when it really isn't. So nice story but the whole world is broken anyway. Any questions about SSL? There's one corner case that you need to be mindful of when setting up your own website, a running SSL in your website requires that you have a unique-- fill in the blank. Requires that your website have a unique IP. And this is one of the genuine gotchas [phonetic] with SSL. You have a sort of Catch-22 with SSL. Because SSL is about encrypting information, what's get encrypted? Really everything in the request and the response. So everything inside of the virtual envelope is encrypted what are some of the things inside the virtual envelope? Well, the get line and also-- The specific server you would be on a virtual-- Exactly. The host tether which tells the server which Vhost this is for. But the problem is, as we're looking at the configuration here. Every Vhost can obviously have its own SSL certificate because it might be food.com, bar.com. This could be unrelated entities. This is a snippet of a shared web host's web server configuration. So, if you're getting encrypted request but the only way to figure out how to-- who the request is for, is to decrypt the request. But to decrypted request, you have to know who it is for because the SSL certificate key-- the private key you should use is tied to that Vhost. You again have this Catch-22. You can only figure out who it's for by knowing who it's for. And so, there's, you know, there's-- in theory work around, you could try all possible private keys you have on the system decrypting but that's not necessarily deterministic and it is also a little hackish [phonetic] especially if you have hundreds of Vhost on the server. So the de facto result is that you just can't do it. But if you give every Vhost a unique IP address and then associate effectively the certificate with that IP address, then you're safe. Because then, you can just assume that if it comes in on IP address w.x.y.z it must be using this SSL certificate. And there is one corner case. If you have a wildcard certificate like we, the course do, thankfully with the wildcard, we don't need a unique IP address for all of our subdomains, FTP, mail, web and so forth. Because, if they all come to the same server, you can use the same wildcard certificate to decrypt all of that traffic. So in short, when you sign up for a web host, if you want SSL, which frankly this days, it's just a good thing to have, good practice to get into, it's probably worth paying a few dollars more to get a unique IP address. Because otherwise, your users will get that very scary, red message. And Google makes it, Chrome makes it easy to click through. Firefox, you literally have to click like five buttons in order to get pass the warnings. It's atrocious. No normal user will ever figure it out. So paying for an SSL certificate is sort of a necessary evil these days. The end result is great, cryptography. But a bunch of hoops you have to try-- jump through. All right, any questions? Yeah. When I post a post-it paper that you mentioned [inaudible]. By morning. I'll dig up the URL and then I will post it on the lecture's page of the website. So that-- if of interest, you can check that out. Let me just pull up our slides here. And go to-- so what about this? This is among the more cryptic pieces of syntax that's useful to know or at least get comfortable with or get comfortable copying and pasting. Because with Apache you can actually start to do fairly powerful things. And this is perhaps one of the most common. This is using enough feature of Apache and other web servers have very similar functionality, though they might call it something different, called the URL rewriting. So mod rewrite just means module rewrite, this is an optional feature. You can enable an Apache web server that lets you rewrite URLs. Now even if you've never seen this syntax before, what do you think these three lines of monospaced text are doing? Yeah. Compensating for omissions and misspellings. Compensating for omissions and misspellings? Sort of. That's actually a good thought. The only catch there is that if the user does mistype the address, it won't necessarily work unless DNS is configured to at least deliver the user to this end point. So in other words, if they accidentally type wwww.cs75.net, that will be a dead end unless we in DNS have allowed to work with an aid record for instance or wild card record which is also possible. What else might this be doing though? That's on the right track though and this is a very concrete case that we're solving. Maybe it has something to do with the checking if it's HTTPS. OK, good. So is it checking whether it's HTTPS. So it's technically not, though it's very close to doing that. We could tweak it in a certain way. Yeah? Is this kind like a re-direct something? It is a redirect and what's it redirecting from and to do you think? From the top unto the bottom one. From the top unto the bottom one, sort of. So that's actually pretty close. So let's start teasing this apart. So the very first line does what it says. Turns this so-called rewrite engine on, if without that, also this is a common thing I often forget about, nothing is going to work unless you explicitly turn the engine on. So first line does that. Second line is a condition. So you can think of this is a certain of a cryptic way of implementing an if-else type condition. So if the HTTP host variables-- so what is this? Anything with the present sign curly brace and then a capital phrase like that, it's what's called an environment variable on the web server. There's a whole bunch of variables that are set to sort of automatically for you when a user visits, among them is HTTP host. And that is a variable that specifies what is the IP address or literally the word, the host name or domain name that the user visited. It's equivalent to the host line if you will from the HTTP request. So bang here is part now of a regular expression. So if unfamiliar, regular expression is a pattern that you're trying to match. Bang is the opposite of true, so it means if the HTB host is not going to match the following, don't proceed any further. So what are trying to match? The caret symbol means what in a rejects? Reject is fancy way of saying regular expression. Anything? Not anything that would be dot. Not any case. Not any case. Caret symbol, anyone else? Begins. Begins, perfect. So caret symbol means the beginning of the variables value must start with www this is to avoid accidental substring matching where you're matching part of the did domain name but not all of it. So this means you must start matching from www. In other words the first letter in the host name must actually be www. It can't be xyz, www. So www, I have a backslash dot. Based on what I just said about dot significance what is backslash dot? It's an escape character, so it means literally a period. If you just say period that means any character can be here, backslash dot is only a dot can be here. Cs75.net/.net means it must match some literally a .NET and then this NC is fairly-- or arcane, just means no case. It's a case insensitive. It doesn't matter if the user have the caps lock key on, this will still match if the word is correct. So if the HTTP host is not equal to literally www.cs75.net proceed to the following line. What does the following line say? This is rewrite rule. So this is the-- if you have an if-else, this is the if-then part of the expression. So if, then do this. So this thing here, let's come back to and focus on this. I am going to rewrite the user, rather, redirect the user to HTTPS://www.cs75.net/$1. What is $1 may be referred to for those familiar with rejectses [phonetic]? I think it's whatever the user type in after .net? Exactly. So that you wouldn't have like a [inaudible]. Exactly. So let's go back to this. What is this doing? Parentheses, in the context of regular expressions, generally mean capturing parenthesis. So this cryptic sequence of symbols here means dot start. So dot is any character, star means zero or more of the proceeding thing. So this means zero or more characters capture them. Where you're capturing them from? Exactly what you said, anything after the slash that the user typed in is captured by these parentheses and by convention is stored in a variable called $1. If I had a second pair of parentheses over here for whatever reason, then I would have access to $1 and $2 and $3 and $4. So it's a generic way of not knowing in advance how many parentheses you might have, but you can at least express yourself after the fact. So this just ensures that if the user visits something/abc, I will not be redirecting the user to www.cs75.net. That's it. I will also have the courtesy of sending them to /abc. And this is infuriating how few websites actually do this, especially in mobile phones. If you're in the habit of reading of news or what-not on your phone, this is a detail that drives me nuts. I'll go to like Google News, which has links to all sorts of websites, I'll click through, and for whatever stupid reason, the website will decide, "Oh, you probably want-- you came to us from Google News, but we want to show you our-- the mobile version of our website, so let us send you to m.news.com" or whatever it is, completely forgetting what the URL was that you we were at. So the end result is you can't view the article that you clicked on. How do you fix this? Simple as something like this. Now, if they're not using a patchy, it's going to be a little different, but it affix, it's fundamentally that simple to remember what the user typed in. So again, in terms of user experience, in terms of running your own websites, super simple thing to do and certainly to you user's advantage, because if you're like me, you just leave that news site and never come back because it just-- it was annoying to visit in that case. All right. And how about a couple more technical details? R equals 301. Anyone want to guess what's that referring to? Isn't that the redirect one? Yeah. The redirect's status quo that we talked about on Monday, 301 means, what specifically? Moved-- Permanently. -- permanently. So this is in contrast with 302, which happens to be moved temporarily. Who cares? Like why are these two separate codes do you think whose functionality is essentially the same. If it's moved permanently or computers don't save that. Good. If it's a 301 and thus permanent, the browser, if it's smart, it will cache that response and the next time you, the human, try to visit the same page, you're just going to be automatically redirected without wasting the server's time asking the same question. Whereas 302 means it's temporary, you probably should check back with me. So upside is, you save a little time. The user gets a response a little bit faster. Downside though is what? What's the downside of 301 do you think? Again, think-- start thinking about corner cases and problems you might be creating by trying to be helpful. For the-- what's that? In case it will revert that back. In case it reverts back. Suppose that you just decide to reconfigure your server or you change the name of it or whatever. You know, it's not something you do commonly, but the day you do it, are you going to be tricking your users into visiting a dead end? And so you have to be mindful, especially if you're the person doing the web server configuration, not the development of the website, you know, maybe we should make sure both of these continue working for some number of days or weeks so that anyone in the world who had cached this response finally reboots their computer or quits their browsers. So these are the kinds of corner cases to be mindful of especially when you care ever so much about uptime and making sure your users don't hit dead ends. L, probably won't guess this, this means last. This just means if you have a whole bunch of these rewrite conditions and rules in the same file, this is just one of saying, "That's it. Don't bother processing anything else in the file. We want this redirect to kick in first." So find fault with this. I'm kind of looking at my own align here, and there's technically a bug even though it's not likely ever to be encountered. How could have I been a little more rigorous with defining this do you think? Specifically, I'm thinking about my pattern matching. It's not quite as robust or correct as I think it probably should be, if you want to be really nit-picky. Yeah. With-- I don't know that would help if you could add HTTP in front of www. >>HTTP in front of www. Oh, good-- so good thoughts, to put HTTP, it would actually break them. Because HTTP host, the variable is by definition, and you can only know this by reading the manual, does not contain the protocol, it only contains the host. How about if I point at the end here? What could I be doing better? Yeah. The slash at the end. So good thought too. Slash also though doesn't belong because it's part of the path and host is literally just the host. But it is something there. If you are familiar with the regular expressions, it could be-- Sets, I think it corresponds, gives [inaudible] toward the end. Exactly, and for some crazy reason, you would like to think that-- or it'd be a nice world if the karat symbol represented both the beginning and the end of a string, but the world chose dollar sign. So, I should really put a dollar sign after the T here, because that would mean, you have to literally match NET and that's it. Now, why is that relevant? Well, it's probably not that relevant because I do not know of any top level domains that exist today that are-- that started with NET and have more letters after them. But there's this trendency [phonetic] now where the world is creating much bigger names. And in fact if you pay like $100,000, you can get .google or .apple. But someone could get .networksolutions. And as soon as we do that, then again, the pattern match is not quite right. But again, it has no real material effect because if DNS weren't set up, the user would never reach me. But again, just a little thing to be mindful of that is not as precise as we could be. All right. So, what is this-- OK, that was really technical. Who cares, what is this really doing? Why would the user ever reached my website and not already be at www.cs75.net? What is the point of these three lines from a user's-- or really just big picture here? How else could you visit www.cs75. net? Even today with your laptops? Yeah. Use FTP. OK, FTP but then this won't even kick in because this is just a web, just a port 80, just HTTP. How else could you visit the course's on page? Yeah. There could be error in one of the DNS server that [inaudible]-- OK. Someone to your ID and [inaudible] who doesn't intend to go your actual [inaudible]. Oh, so that's good. So if there's a DNS error or there's just some maliciousness going on, you could be lead to our website and-- right? We did this Monday, what was this stupid little demo I did on the fly that made a certain news company look a little silly? Change the name CNN. Yeah, right? I think I had davidnews.com all of a sudden and we went there and we stayed there. And I mentioned at the time that CNN, if they just put like two lines of configuration in the file, they could fix this and immediately redirect the user to protect their branding so that it goes back to www.cnn.com. This is exactly the fix. Now we're not doing it because worried people are going to come up with like fake cs75.com or stupid stuff like that. But just the simpler, what if they just visit http://cs75.net, enter. We just decided as a course that like most websites on the internet, we want to standardize not on cs75.net, which we want to work but we want to redirect the users so that they end up at www.cs75.net. Now why? One of it is just, you know, branding. If you want to-- there's something to be said for just at least standardizing what your URLS look like, whether it has the www or not. But more than that, we mentioned briefly on Monday and we'll revisit this in time, the cookie issue. Whereby, if you do have a subdomain, you can then isolate cookies to be part of the www subdomain and they don't have to be global to your whole domain name cs75.net. So in another words, all these lines are doing for us, and these are literally the lines we use on our website. If I go to http://cs75.net, enter, where do I end up? Well a couple of places, one, I ended up at www, just because. But I also end up at the SSL version also just because. And then this, it's just because we're using MediaWiki, software that automatically makes the default page called main page for no good reason. So there's a few things going on there. So you can infer from this though, how can you enforce use of SSL on your website? Suppose you're bank, suppose your Gmail these days and you want to force users to stay on HTTPS even if they visit HTTP, how do you do it? Well, it's pretty much the same trick here. But rather than check the host name which is not the problem now, you want to check SSL, so what you can really do in this case, is instead align like this. RewriteCond HTTPS not equal On. So this i a light-- slightly different syntax but this is a different condition we could use that asks a different question. If the environment variable called HTTPS is not equal ON, on, that's the implication? It means it's off. And so what should you do? Well, the next line is that same rewrite role, you will redirect the user. So, this is how you enforce SSL. This is one way you can enforce SSL on a website. Yeah. And so this checks for every page to say somewhat about [inaudible] .com slash banking-- Exactly. -- but still work [inaudible] send to the HTTPS. Exactly. This will work for every page on the website because we had that additional use of the capturing parenthesis to ensure that they don't just go back to the generic home page, which is just annoying at least in my experience, but rather they go to slash whatever they were at. And this gets installed to clear either in that file called httpd.conf. But as you also see, there are per directory file configuration files that Apache supports called HT access files, literally just a text file called period H-T-A-C-C-E-S-S. And that syntax looks very similar to this. But, you can't necessarily do everything in an HT access file that you can in the main server configuration. In depends if people like us, the system administrators of a website let you put certain commands in a directory. So, you can use .htaccess files the password protect directories for instance to change mind-type so to speak some fairly arcane details. But this is one of the most compelling. And there's actually another one. Facebook, if you're a user. Almost, many of the URLs end in what file extension as we said on Monday? So, .php just because, like, for historical reasons, they still use PHP for a lot of their front end stuff, but there's no technical reason to expose what language you're using on your server. In fact, it feels like it's just a waste of four bytes, right? Why bother sending .php when it's strictly not necessary. And frankly it's very web 2.0 these days to have cleaner URLs, prettier URLs. They just don't have craft like file extensions. These httpd.conf and also HT access files can also be used to let you avoid ever putting .php in your URLs. Your files on your hard drive can still be called hello.php but the user could just visit /hello and using mod rewrite, you can essentially tell the web server if the file /hello does not exist, look for /hello.php. And if that exists, serve that up instead. Yeah. No, nothing. OK. So, lot of power. I will say too, this is one of the things that frustrates some people including myself the most because the slightest syntax error anywhere, if you get the permissions of the file wrong, your whole website can break. So, it's a lot of power and a lot of trial and error and a lot of googling sometimes to solve these problems. All right. Any questions? No? All right. So, where can use stuff like this? Well, next week, when we start talking about the first project, we'll introduce this appliance, this virtual machine in which you have your own version of Apache running. But-- And certainly after the course or even during the course if you want to experiment with other approaches, it's actually very easy to get LAMP onto your own computer. You don't need to pay for a web post, you don't need to set up Linux computer. You can do it on your own Mac or PC. In fact Mac OS these days comes with Apache, comes with PHP, comes with Python, comes with Perl, a lot of support for web programming related stuff built in even though you sometimes have to run some commands to actually enable it. Your laptop is not a web server by default even though Apache is in there if it's a Mac. Windows tends not to come with as much software along these lines but either way, there are some packages, this is one of them XAMPP that makes it pretty easy to make a web environment on your own computer not necessarily for serving content to real users. We had that discussion on Monday that, you know, getting users from the outside world to your home with your cable model and all that, it's not trivial and your ISP might not even let you or like, but for development purposes. You don't need of actual web server per se. You don't need to pay anyone to start doing web development. You can do it on your own local hard drive even if it's not static content, HTML files but it's actually dynamic with something like PHP. So, XAMPP is just the product name for free software that includes support for Linux, Mac OS, Solaris, and Windows. So, it doesn't matter what OS you have and it installs for you Apache, MySQL, PHP and also even Perl which is the other P in LAMP sometimes. Or actually no, that's the P in XAMPP in LAMP. So, what is this mean? It means you go to their website which is, you just google XAMPP to pull up their page. You can install the software. And ideally, you then have some nice documentation locally and your own database, your own web server, your own installation of PHP so you can do all your development locally, which is nice because it's super fast. And it means you can work in a cafe or what not without even having internet access. There are some corner cases. XAMPP hasn't been the easiest historically to set up. Sometimes it does not quite work on everyone's computers, which is why we actually transition to the VM approach where we can guarantee that everyone's environment is the same and works correctly. But certainly moving forward when you no longer want to rely on course provided software realized this is a nice local development option as well. And similarly that you configure most anything you would like. Any questions then? All right. It feels like a good point to take a five-minute break and when we return, why do not we dive into PHP and actually finishing the back end of something like google.com. So let's take five. All right, we're back. So just a couple of details, you should have or should soon receive an email invitation from the course's discussion tool. We'll post a link and announcement on the course's home page to explain to where to go and how to go if you do not receive such a link but it would have gone to this e-mail to the e-mail address with which you registered for the course, FYI, in case that's not in address you use quite commonly. But again, more details on the course's home page by tomorrow. Let me introduce another of the course's TF's alone who if you would not mind coming up close to my microphone, would like to say hello to the class. Hi everyone. My name is Allan. You can call me Allen. It's-- These are for you and I'm here to take your questions and help you out with anything you-- OK. Excellent. And Peter how we met on Monday will be back shortly this evening and once lecture wraps, we'll dive into section. Which again will be an opportunity for slightly more intimate Q&A to go over concepts that might be a little more abstract and particularly once the first project is released which will be on July 9th is when the first one will go out, it will be an opportunity particularly to focus on the project and get direction and guidance and design tips on them. So, more on that to come. All right. So, time for some PHP. So recall that we talked briefly about some of the basic UI mechanisms that browsers allow. Radio buttons, text fields, text areas, checked boxes and the like. And these really are going to be the fundamental mechanisms whereby we go from static web sites with just HTML and CSS to dynamic websites with some kind of server side intelligence that does something based on user input to produce dynamic user output. So these days, thankfully the web is getting more interesting and sexier than some of these more old school UI mechanisms. But even the fanciest of autocomplete widgets that you see, and calendaring things where you can choose calendar dates and the like are still built on top of these but all the more stylized these days with JavaScript and with CSS. And so we'll look at some of that fancier use of input mechanisms in a few weeks when get to AJAX and JavaScript itself. So here is a representative snippet of Google. Recall that on Monday, we started implementing the same interface even though it was all black and white in text. But we did have a text box and we did have a couple of buttons and when you click that submit button, you actually ended up initially nowhere, right? We ended up on my same file, which is not dynamic at all. But then I went in and change the action attributes so that we actually submit it to Google, so technically we cut some corners and didn't implement a dynamic website ourselves but we did look at the basic mechanism whereby form input becomes get request or an alternative to GET is POST. For those familiar, what are-- what is one or more of the fundamental differences between using GET versus POST? Yeah. Oh, GET is actually going to include what you entered in the form of the URL. OK. And POST is just not good into that. OK. Excellent. So GET request will have state change in the URL itself. And that's exactly what we saw on Monday with the Google where we had question mark, what came next? Question mark-- Q. -- Q equals whatever-- harvard whatever I tap in-- type in or the user types in. So POST does not do that. So, that's a nice distinction, but what's-- what are some more distinctions or what would motivate you using GET versus POST if functionally they could be the same. You could still get search results for instance even though Google as an aside does not support POST. What's the-- What else should drive you to GET versus POST or vice versa? Yeah. Well if you're on the site that tends to deal with uploads here then why don't you suppose with it had special ways to deal with large files-- Excellent. Yeah. So GET requests are not so great for things like file uploads, photo uploads, right? If anything conceptually, this just make no sense, how do you upload a file in a URL. Now technically, you can encode it using something called base64 encoding where you convert the binary image of zeros and ones to As and Bs and Cs and 1, 2, 3s and so forth. But the other gotcha is that most browsers have a length on the maximum length of the URL. Unfortunately, this is not standardized and it's barely even documented. But the rough rule of thumb is if your URL is several hundred characters long, it's probably too long. And a reasonable cut off is something like 1024 characters. You're definitely pushing your limits. However, it's completely browser dependent. Some browser support 8000-character URLs, 1000-character URLs but the take away is that, really, you have to deal with lowest common denominator, whatever that is. And so anytime your URL start getting long, it's probably time to rethink your design and start using something called AJAX, which again we'll look at or using POST. POST does not have a limit. In fact, one of the upsides of POST is that it in HTTP headers, will tell the server how big the file or parameters are that are being posted, so to speak, so that the browsers know when it's received everything. So the browser figures out. OK, this is like a 5 megabyte photo, so I'm going to tell the web server through the headers expect 5 megabytes And then with the-- server gets is below all the headers is all the crazy zeros and ones or equivalently A, B, Cs, 1, 2, 3s but it knows where they stop. So it knows when it's received the whole photo. Suppose there's grade for that, what else is POST compelling for? What other used cases besides file uploads? And put on your paranoid hat. If you're using GET, what are you at risk for? Yeah. Somebody is actually is snipping what the user sends. Perfect. So if-- and what might the user send that could be sensitive? I mean, you wouldn't really send the password or a-- Good. -- username with the GET list. OK, good. So sending user names, passwords, credit card numbers, anything that's arguably sensible probably should not be submitted by a GET because it ends up in the URL, and why is that bad? Well fundamentally, it's still being sent to the web server and if it's over SSL, it's at least encrypted. However, it's not encrypted from you family members or your friends or your roommates who might sit down at your same computer. And you know what you can do with most browsers today browse through the history, right? And if it's in the URL, that means it's going to get logged and it's going to end up in autocomplete until the cache is manually cleared. It's just too easy then for someone to find it. And it's also going to end up somewhere else. Even though it might be transmitted over SSL, so random people on the internet or Starbucks can't see it, once the server gets the request, many servers as we-- you can maybe infer from the httpd.conf configuration file are there have logs. And what tends to get logged in logs? Not POST, because they could be huge, 5 megabytes and what not. But typically what are logged in logs? GET requests, including the URL that was visited. Which means any website that's ever used GET for password authentication or credit card submission-- which would be rare but could happen especially if the person does not know what they are doing-- it's ending up in the logs. Which means some random person's unencrypted log files has all of your sensitive information. So in short, anytime something is big or anytime something is sensitive, GET is not the way to go. However, that would seem that's just fine, just use POST all the time, right? Just avoid all these issues together. I do not have to remember what the difference is. But what's the downside of using POST? Based on your own, maybe even non-technical user experience, what's the downside? Yeah. Can copy-paste the URL-- Yeah. -- available [inaudible]. Perfect. You can copy-paste the URL. Completely reasonably concern especially from the user experience user perspective. Because, very reasonable for someone who want to copy the URL say "Oh, check out this book" or "check out this link", whatever it is you're looking at. And it's actually pretty infuriating when the person who receives the email says "Oh, I only see their home page" because they just redirected them, because of a number of things. One, the state that was necessary to remember that book, the ISPN or whatever was not in the URL because they were using POST, or it's even worse, some websites-- even I think the Harvard Coop does this. When you navigate around their website, the URL similarly doesn't change because the information being stored is best that I can tell in their session cookies. Something we'll talk about on next week or later tonight, whereby it's only remembered by the server. Thanks to a cookie where you are, which means even you can't bookmark your own pages that are of interest to you. So in short, horrible design, and some websites are very much guilty of this. So how many time you want the user to be able to save state in a URL rather in an email or just with the back button too. It's helpful to make sure it is in the URL itself. Of course there's another reason, this is getting better these days with modern browsers, but typically with POSTs if you click reload, you'll often get prompted and the website will say or the browser will say "Are you sure you want to resubmit this form?" So there's also issues of resubmitting forms and what not that are typically bad. And so one of the things that's got in more common these days to avoid people accidentally checking out twice or buying things twice on an online store, you know, having that message say, wait a minute, are you sure you want to submit this form? What you can often do is-- once the user does a POST because they have uploaded something or they bought something, what you then do is immediately redirect them with a 301 or a 302 which only use GETs. You cannot use redirects to repost somewhere else, FYI. Then the user, if they accidentally hit reload or hit back in their browser, they're only going to get back and forth between a GET requests not a POST. So you can also discourage the user from submitting a form again. And there's other protections you can put in place, but that's another reason, too, if you want to avoid resubmission of forms. Sending a GET via redirect can be one level of protection against that. All right. So, here we go with PHP. This is going to be a fairly rapid tour of this, because again the course does assume nontrivial prior programming experience. So this is another detail to where if you find yourself what is programming, again, we should have a conversation right after class or over the LAN or with Peter if you're more comfortable about what your own background is because we're going to start talking about things like arrays and hash tables and associative arrays. And if this is all new to you, it's definitely going to be a bigger challenge but we've certainly had students do it before, so use your judgment along the way. So, one of the best things about PHP is its documentation to be honest. It's actually fairly user-friendly, very nice to navigate and so let me just follow up an arbitrary example, kind of a boring function but one that's commonly used. If I Google PHP date function, I can go up to a representative documentation page here. And just to give you quick tour of something you'll see much more when you dive into the course's projects, along the left-hand side of the website is a list of all of the related or available functions, PHP is actually not this slow of a language usually. Let's try reloading. OK. She didn't-- oh, so, actually there's an interesting lesson there. So actually, let's try this rather than just give up on this altogether. Let me see if we can-- oh, damn network. So I was going to pull up Chrome's network tab, we could look at exactly what was hanging there, but it seems to have resolved itself. So, a quick tour then of the page here. So on the left-hand side is all of the related functions, just FYI, a little overwhelming at first but the reality is for this class and really in general, you're not going to need to know every one of these functions, just looking it up on demand is useful enough typically. On the right-hand side is the canonical layout of a function. So, it tells you first what version of PHP supports this function. This is actually important not so much when you control your on own server because either you'll be running yourself if it's your own server, pretty recent version of PHP, 5.1, 5.2, 5.3, 5.4, or fairly recent incarnations but 5.4 the latest. But there are some web hosting companies that might still be running PHP 4, not terribly common but you will lose a huge amount of functionality including object-oriented programming support, if you are something as old as PHP 4, just FYI. So, what does this function do? It formats a local date and time which means if I give it a string like H colon M for hours colon minutes or something like that. It should return to me a formatted string like 3:00 p.m. or something like that. So, that's what it does, it gives me the current time or it converts a numeric time stamp to a date. So, here is how you parse the signatures. This means it returns a string. This means it takes a string as its first argument, which is the format. Any variable in PHP as we'll see quite a bit is it starts with a dollar sign. Square brackets in documentation means it's optional, which means if you want to override the current date and time you can pass a new numeric time stamp. For those unfamiliar, a time stamp in many programming languages is the number of seconds since January 1st 1970, the so-called epoch. And then you can override the default behavior. Useful if you've stored time stamps in like a database and you want to display them in some human friendly way after the facts. All right. This returns a string format [inaudible] to given format string dot, dot, dot. Here's just some more detail on the format, so the format parameter can apparently be a quoted string containing all of these various placeholders, D for day, J for day of the month without leading zeroes and so forth. Memorizing this is not a good use of any human's time, but looking it up is reasonable. Let's just scroll down, past all of that. Timestamp does what I promised. Return value returns a formatted date string. If you do something wrong, it goes on to explain that there's an error. And then let me scroll down here. The examples, frankly, is where my eye is typically drawn most immediately. So, if I take a look here this gives me some little cheat sheets. If I want to print out echo date "l" for whatever reason L denotes the day of the week. If it is Monday, it would print Monday. Today, it would print Wednesday dynamically. Here's some more complicated string that they claim will print out this and so on and so forth. This is the kind of thing that this function does. But the takeaway is, for our purposes now, is just PHP's documentation is always structured in this way. Summary of the function up top, description of the parameters, some version notes in case you need to be aware what version of PHP you have. Example one, example two, example three. And then at the bottom, there's generally some pretty intelligent discussion on the comment threads that are there. It's not really crazy talk. This seemed to moderate it quite well, so you actually see people sharing useful code for command, workarounds or common tricks that someone might want to do related to the date function. So in short, the documentation will be your friend and what we will do in lecture is not to go through mind-numbing tours of the various functions that exist and so forth, but focus much more so on the concepts, on the syntax, and on the overall framework so that you know as you dive in to how do I do this, how do I do this, where it fits in big picture in terms of a project. So PHP is an interpreted language. What does it mean for a language to be interpreted? Or what is the opposite of an interpreted language even though they're not truly literally opposites. Yeah? [ Inaudible Remark ] A compiled language, so a compiled language is something like C or C++, or language that has source code written in English-like syntax but you have to run it through a compiler like GCC or Visual Studio or the like and it outputs what's generally called object code or more specifically zeroes and ones that are patterned in a way that a CPU like an Intel CPU understands. An interpreted language skips that step, essentially, whereby instead you write the source code and then you pass your source code through what's called an interpreter instead of a compiler and then an interpreter essentially reads that language that you've written, the source code you've written, top to bottom, left to right, doing line by line exactly what you tell it to do. So the upside is, there's no intermediate step, you don't have to run the compiler then run your program. In an interpreted world, you just run your program through the interpreter and it's that. It's one step instead of two. But what's the downside of the fact that it's interpreting it line by line as opposed to converting it to zeroes and ones? Performance. Performance, typically. So compiled languages tend to be faster because you're spending more time in memory and disk space upfront to convert source codes to object codes, zeroes and ones, but once it zeroes and ones, it's super ready to be read and understood by the CPU. Whereas an interpreted language typically needs to be literally interpreted again and again, and every time I call the date function, D-A-T-E needs to be parsed or read and then converted effectively to the underlying functionality. Now, there exists compilers of sorts for PHP and for other interpreted languages and what are called opcode caches. More on this at the end of this semester when we talk about scalability, which simply means for now, that smart web servers and interpreters will do the interpretation once, convert it to some intermediate format and then save that intermediate format, which in the PHP world is called opcodes, O-P-C-O-D-E-S. And this just means it will skip that step the next time around. It's not quite compiled but at least it's better, it's a closer approximation to it. Frankly, it's a nice thing with interpreted languages because you don't have to go through that annoying step of recompiling and recompiling. Every time you make a change, you can interact with your code a lot more fluidly. It just saves some steps, especially for large projects which might have large number of files and lines of code to actually compile otherwise. So, some upsides and some downsides. If you're crazy popular like someone like Facebook, Facebook actually has a framework called HipHop. It's PHP which they released open source a while back which actually compiles PHP down to C++ which is then compiled and turned to object codes to get maximal performance out of the code that they write. And this is motivated by a number of things, but among the things they discuss publicly is this way, PHP is fairly omnipresent and it's fairly easy language for people to learn especially coming out of college and the like. So it means they can have their developers using a language that's fairly easy to learn, they probably already know it, and they can then defer the performance details that are typically associated with the language to some of their more advances engineers who can then take PHP code down to something that's even more highly performing. So among the options that exists these days. So a lot of the arguments you might see on the web about performance of PHP versus Ruby versus Python versus Java versus this. There are many, many different technical solutions to the performance question. And a very valid heuristic, I think, when choosing a language, whether it's going to be PHP or another, is what you already know and what the cost is to you to develop or to learning something else, what friends know or what your colleagues know, and also what tools exist to mitigate, and the prices you pay to use something like an interpreted language. So suPHP, this is something that will be installed in the CS50 Appliance. It is installed on some web host, but not nearly as many as would be good. So suPHP is substitute user PHP and it exists to solve the following problem. When you have web server, you have software running on it that listens for connections on port A and so forth. Years ago, most such servers ran as a username called root. Root is the administrative user and running anything as root is generally bad, why? Yes? Well, you can, like, destroy your computer. You can destroy your computer, how? Be more specific. Well, you can remove, like, files that are essential to the operating system. Good. So if the root user has full-fledged access to the system, if you make a mistake in your code, if it's a web server, and you're running web code, and you make a mistake and you accidentally delete the wrong directory, that is permanent, like you can touch anything on the system including the password file which even though is encrypted should not generally be shared with the world. So in short, running anything as root puts you at risk because if what root is doing is bugging. And odds are you're human, you error, you're going to write buggy code sometimes, that means who's running the buggy code, the most important user on the system which means your entire machine could be compromised if you screw up. So finally, the world years ago got into the habit of at least running web servers in particular as different username. Sometimes "nobody" literally, the username nobody or dub dub dub or Apache or HDPD, it doesn't really matter what it is, it matters that it's not root. But some problems arise, especially in this popular world these days of V-hosting and web post, commercial web post. Because just think of this, if you are customer A and there's a customer B, and you have someone like DreamHost or the like, you each have accounts with them and you have your own usernames and passwords and you have your own home directory, so to speak, where you can store you code. But the web server runs under username Apache for instance, Apache again being the web server software. In terms of permissions, Apache is not you, obviously, because you are A or you are B, so you have different usernames. But if Apache is the web server and the web server needs to obviously be able to see your files in order to serve them up, what kinds of permission do your files need if you're familiar with Linux file permissions or Windows, really file permissions in general? Your files have to be what's called world readable, typically. You can do more fine grain permissions, but the reality is on most systems, the easier approaches, you're told to chmod your file 644, more on that in the future. But make your files world readable. Why? Because you don't really need the world to read them, you need the web server to be able to read them including specially your PHP code, which we'll about-- were about to start writing. So, what's the implication, though? There's a few things. If your files are world readable so that this middle man Apache can read them, that's great. It makes the website work. But it also means that someone else can read your file, too. That would probably be the other customer. The other customer, customer B, right? Because world readable is world readable. Now, if your files are being served up on a web server, that means you can see your files at URLs like /hello.php. So that means anyone in the internet knows what your files are called. So, other customer because he or she can log in to the same server can just enter your directory, and even though they might not be able to see all of your files, if they know what they're called, they can then definitely see your files by just using a text editor or some kind of program that just opens these files. Now, that's not such a big deal for JavaScript, for CSS, because frankly, who cares? That stuff is by nature of JavaScript and CSS, going to be sent to the browser in the whole world anyway. And you might try to obfuscate, as we'll discuss in a few weeks with minification and hiding things from users. But you can't really protect your intellectual property when it comes to JavaScript and CSS because the browser and the whole world have to see it. But PHP, you might put a lot of heart into it and you've put a lot of intellectual property into your PHP code which is really the secret sauce of your business or whatever. But now, the web server needs to be able to read it as can any customer. So now, you are at risk of the customer seeing all the hard work you've done. In fact, what might your files contain out of necessity, if familiar with databases in the like? Yeah? The PHP would need to contain the name of the database and the password. Exactly. Things like usernames and passwords for databases, for caching engines, for Facebook APIs, whatever it is, your PHP code might have some more insight of it variables that did need to be there but you don't need the customer be being able to see it. So in short, running a web server as Apache is great for security of the whole site, bad for the security of customers A and B and C who probably don't even know each other and certainly shouldn't trust each other. So, thankfully there's a solution here and it comes in different forms. One of the solutions for PHP is suPHP. In the suPHP model, customer A's code is executed by a username called A, the same user's username. B's code is executed by username B. In other words, the web server sort of magically transforms itself into user A when it's time to execute A's code and transforms itself into user B when it's time to execute B's code which means if you screw up and have buggy code and you're customer A, whose files could you possibly delete under this model? Only your own. And, you know, that still might be unfortunate but at least you're not compromising anyone else on the system. And it's your own fault if you delete your own files, but it's a good thing that you can't delete anyone else's files. This also solves another issue. If your website is like a commercial website even if it's small with only hundreds or thousands of customers, and those customers need to upload files like photos, or videos or stuff that's not meant to be public in the Facebook sense but fairly private at least in the limited privacy sense. So, the upside here is when a user uploads a file now and the web server is using suPHP, that file will be saved on the disk as owned by user A and B's files will be saved as user B. By contrast in the other model where everything gets run by Apache, who saves the files? Apache, which means Apache owns the files and that means the only way to ensure that they can be accessed subsequently is to make them world readable which means all of the new content your users are uploading is going to be readable by customer A, and B, and C, and D on the system. So, in short, this is good and this is not a feature that's typically advertised significantly by web post. I don't even know if DreamHost does it these days. I'm going to guess they don't because we didn't see mention of it, but don't hold me to that. You might want to dig a little deeper into the fine print. But if you are using something like a virtual private server, you can also avoid this issue altogether because at least if you own the whole server, even if it's a rented virtual machine, at least there's no other customers on the same server. So, again, something to be mindful of so that, you know, when you pay 895, 599, whatever it is per month, again you're getting what you pay for. And if you care about your intellectual property and the security of your site, these are the kinds of questions you should be mindful of asking or reading up on before signing up. So, suPHP is something that will be installed in the appliance. So, for those who would like to read up on the language itself, this week, there will just be recommended readings of sorts. Realize that there are some good tutorials online. And again, if you have a programming background in any syntactically similar language, some of these might even be boring which would be great because it will walk you through for loops and while loops and the like. So, we'll just do a quick tour of some of these syntactic details tonight, but then focus on some of the higher level concepts that will be distinct to web programming. So, without further ado, one of the more stupid details but I just put it out there because it's the first thing you see. Variables, again, start with dollar signs in PHP and here is the rule as to what is valid. In short, I would choose variables in a sort of normal way typically with alphabetical letters but there are some other things you can use like underscores and numbers and the like. But again, we won't spend too much time on this kind of level of detail. Data types, PHP is what's called a loosely typed language which means the data types exist, kind of, but they're not readily enforced in the same way that they are in java or in C or in C++. So, what data types exist, booleans, integers, floats, and strings. But when you declare a variable, you do not specify its type. It is inferred by the type of value you actually put inside of it. So, if you say something like $x, because again dollar sign means this is a variable, $x is a very boring name for a variable but it's a variable equal sign, one, two, three, semicolon. That data-- The data type of that value will be integer even though I didn't specify it as such. If by contrast I say $x equals 1.23 semicolon, it's instead going to be what? Yeah? Float. Float, a floating point value, a real number. If you instead say equals true or equals false it's going to be a boole. If you instead say "hello", it's going to be a string. But that type is not invariant. If you try to use a string in a boolean context, then you go get a lot of implicit conversion. So, in other words in an if condition, normally you would say something like if x equals y or you would say if true, something like that. If instead you say if "hello", well, hello will be implicitly casted to a boolean and because hello is not the number zero, the boolean value of hello is going to be true. So, you can use strings even as truth values which can encourage sloppy programming and we'll see some examples of these, but it's also useful sometimes and that it's not as pedantic a language as something like Java where you are constantly casting things back and forth. Yeah? Can you perform string operations on integers or vice versa? Good question. Can you perform string operations on integers and vice versa? Yes, they will be up casted to a string in that case and become part of the string itself. And one of the motivations for this is that PHP from the start was really designed to be web-centric, and the reality is when you're writing web software, you're interacting with the user entirely via strings. Now, the user might type in one, two, three, but as we've seen via HTTP GETs and talked about HTTP POST, it's all text at the end of the day. There's no data type associated with an HTML input field. So, even though the user might type one, two, three, what's going to be sent to the server is "one, two, three". And so the fact that there is this loose typing is reasonably consistent with what you're getting from the user anyway, even though again it can feel a little messy, and it is in some sense, but that's at least one of the original motivations for it. In terms of objects and collections, Java-- PHP has arrays and it also has objects, more on those to come. And there's also things called resources. Resource is something like when you open a file, what you get back is not the file per se, you get sort of like a pointer or a reference in C or Java-speak. And that reference is to a resource which is sort of like a special object that contains interesting information, the size of the file, your location and then the type of it, and so forth, details like that. Null is null. It's when you have no value there. You can have the value null as a placeholder, but variables in PHP as well see can also be set or not set. So, null is an actual value. It doesn't mean the absence of a value. You can have the absence of a value as we'll soon see. And then there's mixed. So, mixed isn't really a type but you'll see these things in documentation, in particular. If you see on PHP.net documentation that says this function takes mixed, what does that mean? Well, it means it can take any number of different types. It can accept a string or a number, and this is where PHP is both handy but also a little sloppy and that it's not strictly typed or strongly typed. Number means integer or float if the function doesn't care. And a callback is a function pointer. We won't spend too much time on those but you can pass function around by pointers or by references generally known as a callback, in this case. Another word on mixed, PHP is very common for its design of returning mixed data types. So it's very common in PHP for a function even like date to return strings. But if something goes wrong, it could actually return a boole. And what's it going to return in that case? False. So, it's very often, it's very common rather in PHP functions that you'll-- 99% of the time, it will return a certain data type but it could return something very different. So, learning to check for that correctly is good in the context of PHP. So, I'll point that along the way as well. So now, some special variables before we start writing some code. So, in PHP, there are special global variables that are called superglobals. They are in scope, so to speak, everywhere. In any line of code you write so long as it's executed by a web server, you have access to these variables. They start with dollar sign, start with underscores and then all caps. So, $_GET is a variable. It's going to be an array. It's going to be an associative of array AKA hash table, AKA-- not really an object, but it's a key value store. What do you think is in that variable called $_GET? Take a guess. Yeah. All the things that are in the URL that performs them. Exactly. So, Q equals Harvard, foo equals bar, bass equals coax [phonetic], whatever the user submitted via the form is going to be handed to on a platter, so to speak, in the form of this variable so that if you want the value of Q, you just have to look inside that variable. And this is one of the things that's compelling about PHP. In contrast, language like Perl, which is very popular years ago for web programming, you either jump through hoops or use an-- a popular library to actually parse the HTTP requests to get access to the keys and values. PHP and frameworks like Django and Ruby on Rails make this so much easier these days. And PHP does this to the superglobals. $_POST, well I guess what that does anything you post ends up in that array. $_FILES is great, too. If you do let the user upload photos or whatnot, you're handed the files in the form of an array, you don't have to parse it or figure out how to deal with file uploads yourself, super easy in that sense. Some of the more esoteric ones now are SERVER and ENV. SERVER contains things like the user's IP address, they're user agents. What was user agent? Yeah? The browser and the operating system. The browser and the operating system, that cryptic string that is apparently being sent every time the browser visits you. And this ENV variable is rarely used but it gives you access to lower level details on the machine. COOKIE is nice and we'll come back to that next week. But COOKIE stores cookies, key values that you might send or receive from browsers. $_REQUEST has all of the interesting details about the user's request. What path did they request? What was-- Was there a question mark in the URL with parameters? So, if you access to the raw details before they end up in a more user-friendly place like GET and POST and COOKIE. And SESSION is one of the most powerful ones, arguably. It is the thing that allows you to implement a state and implement things like shopping carts. Even though HTTP, as we sort of began to discuss on Monday, is stateless and that as soon as you visit a page, and you disconnect from the server and the page is loaded, you no longer have a connection to the server anymore. Via COOKIES, you can remember or rather a server can remember that you're logged in and we likened the COOKIE on Monday to like a hand stamp. And what is SESSION? SESSION is this amazing superglobe on PHP that you the programmer can put anything you want in it, any keys and values, any numbers, any strings, any ISBNs, of things a user put in their shopping car. And the next time the user visits your website, so long as their cookie hasn't expired, you can access that exact same data in $_SESSION, magically, so to speak. You don't have to worry about figuring out who the user is. PHP and in turn the web server do all that for you out of the box. So again, another upside of a language like this. So let's actually see this in action rather than talking about it in the abstract. So, last time, recall that we had this file, let me go cs75. net lectures where we posted a video and more. And in our source code directory, typically if we write some source code on the fly, during lecture I'll clean it up and then upload it the next day to the server if you want to play around so you don't have to write down code and whatnot. Or if we have some stuff in advance, I'll put it there. So this is from Monday, and we had this site, Google and Google Search. And when I submitted this, recall that if I search for Harvard enter, I ended up at-- enter, oh, we broke it. I should fix-- I will fix this. Recall that I broke it at the very end of class, by changing the value of Q to something else altogether because I think I said like QQQ or something random like that. So, let's now instead of using Google to do our back end, let's instead write the back end ourselves. So I'm going to go ahead and do this. First let me grab this page source and I'm going to open up our little text editor as before. And yeah, this is what I did wrong last time. So now it's back to Q. But this time, I'm going to change this to point at my own server. So then a word on a server, if I scroll over here, this is my CS50 Appliance, the virtual machine that in a week and a half's time, we'll start using as well, and it's in Linux computer, but more than that, even though it looks like a desktop with a little Start-like menu in Windows, it's still a server, and I can see this as follows. If I go ahead and inside of the appliance I visit, Google, I see Google, but if instead I do http://local host, local host is the common name for a Linux computer when you're on the Linux computer itself. And this is true in Mac OS as well and sort of in Windows. Local host refers to the computer you're on. So when I visit http://local host and let's just say slash, enter, I should see the root directory of the web server. So this is what I'm seeing. The fact that I'm seeing this page, and actually, it tells us literally what it is, "This page is used to test the proper operation of the Apache HTTP server after it's been installed. If you can read this page, it means the web server installed at this site working properly but it's not yet being configured." So that's great, that's exactly what I wanted to see. Some mentioned that web server is working and now it's up to me to actually populate it with some data. Now I can do something else now. And these kinds of steps if you're unfamiliar, we will explain in the first-- before the first project. What I've done is, right now, is I've opened up a so-called terminal window. This is an old school black and white interface for navigating the contents of a computer. It's like the DOS prompt of yesteryear. Mac OS, it's the terminal window. Windows sort of has an analog in the command prompt, but it's not as flexible as on Linux and Mac OS. And I can do a few things here. Again, we'll document this more in the future, but this is fairly archean command for making a directory, mkdir, space, the name of the directory. Now I'm going to go ahead and hit enter. And what that will do for me, ignore the control C, is I can now do cd public html and that stands for change directory. And in change directory, now I am inside of this, so cd is like double clicking a folder in a modern operating system which then opens a new window. So cd has now put me inside of public html. So now I'm going to go ahead and do this. I'm going to go ahead and run a command like gedit hello or let's do google.html. Gedit happens to be a text editor for Linux, so it's like a text edit, it's like notepad.exe, but this one's a little nicer and that it supports something called syntax highlighting, whereby my code will be colorized to be more user-friendly. So let me go ahead and copy what we wrote on Monday over here and paste it in. So this is what I mean by syntax highlighted. It's just pink and purple and whatnot, just to draw our attention to semantically the different parts of the web page, and now I'm going to go ahead and hit save. So control S, or I can go to file menu. And now, let me go back to that terminal window, and again I'm back in Linux here, and I'm going to go ahead and do ls, and notice I have a file, called google.html. And I can do all sorts of commands. There's the cat command which shows you the contents of files. There's the more command which shows you the contents of files. You can do any number of things. I can accidentally delete it with the RM command, don't do that. But I can do all sorts of things at the so-called command line that I could with a mouse and a keyboard, traditionally. So, what's the takeaway here? Now that I have google.html, notice that I have it in my public html directory, but if you can infer, who am I at the moment? What's my user name? Yeah? Jharvard? Jharvard, so I am John Harvard. Why is that? Well we configured this particular virtual machine with a generic username, John Harvard, so that anyone can use it and so that in documentation and whatnot, we can tell you exactly what your username is. It just gets a little more annoying if everyone has unique addresses because troubleshooting is harder and so forth. So just assumed you've signed up for a web hosting company. They have arbitrarily told you your username will be jharvard instead of A or B. So now I'm in John Harvard so-called home directory, the folder that I get for all my storage. And in there I created the public html subdirectory or folder. And in there, just to be clear, what's inside of public html at this point in the story? Google.html. So how do I visit google.html? Well, I'm going to open my Chrome browser and rather than visit just local host, I'm going to actually do this, http://localhost/ tilde jharvard/google.html. So this is a convention on a lot of web servers. When you want to access a specific person's home directory, you do slash tilde username, slash filename. You do not type what apparently? Public html. So public html is implied by the fact that you're using the URL, so don't type public html in URL itself. And now I'm going to go ahead and hit enter. And voila-- damn it, broken. So what does this mean? First of all, which-- what's the status code here? Has anyone spot it? Yeah? Forbidden. Forbidden, 403, you can see it in the tab at the very top. So that's one of those more archean status codes. 404 is a little more common, File Not Found. File is there, but I'm forbidden. So just high level, in English, what does this mean? Yeah? Didn't set the permissions. I haven't set the permissions. So we talked early about the idea of global permissions. Now let's frame this in a Linux context. And again, Mac OS is very similar. Windows isn't quite the same process, but the ideas exist on all of these platforms. So, let me do ls for list again. This is like dir if you come from a Windows world. And I see google.html, not all that enlightening. But I can do a long listing. So ls -l and then hit enter. So -l, for those less familiar with Linux or unfamiliar is the switch, the command line switch or flag or option, whatever you want to call it, that modifies the behavior of the command which in this case is called ls, enter. And now I see more outputs. What do I see now? I see first who owns the file, what is their group and by default the appliance is configured so that there's a students group and there's only one student for everyone called jharvard. But when you install your appliance, you're not sharing the same appliance. You have your own copy of the appliance with on jharvard account. This means it is 424 bytes which means 424 characters I typed in to that file. This is when we last edited it. This is the name of the file. And I skip the most interesting part which is over here. Now, this is maybe a little cryptic but rw generally denotes read and write. And what we have here is an indication of three types of permissions. So this is a very crash course. Again, you don't need to commit all of this to memory yet because they'll come up again in the actual projects. But what we've just done here is-- let me actually copy and paste this. We have this sequence here. What in the world does this mean? Well first, I'm going to cheat and I'm going to get rid of this one, the first dash is either a D if it's a directory, or a hyphen if it's a file, something else if it's something else. But for now, let's just assume that directories and files are all that exist. So now there's this and let me put some spaces then. It looks like we have a pattern of triples here. The first triple is the owner, so to speak. The second sequence is the group, in this case, students. And then the last is the world. So what is the implication right now? The owner can read and write this file. The group, students, can read and write. That feels a little worrisome, but in this case, the virtual machine is on my own computer. There's a students group but I'm the only student. So this is kind of immaterial. So it's not great but not bad, it doesn't really-- it's not applicable at the moment. The whole world though can read this and that's what I want for an html file. So it feels like my permissions are right. What else could be wrong then? Again context is web server is running as Apache or some username that's not me right now, but we have to give him access to it. Yeah? I have a question. OK. [ Inaudible Remark ] The last dash? In this case, no. This is actually OK and others would be possible. Technically rw or r would be fine. Or even rw nothing r would be fine. Point is that the world has to be able to read it. But what else does the world have to be able to have access to, do you think? Directory. The directory, right? We got to go one level higher. So how can I do this? Well, when I did ls -l a moment ago, I only saw the file. Let me do ls -al which is all in long or I can do this, you can combine switches typically in Linux just for our convenience like this, al. Now I see more. The first two lines are dot and dot-dot. What does dot represent in a typical file system? Sure. [ Inaudible Remark ] Close. What does dot represent? Oh, let me change the question. What does dot-dot represent? Excellent. It means the directory above, so dot-dot. So dot, though, by contrast represents? Current folder. The current folder, the one that you're in. So dot is where you are, dot-dot is your so-called parent which just means the thing you're inside of that the-- that what the parent folder is that your folder is inside of. So dot here refers to a directory called public html. Dot-dot refers to my home directory. And now-- I know what it is. Damn it. OK. So, I'm going to have to fake the story slightly for just a moment. Everything is actually correct. There is another secret setting that I changed earlier in the week while playing with the virtual machine that explains this. It's a feature called SELinux for security enhanced Linux which disallows anyone including John Harvard from using the web. So let me see if I can quickly fix this, but this was a wonderful stroll down the diagnostic techniques that would have led us to the solution. [ Pause ] Uh-huh, oops, and we go here. OK. So, OK. So, this is a detail you will not trip over yourself because by default what I just did is already done for you. It's just I disabled that we're playing around the other day. This was an additional security mechanism called SELinux which comes with flavors of Linux like Fedora and CentOS, and Redhat and it's meant to lock down systems even more. But doesn't matter because the story we told is still very much the same. In fact, I can simulate now how we could have created a problem for ourselves as follows. Let me go into this directory and everything now looks correct. All of this is good because it means a few things. Google.html is readable by the world. What do you think x means for both dot and dot-dot? Executable. Executable. Now normally, executable means like execute a file, run a program, but that's not the case for directories because notice the D and the D? For directories, if a directory is executable, that means someone can get into it. They can't necessarily read it and see the contents or is read. Execute means they can do the equivalent of cd into it or they can visit the URL that contains that directory's name. So the fact that this is x, this x and this is r is actually perfect. That's what we want. But I can simulate it being wrong. Suppose that by default when I'd created this file, it looked like this. What's wrong now with this picture? What jumps out at you? Yeah. That it's only read, written by the owner and no one else can access it. Perfect. Only read, writable by the owner, no one else can read it. That's a problem. So there's a bunch of ways to fix this but the way we'll introduce for now is chmod which is change mode and then a for all aka everybody, plus, what do I want to give everyone, r. So a little archean, the syntax, but then this command gives it what do we want. Change the mode of the google.html to get everyone r. The plus means give, minus means subtract. So enter, ls -al and now that problem is solved. By contrast, if the directories looked like this, propose to me how we fix this problem now. Now my dot and dot-dot directories are no longer executable which means my file is readable but no one can get into this directory via the web. How do I fix this? A plus x. OK. Good, a+x for executability and then the name of the file which is-- or folder which is dot and I can actually put a space separated list of these things on the command line. I can hit that and now ls -al, we fix that problem, too. Now suppose I goofed and suppose I do chmod a+x google.html you can maybe guess what's going to change. So think to yourself what does this line going to look like. In just a second, now it has an x everywhere as well. Does this mean anything? In this case, no, it's an HTML file, it's a static file. Making it executable means nothing. And so, is this going to break anything? No, it's just kind of wrong in principle. However, sometimes with PHP, your PHP files need to be executable. That is not the case on most web servers. Typically, they just need to be readable. And we'll now see some PHP, all right. So that was a lot of fun making google.html. Now, let us pretend to implement a Goggle server. I'm going to go ahead and hit New, let me copy this temporarily. So new file, I'm going save this as server.php. So our very first PHP file, we're going to pretend to be Goggle for a moment, enter. And now I'm going to start, you know, I'm going to cheat here and say, you know what, I don't what to do any of these just yet. I'm going to just do something silly like coming soon. So this, I argue, is PHP. I name the file server.php, I claim you now no PHP. And why is that? Well in the world of PHP you can actually commingle HTML and CSS with row of PHP code. So the fact that I haven't actually written any PHP code, is actually kind of sad because this is not PHP, but this will still work. So let's actually take a look at what happens. I'm going to go into google.html now, which again we made Monday. And I've already fixed the query string. But I don't want to go to search on goole.com now, I'm instead get to change this to server.php. In order words, when I submit this form now, I want it going to my own file just to see what happens. So let's go ahead and pull this up. And let me go ahead and type in Harvard again, enter. Wait a minute, something is wrong. What I'd do that's wrong? I did not implement this certainly. Yeah. [ Inaudible Remark ] Perfect, right? Stupid mistake, right? Caching, right? The browser has to be reloaded to actually get the new copy of the HTML. So let's hit the back button, and let's then reload here. And now, let me do a sanity check. I'm going to right click and view page source, now it's correct. This is what the browser is now seeing, server.php. So here we go, I'm going to search for Harvard now and hit enter. Hmm, problem. So this is a security feature that's actually provided by suPHP. Just for good measure, suPHP does not want your PHP files to be writtable [phonetic], why? Because if you screw up, if the file is writtable, you could change the file itself somehow. So we can fix this using what we know already of chmod, ls-al-- oops-- ls-al, the problem is that the PHP file is writtable by group. How do I take away that W from my group do you think? Yeah. Use G minus [inaudible]. Perfect G minus W for a server.php, enter. And now I do ls-al and that's OK. And you know what I'm going to do one more thing chmod, I'm going to do a minus r of server.php. And now, here is the output. This is actually wrong now. I need to give myself back. So a chmod owner, O plus R of server.php ls-al-- oops-- let's cheat here. So now what do we see? OK. So now, I argue that this is sufficient for PHP. Whereas JavaScript and HTML and CSS and GIFs and PNGs and JPEGs need to be readable by all, I argue now that PHP files only have to be readable by me. Why does this distinction? Why does this make sense in the context of what we've discussed this far today? Yeah. This is just wild guess. That made the PHPs just run on the server not by the actual user on the other side-- Perfect. -- it's just getting what the PHP needs to, in which is irrelevant in this case. Exactly. So whereas static files like JavaScript, CSS, HTML, JPEG are ultimately sent literally to the user to be viewed and seen by him or her. PHP is meant to be first interpreted by the server and then the server will send the output of that PHP file to the browser. Now at the moment, we have kind of a silly example. Inside of server.php is no PHP code whatsoever. What's inside of there, just HTML. So what's going to happen when I reload the page, and resubmit that form, the web server Apache is going to realize, "Oh, you have submitted a form to a PHP file." Why? Because it ends in .php. I am configured because of the way the LAMP stack works to interpret .php files using the PHP interpreter which is just a program that understand PHP. Now, the PHP interpreter is going to look for PHP code. Anything that's not a PHP code, it's defined to just spit out raw. So anything in the file, even if it ends in .php, if its not a PHP code itself, it just get sent raw to the browser. So what is the user going to see in this case? Literally all of my HTML because I haven't written a single line of PHP code yet. But the point though is that because it did end in PHP, the principle is the same, only the web server has to be able to read that PHP file in order to interpret it. But who is the web server going to be running as for PHP files? Yeah. Jharvard Jharvard, because of the suPHP feature. Substitute user PHP, means for any PHP files substitute the user who owns the file so that, the security mechanism we discussed is in place. So I'm going back to my browser, I'm going to go back to the form, I'm going to resubmit Harvard to my fake Google search. And now enter, now, list the URL, is server.php question mark Q equals Harvard, Coming Soon. How, lets write some PHP code. One of the most powerful things you can do in a dynamic website is actually spit out what the user has done. So here is my PHP code, rather-- well, it's sort of meaningless because there is no PHP, let me-- your server.php. Instead of coming soon, let me do something like, "You wanted to search for; let me do a bold tag, and let me really cheat now, harvard, save this. All right. Now, nobody should be fooled by this, when I go back here, go back, do I have to reload the form? No, because I only changed the server.php files. You don't need to refresh everything. I didn't change the Google.html. Let me go ahead and click Google search, oh my God, we now have a dynamic website. I typed Harvard and Harvard appeared on the screen, but not really, right? Because if I go back again, and I type in Yale in I Google search, OK, I'm clearly cheating. So let's be a little more genuinely dynamic. Let's go here, and I don't want to spit out Harvard. But based on the discussion of superglobals earlier, where in the world can we find what the user typed in for queue? Yeah, go ahead, yup. [ Inaudible Remark ] In it GET superglobal, yeah. So let's do this, we now need to insert the value of that variable. And you might just want to do this, $_GET, here is the syntax for going into a superglobal. You do square brackets, quote and quote the name of the thing you want to GET closed bracket. All right, so this is a super global itself but it's more specifically in a associate of array otherwise known as a hash table, hash map, whatever, you're familiar with. And that means you index into it using not numbers but words or letters, and once you get out of it, is the key-- the values. So in this case, we should get back H-A-R-V-A-R-D or Y-A-L-E, but not quite. So, let me try this just to prove that I'm wrong. Let me go back here, real search. OK, clearly not what I want but I need to tell the server, here is PHP code. Otherwise, it's just cryptic looking English. The means by which I do that is I have to enter PHP mode, open bracket question mark PHP space. And then on the end kind of the opposite, question mark close bracket. If you've come from the world of ASP in Windows or JSP in Java, you might have seen similar tags, this just means, enter PHP mode, do something, exit PHP mode. So let's see what the end result is here. Let me go back to Google, reverse, Google search for Yale, interesting. What is missing here now? What did I do wrong? Yeah? Well, you have to actually set the value of the GET. Exactly. So think about any programming language you know, generally, if you want to print the value of variable, it's not sufficient just to write the name of the variable in your program. Echo. Echo would work but we have a couple of options here. We can say echo, literally, we can say print and then we can do a parenthesis to make an actual function call. I'll go with this one for now. But echo is also a viable option, and now we're explicitly telling the interpreter, print the value of this variable here. So let's go back to my browser, go back, resubmit Yale. And now, we have some dynamism to it. Yeah. Is there a difference between echo and print? Is there a different between echo and print? Not really. Print is a proper function, Echo is a language construct that the crazy people in the internet that have done benchmarks comparing print and echo. And every blog post that I've read, pretty much says they're equivalent. Now, except for microseconds or milliseconds of your echoing millions of things but, for all intents and purposes, they're the same. So we can do something else here. And now this is a religious thing that I'm sure some people on the Internet will hate me for saying. But, I've always thought this is atrocious construct for saying enter PHP mode. And indeed, PHP also supports what are called short tags, open bracket question mark and that's it. Now, there are corner cases you can get into and if you read the crazy religious debates online, you'll see that, one of the reasonably compelling reasons is that, if a web server is not configured with support for short tags, this is a short tag, because why? It's shorter than I what previously typed. Then you do run the risk of having your raw PHP code transmitted to the user as though it's just HTML or the like at which point you've disclosed the sanctity of your intellectually property, or worse, your user names and passwords. So that's kind of a legitimate. But if you are running your own web server, and have control over the short tags feature in a file called php.ini, which is config file, I think we mentioned briefly on Monday, that we'll be on the appliance for you to tinker with if you want. Frankly, I just think there's an elegance about the symmetry of this. But typically when you're writing code, that won't necessarily run on your own server but could be posted as open source code, or you're writing it for corporate project where you don't have control over the web servers themselves, the first way I did it with open bracket PHP is the preferred way. Because it's more portable, it's not going to break. Because the worst thing, is if you download code that someone else has written and it's all short tags and your web server doesn't support short tags and you might not control your web server because it's a third party web post, it's a pain in the neck. You go through thousands of lines of your own code changing your short tags to long tags or vice versa. So just FYI, you'll see both tricks online. So this is nice but can we do better than this? Well, let's actually try something a little more general. Let me go in here instead, let me create a new form and let's do a few different data types this time. Let me go ahead here and paste this in just to get it started. And then I'll have a registration form, and center Google Registration. Again, we'll do register dot-- or this time we'll do register.php. And let's do a few things this time. I'm going to do input, name, equals name. And I'm going to say, let's do this quick and dirty for a registration form for like a conference or student group or something like that, input name equals name, type equals text. And now, let's do a line break here, and let's just do another something here, like, let's do Gender and let's do this check-- or write there radio and for something like gender. And then I'll say value equals M for male and I'll say M here. And then I'll stay over here, input type equals name-- nope gender-- nope, name equals gender, type equals radio, value equals F, and now I'll put F here. And then should we do one more, let's do one, just a simple drop down down here. Let's do a select, name equals states, close select. Let's do this here, option value equals let's say Connecticut, close option, and Massachusetts. So our registration form for whatever reason will only support people from Connecticut or Massachusetts just so we don't get bored typing them all out. OK, so I've made a very quick and dirty form in-- sadly a file called google.php. So, I'll restore that later so you can have the original code back. Let's go ahead and save this as something else. So, register.html. OK. So, now let me pull this up in my browser. Server is going to change to register.html. OK. So there we have pretty atrocious looking website. And in fact I've omitted one of the more important pieces. So, what do we need? It'll be nice if we had a submit button. So, let's go in here, input type equal submit, value equals register, close brackets, reload. OK. So, there's our very simple website. It's a little more interesting than our fake Google site because at least now we have a couple of user input mechanisms that we didn't have before. So, then let's now look on the back end what we're going to get. So, first, let me fill this out as a sample. David, male, we'll change this to Massachusetts, and now I'm going to click register. But let me zoom out so we can see the URL change. Register, and now registered.php was not found on the server but that make sense because we haven't created it yet. So, let's go ahead and do that. Let me go back to my text editor, let me copy this temporarily, make a new file, paste this in, we'll call this register.php. And I want to say here registered and we'll say something like, "Hello", open brackets, print, dollar sign, underscore, GET, name, close bracket, close bracket there. So, let's take this one step at the time. First, I'm just going to say hello to whoever it was that registered, OK? So, let's get back over to the browser. We'll go back, we submit the form and damn it, same bug again. So, quick how do we fix this, writeable by group, that was the problem. Chmod. G. G. Plus Minus, w register.php. OK, fixed. Let me go back here. And notice, new status code 500, 500 is generally the worst, it means it really did something wrong. All right. So, let's go back here. Let's reload the form and wala [phonetic] Hello David. OK, so some progress there. so that's good. And let me introduce one other syntactic trick. Frankly, this isn't the prettiest thing printing out a symbol there. There is this trick you can do with short tags which is very compelling. If you want to insert the value of a variable, you can put open bracket question mark equal sign with no space in between them. So, just to confirm, let me go back to the page, let me reload and seems to be staying the same, which is great. Now, let's look at the URL. It's more complex in Google's because we have multiple input. David, male, state equals MA. How do we get access to these other values? Well, first let's do a quick and dirty thing and let's just look at the entire contents of GET. So, let me go into registered.php and I'm going to cheat now, I'm going to output a pre-formatted of tag, we call that pre-formatted text uses monospaced font just so everything looks like code. And what I'm going to do in here is instead going to do ?=$_GET. But this isn't quite right. Actually let me put this on this line. So, you'd like do think this will just print out the entirety of GET. But let's see what I see instead, if I go to here, let me reload. OK, not that enlightening. It just says array. But that make sense because I did say GET is an array. So, we need to print it recursively to see what's inside the array. And the trick you can use, and this is not generally for production code. You don't say print, you say print_r for recursive, and it's a wonder way of just taking a quick peak inside of variables. So, I'm going to go to registered, reload, and there we go. So, this is what it looks like. This is completely arbitrary formatting. This has nothing to do with the underlying implementation, it's just the pretty way of printing the information. And now, I see three keys, name, gender, state followed three value. So, this is just a nice sanity check as to what's actually in there. So, now I can do something like this. Let me go back in the P-- registered.php, let me go back to saying h1 Hello equals a $_GET name close bracket exclamation point close h1. Now, let me do this again and I'm going to say something like You are a-- this is going to be a little underwhelming at first. Let's just do gender and then close that h1 tag. And then finally, you are from state. So, this should hopefully follow logically from what we did a moment ago. So, let's reload now. And fonts are little big. Not the most user friendly thing, but at least we're on our way. However, notice that there is no security mechanism in place here right now. There is no sanity checking of user's input. And notice, we used GET, recall the URL looks like this. So, what if I instead do something like this, this is not like a correct website right now. So, there's opportunities here, right? There's opportunities to one make sure that what the user-- what we provided to the user is options are actually checked on the server side. Two, we can make it more user friendly, it'll be nice is Massachusetts said Massachusetts not MA. It'll be nice if the M became male in lower case or you are a guy to you are a girl or just something. So, there seems to be opportunities here for if conditions and else's and some kind of conditional checks and so forth. So, we can build up there. But one of the most important takeaways is that right now, we're just trusting what the user has submitted to the form and this in of itself is not a good assumption because we can do something even worst than this. This is a very common thing known as an-- cross-site scripting attack. That we'll talk about more toward the end of the semester. But if you're familiar with JavaScript even minimally, what if I do something crazy like this, you have been hacked, question mark here, close script tag. OK, that's my name I claim, all right? So, what's going to happened now? Well, because of the service side code what am I doing with the name parameter? Yeah. You're closing the script site? I'm closing the script. Well, and well, actually you notice here, I closed it but I also opened it. What am I doing in registered.php with that value? Yeah. You're actually going to get-- you're going to send that string to the user and the user [inaudible] is going to interpret it as JavaScript. Exactly. I'm literally going to spit out what the user typed. But if the user typed HTML, that's going to add it to the page and that HTML is going to be executed or interpret it, and if it's a script tag, it means the JavaScript code is going to run. So, in short, what we just did is amazingly simple, too simple. Very bad, like this is not a good code. And many websites make this mistake because watch what happens now. If I go here and click register, o-oh, what did I wrong? Register [inaudible] script, all right, stand by one second. My dramatic alert. You have been hacked. Hmm, Chrome, are you doing this to me? [ Inaudible Remark ] That should be OK where we put it. Semicolon. Semicolon, let me go back. You have been hacked. That should be OK, let me try one other thing otherwise this is going to be a very underwhelming-- type equals-- oh my God. All right. Stand by for one second. We're going to try one other thing here. Otherwise, you will never believe anything else I'll say. OK, 151, 128. Register.html. OK, so before I tell you what I just did, we're going to try this again. Script, alert, you have been hacked, Massachusetts. Oh, damn you Chrome. OK, Google has been too helpful for its own good. So, Google is detecting what we just did and is scrubbing that apparently for us, which is rather good and bad of them. So, this was the effect I was trying to create. So, I very quickly open up the Firefox instead, which apparently doesn't have this protection in place, and this is not the behavior we wanted. But as soon as I click OK, we should at least see some of the behavior I expected but not quite all of it. Now, this is stupid, right? You're an idiot if you're trying to like trick yourself into executing JavaScript alerts. Like this is not really threatening anyone other than myself. However, if you think about how we did, notice what's in the URL there. So, apparently you can trigger these kinds of tricks by typing an input manually to forms but that's the silly way of doing it. What if instead you are bad guy and you're doing like a fishing attacks, sending people bogus emails, and you're telling them to click a link and they don't necessarily see the whole link because it's hidden with HTML email formatting. But they click that link, they get led to my page and then some JavaScript code executes. Well, this too stupid JavaScript. Triggering an alert is not hacking anyone. But as we'll see in a few weeks with JavaScript, you also have access to a user's cookies in JavaScript, which means there are attacks that we'll talk about later in the semester whereby you can steal someone session cookie, high jacking their session in the same way we discussed on Monday with Firesheep and Starbucks and the like by having tricked the user into typing or clicking a link that it takes advantage of this failure to escape the user's input. So, the fix here is actually relatively simple, if tedious in my code, you never, ever, ever, ever, want to trust what the user has typed in. So, the real way to echo user input is something like this, HTML special chars, which is an annoyingly long function name but it is a very good function. And that it will ensure that any potentially dangerous characters, among them the open bracket, which as you know, demarks that start of an HTML tag will be escaped. So that now, if I go back and resubmit the exact same form-- now, I look like the idiot because I've typed in-- displaying exactly what I typed in, which you would think is the expected behavior anyway. So, one of the recurring themes that we'll discuss not just at the end of the semester but throughout is how to take advantage of things like escaping both for user input here, for JavaScript inputs and most importantly for database inputs so that ultimately you are not vulnerable to attacks like this. So, what did I do to work around this? In Firefox, notice my URL is very different. In Firefox, what URL did I use to visit the website? The same website. Yeah. So, go ahead, what is it? [ Inaudible Remark ] Yeah, so this private IP, 192.168.151.128. Where did that come from? Well, the CS-50 appliance, the virtual machine I've been running, it's just the computer on the internet. I'll be the virtual one, and because I'm running it in the program called VMware, which is again a hypervisor that allows you to run one operating system on another. Notice in the bottom right hand corner of the appliance, there is mention of my IP address. And this can change all the time. VMware in this case is acting as the so-called DHCP server giving the appliance a different IP potentially every time I turn it on. But this is just a configuration we put here to always remind the human what IP address here she has. So, what is the implication? This is nice because it means I can as I promise on Monday minimize the appliance all together. Not even have to worry about getting too comfortable with the actual Linux environment, and I can just treat this as a remote server. Now, it's remote in the sense that I-- it's remote as though it's remote. It's actually physically present but I can still address via-- an IP address here. And if I'm on my own Mac or my PC depending on your OS, I can now just visit that actual URL with the browser, it says though I'm visiting a remote server. And if I'm really particular, and I just don't like looking at this address, what I can do as what I did on Monday whereby I can open up the terminal window and I can do edit etc hosts, type in my password. And then remember we did this trick here so let me go here and then I can do davidsecretwebsite.com. And now, because I've taught my Mac to make the DNS association for me, I can change this to this, and now notice, davidsecretwebsite.com is born. I'll be at only on my own local computer. So, when I mentioned earlier that you can do developments on your own computer, it's a wonderful way of doing website development because you can still simulate all of the realities of HTTP and DNS. But locally without needing an internet connection, without needing remote server without having to pay anyone for those services, you can spend those months upfront working at home and a café at work all without needing any of the physical infrastructure that's typically associated with the internet. So, we're also introduce you in the first project to this approach. But we've only just scratch the surface. So one, all I've been doing is that going out input, but clearly a website like Facebook and Google take input, it checks the inputs with if conditions and else's and the loops in and what not, it does like writing things to data bases. And it would also be nice to to move away from what seems to be a very sloppy start. Whereby, we've been running HTML and then I kind of dropped in to PHP mode very quickly, then went back to HTML. This is not going to scale very well. So, if you're coming to the course with the background in ASP or JSP or even Django or Rails, there are ways of cleaning up our codes so that we can practice some good principles like, let's keep presentation separate from our data. This is one of these mantras that makes good sense especially for large projects where you keep your HTML separate from your CSS, separate from your JavaScript, separate from your data, separate now from your PHP codes. So, even though tonight we've started to dive in with this commingling approach, and on Monday, we'll do some more of the same, we'll also look at some common paradigms among them, MVC, Model-View-Controller, where you can really start to separate these things into more complex, more sophisticated, rather more clean redesigned applications. But for now, why don't we go ahead and adjourn here officially. We'll take a 5, 10-minute break. Peter we'll get set up, if you'd like to remain per section by all means do. Otherwise section will be filmed as usual and be placed online by sometime tomorrow. And I'll linger around for one on one questions. All right, we'll see you on Monday. [ Silence ] END